Title: | Forest Discretization |
---|---|
Description: | Supervised, multivariate, and non-parametric discretization algorithm based on tree ensembles learning and moment matching optimization. This version of the algorithm relies on random forest algorithm to learn a large set of split points that conserves the relationship between attributes and the target class, and on moment matching optimization to transform this set into a reduced number of cut points matching as well as possible statistical properties of the initial set of split points. For each attribute to be discretized, the set S of its related split points extracted through random forest is mapped to a reduced set C of cut points of size k. This mapping relies on minimizing, for each continuous attribute to be discretized, the distance between the four first moments of S and the four first moments of C subject to some constraints. This non-linear optimization problem is performed using k values ranging from 2 to 'max_splits', and the best solution returned correspond to the value k which optimum solution is the lowest one over the different realizations. ForestDisc is a generalization of RFDisc discretization method initially proposed by Berrado and Runger (2009) <doi:10.1109/AICCSA.2009.5069327>, and improved by Berrado et al. in 2012 by adopting the idea of moment matching optimization related by Hoyland and Wallace (2001) <doi: 10.1287/mnsc.47.2.295.9834>. |
Authors: | Haddouchi Maïssae |
Maintainer: | Haddouchi Maïssae <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.0 |
Built: | 2024-11-13 02:57:18 UTC |
Source: | https://github.com/cran/ForestDisc |
Extraction of the splits learned by random forest regarding continuous predictors.
Extract_cont_splits(SelectedTREES)
Extract_cont_splits(SelectedTREES)
SelectedTREES |
The output of the function RF2Selectedtrees() |
List with 2 components:
continuous_var |
Vector of continuous predictors. |
continuous_splits |
Data frame of splits learned by random forest algorithm regarding continuous predictors. |
Haddouchi Maïssae
data(iris) Mydata=iris id_target=5 set.seed(1234) X=Mydata[,1:(id_target-1)] Y=Mydata[,id_target] ntree=50 RFTREES=RF2Selectedtrees(X,Y,ntree) RFCONTSPLITS=Extract_cont_splits(RFTREES)
data(iris) Mydata=iris id_target=5 set.seed(1234) X=Mydata[,1:(id_target-1)] Y=Mydata[,id_target] ntree=50 RFTREES=RF2Selectedtrees(X,Y,ntree) RFCONTSPLITS=Extract_cont_splits(RFTREES)
ForestDisc is a supervised, multivariate and non-parametric discretization algorithm based on tree ensembles learning and moment matching optimization. This version of the algorithm relies on random forest algorithm to learn a large set of split points that conserves the relationship between attributes and the target class, and on moment matching optimization to transform this set into a reduced number of cut points matching as well as possible statistical properties of the initial set of split points. For each attribute to be discretized, the set S of its related split points extracted through random forest is mapped to a reduced set C of cut points of size k.
ForestDisc(data,id_target,ntree=50,max_splits=10,opt_meth="NelderMead")
ForestDisc(data,id_target,ntree=50,max_splits=10,opt_meth="NelderMead")
data |
Data frame to be discretized. |
id_target |
Column id of the target class. |
ntree |
Number of trees to grow using random forest algorithm in order to learn split points. The default value is 50. |
max_splits |
Maximum number of cut points to be used for discretizing continuous attributes in the data. Possible values for 'max_splits' range between 2 and 10. Default value = 10. |
opt_meth |
The non-linear optimization algorithm to use in order to get the optimal set of cut points matching as well as possible the set of split points. The possible values are DIviding RECTangles algorithm "directL", NelderMead Simplex method "NelderMead", Sequential Least-Squares Quadratic Programming "SLSQP". (more details about these non-linear optimization algorithms can be found in the documentation of the "NLopt" library). The default value used is "NelderMead". |
List with components:
Data_disc |
Discretized data. |
cont_variables |
Continuous attributes column ids. |
Listcutp |
List of cut points used to discretize continuous attributes. |
cut_points |
Data frame summarizing the best solution returned. |
opt_results |
Data frame summarizing all the solutions returned for different realizations. Each realization is determined by a size of the set of cut points, ranging between 2 and 'max_splits'. |
Haddouchi Maïssae
data(iris) Mydata=iris id_target=5 set.seed(1234) Mydata_Disc=ForestDisc(Mydata,id_target)
data(iris) Mydata=iris id_target=5 set.seed(1234) Mydata_Disc=ForestDisc(Mydata,id_target)
Learn decision splits from random forest algorithm. The resulting model consists of a set of trees where each tree is a collection of rules, and each rule is a combination of decision splits (pairs of variable/value(s)) defined from a root node to a terminal node.
RF2Selectedtrees (X,Y,ntree,max_TreeRules = 'default',min_RuleSupport = 'default')
RF2Selectedtrees (X,Y,ntree,max_TreeRules = 'default',min_RuleSupport = 'default')
X |
Descriptive attributes data frame. |
Y |
Target attribute (A response vector). |
ntree |
Number of trees to grow using Random Forest algorithm. |
max_TreeRules |
The maximum number of rules in each tree. It represents the maximum number of terminal nodes in each tree grown by random forest. The default value is the one set in random forest algorithm. |
min_RuleSupport |
The minimum support related to each rule (defined from a root node to a leef node). The support of a rule represents the size of its terminal node divided by the number of instances in the data. The default value used is the minimum size of terminal node set in random forest algorithm divided by the number of instances in the data. |
List with components:
ntree |
Number of trees. |
list |
List of 'ntree' matrix where each one corresponds to a tree grown by random forest algorithm. Each matrix consists of six columns and number of rows equal to the number of nodes in the tree. (more details can be found in the documentation of the function 'getTree' from “randomForest” package) |
RF |
The original call to randomForest algorithm used. |
xlevels |
vector of lists of size equal to the number of predictors. Each list corresponds to an attribute. In the case of categorical attribute, the categories are returned. In the case of continuous attribute, the distinct splits values performed by random Forest are returned. |
continuous_var |
Vector of continuous predictors. |
categorical_var |
Vector of categorical predictors. |
Haddouchi Maïssae
data(iris) Mydata=iris id_target=5 set.seed(1234) X=Mydata[,1:(id_target-1)] Y=Mydata[,id_target] ntree=50 RFTREES=RF2Selectedtrees(X,Y,ntree)
data(iris) Mydata=iris id_target=5 set.seed(1234) X=Mydata[,1:(id_target-1)] Y=Mydata[,id_target] ntree=50 RFTREES=RF2Selectedtrees(X,Y,ntree)
Build the optimal set of cut points C for discretization, based on moment matching. The set of split points S extracted through Extract_cont_splits() function is mapped to a reduced set of cut points C.
Select_cont_splits(cont_splits,max_splits,opt_meth)
Select_cont_splits(cont_splits,max_splits,opt_meth)
cont_splits |
Output of the function Extract_cont_splits(). |
max_splits |
Maximum number of cut points allowed. Possible values range between 2 and 10. Default value = 10. |
opt_meth |
The non-linear optimization algorithm to use in order to get the optimal set of cut points matching as well as possible the set of split points. The possible values are DIviding RECTangles algorithm "directL", NelderMead Simplex method "NelderMead", Sequential Least-Squares Quadratic Programming "SLSQP". (more details about these non-linear optimization algorithms can be found in the documentation of the "NLopt" library). |
List with 2 components:
All_splits |
Data frame of solutions returned for k values ranging from 2 to 'max_splits'. |
Selected_splits |
Data frame of the best solution returned. |
Haddouchi Maïssae
data(iris) Mydata=iris id_target=5 set.seed(1234) X=Mydata[,1:(id_target-1)] Y=Mydata[,id_target] ntree=50 RFTREES=RF2Selectedtrees(X,Y,ntree) RFCONTSPLITS=Extract_cont_splits(RFTREES) RFSELECTCONTSPLITS=Select_cont_splits(cont_splits=RFCONTSPLITS,max_splits=10,opt_meth="NelderMead")
data(iris) Mydata=iris id_target=5 set.seed(1234) X=Mydata[,1:(id_target-1)] Y=Mydata[,id_target] ntree=50 RFTREES=RF2Selectedtrees(X,Y,ntree) RFCONTSPLITS=Extract_cont_splits(RFTREES) RFSELECTCONTSPLITS=Select_cont_splits(cont_splits=RFCONTSPLITS,max_splits=10,opt_meth="NelderMead")