Introduction to the Package tree.bins

Piro Polo

2018-02-16

Introduction

When working with large data sets, there may be a need to recategorize the factors by some criterion. The tree.bins package allows users to recategorize these variables through a decision tree method derived from the rpart() function of the rpart library. The tree.bins() function is especially useful if the data set contains several factor class variables, which many of those variables contain an abnormal amount of levels. The intended purpose of the library is to recategorize predictors in order to limit the number of dummy variables created when applying a statistical method to model a response. This document illustrates a typical problem where the tree.bins library would be used and how it would be used.

Pre-Categorization: Typical Variable for Consideration

This section illustrates a typical variable that could be considered for recategorization

Visualization of Candidate Variable

I use a subset of the Ames data set to illustrate. The below chunk illustrates the average home sale price of each Neighborhood.

AmesSubset %>% 
  select(SalePrice, Neighborhood) %>% 
  group_by(Neighborhood) %>% 
  summarise(AvgPrice = mean(SalePrice)/1000) %>% 
  ggplot(aes(x = reorder(Neighborhood, -AvgPrice), y = AvgPrice, fill = Neighborhood)) +
  geom_bar(stat = "identity") + 
  labs(x = "Neighborhoods", y = "Avg Price (in thousands)", 
       title = paste0("Average Home Prices of Neighborhoods") , fill = "Neighborhoods") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Notice that many neighborhoods observe the same average sale price. This indicates that we could recategorize the neighborhoods variable into fewer levels.

Statistical Method Implementation of Candidate Variable

The following illustrates the results of using a statistical learning method – linear regression for this example – on a categorical variable with several levels.

fit <- lm(formula = SalePrice ~ Neighborhood, data = AmesSubset)
summary(fit)
#> 
#> Call:
#> lm(formula = SalePrice ~ Neighborhood, data = AmesSubset)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -163138  -27138   -4526   20405  433829 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         189845.3    11430.8  16.608  < 2e-16 ***
#> NeighborhoodBlueste -56359.6    22861.6  -2.465 0.013774 *  
#> NeighborhoodBrDale  -85937.8    16366.4  -5.251 1.67e-07 ***
#> NeighborhoodBrkSide -63894.3    12860.7  -4.968 7.33e-07 ***
#> NeighborhoodClearCr  14812.4    14903.9   0.994 0.320410    
#> NeighborhoodCollgCr  12799.6    12093.5   1.058 0.290009    
#> NeighborhoodCrawfor  15204.5    12971.2   1.172 0.241264    
#> NeighborhoodEdwards -57332.8    12269.8  -4.673 3.17e-06 ***
#> NeighborhoodGilbert   -403.9    12398.5  -0.033 0.974018    
#> NeighborhoodGreens    8454.7    26066.3   0.324 0.745703    
#> NeighborhoodGrnHill  90154.7    38763.8   2.326 0.020131 *  
#> NeighborhoodIDOTRR  -86378.8    13199.2  -6.544 7.56e-11 ***
#> NeighborhoodLandmrk -52845.3    53615.2  -0.986 0.324428    
#> NeighborhoodMeadowV -97588.5    15121.5  -6.454 1.36e-10 ***
#> NeighborhoodMitchel -27485.2    12878.0  -2.134 0.032940 *  
#> NeighborhoodNAmes   -45419.1    11794.3  -3.851 0.000121 ***
#> NeighborhoodNoRidge 131325.3    13581.8   9.669  < 2e-16 ***
#> NeighborhoodNPkVill -49223.4    17382.7  -2.832 0.004675 ** 
#> NeighborhoodNridgHt 127292.8    12422.5  10.247  < 2e-16 ***
#> NeighborhoodNWAmes    2118.9    12631.2   0.168 0.866794    
#> NeighborhoodOldTown -64336.5    12166.8  -5.288 1.37e-07 ***
#> NeighborhoodSawyer  -53419.4    12531.9  -4.263 2.11e-05 ***
#> NeighborhoodSawyerW -11867.9    12913.9  -0.919 0.358202    
#> NeighborhoodSomerst  42257.6    12326.2   3.428 0.000620 ***
#> NeighborhoodStoneBr 115745.8    14538.5   7.961 2.81e-15 ***
#> NeighborhoodSWISU   -53087.1    14311.7  -3.709 0.000213 ***
#> NeighborhoodTimber   66699.6    13581.8   4.911 9.79e-07 ***
#> NeighborhoodVeenker  64328.2    17090.1   3.764 0.000172 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 52380 on 2021 degrees of freedom
#> Multiple R-squared:  0.5627, Adjusted R-squared:  0.5568 
#> F-statistic:  96.3 on 27 and 2021 DF,  p-value: < 2.2e-16

Notice that there are multiple dummy variables being created to capture the different levels found within the Neighborhoods variable.

Visualizing the Bins Created by a Decision Tree

The below steps illustrate how rpart() categorizes the different levels of Neighborhoods into the separate leaves. These leaves are used to generate the mappings extracted within tree.bins() to recategorize the current data.

d.tree = rpart(formula = SalePrice ~ Neighborhood, data = AmesSubset)
rpart.plot::rpart.plot(d.tree)

These 5 categories is what tree.bins() will use to recategorize the variable Neighborhood.

Post-Categorization: Typical Variable for Consideration

This section illustrates the result of using tree.bins() to recategorize a typical variable.

Recategorization of Candidate Variable

Continuing from the above example, we can clearly identify that there are similarities in many of the levels within the Neighborhoods variable in relation to the response. To limit the number of dummy variables that are created in a statistical learning method, we would like to group the categories that display similar associations with the responses into one bin. We could create visualizations to identify these similarities in levels for each variable, but it would an extremely tedious task not to mention subjective to the analyst.

A better method would be to use the rules that are generated from a decision tree. This can be accomplished by using the rpart() function in the rpart library. However, this task remains tedious, especially when there are numerous factor class variables. The tree.bins() function allows the user to iteratively recategorize each factor level variable for the specified data set.

sample.df <- AmesSubset %>% select(Neighborhood, MS.Zoning, SalePrice)
binned.df <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "new.fctrs")
levels(sample.df$Neighborhood) #current levels of Neighborhood
#>  [1] "Blmngtn" "Blueste" "BrDale"  "BrkSide" "ClearCr" "CollgCr" "Crawfor"
#>  [8] "Edwards" "Gilbert" "Greens"  "GrnHill" "IDOTRR"  "Landmrk" "MeadowV"
#> [15] "Mitchel" "NAmes"   "NoRidge" "NPkVill" "NridgHt" "NWAmes"  "OldTown"
#> [22] "Sawyer"  "SawyerW" "Somerst" "StoneBr" "SWISU"   "Timber"  "Veenker"
unique(binned.df$Neighborhood) #new levels of Neighborhood
#> [1] "bin#.4" "bin#.3" "bin#.5" "bin#.2" "bin#.1"

The Different Return Options of tree.bins()

Depending on what is the most useful information to the user, tree.bins() can return either the recategorized data.frame or a list comprised of lookup tables. The lookup tables contain the old to new value mappings generated by tree.bins().

The “new.fctrs” returns the recategorized data.frame

head(binned.df)
#>    SalePrice Neighborhood MS.Zoning
#> 1:    105000       bin#.4    bin#.1
#> 2:    244000       bin#.4    bin#.2
#> 3:    189900       bin#.3    bin#.2
#> 4:    195500       bin#.3    bin#.2
#> 5:    191500       bin#.5    bin#.2
#> 6:    236500       bin#.5    bin#.2

The “lkup.list” returns a list of the lookup tables

lookup.list <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "lkup.list")
head(lookup.list[[1]])
#>   Neighborhood Categories
#> 1       BrDale     bin#.1
#> 2      BrkSide     bin#.1
#> 3       IDOTRR     bin#.1
#> 4      MeadowV     bin#.1
#> 5      OldTown     bin#.1
#> 6      Somerst     bin#.2

Using the bin.oth() Function

Using tree.bins() the user will be able to recategorize factor class variables of one particular data.frame. Let’s assume, that down the road, they obtain a similar dataset that contains the same old categorical convention. In this case, a user may want to recategorize this new data.frame by the same lookup tables that were generated from the first data.frame. In this case, being able to bin other data.frames with the same lookup table would be quite useful. The example below takes in a subset of the AmesSubset data and returns a data.frame recategorized by the lookup list generated from the tree.bins() function.

oth.binned.df <- bin.oth(list = lookup.list, data = sample.df)
head(oth.binned.df)
#>    SalePrice Neighborhood MS.Zoning
#> 1:    105000       bin#.4    bin#.1
#> 2:    244000       bin#.4    bin#.2
#> 3:    189900       bin#.3    bin#.2
#> 4:    195500       bin#.3    bin#.2
#> 5:    191500       bin#.5    bin#.2
#> 6:    236500       bin#.5    bin#.2