dataPreparation

2017-08-17

This vignette introduces dataPreparation, what it offers, how simple it is to use it.

1 Introduction

1.1 Package presentation

Based on data.table package, dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

data.table and other dependencies are handled at installation.

1.2 Main preparation steps

Before using any machine learning (ML) algorithm, one needs to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

Here are the functions available in this package to tackle those issues:

Correct Transform Filter Handle NA Shape
unFactor generateDateDiffs fastFilterVariables fastHandleNa shapeSet
findAndTransformDates generateFactorFromDate whichAreConstant sameShape
findAndTransformNumerics aggregateByKey whichAreInDouble setAsNumericMatrix
setColAsCharacter generateFromFactor whichAreBijection
setColAsNumeric generateFromCharacter
setColAsDate fastRound
setColAsFactor

All of those functions are integrated in the full pipeline function prepareSet.

In this tutorial we will detail all those steps and how to treat them with this package using an example data set.

1.3 Tutorial data

For this tutorial, we are going to use a messy version of adult data base.

data(messy_adult)
print(head(messy_adult, n = 4))
#         date1      date2       date3           date4    num1    num2
# 1: 2017-10-07         NA 19-Jan-2017 21-January-2017 -3.0953  0,4954
# 2: 2017-31-12 1513465200 06-Jun-2017    08-June-2017  0.2227 -0,8202
# 3: 2017-12-10 1511305200 03-Jul-2017    05-July-2017 -0.2916  -0,713
# 4: 2017-06-09 1485126000 19-Jul-2017    21-July-2017  2.3236  0,7155
#    constant                             mail    num3 age    type_employer
# 1:        1            marie.cynthia@aol.com -3,0953  39        State-gov
# 2:        1     jake.caroline@protonmail.com  0,2227  50 Self-emp-not-inc
# 3:        1 caroline.caroline@protonmail.com -0,2916  38          Private
# 4:        1         caroline.marie@yahoo.com  2,3236  53          Private
#    fnlwgt education education_num            marital        occupation
# 1:  77516 Bachelors            13      Never-married      Adm-clerical
# 2:  83311 Bachelors            13 Married-civ-spouse   Exec-managerial
# 3: 215646   HS-grad             9           Divorced Handlers-cleaners
# 4: 234721      11th             7 Married-civ-spouse Handlers-cleaners
#     relationship  race  sex capital_gain capital_loss hr_per_week
# 1: Not-in-family White Male         2174            0          40
# 2:       Husband White Male            0            0          13
# 3: Not-in-family White Male            0            0          40
# 4:       Husband Black Male            0            0          40
#          country income
# 1: United-States  <=50K
# 2: United-States  <=50K
# 3: United-States  <=50K
# 4: United-States  <=50K

We added 9 really ugly columns to the data set:

The same info can be contained in two different columns.

2 Correct functions

2.1 Identifying factor that shouldn’t be

It often happens when reading a data set that R put string into a factor even if it shouldn’t be. In this tutorial data set, mail is a factor but shouldn’t be. It will automatically be detected using unFactor function:

print(class(messy_adult$mail))
# "factor"
messy_adult <- unFactor(messy_adult)
# "unFactor: I will identify variable that are factor but shouldn't be."
# "unFactor: I unfactor mail."
# "unFactor: It took me 0.14s to unfactor 1 column(s)."
print(class(messy_adult$mail))
# "character"

2.2 Identifing and transforming date columns

The next thing to do is to identify columns that are dates (the first 4 ones) and transform them.

messy_adult <- findAndTransformDates(messy_adult)
# "findAndTransformDates: It took me 0.28s to identify formats"
# "findAndTransformDates: It took me 0.23s to transform 4 columns to a Date format."
Let’s have a look to the transformation performed on those 4 columns:
date1_prev date2_prev date3_prev date4_prev transfo date1 date2 date3 date4
2017-10-07 NA 19-Jan-2017 21-January-2017 => 2017-07-10 NA 2017-01-19 2017-01-21
2017-31-12 1513465200 06-Jun-2017 08-June-2017 => 2017-12-31 2017-12-17 00:00:00 2017-06-06 2017-06-08
2017-12-10 1511305200 03-Jul-2017 05-July-2017 => 2017-10-12 2017-11-22 00:00:00 2017-07-03 2017-07-05
2017-06-09 1485126000 19-Jul-2017 21-July-2017 => 2017-09-06 2017-01-23 00:00:00 2017-07-19 2017-07-21
2017-02-03 1498345200 16-May-2017 18-May-2017 => 2017-03-02 2017-06-25 01:00:00 2017-05-16 2017-05-18
2017-04-10 1503183600 02-Apr-2017 04-April-2017 => 2017-10-04 2017-08-20 01:00:00 2017-04-02 2017-04-04

As one can see, even if formats were different and somehow ugly, they were all handled.

2.3 Identifying and transforming numeric columns

And now the same thing with numeric

messy_adult <- findAndTransformNumerics(messy_adult)
# "findAndTransformNumerics: It took me 0.14s to identify 3 numerics column(s), i will set them as numerics"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the columnnum1"
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the columnnum2"
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I am doing the columnnum3"
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "findAndTransformNumerics: It took me 0.06s to transform 3 column(s) to a numeric format."
num1_prev num2_prev num3_prev transfo num1 num2 num3
-3.0953 0,4954 -3,0953 => -3.0953 0.4954 -3.0953
0.2227 -0,8202 0,2227 => 0.2227 -0.8202 0.2227
-0.2916 -0,713 -0,2916 => -0.2916 -0.7130 -0.2916
2.3236 0,7155 2,3236 => 2.3236 0.7155 2.3236
-0.9326 -0,3564 -0,9326 => -0.9326 -0.3564 -0.9326
1.2396 NA 1,2396 => 1.2396 NA 1.2396

So now our data set is a bit less ugly.

3 Filter functions

The idea now is to identify useless columns:

3.1 Look for constant variables

constant_cols <- whichAreConstant(messy_adult)
# "whichAreConstant: constant is constant."
# "whichAreConstant: it took me 0.17s to identify 1 constant column(s)"

3.2 Look for columns in double

double_cols <- whichAreInDouble(messy_adult)
# "whichAreInDouble: num3 is exactly equal to num1. I put it in drop list."
# "whichAreInDouble: it took me 0.19s to identify 1 column(s) to drop."

3.3 Look for columns that are bijections of one another

bijections_cols <- whichAreBijection(messy_adult)
# "whichAreBijection: date4 is a bijection of date3. I put it in drop list."
# "whichAreBijection: num3 is a bijection of num1. I put it in drop list."
# "whichAreBijection: education_num is a bijection of education. I put it in drop list."
# "whichAreBijection: it took me 0.47s to identify 3 column(s) to drop."

To control this, let’s have a look to the concerned columns:

kable(head(messy_adult[, .(constant, date3, date4, num1, num3, education, education_num)])) %>%
   kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 12)
constant date3 date4 num1 num3 education education_num
1 2017-01-19 2017-01-21 -3.0953 -3.0953 Bachelors 13
1 2017-06-06 2017-06-08 0.2227 0.2227 Bachelors 13
1 2017-07-03 2017-07-05 -0.2916 -0.2916 HS-grad 9
1 2017-07-19 2017-07-21 2.3236 2.3236 11th 7
1 2017-05-16 2017-05-18 -0.9326 -0.9326 Bachelors 13
1 2017-04-02 2017-04-04 1.2396 1.2396 Masters 14

Indeed:

3.4 Filter them all

To directly filter all of them:

ncols <- ncol(messy_adult)
messy_adult <- fastFilterVariables(messy_adult)
print(paste0("messy_adult now have ", ncol(messy_adult), " columns; so ", ncols - ncol(messy_adult), " less than before."))
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 1 constant column(s) in dataSet."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I delete 1 column(s) that are in double in dataSet."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 2 column(s) that are bijections of another column in dataSet."
# "messy_adult now have 20 columns; so 4 less than before."

4 useless rows have been deleted. Without those useless columns, your machine learning algorithm will at least be faster and maybe give better results.

4 Transform functions

Before sending this to a machine learning algorithm, a few transformations should be performed.

The idea with the functions presented here is to perform those transformations in a RAM efficient way.

4.1 Dates differences

Since no machine learning algorithm handle Dates, one needs to transform them or drop them. A way to transform dates is to perform differences between every date.

We can also add an analysis date to compare dates with the date your data is from. For example, if you have a birth-date you may want to compute age by performing today - birth-date.

date_cols <- names(messy_adult)[sapply(messy_adult, is.POSIXct)]
messy_adult <- generateDateDiffs(messy_adult, cols = "auto", analysisDate = as.Date("2018-01-01"), units = "days")
# "generateDateDiffs: I will generate difference between dates."
# "generateDateDiffs: It took me 0.15s to create 6 column(s)."
date1.Minus.date3 date1.Minus.analysisDate date2.Minus.date3 date2.Minus.analysisDate date3.Minus.analysisDate
172 -174.9583333 NA NA -346.9583
208 -0.9583333 193.95833 -15 -208.9583
101 -80.9583333 141.95833 -40 -181.9583
49 -116.9583333 -177.04167 -343 -165.9583
-75 -304.9583333 39.95833 -190 -229.9583
185 -88.9583333 139.95833 -134 -273.9583

4.2 Transforming dates into aggregates

Another way to work around dates would be to aggregate them at some level. This time drop is set to TRUE in order to drop date columns.

messy_adult <- generateFactorFromDate(messy_adult, cols = date_cols, type = "quarter", drop = TRUE)
# "generateFactorFromDate: I will create a factor column from each date column."
# "generateFactorFromDate: It took me 0.19s to transform 3 column(s)."
date1.quarter date2.quarter date3.quarter
Q3 QNA Q1
Q4 Q4 Q2
Q4 Q4 Q3
Q3 Q1 Q3
Q1 Q2 Q2
Q4 Q3 Q2

4.3 Generate features from character columns

Character columns are not handled by any machine learning algorithm, one should transform them. Function generateFromCharacter build some new feature from them, and then drop them.

messy_adult <- generateFromCharacter(messy_adult, cols = "auto", drop = TRUE)
# "generateFromCharacter: it took me: 0s to transform 1 character columns into, 3 new columns."
mail.notnull mail.num mail.order
FALSE 195 1
FALSE 195 1
FALSE 195 1
FALSE 195 1
FALSE 195 1
FALSE 195 1

4.4 Aggregate according to a key

To model something by country; one would want to to compute an aggregation of this table in order to have one line per country.

agg_adult <- aggregateByKey(messy_adult, key = "country")
# "aggregateByKey: I start to aggregate"
# "aggregateByKey: 139 columns have been constructed. It took 0.62 seconds. "
country max.age type_employer.Without-pay education.Assoc-acdm marital.Married-AF-spouse
? 90 0 12 0
Cambodia 65 0 0 0
Canada 80 0 1 0
China 75 0 0 0
Columbia 75 0 3 0
Cuba 77 0 3 0

Every time you have more than one line per individual this function would be pretty cool.

4.5 Rounding

One might want to round numeric variables in order to save some RAM, or for algorithmic reasons:

messy_adult <- fastRound(messy_adult, digits = 2)
num1 num2 age type_employer fnlwgt education
NA 1.12 40 Private 193524 Doctorate
-0.66 NA 53 Private 346253 HS-grad
0.19 -0.05 38 Federal-gov 125933 Masters
2.67 -1.58 47 Local-gov 287480 Masters
-0.10 0.26 23 Private 352139 Some-college
1.51 NA 50 ? 23780 Masters

5 Handling NAs values

Then, let’s handle NAs

messy_adult <- fastHandleNa(messy_adult)
#     num1  num2 age type_employer   ...       country income
# 1:  0.00  1.12  40       Private   ... United-States   >50K
# 2: -0.66  0.00  53       Private   ... United-States  <=50K
# 3:  0.19 -0.05  38   Federal-gov   ...          Iran   >50K
# 4:  2.67 -1.58  47     Local-gov   ... United-States  <=50K
#    date1.Minus.date2 date1.Minus.date3 date1.Minus.analysisDate
# 1:             56.04               146                   -34.96
# 2:           -297.96              -313                  -324.96
# 3:            112.04               189                  -146.96
# 4:              0.00                 0                     0.00
#    date2.Minus.date3 date2.Minus.analysisDate date3.Minus.analysisDate
# 1:             89.96                      -91                  -180.96
# 2:            -15.04                      -27                   -11.96
# 3:             76.96                     -259                  -335.96
# 4:            221.96                      -12                  -233.96
#    date1.quarter date2.quarter date3.quarter mail.notnull mail.num
# 1:            Q4            Q4            Q3        FALSE      195
# 2:            Q1            Q4            Q4        FALSE      195
# 3:            Q3            Q2            Q1        FALSE      195
# 4:           QNA            Q4            Q2        FALSE      195
#    mail.order
# 1:          1
# 2:          1
# 3:          1
# 4:          1

It set default values in place of NA. If you want to put some specific values (constants, or even a function for example mean of values) you should go check fastHandleNa documentation.

6 Shape functions

There are two types of machine learning algorithm in R: those which accept data.table and factor, those which only accept numeric matrix.

Transforming a data set into something acceptable for a machine learning algorithm could be tricky.

The shapeSet function do it for you, you just have to choose if you want a data.table or a numerical_matrix.

First with data.table:

clean_adult = shapeSet(copy(messy_adult), finalForm = "data.table", verbose = FALSE)
print(table(sapply(clean_adult, class)))
# 
#  factor integer numeric 
#      12       1      15

As one can see, there only are, numeric and factors.

Now with numerical_matrix:

clean_adult <- shapeSet(copy(messy_adult), finalForm = "numerical_matrix", verbose = FALSE)
num1 num2 age type_employer? type_employerFederal-gov type_employerLocal-gov
0.00 1.12 40 0 0 0
-0.66 0.00 53 0 0 0
0.19 -0.05 38 0 1 0
2.67 -1.58 47 0 0 1
-0.10 0.26 23 0 0 0
1.51 0.00 50 1 0 0

As one can see, with finalForm = "numerical_matrix" every character and factor have been binarized.

7 Full pipeline

Doing it all with one function is possible:

To do that we will reload the ugly data set and perform aggregation.

data("messy_adult")
agg_adult <- prepareSet(messy_adult, finalForm = "data.table", key = "country", analysisDate = Sys.Date(), digits = 2)
# "prepareSet: step one: correcting mistakes."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 1 constant column(s) in dataSet."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 3 column(s) that are bijections of another column in dataSet."
# "unFactor: I will identify variable that are factor but shouldn't be."
# "unFactor: I unfactor mail."
# "unFactor: It took me 0.14s to unfactor 1 column(s)."
# "findAndTransformNumerics: It took me 0.18s to identify 2 numerics column(s), i will set them as numerics"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the columnnum2"
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I am doing the columnnum3"
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "findAndTransformNumerics: It took me 0.04s to transform 2 column(s) to a numeric format."
# "findAndTransformDates: It took me 0.22s to identify formats"
# "findAndTransformDates: It took me 0.01s to transform 3 columns to a Date format."
# "prepareSet: step two: transforming dataSet."
# "generateDateDiffs: I will generate difference between dates."
# "generateDateDiffs: It took me 0.16s to create 6 column(s)."
# "generateFactorFromDate: I will create a factor column from each date column."
# "generateFactorFromDate: It took me 0.61s to transform 3 column(s)."
# "generateFromCharacter: it took me: 0s to transform 1 character columns into, 3 new columns."
# "aggregateByKey: I start to aggregate"
# "aggregateByKey: 164 columns have been constructed. It took 0.55 seconds. "
# "prepareSet: step three: filtering dataSet."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 11 constant column(s) in result."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 35 column(s) that are bijections of another column in result."
# "prepareSet: step four: handling NA."
# "prepareSet: step five: shaping result."
# "Transforming numerical variables into factors when length(unique(col)) <= 10."
# "Previous distribution of column types:"
# col_class_init
#  factor numeric 
#       1     117 
# "Current distribution of column types:"
# col_class_end
#  factor numeric 
#      28      90

As one can see, every previously steps have been done.

Let’s have a look to the result

# "118 columns have been built; for 41 countries."
country nbrLines mean.num3 mean.age min.age max.age type_employer.?
? 515 0 39.10 17 90 22
Cambodia 19 0 37.79 18 65 1
Canada 109 0 42.72 17 80 13
China 66 0 41.97 22 75 7
Columbia 53 0 40.32 18 75 3
Cuba 87 0 45.10 21 77 3

8 Description

Finally, to generate a description file from this data set, function description is available.

It will describe, the set and its variables. Here we put the result in a txt file:

description(agg_adult, path_to_write = "report.txt")

9 Conclusion

We hope that this package is helpful, that it helped you prepare your data in a faster way.

If you would like to add some features to this package, notice some issues, please tell us on GitHub. Also if you want to contribute, please don’t hesitate to contact us.