dataPreparation
2017-08-17
This vignette introduces dataPreparation, what it offers, how simple it is to use it.
Introduction
Package presentation
Based on data.table package, dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.
This package is
- fast (use
data.table
and exponential search)
- RAM efficient (perform operations by reference and column-wise to avoid copying data)
- stable (most exceptions are handled)
- verbose (log a lot)
data.table
and other dependencies are handled at installation.
Main preparation steps
Before using any machine learning (ML) algorithm, one needs to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:
- Read: load the data set (this package don’t treat this point: for csv we recommend
data.table::fread
)
- Correct: most of the times, there are some mistake after reading, wrong format… one have to correct them
- Transform: creating new features from date, categorical, character… in order to have information usable for a ML algorithm (aka: numeric or categorical)
- Filter: get read of useless information in order to speed up computation
- Handle NA: replace missing values
- Shape: put your data set in a nice shape usable by a ML algorithm
Here are the functions available in this package to tackle those issues:
unFactor |
generateDateDiffs |
fastFilterVariables |
fastHandleNa |
shapeSet |
findAndTransformDates |
generateFactorFromDate |
whichAreConstant |
|
sameShape |
findAndTransformNumerics |
aggregateByKey |
whichAreInDouble |
|
setAsNumericMatrix |
setColAsCharacter |
generateFromFactor |
whichAreBijection |
|
|
setColAsNumeric |
generateFromCharacter |
|
|
|
setColAsDate |
fastRound |
|
|
|
setColAsFactor |
|
|
|
|
All of those functions are integrated in the full pipeline function prepareSet
.
In this tutorial we will detail all those steps and how to treat them with this package using an example data set.
Tutorial data
For this tutorial, we are going to use a messy version of adult data base.
data(messy_adult)
print(head(messy_adult, n = 4))
# date1 date2 date3 date4 num1 num2
# 1: 2017-10-07 NA 19-Jan-2017 21-January-2017 -3.0953 0,4954
# 2: 2017-31-12 1513465200 06-Jun-2017 08-June-2017 0.2227 -0,8202
# 3: 2017-12-10 1511305200 03-Jul-2017 05-July-2017 -0.2916 -0,713
# 4: 2017-06-09 1485126000 19-Jul-2017 21-July-2017 2.3236 0,7155
# constant mail num3 age type_employer
# 1: 1 marie.cynthia@aol.com -3,0953 39 State-gov
# 2: 1 jake.caroline@protonmail.com 0,2227 50 Self-emp-not-inc
# 3: 1 caroline.caroline@protonmail.com -0,2916 38 Private
# 4: 1 caroline.marie@yahoo.com 2,3236 53 Private
# fnlwgt education education_num marital occupation
# 1: 77516 Bachelors 13 Never-married Adm-clerical
# 2: 83311 Bachelors 13 Married-civ-spouse Exec-managerial
# 3: 215646 HS-grad 9 Divorced Handlers-cleaners
# 4: 234721 11th 7 Married-civ-spouse Handlers-cleaners
# relationship race sex capital_gain capital_loss hr_per_week
# 1: Not-in-family White Male 2174 0 40
# 2: Husband White Male 0 0 13
# 3: Not-in-family White Male 0 0 40
# 4: Husband Black Male 0 0 40
# country income
# 1: United-States <=50K
# 2: United-States <=50K
# 3: United-States <=50K
# 4: United-States <=50K
We added 9 really ugly columns to the data set:
- 4 dates with various formats, or time stamps, and NAs
- 1 constant column
- 3 numeric with different decimal separator
- 1 email address
The same info can be contained in two different columns.
Correct functions
Identifying factor that shouldn’t be
It often happens when reading a data set that R put string into a factor even if it shouldn’t be. In this tutorial data set, mail
is a factor but shouldn’t be. It will automatically be detected using unFactor
function:
print(class(messy_adult$mail))
# "factor"
messy_adult <- unFactor(messy_adult)
# "unFactor: I will identify variable that are factor but shouldn't be."
# "unFactor: I unfactor mail."
# "unFactor: It took me 0.14s to unfactor 1 column(s)."
print(class(messy_adult$mail))
# "character"
Filter functions
The idea now is to identify useless columns:
- constant columns: they take the same value for every line,
- double columns: they have an exact copy in the data set,
- bijection columns: there is another column containing the exact same information (but maybe coded differently) for example col1: Men/Women, col2 M/W.
Look for constant variables
constant_cols <- whichAreConstant(messy_adult)
# "whichAreConstant: constant is constant."
# "whichAreConstant: it took me 0.17s to identify 1 constant column(s)"
Look for columns in double
double_cols <- whichAreInDouble(messy_adult)
# "whichAreInDouble: num3 is exactly equal to num1. I put it in drop list."
# "whichAreInDouble: it took me 0.19s to identify 1 column(s) to drop."
Look for columns that are bijections of one another
bijections_cols <- whichAreBijection(messy_adult)
# "whichAreBijection: date4 is a bijection of date3. I put it in drop list."
# "whichAreBijection: num3 is a bijection of num1. I put it in drop list."
# "whichAreBijection: education_num is a bijection of education. I put it in drop list."
# "whichAreBijection: it took me 0.47s to identify 3 column(s) to drop."
To control this, let’s have a look to the concerned columns:
kable(head(messy_adult[, .(constant, date3, date4, num1, num3, education, education_num)])) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 12)
constant
|
date3
|
date4
|
num1
|
num3
|
education
|
education_num
|
1
|
2017-01-19
|
2017-01-21
|
-3.0953
|
-3.0953
|
Bachelors
|
13
|
1
|
2017-06-06
|
2017-06-08
|
0.2227
|
0.2227
|
Bachelors
|
13
|
1
|
2017-07-03
|
2017-07-05
|
-0.2916
|
-0.2916
|
HS-grad
|
9
|
1
|
2017-07-19
|
2017-07-21
|
2.3236
|
2.3236
|
11th
|
7
|
1
|
2017-05-16
|
2017-05-18
|
-0.9326
|
-0.9326
|
Bachelors
|
13
|
1
|
2017-04-02
|
2017-04-04
|
1.2396
|
1.2396
|
Masters
|
14
|
Indeed:
- constant was build constant, it contains only 1,
- num1 and num3 are equal,
- date3 and date4 are separated by 2 days: date4 doesn’t contain any new information for a ML algorithm,
- education and education_num contains the same information one with a key index, the other one with the character corresponding.
whichAreBijection
keeps the character column.
Filter them all
To directly filter all of them:
ncols <- ncol(messy_adult)
messy_adult <- fastFilterVariables(messy_adult)
print(paste0("messy_adult now have ", ncol(messy_adult), " columns; so ", ncols - ncol(messy_adult), " less than before."))
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 1 constant column(s) in dataSet."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I delete 1 column(s) that are in double in dataSet."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 2 column(s) that are bijections of another column in dataSet."
# "messy_adult now have 20 columns; so 4 less than before."
4 useless rows have been deleted. Without those useless columns, your machine learning algorithm will at least be faster and maybe give better results.
Handling NAs values
Then, let’s handle NAs
messy_adult <- fastHandleNa(messy_adult)
# num1 num2 age type_employer ... country income
# 1: 0.00 1.12 40 Private ... United-States >50K
# 2: -0.66 0.00 53 Private ... United-States <=50K
# 3: 0.19 -0.05 38 Federal-gov ... Iran >50K
# 4: 2.67 -1.58 47 Local-gov ... United-States <=50K
# date1.Minus.date2 date1.Minus.date3 date1.Minus.analysisDate
# 1: 56.04 146 -34.96
# 2: -297.96 -313 -324.96
# 3: 112.04 189 -146.96
# 4: 0.00 0 0.00
# date2.Minus.date3 date2.Minus.analysisDate date3.Minus.analysisDate
# 1: 89.96 -91 -180.96
# 2: -15.04 -27 -11.96
# 3: 76.96 -259 -335.96
# 4: 221.96 -12 -233.96
# date1.quarter date2.quarter date3.quarter mail.notnull mail.num
# 1: Q4 Q4 Q3 FALSE 195
# 2: Q1 Q4 Q4 FALSE 195
# 3: Q3 Q2 Q1 FALSE 195
# 4: QNA Q4 Q2 FALSE 195
# mail.order
# 1: 1
# 2: 1
# 3: 1
# 4: 1
It set default values in place of NA. If you want to put some specific values (constants, or even a function for example mean of values) you should go check fastHandleNa
documentation.
Shape functions
There are two types of machine learning algorithm in R: those which accept data.table and factor, those which only accept numeric matrix.
Transforming a data set into something acceptable for a machine learning algorithm could be tricky.
The shapeSet
function do it for you, you just have to choose if you want a data.table or a numerical_matrix.
First with data.table:
clean_adult = shapeSet(copy(messy_adult), finalForm = "data.table", verbose = FALSE)
print(table(sapply(clean_adult, class)))
#
# factor integer numeric
# 12 1 15
As one can see, there only are, numeric and factors.
Now with numerical_matrix:
clean_adult <- shapeSet(copy(messy_adult), finalForm = "numerical_matrix", verbose = FALSE)
num1
|
num2
|
age
|
type_employer?
|
type_employerFederal-gov
|
type_employerLocal-gov
|
…
|
0.00
|
1.12
|
40
|
0
|
0
|
0
|
…
|
-0.66
|
0.00
|
53
|
0
|
0
|
0
|
…
|
0.19
|
-0.05
|
38
|
0
|
1
|
0
|
…
|
2.67
|
-1.58
|
47
|
0
|
0
|
1
|
…
|
-0.10
|
0.26
|
23
|
0
|
0
|
0
|
…
|
1.51
|
0.00
|
50
|
1
|
0
|
0
|
…
|
As one can see, with finalForm = "numerical_matrix"
every character and factor have been binarized.
Full pipeline
Doing it all with one function is possible:
To do that we will reload the ugly data set and perform aggregation.
data("messy_adult")
agg_adult <- prepareSet(messy_adult, finalForm = "data.table", key = "country", analysisDate = Sys.Date(), digits = 2)
# "prepareSet: step one: correcting mistakes."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 1 constant column(s) in dataSet."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 3 column(s) that are bijections of another column in dataSet."
# "unFactor: I will identify variable that are factor but shouldn't be."
# "unFactor: I unfactor mail."
# "unFactor: It took me 0.14s to unfactor 1 column(s)."
# "findAndTransformNumerics: It took me 0.18s to identify 2 numerics column(s), i will set them as numerics"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the columnnum2"
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I am doing the columnnum3"
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "findAndTransformNumerics: It took me 0.04s to transform 2 column(s) to a numeric format."
# "findAndTransformDates: It took me 0.22s to identify formats"
# "findAndTransformDates: It took me 0.01s to transform 3 columns to a Date format."
# "prepareSet: step two: transforming dataSet."
# "generateDateDiffs: I will generate difference between dates."
# "generateDateDiffs: It took me 0.16s to create 6 column(s)."
# "generateFactorFromDate: I will create a factor column from each date column."
# "generateFactorFromDate: It took me 0.61s to transform 3 column(s)."
# "generateFromCharacter: it took me: 0s to transform 1 character columns into, 3 new columns."
# "aggregateByKey: I start to aggregate"
# "aggregateByKey: 164 columns have been constructed. It took 0.55 seconds. "
# "prepareSet: step three: filtering dataSet."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 11 constant column(s) in result."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 35 column(s) that are bijections of another column in result."
# "prepareSet: step four: handling NA."
# "prepareSet: step five: shaping result."
# "Transforming numerical variables into factors when length(unique(col)) <= 10."
# "Previous distribution of column types:"
# col_class_init
# factor numeric
# 1 117
# "Current distribution of column types:"
# col_class_end
# factor numeric
# 28 90
As one can see, every previously steps have been done.
Let’s have a look to the result
# "118 columns have been built; for 41 countries."
country
|
nbrLines
|
mean.num3
|
mean.age
|
min.age
|
max.age
|
type_employer.?
|
…
|
?
|
515
|
0
|
39.10
|
17
|
90
|
22
|
…
|
Cambodia
|
19
|
0
|
37.79
|
18
|
65
|
1
|
…
|
Canada
|
109
|
0
|
42.72
|
17
|
80
|
13
|
…
|
China
|
66
|
0
|
41.97
|
22
|
75
|
7
|
…
|
Columbia
|
53
|
0
|
40.32
|
18
|
75
|
3
|
…
|
Cuba
|
87
|
0
|
45.10
|
21
|
77
|
3
|
…
|
Description
Finally, to generate a description file from this data set, function description
is available.
It will describe, the set and its variables. Here we put the result in a txt file:
description(agg_adult, path_to_write = "report.txt")
Conclusion
We hope that this package is helpful, that it helped you prepare your data in a faster way.
If you would like to add some features to this package, notice some issues, please tell us on GitHub. Also if you want to contribute, please don’t hesitate to contact us.