Scatter Plots

David Gerbing

library("lessR")

First read the Employee data included as part of lessR.

d <- Read("Employee")
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------

lessR provides many versions of a scatter plot with its Plot() function.

Two Variables

The regular scatterplot.

Plot(Years, Salary)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

The enhanced scatterplot with parameter enhance.

Plot(Years, Salary, enhance=TRUE)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
## >>> Outlier analysis with Mahalanobis Distance 
##  
##   MD                  ID 
## -----               ----- 
## 8.14     Correll, Trevon 
## 7.84       Capelle, Adam 
##  
## 5.63  Korhalkar, Jessica 
## 5.58       James, Leslie 
## 3.75         Hoang, Binh 
## ...                 ...

Map variable Pre to the points with the size parameter, a bubble plot.

Plot(Years, Salary, size=Pre)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

Plot against levels of categorical variable Gender with the by parameter.

Plot(Years, Salary, by=Gender)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

Two categorical variables result in a bubble plot of their joint frequencies.

Plot(Dept, Gender)

## >>> Suggestions
## Plot(Dept, Gender, size_cut=FALSE) 
## Plot(Dept, Gender, trans=.8, bg="off", grid="off") 
## SummaryStats(Dept, Gender)  # or ss 
## 
## 
## Joint and Marginal Frequencies 
## ------------------------------ 
##  
##        Dept 
## Gender   ACCT ADMN FINC MKTG SALE Sum 
##   F         3    4    1    5    5  18 
##   M         2    2    3    1   10  18 
##   Sum       5    6    4    6   15  36 
## 
## 
## Cramer's V: 0.415 
##  
## Chi-square Test:  Chisq = 6.200, df = 4, p-value = 0.185 
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate

Distribution of a Single Variable

The default plot for a single continuous variable includes not only the scatterplot, but also the violin plot and box plot, with outliers identified. Call this plot the VBS plot.

Plot(Salary)
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE)  # Label two outliers ...
## Plot(Salary, box_adj=TRUE)  # Adjust boxplot whiskers for asymmetry 
## 
## --- Salary --- 
## Present: 37 
## Missing: 0 
## Total  : 37 
##  
## Mean         : 73795.557 
## Stnd Dev     : 21799.533 
## IQR          : 31012.560 
## Skew         : 0.190   [medcouple, -1 to 1] 
##  
## Minimum      : 46124.970 
## Lower Whisker: 46124.970 
## 1st Quartile : 56772.950 
## Median       : 69547.600 
## 3rd Quartile : 87785.510 
## Upper Whisker: 122563.380 
## Maximum      : 134419.230 
## 
## 
## (Box plot) Outliers: 1 
##  
## Small      Large           
## -----      -----           
##            Correll, Trevon 134419.23 
## 
## 
## Number of duplicated values: 0 
## 
## 
## Parameter values (can be manually set) 
## ------------------------------------------------------- 
## size: 0.61      size of plotted points 
## jitter_y: 0.45  random vertical movement of points 
## jitter_x: 0.00  random horizontal movement of points 
## bw: 9529.04     set bandwidth higher for smoother edges

For a single categorical variable, get the corresponding bubble plot of frequencies.

Plot(Dept)

## >>> Suggestions
## Plot(Dept, color_low="lemonchiffon2", color_hi="maroon3") 
## Plot(Dept, values="count")  # scatter plot of counts 
## 
## 
## --- Dept ---
## 
## 
##                 ACCT   ADMN   FINC   MKTG   SALE    Total 
## Frequencies:       5      6      4      6     15       36 
## Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 
## 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 10.944, df = 4, p-value = 0.027

Cleveland Dot Plot

The Cleveland dot plot, here for a single variable, has row names on the y-axis. The default plots sorts by the value plotted.

Plot(Salary, row_names)

## >>> Suggestions
## Plot(Salary, y=row_names, sort_yx=FALSE, segments_y=FALSE)  
## 
## 
##  
## --- Salary --- 
##  
##      n   miss      mean        sd       min       mdn       max 
##      37      0   73795.6   21799.5   46125.0   69547.6  134419.2 
## 
## 
## (Box plot) Outliers: 1 
##  
## Small      Large 
## -----      ----- 
##             134419.2

The standard scatterplot version of a Cleveland dot plot.

Plot(Salary, row_names, sort_yx="0", segments_y=FALSE)

## >>> Suggestions 
## 
## 
##  
## --- Salary --- 
##  
##      n   miss      mean        sd       min       mdn       max 
##      37      0   73795.6   21799.5   46125.0   69547.6  134419.2 
## 
## 
## (Box plot) Outliers: 1 
##  
## Small      Large 
## -----      ----- 
##             134419.2

This Cleveland dot plot has two x-variables, indicated as a standard R vector with the c() function. In this situation the two points on each row are connected with a line segment. By default the rows are sorted by distance between the successive points.

Plot(c(Pre, Post), row_names)

## >>> Suggestions
## Plot(c(Pre, Post), y=row_names, sort_yx=FALSE, segments_y=FALSE)  
## 
## 
##  
## --- Pre --- 
##  
##      n   miss    mean      sd     min     mdn     max 
##      37      0    78.8    12.0    59.0    80.0   100.0 
##  
##  
## --- Post --- 
##  
##      n   miss    mean      sd     min     mdn     max 
##      37      0    81.0    11.6    59.0    84.0   100.0 
## 
## 
## No (Box plot) outliers 
## 
## 
##  n  diff  Row 
## --------------------------- 
##  1 13.0 Korhalkar, Jessica 
##  2 13.0 Cooper, Lindsay 
##  3 12.0 Anastasiou, Crystal 
##  4 12.0 Wu, James 
##  5 10.0 Ritchie, Darnell 
##  6  8.0 Campagna, Justin 
##  7  7.0 Cassinelli, Anastis 
##  8  7.0 Hamide, Bita 
##  9  7.0 Sheppard, Cory 
## 10  6.0 LaRoe, Maria 
## 27 -1.0 Kimball, Claire 
## 28 -2.0 Capelle, Adam 
## 29 -2.0 Stanley, Emma 
## 30 -2.0 Adib, Hassan 
## 31 -2.0 Skrotzki, Sara 
## 32 -3.0 Anderson, David 
## 33 -3.0 Correll, Trevon 
## 34 -3.0 Kralik, Laura 
## 35 -3.0 Jones, Alissa 
## 36 -4.0 Gvakharia, Kimberly 
## 37 -4.0 Downs, Deborah