Tutorial 6: Introduction to Machine Learning.
Table of Content:
Exercise 1: Machine Learning Model
a 1.1 Perform the exercise done during the tutorial on the database LifeCycleSavings (you can load it in base R by running LifeCycleSavings, https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/LifeCycleSavings). Use the linear model as an engine to check how well the variables pop15, pop75 and dpi predict sr (aggregate personal savings). Use set.seed(70). Why did the model removed one variable? Interpret if these variables are good in predicting the value of sr.
<Analysis/Assess/Total>
<30/20/50>
Recipe
Inputs:
role #variables
outcome 1
predictor 3
Training data contained 30 data points and no missing data.
Operations:
Centering for pop15, pop75, dpi [trained]
Scaling for pop15, pop75, dpi [trained]
Correlation filter on pop75 [trained]
# A tibble: 20 x 1
.pred
<dbl>
1 10.7
2 7.53
3 7.14
4 11.8
5 7.61
6 11.5
7 12.6
8 9.72
9 11.0
10 12.5
11 12.0
12 12.1
13 12.1
14 8.29
15 7.68
16 12.6
17 9.41
18 7.23
19 8.90
20 8.40
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.98
2 rsq standard 0.315
3 mae standard 3.09
b 1.2 Assess the performance of the previous model using a scatter
plot. Put the actual sr (aggregate personal savings)
in the
horizontal axis (x) and the predicted
sr (aggregate personal savings)
in the vertical axis. What is your
conclusion about the quality of the prediction?
c 1.3 Perform the same exercise using the london_house_price
dataset. Check how well house type, area(in sq ft), number of
bedrooms and number of bathrooms predict the house price (using a
linear model). Use set.seed(71)
.
<Analysis/Assess/Total>
<2088/1392/3480>
Recipe
Inputs:
role #variables
outcome 1
predictor 4
Training data contained 2088 data points and no missing data.
Operations:
Centering for Area.in.sq.ft, No..of.Bedrooms, No..of.Bathrooms [trained]
Scaling for Area.in.sq.ft, No..of.Bedrooms, No..of.Bathrooms [trained]
Correlation filter on No..of.Bedrooms [trained]
# A tibble: 1,392 x 1
.pred
<dbl>
1 2569291.
2 1707867.
3 842547.
4 559158.
5 1123157.
6 915861.
7 770459.
8 3358627.
9 6010800.
10 771794.
#... with 1,382 more rows
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1501431.
2 rsq standard 0.461
3 mae standard 757525.
d 1.4 Use the heart.csv
dataset, (read more about this dataset
here),
to train a classification model predicting the probability of
getting a heart attack.
- Fit the model using all variables predicting the dependent variable
target
. - Use a
prop = .75
ininitial_split
. - Calculate the Confusion Matrix.
- Calculate the accuracy, sensitivity, specificity.
- Plot the ROC.
- Calculate the ROC-AUC.
- What is your conclusion about the model?
parsnip model object
Call: stats::glm(formula = target ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) age sex cp trestbps chol
4.946390 -0.001948 -1.868410 0.867557 -0.026999 -0.007845
fbs restecg thalach exang oldpeak slope
-0.065102 0.217100 0.025711 -0.776306 -0.442561 0.806031
ca thal
-0.656924 -1.172524
Degrees of Freedom: 225 Total (i.e. Null); 212 Residual
Null Deviance: 311.5
Residual Deviance: 160.6 AIC: 188.6
# A tibble: 77 x 2
.pred_0 .pred_1
<dbl> <dbl>
1 0.332 0.668
2 0.0200 0.980
3 0.0165 0.984
4 0.0103 0.990
5 0.0467 0.953
6 0.749 0.251
7 0.0600 0.940
8 0.150 0.850
9 0.0611 0.939
10 0.0992 0.901
# ... with 67 more rows
Truth
Prediction 0 1
0 28 5
1 7 37
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.844
2 sens binary 0.8
3 spec binary 0.881
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.0776
Exercise 2: Dplyr Package
- We will use the same database in 1.2 (prices on London houses) in this exercise
- Answers must be done using a function of the dplyr package
a 2.1 Create a new subset with houses where the area is equal to or greater than 1000 sq. feet and smaller than 2000 sq. feet. Use head() to print the firs 6 observation of this subset
X Property.Name Price House.Type Area.in.sq.ft No..of.Bedrooms
1 3 Festing Road 1765000 House 1986 4
2 6 Alfriston Road 1475000 House 1548 4
3 8 Adam & Eve Mews 2500000 House 1308 3
4 16 Cambridge Park 1450000 Flat / Apartment 1702 3
5 18 Elsworthy Rise 2275000 New development 1173 3
6 25 St Mary's Grove 1300000 House 1101 3
No..of.Bathrooms No..of.Receptions Location City.County Postal.Code
1 4 4 Putney London SW15 1LP
2 4 4 London SW11 6NW
3 3 3 London W8 6UG
4 3 3 Twickenham TW1 2PF
5 3 3 Primrose Hill London NW3 3DS
6 3 3 Islington London N1 2NT
b 2.2 Arrange the dataset (PS: not the subset) in decreasing order of number of bedrooms. Use head() to print the first 6 observations.
X Property.Name Price House.Type Area.in.sq.ft No..of.Bedrooms
1 43 Old Battersea House 9975000 House 10100 10
2 1422 St. Petersburgh Place 5500000 House 4227 9
3 2619 Courtenay Avenue 16999999 House 11733 9
4 3394 Upper Wimpole Street 14750000 House 9053 9
5 224 Harper Lane 1650000 House 4016 8
6 286 Fentiman Road 3100000 House 3800 8
No..of.Bathrooms No..of.Receptions Location City.County Postal.Code
1 10 10 Battersea London SW11 3LD
2 9 9 London W2 4LA
3 9 9 Highgate London N6 4LR
4 9 9 Marylebone London W1G 6LG
5 8 8 Radlett Hertfordshire WD7 9HJ
6 8 8 London SW8 1QA
c 2.3 Create a new variable (i.e., save it in the dataframe) that gives you the price of the square foot in each house. Use mean() to check the mean of this new variable
[1] 1066.25
d 2.4 Create a new variable that assumes the value 1 when the house type is House, 2 if the type is Penthouse, 3 if the type is Flat / Apartment or Studio and 0 otherwise. Use the table() function on the new variable
0 1 2 3
375 1430 100 1575
e 2.5 Get the interquartile range of the Price per House Type.
# A tibble: 8 x 2
House.Type iqr_price
<chr> <dbl>
1 Bungalow 275000
2 Duplex 337500
3 Flat / Apartment 700050
4 House 1512500
5 Mews 50000
6 New development 1520000
7 Penthouse 2963788.
8 Studio 150000