Exercise 1: Machine Learning Model

a 1.1 Perform the exercise done during the tutorial on the database LifeCycleSavings (you can load it in base R by running LifeCycleSavings, https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/LifeCycleSavings). Use the linear model as an engine to check how well the variables pop15, pop75 and dpi predict sr (aggregate personal savings). Use set.seed(70). Why did the model removed one variable? Interpret if these variables are good in predicting the value of sr.

<Analysis/Assess/Total>
<30/20/50>

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          3

Training data contained 30 data points and no missing data.

Operations:

Centering for pop15, pop75, dpi [trained]
Scaling for pop15, pop75, dpi [trained]
Correlation filter on pop75 [trained]

# A tibble: 20 x 1
   .pred
   <dbl>
 1 10.7 
 2  7.53
 3  7.14
 4 11.8 
 5  7.61
 6 11.5 
 7 12.6 
 8  9.72
 9 11.0 
10 12.5 
11 12.0 
12 12.1 
13 12.1 
14  8.29
15  7.68
16 12.6 
17  9.41
18  7.23
19  8.90
20  8.40

# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.98 
2 rsq     standard       0.315
3 mae     standard       3.09 

b 1.2 Assess the performance of the previous model using a scatter plot. Put the actual sr (aggregate personal savings) in the horizontal axis (x) and the predicted sr (aggregate personal savings) in the vertical axis. What is your conclusion about the quality of the prediction?

c 1.3 Perform the same exercise using the london_house_price dataset. Check how well house type, area(in sq ft), number of bedrooms and number of bathrooms predict the house price (using a linear model). Use set.seed(71).

<Analysis/Assess/Total>
<2088/1392/3480>

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          4

Training data contained 2088 data points and no missing data.

Operations:

Centering for Area.in.sq.ft, No..of.Bedrooms, No..of.Bathrooms [trained]
Scaling for Area.in.sq.ft, No..of.Bedrooms, No..of.Bathrooms [trained]
Correlation filter on No..of.Bedrooms [trained]

# A tibble: 1,392 x 1
      .pred
      <dbl>
 1 2569291.
 2 1707867.
 3  842547.
 4  559158.
 5 1123157.
 6  915861.
 7  770459.
 8 3358627.
 9 6010800.
10  771794.
#... with 1,382 more rows

# A tibble: 3 x 3
  .metric .estimator   .estimate
  <chr>   <chr>            <dbl>
1 rmse    standard   1501431.   
2 rsq     standard         0.461
3 mae     standard    757525.   

d 1.4 Use the heart.csv dataset, (read more about this dataset here), to train a classification model predicting the probability of getting a heart attack.

  • Fit the model using all variables predicting the dependent variable target.
  • Use a prop = .75 in initial_split.
  • Calculate the Confusion Matrix.
  • Calculate the accuracy, sensitivity, specificity.
  • Plot the ROC.
  • Calculate the ROC-AUC.
  • What is your conclusion about the model?
parsnip model object


Call:  stats::glm(formula = target ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)          age          sex           cp     trestbps         chol  
   4.946390    -0.001948    -1.868410     0.867557    -0.026999    -0.007845  
        fbs      restecg      thalach        exang      oldpeak        slope  
  -0.065102     0.217100     0.025711    -0.776306    -0.442561     0.806031  
         ca         thal  
  -0.656924    -1.172524  

Degrees of Freedom: 225 Total (i.e. Null);  212 Residual
Null Deviance:      311.5 
Residual Deviance: 160.6    AIC: 188.6

# A tibble: 77 x 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1  0.332    0.668
 2  0.0200   0.980
 3  0.0165   0.984
 4  0.0103   0.990
 5  0.0467   0.953
 6  0.749    0.251
 7  0.0600   0.940
 8  0.150    0.850
 9  0.0611   0.939
10  0.0992   0.901
# ... with 67 more rows

          Truth
Prediction  0  1
         0 28  5
         1  7 37

# A tibble: 3 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.844
2 sens     binary         0.8  
3 spec     binary         0.881

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary        0.0776

Exercise 2: Dplyr Package

  • We will use the same database in 1.2 (prices on London houses) in this exercise
  • Answers must be done using a function of the dplyr package

a 2.1 Create a new subset with houses where the area is equal to or greater than 1000 sq. feet and smaller than 2000 sq. feet. Use head() to print the firs 6 observation of this subset

   X   Property.Name   Price       House.Type Area.in.sq.ft No..of.Bedrooms
1  3    Festing Road 1765000            House          1986               4
2  6  Alfriston Road 1475000            House          1548               4
3  8 Adam & Eve Mews 2500000            House          1308               3
4 16  Cambridge Park 1450000 Flat / Apartment          1702               3
5 18  Elsworthy Rise 2275000  New development          1173               3
6 25 St Mary's Grove 1300000            House          1101               3
  No..of.Bathrooms No..of.Receptions      Location City.County Postal.Code
1                4                 4        Putney      London    SW15 1LP
2                4                 4                    London    SW11 6NW
3                3                 3                    London      W8 6UG
4                3                 3                Twickenham     TW1 2PF
5                3                 3 Primrose Hill      London     NW3 3DS
6                3                 3     Islington      London      N1 2NT

b 2.2 Arrange the dataset (PS: not the subset) in decreasing order of number of bedrooms. Use head() to print the first 6 observations.

     X         Property.Name    Price House.Type Area.in.sq.ft No..of.Bedrooms
1   43   Old Battersea House  9975000      House         10100              10
2 1422 St. Petersburgh Place  5500000      House          4227               9
3 2619      Courtenay Avenue 16999999      House         11733               9
4 3394  Upper Wimpole Street 14750000      House          9053               9
5  224           Harper Lane  1650000      House          4016               8
6  286         Fentiman Road  3100000      House          3800               8
  No..of.Bathrooms No..of.Receptions   Location   City.County Postal.Code
1               10                10  Battersea        London    SW11 3LD
2                9                 9                   London      W2 4LA
3                9                 9   Highgate        London      N6 4LR
4                9                 9 Marylebone        London     W1G 6LG
5                8                 8    Radlett Hertfordshire     WD7 9HJ
6                8                 8                   London     SW8 1QA

c 2.3 Create a new variable (i.e., save it in the dataframe) that gives you the price of the square foot in each house. Use mean() to check the mean of this new variable

[1] 1066.25

d 2.4 Create a new variable that assumes the value 1 when the house type is House, 2 if the type is Penthouse, 3 if the type is Flat / Apartment or Studio and 0 otherwise. Use the table() function on the new variable

   0    1    2    3 
 375 1430  100 1575 

e 2.5 Get the interquartile range of the Price per House Type.

# A tibble: 8 x 2
  House.Type       iqr_price
  <chr>                <dbl>
1 Bungalow           275000 
2 Duplex             337500 
3 Flat / Apartment   700050 
4 House             1512500 
5 Mews                50000 
6 New development   1520000 
7 Penthouse         2963788.
8 Studio             150000