Tutorial 4: Linear Regression.
About the Data
-
Panel Data from South Korea
-
Variables included:
- id
- year
- wave : from wave 1st in 2005 to wave 14th in 2018
- region: 1) Seoul 2) Kyeong-gi 3) Kyoung-nam 4) Kyoung-buk 5) Chung-nam 6) Gang-won &. Chung-buk 7) Jeolla & Jeju
- income: yearly income in 10,000 KRW(ten thousands Korean Won. 1100 KRW = 1 USD)
- family_member: no of family members
- gender: 1) male 2) female
- year_born
- education_level: 1) no education(under 7 yrs-old) 2) no education(7 & over 7 yrs-old) 3) elementary 4) middle school 5) high school 6) college 7) university degree 8) MA 9) doctoral degree
- marriage: marital status. 1) not applicable (under 18) 2) married 3) separated by death 4) separated 5) not married yet 6) others
- religion: 1) have religion 2) do not have
- occupation
- company_size
- reason_none_worker: 1) no capable 2) in military service 3) studying in school 4) prepare for school 5) prepare to apply job 6) house worker 7) caring kids at home 8) nursing 9) giving-up economic activities 10) no intention to work 11) others
Exercise 1: Linear Regression
a 1.1 Regress education level on income.
Call:
lm(formula = income ~ education_level, data = korea)
Residuals:
Min 1Q Median 3Q Max
-237119 -1461 -439 790 462253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1118.833 36.118 -30.98 <2e-16 ***
education_level 1010.652 7.507 134.62 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3820 on 92855 degrees of freedom
Multiple R-squared: 0.1633, Adjusted R-squared: 0.1633
F-statistic: 1.812e+04 on 1 and 92855 DF, p-value: < 2.2e-16
b 1.2 Create an age variable and regress it on income
Call:
lm(formula = income ~ age, data = korea)
Residuals:
Min 1Q Median 3Q Max
-236901 -1633 -697 800 465548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8278.5654 48.6856 170.0 <2e-16 ***
age -82.6049 0.8013 -103.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3956 on 92855 degrees of freedom
Multiple R-squared: 0.1027, Adjusted R-squared: 0.1027
F-statistic: 1.063e+04 on 1 and 92855 DF, p-value: < 2.2e-16
c 1.3 Regress both age and education level on income
Call:
lm(formula = income ~ age + education_level, data = korea)
Residuals:
Min 1Q Median 3Q Max
-237303 -1463 -431 731 462951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1321.5122 92.4989 14.29 <2e-16 ***
age -28.3540 0.9902 -28.64 <2e-16 ***
education_level 837.7978 9.6077 87.20 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3803 on 92854 degrees of freedom
Multiple R-squared: 0.1706, Adjusted R-squared: 0.1706
F-statistic: 9551 on 2 and 92854 DF, p-value: < 2.2e-16
d 1.4 Regress log of age and education level on income
Call:
lm(formula = income ~ log_age + education_level, data = korea)
Residuals:
Min 1Q Median 3Q Max
-237212 -1459 -440 752 462701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3180.67 240.01 13.25 <2e-16 ***
log_age -948.16 52.33 -18.12 <2e-16 ***
education_level 903.97 9.53 94.85 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3813 on 92854 degrees of freedom
Multiple R-squared: 0.1662, Adjusted R-squared: 0.1662
F-statistic: 9257 on 2 and 92854 DF, p-value: < 2.2e-16
- 1.5 Regress gender, log of age and education level on income. Present only the coefficients
tip: explore the lm function on the help tab
(Intercept) log_age education_level gender
5307.3048 -934.4571 766.9034 -1206.0092
Exercise 2: T-test
- 2.1 Get the average number of family members for each one of groups: 1) have religion, 2) do not have a religion, 9) unknown.
1 2 9
2.409554 2.559758 3.129630
- 2.2 perform a t-test to know if the means are statistically significant different from each other
tip: you will have to remove the observations where religion = 9 before
Welch Two Sample t-test
data: family_member by religion
t = -17.737, df = 92793, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-0.1668026 -0.1336061
sample estimates:
mean in group 1 mean in group 2
2.409554 2.559758
Exercise 3: Visualization a Regression
- 3.1 Using ggplot, visualize the regression done in 1.1
`geom_smooth()` using formula 'y ~ x'
- 3.2 Visualize the regression done in 1.2.
`geom_smooth()` using formula 'y ~ x'
PS: Even though the correlation is significant it is not that clear when looking at the graphs (Too many outliers)
- 3.3 repeat 3.2, but only show observations with an income up to 20,000
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 505 rows containing non-finite values (stat_smooth).
Warning: Removed 505 rows containing missing values (geom_point).
Exercise 4
- We will use another dataset in this exercise: Data on house prices in london
- 4.1 Regress area (in sq feet) on the price. Interpret the coefficients (i.e., the value and their statistical significance)
Call:
lm(formula = Price ~ Area.in.sq.ft, data = london_house)
Residuals:
Min 1Q Median 3Q Max
-8755213 -503561 -167061 129088 33546963
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36674.05 45936.19 -0.798 0.425
Area.in.sq.ft 1109.68 20.98 52.897 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1688000 on 3478 degrees of freedom
Multiple R-squared: 0.4458, Adjusted R-squared: 0.4457
F-statistic: 2798 on 1 and 3478 DF, p-value: < 2.2e-16
- 4.2 Regress the log of area (in sq feet) on the price. Interpret the coefficients.
Call:
lm(formula = Price ~ log_area, data = london_house)
Residuals:
Min 1Q Median 3Q Max
-3447406 -834395 -205682 438856 34869633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13584801 344947 -39.38 <2e-16 ***
log_area 2138504 47561 44.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1803000 on 3478 degrees of freedom
Multiple R-squared: 0.3676, Adjusted R-squared: 0.3674
F-statistic: 2022 on 1 and 3478 DF, p-value: < 2.2e-16
- 4.3 Display the regression fit done in 4.1 graphically. Looking at the plot, do you think the area is enough to predict the value of a house
`geom_smooth()` using formula 'y ~ x'
!
- 4.4 Regress the log of area on the price again, but now control for the city county and number of bedrooms. Why would we want to control for different counties? Does the coefficient of N of Bedrooms has the expected sign?
Call:
lm(formula = Price ~ log_area + City.County + No..of.Bedrooms,
data = london_house)
Residuals:
Min 1Q Median 3Q Max
-3539955 -784231 -191495 446677 33697553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23734985 1789948 -13.260 <2e-16 ***
log_area 3673635 95113 38.624 <2e-16 ***
City.County27 Carlton Drive 1838298 2384893 0.771 0.441
City.County311 Goldhawk Road 251035 2384362 0.105 0.916
City.County4 Circus Road West 637541 2384422 0.267 0.789
City.County52 Holloway Road 1594405 2384641 0.669 0.504
City.County6 Deal Street 600116 2384423 0.252 0.801
City.County82-88 Fulham High Street 1353683 2384563 0.568 0.570
City.CountyBattersea 833206 2065216 0.403 0.687
City.CountyBlackheath 672067 2384912 0.282 0.778
City.CountyBushey -1137737 2384441 -0.477 0.633
City.CountyChelsea 1106813 1946854 0.569 0.570
City.CountyChessington 130712 2384396 0.055 0.956
City.CountyCity Of London 1492139 2064974 0.723 0.470
City.CountyClapton 1539011 2384625 0.645 0.519
City.CountyClerkenwell 1616168 2384598 0.678 0.498
City.CountyDe Beauvoir 209057 2384847 0.088 0.930
City.CountyDeptford 1293718 2384780 0.542 0.588
City.CountyDowns Road -191611 2384341 -0.080 0.936
City.CountyE5 8DE -204646 2064925 -0.099 0.921
City.CountyEaling -211849 2384840 -0.089 0.929
City.CountyEssex -250797 1700106 -0.148 0.883
City.CountyFitzrovia 1869415 2384676 0.784 0.433
City.CountyFulham 592614 1847062 0.321 0.748
City.CountyFulham High Street 1732175 2384854 0.726 0.468
City.CountyGreenford 1032605 2385692 0.433 0.665
City.CountyHertfordshire -478510 1777923 -0.269 0.788
City.CountyHolland Park 1217343 2384485 0.511 0.610
City.CountyHornchurch 454640 2386091 0.191 0.849
City.CountyKensington 2114679 2384628 0.887 0.375
City.CountyKent -2186168 2385234 -0.917 0.359
City.CountyLambourne End -992796 2385091 -0.416 0.677
City.CountyLillie Square 2322928 2384467 0.974 0.330
City.CountyLittle Venice 1233507 2384533 0.517 0.605
City.CountyLondon 1194884 1686516 0.708 0.479
City.CountyLondon1500 46965 2384463 0.020 0.984
City.CountyMarylebone 2366145 1885108 1.255 0.210
City.CountyMiddlesex -215738 1697289 -0.127 0.899
City.CountyMiddx -1404958 2385323 -0.589 0.556
City.CountyN1 6FU 1722555 2385430 0.722 0.470
City.CountyN7 6QX 998057 1802625 0.554 0.580
City.CountyNorthwood 231127 2065368 0.112 0.911
City.CountyOxshott -1871493 2386503 -0.784 0.433
City.CountyQueens Park 648622 2384413 0.272 0.786
City.CountyRichmond -201387 2065328 -0.098 0.922
City.CountyRichmond Hill 397874 2384575 0.167 0.867
City.CountyRomford -1079758 2385756 -0.453 0.651
City.CountySpitalfields -982532 2384692 -0.412 0.680
City.CountySurrey -307927 1689877 -0.182 0.855
City.CountySurrey Quays 1698776 2384685 0.712 0.476
City.CountyThames Ditton -353279 2384799 -0.148 0.882
City.CountyThe Metal Works 748814 2384386 0.314 0.754
City.CountyThurleigh Road 208120 1803097 0.115 0.908
City.CountyTwickenham 191162 1755395 0.109 0.913
City.CountyWandsworth 34922 2065234 0.017 0.987
City.CountyWatford -759393 1886148 -0.403 0.687
City.CountyWimbledon -419106 2384650 -0.176 0.860
City.CountyWornington Road 1240295 1847103 0.671 0.502
No..of.Bedrooms -625580 40088 -15.605 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1686000 on 3421 degrees of freedom
Multiple R-squared: 0.4563, Adjusted R-squared: 0.447
F-statistic: 49.49 on 58 and 3421 DF, p-value: < 2.2e-16
- 4.5 Regress the square of area and area on the price. Do you think the relationship between the area and the price is linear or non-linear?
Call:
lm(formula = Price ~ area2 + Area.in.sq.ft, data = london_house)
Residuals:
Min 1Q Median 3Q Max
-8162658 -516343 -143281 160531 33506243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.440e+05 6.306e+04 -2.283 0.0225 *
area2 -1.286e-02 5.182e-03 -2.481 0.0131 *
Area.in.sq.ft 1.208e+03 4.494e+01 26.888 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1687000 on 3477 degrees of freedom
Multiple R-squared: 0.4468, Adjusted R-squared: 0.4465
F-statistic: 1404 on 2 and 3477 DF, p-value: < 2.2e-16
Exercise 5
- 5.1 Create a dummy variable that is equal to 1 if the number of bathrooms is greater than 2 and 0 otherwise. Regress this dummy on the price. Interpret the results
Call:
lm(formula = Price ~ d_bathroom, data = london_house)
Residuals:
Min 1Q Median 3Q Max
-2168411 -943411 -344676 130324 37206589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 969676 54943 17.65 <2e-16 ***
d_bathroom 1573735 72877 21.59 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2129000 on 3478 degrees of freedom
Multiple R-squared: 0.1182, Adjusted R-squared: 0.118
F-statistic: 466.3 on 1 and 3478 DF, p-value: < 2.2e-16
- 5.2 Perform the same regression done in 5.1, but now include the area as an additional independent variable. What happens with the coefficient of the dummy? Interpret it. Why do you think this happens?
Call:
lm(formula = Price ~ d_bathroom + Area.in.sq.ft, data = london_house)
Residuals:
Min 1Q Median 3Q Max
-9001985 -477709 -189080 107772 33486687
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1445.92 48460.12 0.030 0.9762
d_bathroom -170118.65 69321.99 -2.454 0.0142 *
Area.in.sq.ft 1143.87 25.17 45.443 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1687000 on 3477 degrees of freedom
Multiple R-squared: 0.4468, Adjusted R-squared: 0.4465
F-statistic: 1404 on 2 and 3477 DF, p-value: < 2.2e-16
- 5.3 What is Omitted Variable Bias? Can you give examples of two additional variables that are not in the data that could influence the price of a house?