About the Data

  • Panel Data from South Korea

  • Variables included:

    • id
    • year
    • wave : from wave 1st in 2005 to wave 14th in 2018
    • region: 1) Seoul 2) Kyeong-gi 3) Kyoung-nam 4) Kyoung-buk 5) Chung-nam 6) Gang-won &. Chung-buk 7) Jeolla & Jeju
    • income: yearly income in 10,000 KRW(ten thousands Korean Won. 1100 KRW = 1 USD)
    • family_member: no of family members
    • gender: 1) male 2) female
    • year_born
    • education_level: 1) no education(under 7 yrs-old) 2) no education(7 & over 7 yrs-old) 3) elementary 4) middle school 5) high school 6) college 7) university degree 8) MA 9) doctoral degree
    • marriage: marital status. 1) not applicable (under 18) 2) married 3) separated by death 4) separated 5) not married yet 6) others
    • religion: 1) have religion 2) do not have
    • occupation
    • company_size
    • reason_none_worker: 1) no capable 2) in military service 3) studying in school 4) prepare for school 5) prepare to apply job 6) house worker 7) caring kids at home 8) nursing 9) giving-up economic activities 10) no intention to work 11) others

Exercise 1: Linear Regression

a 1.1 Regress education level on income.

Call:
lm(formula = income ~ education_level, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-237119   -1461    -439     790  462253 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1118.833     36.118  -30.98   <2e-16 ***
education_level  1010.652      7.507  134.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3820 on 92855 degrees of freedom
Multiple R-squared:  0.1633,    Adjusted R-squared:  0.1633 
F-statistic: 1.812e+04 on 1 and 92855 DF,  p-value: < 2.2e-16

b 1.2 Create an age variable and regress it on income

Call:
lm(formula = income ~ age, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-236901   -1633    -697     800  465548 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8278.5654    48.6856   170.0   <2e-16 ***
age          -82.6049     0.8013  -103.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3956 on 92855 degrees of freedom
Multiple R-squared:  0.1027,    Adjusted R-squared:  0.1027 
F-statistic: 1.063e+04 on 1 and 92855 DF,  p-value: < 2.2e-16

c 1.3 Regress both age and education level on income

Call:
lm(formula = income ~ age + education_level, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-237303   -1463    -431     731  462951 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1321.5122    92.4989   14.29   <2e-16 ***
age              -28.3540     0.9902  -28.64   <2e-16 ***
education_level  837.7978     9.6077   87.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3803 on 92854 degrees of freedom
Multiple R-squared:  0.1706,    Adjusted R-squared:  0.1706 
F-statistic:  9551 on 2 and 92854 DF,  p-value: < 2.2e-16

d 1.4 Regress log of age and education level on income

Call:
lm(formula = income ~ log_age + education_level, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-237212   -1459    -440     752  462701 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3180.67     240.01   13.25   <2e-16 ***
log_age          -948.16      52.33  -18.12   <2e-16 ***
education_level   903.97       9.53   94.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3813 on 92854 degrees of freedom
Multiple R-squared:  0.1662,    Adjusted R-squared:  0.1662 
F-statistic:  9257 on 2 and 92854 DF,  p-value: < 2.2e-16
  1. 1.5 Regress gender, log of age and education level on income. Present only the coefficients

tip: explore the lm function on the help tab

    (Intercept)         log_age education_level          gender 
      5307.3048       -934.4571        766.9034      -1206.0092 

Exercise 2: T-test

  1. 2.1 Get the average number of family members for each one of groups: 1) have religion, 2) do not have a religion, 9) unknown.
       1        2        9 
2.409554 2.559758 3.129630 
  1. 2.2 perform a t-test to know if the means are statistically significant different from each other

tip: you will have to remove the observations where religion = 9 before

    Welch Two Sample t-test

data:  family_member by religion
t = -17.737, df = 92793, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -0.1668026 -0.1336061
sample estimates:
mean in group 1 mean in group 2 
       2.409554        2.559758 

Exercise 3: Visualization a Regression

  1. 3.1 Using ggplot, visualize the regression done in 1.1
`geom_smooth()` using formula 'y ~ x'

  1. 3.2 Visualize the regression done in 1.2.
`geom_smooth()` using formula 'y ~ x'

PS: Even though the correlation is significant it is not that clear when looking at the graphs (Too many outliers)

  1. 3.3 repeat 3.2, but only show observations with an income up to 20,000
`geom_smooth()` using formula 'y ~ x'

Warning: Removed 505 rows containing non-finite values (stat_smooth).

Warning: Removed 505 rows containing missing values (geom_point).

Exercise 4

  • We will use another dataset in this exercise: Data on house prices in london
  1. 4.1 Regress area (in sq feet) on the price. Interpret the coefficients (i.e., the value and their statistical significance)
Call:
lm(formula = Price ~ Area.in.sq.ft, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-8755213  -503561  -167061   129088 33546963 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -36674.05   45936.19  -0.798    0.425    
Area.in.sq.ft   1109.68      20.98  52.897   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1688000 on 3478 degrees of freedom
Multiple R-squared:  0.4458,    Adjusted R-squared:  0.4457 
F-statistic:  2798 on 1 and 3478 DF,  p-value: < 2.2e-16
  1. 4.2 Regress the log of area (in sq feet) on the price. Interpret the coefficients.
Call:
lm(formula = Price ~ log_area, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-3447406  -834395  -205682   438856 34869633 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -13584801     344947  -39.38   <2e-16 ***
log_area      2138504      47561   44.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1803000 on 3478 degrees of freedom
Multiple R-squared:  0.3676,    Adjusted R-squared:  0.3674 
F-statistic:  2022 on 1 and 3478 DF,  p-value: < 2.2e-16
  1. 4.3 Display the regression fit done in 4.1 graphically. Looking at the plot, do you think the area is enough to predict the value of a house
`geom_smooth()` using formula 'y ~ x'

!

  1. 4.4 Regress the log of area on the price again, but now control for the city county and number of bedrooms. Why would we want to control for different counties? Does the coefficient of N of Bedrooms has the expected sign?
Call:
lm(formula = Price ~ log_area + City.County + No..of.Bedrooms, 
    data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-3539955  -784231  -191495   446677 33697553 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         -23734985    1789948 -13.260   <2e-16 ***
log_area                              3673635      95113  38.624   <2e-16 ***
City.County27 Carlton Drive           1838298    2384893   0.771    0.441    
City.County311 Goldhawk Road           251035    2384362   0.105    0.916    
City.County4 Circus Road West          637541    2384422   0.267    0.789    
City.County52 Holloway Road           1594405    2384641   0.669    0.504    
City.County6 Deal Street               600116    2384423   0.252    0.801    
City.County82-88 Fulham High Street   1353683    2384563   0.568    0.570    
City.CountyBattersea                   833206    2065216   0.403    0.687    
City.CountyBlackheath                  672067    2384912   0.282    0.778    
City.CountyBushey                    -1137737    2384441  -0.477    0.633    
City.CountyChelsea                    1106813    1946854   0.569    0.570    
City.CountyChessington                 130712    2384396   0.055    0.956    
City.CountyCity Of London             1492139    2064974   0.723    0.470    
City.CountyClapton                    1539011    2384625   0.645    0.519    
City.CountyClerkenwell                1616168    2384598   0.678    0.498    
City.CountyDe Beauvoir                 209057    2384847   0.088    0.930    
City.CountyDeptford                   1293718    2384780   0.542    0.588    
City.CountyDowns Road                 -191611    2384341  -0.080    0.936    
City.CountyE5 8DE                     -204646    2064925  -0.099    0.921    
City.CountyEaling                     -211849    2384840  -0.089    0.929    
City.CountyEssex                      -250797    1700106  -0.148    0.883    
City.CountyFitzrovia                  1869415    2384676   0.784    0.433    
City.CountyFulham                      592614    1847062   0.321    0.748    
City.CountyFulham High Street         1732175    2384854   0.726    0.468    
City.CountyGreenford                  1032605    2385692   0.433    0.665    
City.CountyHertfordshire              -478510    1777923  -0.269    0.788    
City.CountyHolland Park               1217343    2384485   0.511    0.610    
City.CountyHornchurch                  454640    2386091   0.191    0.849    
City.CountyKensington                 2114679    2384628   0.887    0.375    
City.CountyKent                      -2186168    2385234  -0.917    0.359    
City.CountyLambourne End              -992796    2385091  -0.416    0.677    
City.CountyLillie Square              2322928    2384467   0.974    0.330    
City.CountyLittle Venice              1233507    2384533   0.517    0.605    
City.CountyLondon                     1194884    1686516   0.708    0.479    
City.CountyLondon1500                   46965    2384463   0.020    0.984    
City.CountyMarylebone                 2366145    1885108   1.255    0.210    
City.CountyMiddlesex                  -215738    1697289  -0.127    0.899    
City.CountyMiddx                     -1404958    2385323  -0.589    0.556    
City.CountyN1 6FU                     1722555    2385430   0.722    0.470    
City.CountyN7 6QX                      998057    1802625   0.554    0.580    
City.CountyNorthwood                   231127    2065368   0.112    0.911    
City.CountyOxshott                   -1871493    2386503  -0.784    0.433    
City.CountyQueens Park                 648622    2384413   0.272    0.786    
City.CountyRichmond                   -201387    2065328  -0.098    0.922    
City.CountyRichmond Hill               397874    2384575   0.167    0.867    
City.CountyRomford                   -1079758    2385756  -0.453    0.651    
City.CountySpitalfields               -982532    2384692  -0.412    0.680    
City.CountySurrey                     -307927    1689877  -0.182    0.855    
City.CountySurrey Quays               1698776    2384685   0.712    0.476    
City.CountyThames Ditton              -353279    2384799  -0.148    0.882    
City.CountyThe Metal Works             748814    2384386   0.314    0.754    
City.CountyThurleigh Road              208120    1803097   0.115    0.908    
City.CountyTwickenham                  191162    1755395   0.109    0.913    
City.CountyWandsworth                   34922    2065234   0.017    0.987    
City.CountyWatford                    -759393    1886148  -0.403    0.687    
City.CountyWimbledon                  -419106    2384650  -0.176    0.860    
City.CountyWornington Road            1240295    1847103   0.671    0.502    
No..of.Bedrooms                       -625580      40088 -15.605   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1686000 on 3421 degrees of freedom
Multiple R-squared:  0.4563,    Adjusted R-squared:  0.447 
F-statistic: 49.49 on 58 and 3421 DF,  p-value: < 2.2e-16
  1. 4.5 Regress the square of area and area on the price. Do you think the relationship between the area and the price is linear or non-linear?
Call:
lm(formula = Price ~ area2 + Area.in.sq.ft, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-8162658  -516343  -143281   160531 33506243 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.440e+05  6.306e+04  -2.283   0.0225 *  
area2         -1.286e-02  5.182e-03  -2.481   0.0131 *  
Area.in.sq.ft  1.208e+03  4.494e+01  26.888   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1687000 on 3477 degrees of freedom
Multiple R-squared:  0.4468,    Adjusted R-squared:  0.4465 
F-statistic:  1404 on 2 and 3477 DF,  p-value: < 2.2e-16

Exercise 5

  1. 5.1 Create a dummy variable that is equal to 1 if the number of bathrooms is greater than 2 and 0 otherwise. Regress this dummy on the price. Interpret the results
Call:
lm(formula = Price ~ d_bathroom, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-2168411  -943411  -344676   130324 37206589 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   969676      54943   17.65   <2e-16 ***
d_bathroom   1573735      72877   21.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2129000 on 3478 degrees of freedom
Multiple R-squared:  0.1182,    Adjusted R-squared:  0.118 
F-statistic: 466.3 on 1 and 3478 DF,  p-value: < 2.2e-16
  1. 5.2 Perform the same regression done in 5.1, but now include the area as an additional independent variable. What happens with the coefficient of the dummy? Interpret it. Why do you think this happens?
Call:
lm(formula = Price ~ d_bathroom + Area.in.sq.ft, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-9001985  -477709  -189080   107772 33486687 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1445.92   48460.12   0.030   0.9762    
d_bathroom    -170118.65   69321.99  -2.454   0.0142 *  
Area.in.sq.ft    1143.87      25.17  45.443   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1687000 on 3477 degrees of freedom
Multiple R-squared:  0.4468,    Adjusted R-squared:  0.4465 
F-statistic:  1404 on 2 and 3477 DF,  p-value: < 2.2e-16
  1. 5.3 What is Omitted Variable Bias? Can you give examples of two additional variables that are not in the data that could influence the price of a house?