Tutorial 2: Descriptive Statistics.
About the Data
Description of the variables:
- instant: record index
- dteday : date
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
Load the bike data set
setwd("C:/Users/mglez/OneDrive/Documentos/PHD/15 Semester/18062021_utrecht_courses/year_2021_2022/04_period/ECB1ID_Introduction_to_Applied_Data_Science/material/tutorials/tutorial_excersises/02/")
bike <- read.csv("day.csv")
Exercise 1: Function 01
Instructions:
Create the following vectors:
a <- c(2, 10, 5, 4)
b <- c(6, 6, 4, 4)
c <- c(20, 1, 3, 40)
d <- c(2, 5, 10, 4, 3, 2, 10, 3, 2, 10, 7, 10, 11)
Solve:
- 1.1 Create a
functionthat returns the a variable to the power of 2, f(x) = x2. Use it on vectora.
[1] 4 100 25 16
- 1.2 Create a function that gives you the log of one variable divided
by the another variable: f(x,z) = l**n(x)/z. Use it on
vector
aandb
[1] 0.1155245 0.3837642 0.4023595 0.3465736
- 1.3 Create a function to get the weighted average of 3 variables,
where the third variable has weight 2. Use it on vectors
a,bandc
[1] 16.00000 6.00000 5.00000 29.33333
- 1.4 Create a function that randomly select three observations of a
variable and sum them. Use it on vector
d.
tip: use set.seed(10) so that the value is the same as the solution
[1] 19
- 1.5 Create a function that returns the a standardized version of
another variable f(x) = (x−x̄))/σx, where x̄
is the mean and σx is the standard deviation of
x. Use it on the first 25 values of thecasualvariable of the data setbike.
[1] 2.97799483 0.31621897 0.16982130 0.01011475 -0.33591611 -0.25606284
[7] 0.54246992 -0.52224042 -0.70856473 -0.88158016 -0.85496241 -1.09452223
[13] -0.92150680 -0.70856473 1.52732699 1.91328449 0.12989466 -1.30746430
[19] -0.38915163 -0.32260723 -0.42907827 -0.18951844 0.56908768 -0.28268060
[25] 1.04820733
- 1.6 Create a function that returns the number of unique values of a
variable. Use it on the Humidity (
hum) variable from the data setbike.
[1] 595
Exercise 2: Apply functions
Instructions: solve the next exercises using one of the apply functions:
- 2.1 Get the class of every variable of the
bikedataframe.
instant dteday season yr mnth holiday
"integer" "character" "integer" "integer" "integer" "integer"
weekday workingday weathersit temp atemp hum
"integer" "integer" "integer" "numeric" "numeric" "numeric"
windspeed casual registered cnt
"numeric" "integer" "integer" "integer"
- 2.2 Get the sum of every (numeric) variable of the
bikedataframe.
temp atemp hum windspeed
362.1263 346.7528 458.9906 139.2454
- 2.3 apply the function created in 1.1 to the first 10 observations
of the each
numericvariable inbike.
temp atemp hum windspeed
[1,] 0.11845092 0.13222314 0.6493668 0.025742919
[2,] 0.13211626 0.12513128 0.4845371 0.061771635
[3,] 0.03855882 0.03587425 0.1912077 0.061657359
[4,] 0.04000000 0.04499574 0.3486135 0.025694808
[5,] 0.05150948 0.05256473 0.1909314 0.034931610
[6,] 0.04175811 0.05438644 0.2685945 0.008021925
[7,] 0.03862090 0.04361373 0.2486977 0.028468463
[8,] 0.02722500 0.02632636 0.2871170 0.071184374
[9,] 0.01913602 0.01349663 0.1885010 0.131007802
[10,] 0.02275059 0.02276719 0.2332088 0.049848153
- 2.4 Get the maximum value of total users for each day of the week.
0 1 2 3 4 5 6
8227 7525 7767 8173 7804 8362 8714
- 2.5 Get the mean temperature for each season.
1 2 3 4
0.2977475 0.5444052 0.7063093 0.4229060
- 2.6 Apply a function that returns the interquartile range (third
quartile minus the first quartile) of every
integervariable of bike.
instant.75% season.75% yr.75% mnth.75% holiday.75%
365.0 1.0 1.0 6.0 0.0
weekday.75% workingday.75% weathersit.75% casual.75% registered.75%
4.0 1.0 1.0 780.5 2279.5
cnt.75%
2804.0
Exercise 3: Function 02
- 3.1 Create a function that gives the
modeof a variable (most common value). Use it on vectord.
Tip: look up the functions: table() and order()
[1] 10
- 3.2 Get the mode of the
temperaturevariable in thebikedata. Explain what this mode tell us.
[1] 0.265833
- 3.3 Get the mean number of total users for
holidaysand fornon-holidays. Interpret these numbers.
0 1
4527.104 3735.000
- 3.4 Get the maximum and minimum values of the last three variables of bike. After looking at the results, do you think there is an observation where the number of registered and casual users are both zero? Why?
casual registered cnt
2 20 22
casual registered cnt
3410 6946 8714
Exercise 4: Function 03
- 4.1 Create a function that returns the median if the number in the
series is more than 2 times the standard deviation from the mean.
Otherwise it returns the same number. Generate
100realizations of a variablesxwithx <- rnorm(100, mean = 2, sd = 1)and input the last numberx[100] <- 5to have at least one number that deviates from the mean two times. Print thefunctionand themedianin the console.
[1] 2.50466258 1.24603998 1.39414363 1.82278950 2.17061755 2.24281413
[7] 1.82059388 1.36948140 2.97869298 2.29329698 1.62967101 2.54361539
[13] 1.29237516 1.63055423 0.67802435 3.28059746 2.66741505 3.69175496
[19] 2.00126141 1.25753869 2.60968442 1.01039362 1.96515167 2.84715991
[25] 3.52549801 1.93262467 2.21014300 1.91827838 2.01324940 0.79689366
[31] 1.73884564 2.70991555 1.86804020 1.22489409 2.08640253 2.06456372
[37] 2.11520792 3.76600406 1.18645101 1.91062508 2.31481673 -0.06054126
[43] 1.40019739 1.04643414 2.55750589 2.04301918 2.97322126 2.10194801
[49] 0.11938063 0.45623429 1.76626038 3.21967133 1.94280588 1.81278610
[55] 3.00559998 2.07159930 3.36118735 2.81556240 2.52618604 1.57073665
[61] 2.16474018 1.22680843 1.62147553 3.65189056 0.32389464 1.26189533
[67] 1.42507079 3.08997497 2.12648948 1.27682666 3.42620243 2.92451306
[73] 2.28589468 3.30546194 1.14803539 0.99454779 1.08487030 1.44829559
[79] 0.73234433 2.04112404 1.78084660 1.35411387 1.95397877 2.22735950
[85] 2.71209430 2.54704841 1.87845808 1.28930077 1.62094382 0.86075871
[91] 1.93305021 2.32379563 2.46534289 1.09165400 0.54038095 2.04331155
[97] 1.87193408 1.41680231 2.57055482 1.95397877
[1] 1.953979
- 4.2 Make a function that tells us if the mean divided by the median
is between 0.9 and 1.1. Use it on the whole
dataframeusing ansapplyfunction. Why are the median and mean sometimes different?
$instant
[1] TRUE
$dteday
NULL
$season
[1] FALSE
$yr
[1] FALSE
$mnth
[1] TRUE
$holiday
[1] FALSE
$weekday
[1] TRUE
$workingday
[1] FALSE
$weathersit
[1] FALSE
$temp
[1] TRUE
$atemp
[1] TRUE
$hum
[1] TRUE
$windspeed
[1] TRUE
$casual
[1] FALSE
$registered
[1] TRUE
$cnt
[1] TRUE
Exercise 5: Theoretical
-
5.1 Explain how inductive reasoning is different from deductive reasoning?
-
5.2 Explain the three components or constructs of the framework of inductive reasoning from Christou and Papageorgiou (2007).
-
5.3 In the next picture, describe:
-
What is the relationship between the variables
advanceandraises. -
What is the relationship between the variables
privilegesandlearnig. -
What is the relationship between the variables
rattingandadvance.