Tutorial 2: Descriptive Statistics.
About the Data
Description of the variables:
- instant: record index
- dteday : date
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
Load the bike
data set
setwd("C:/Users/mglez/OneDrive/Documentos/PHD/15 Semester/18062021_utrecht_courses/year_2021_2022/04_period/ECB1ID_Introduction_to_Applied_Data_Science/material/tutorials/tutorial_excersises/02/")
bike <- read.csv("day.csv")
Exercise 1: Function 01
Instructions:
Create the following vectors
:
a <- c(2, 10, 5, 4)
b <- c(6, 6, 4, 4)
c <- c(20, 1, 3, 40)
d <- c(2, 5, 10, 4, 3, 2, 10, 3, 2, 10, 7, 10, 11)
Solve:
- 1.1 Create a
function
that returns the a variable to the power of 2, f(x) = x2. Use it on vectora
.
[1] 4 100 25 16
- 1.2 Create a function that gives you the log of one variable divided
by the another variable: f(x,z) = l**n(x)/z. Use it on
vector
a
andb
[1] 0.1155245 0.3837642 0.4023595 0.3465736
- 1.3 Create a function to get the weighted average of 3 variables,
where the third variable has weight 2. Use it on vectors
a
,b
andc
[1] 16.00000 6.00000 5.00000 29.33333
- 1.4 Create a function that randomly select three observations of a
variable and sum them. Use it on vector
d
.
tip: use set.seed(10) so that the value is the same as the solution
[1] 19
- 1.5 Create a function that returns the a standardized version of
another variable f(x) = (x−x̄))/σx, where x̄
is the mean and σx is the standard deviation of
x
. Use it on the first 25 values of thecasual
variable of the data setbike
.
[1] 2.97799483 0.31621897 0.16982130 0.01011475 -0.33591611 -0.25606284
[7] 0.54246992 -0.52224042 -0.70856473 -0.88158016 -0.85496241 -1.09452223
[13] -0.92150680 -0.70856473 1.52732699 1.91328449 0.12989466 -1.30746430
[19] -0.38915163 -0.32260723 -0.42907827 -0.18951844 0.56908768 -0.28268060
[25] 1.04820733
- 1.6 Create a function that returns the number of unique values of a
variable. Use it on the Humidity (
hum
) variable from the data setbike
.
[1] 595
Exercise 2: Apply functions
Instructions: solve the next exercises using one of the apply functions:
- 2.1 Get the class of every variable of the
bike
dataframe.
instant dteday season yr mnth holiday
"integer" "character" "integer" "integer" "integer" "integer"
weekday workingday weathersit temp atemp hum
"integer" "integer" "integer" "numeric" "numeric" "numeric"
windspeed casual registered cnt
"numeric" "integer" "integer" "integer"
- 2.2 Get the sum of every (numeric) variable of the
bike
dataframe.
temp atemp hum windspeed
362.1263 346.7528 458.9906 139.2454
- 2.3 apply the function created in 1.1 to the first 10 observations
of the each
numeric
variable inbike
.
temp atemp hum windspeed
[1,] 0.11845092 0.13222314 0.6493668 0.025742919
[2,] 0.13211626 0.12513128 0.4845371 0.061771635
[3,] 0.03855882 0.03587425 0.1912077 0.061657359
[4,] 0.04000000 0.04499574 0.3486135 0.025694808
[5,] 0.05150948 0.05256473 0.1909314 0.034931610
[6,] 0.04175811 0.05438644 0.2685945 0.008021925
[7,] 0.03862090 0.04361373 0.2486977 0.028468463
[8,] 0.02722500 0.02632636 0.2871170 0.071184374
[9,] 0.01913602 0.01349663 0.1885010 0.131007802
[10,] 0.02275059 0.02276719 0.2332088 0.049848153
- 2.4 Get the maximum value of total users for each day of the week.
0 1 2 3 4 5 6
8227 7525 7767 8173 7804 8362 8714
- 2.5 Get the mean temperature for each season.
1 2 3 4
0.2977475 0.5444052 0.7063093 0.4229060
- 2.6 Apply a function that returns the interquartile range (third
quartile minus the first quartile) of every
integer
variable of bike.
instant.75% season.75% yr.75% mnth.75% holiday.75%
365.0 1.0 1.0 6.0 0.0
weekday.75% workingday.75% weathersit.75% casual.75% registered.75%
4.0 1.0 1.0 780.5 2279.5
cnt.75%
2804.0
Exercise 3: Function 02
- 3.1 Create a function that gives the
mode
of a variable (most common value). Use it on vectord
.
Tip: look up the functions: table() and order()
[1] 10
- 3.2 Get the mode of the
temperature
variable in thebike
data. Explain what this mode tell us.
[1] 0.265833
- 3.3 Get the mean number of total users for
holidays
and fornon-holidays
. Interpret these numbers.
0 1
4527.104 3735.000
- 3.4 Get the maximum and minimum values of the last three variables of bike. After looking at the results, do you think there is an observation where the number of registered and casual users are both zero? Why?
casual registered cnt
2 20 22
casual registered cnt
3410 6946 8714
Exercise 4: Function 03
- 4.1 Create a function that returns the median if the number in the
series is more than 2 times the standard deviation from the mean.
Otherwise it returns the same number. Generate
100
realizations of a variablesx
withx <- rnorm(100, mean = 2, sd = 1)
and input the last numberx[100] <- 5
to have at least one number that deviates from the mean two times. Print thefunction
and themedian
in the console.
[1] 2.50466258 1.24603998 1.39414363 1.82278950 2.17061755 2.24281413
[7] 1.82059388 1.36948140 2.97869298 2.29329698 1.62967101 2.54361539
[13] 1.29237516 1.63055423 0.67802435 3.28059746 2.66741505 3.69175496
[19] 2.00126141 1.25753869 2.60968442 1.01039362 1.96515167 2.84715991
[25] 3.52549801 1.93262467 2.21014300 1.91827838 2.01324940 0.79689366
[31] 1.73884564 2.70991555 1.86804020 1.22489409 2.08640253 2.06456372
[37] 2.11520792 3.76600406 1.18645101 1.91062508 2.31481673 -0.06054126
[43] 1.40019739 1.04643414 2.55750589 2.04301918 2.97322126 2.10194801
[49] 0.11938063 0.45623429 1.76626038 3.21967133 1.94280588 1.81278610
[55] 3.00559998 2.07159930 3.36118735 2.81556240 2.52618604 1.57073665
[61] 2.16474018 1.22680843 1.62147553 3.65189056 0.32389464 1.26189533
[67] 1.42507079 3.08997497 2.12648948 1.27682666 3.42620243 2.92451306
[73] 2.28589468 3.30546194 1.14803539 0.99454779 1.08487030 1.44829559
[79] 0.73234433 2.04112404 1.78084660 1.35411387 1.95397877 2.22735950
[85] 2.71209430 2.54704841 1.87845808 1.28930077 1.62094382 0.86075871
[91] 1.93305021 2.32379563 2.46534289 1.09165400 0.54038095 2.04331155
[97] 1.87193408 1.41680231 2.57055482 1.95397877
[1] 1.953979
- 4.2 Make a function that tells us if the mean divided by the median
is between 0.9 and 1.1. Use it on the whole
dataframe
using ansapply
function. Why are the median and mean sometimes different?
$instant
[1] TRUE
$dteday
NULL
$season
[1] FALSE
$yr
[1] FALSE
$mnth
[1] TRUE
$holiday
[1] FALSE
$weekday
[1] TRUE
$workingday
[1] FALSE
$weathersit
[1] FALSE
$temp
[1] TRUE
$atemp
[1] TRUE
$hum
[1] TRUE
$windspeed
[1] TRUE
$casual
[1] FALSE
$registered
[1] TRUE
$cnt
[1] TRUE
Exercise 5: Theoretical
-
5.1 Explain how inductive reasoning is different from deductive reasoning?
-
5.2 Explain the three components or constructs of the framework of inductive reasoning from Christou and Papageorgiou (2007).
-
5.3 In the next picture, describe:
-
What is the relationship between the variables
advance
andraises
. -
What is the relationship between the variables
privileges
andlearnig
. -
What is the relationship between the variables
ratting
andadvance
.