About the Data

Description of the variables:

  • instant: record index
  • dteday : date
  • season : season (1:winter, 2:spring, 3:summer, 4:fall)
  • yr : year (0: 2011, 1:2012)
  • mnth : month ( 1 to 12)
  • hr : hour (0 to 23)
  • holiday : weather day is holiday or not (extracted from [Web Link])
  • weekday : day of the week
  • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
  • weathersit :
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
  • atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
  • hum: Normalized humidity. The values are divided to 100 (max)
  • windspeed: Normalized wind speed. The values are divided to 67 (max)
  • casual: count of casual users
  • registered: count of registered users
  • cnt: count of total rental bikes including both casual and registered

Load the bike data set

setwd("C:/Users/mglez/OneDrive/Documentos/PHD/15 Semester/18062021_utrecht_courses/year_2021_2022/04_period/ECB1ID_Introduction_to_Applied_Data_Science/material/tutorials/tutorial_excersises/02/")
bike <- read.csv("day.csv")

Exercise 1: Function 01

Instructions:

Create the following vectors:

a <- c(2, 10, 5, 4)
b <- c(6, 6, 4, 4)
c <- c(20, 1, 3, 40)
d <- c(2, 5, 10, 4, 3, 2, 10, 3, 2, 10, 7, 10, 11)

Solve:

  1. 1.1 Create a function that returns the a variable to the power of 2, f(x) = x2. Use it on vector a.
[1]   4 100  25  16
  1. 1.2 Create a function that gives you the log of one variable divided by the another variable: f(x,z) = l**n(x)/z. Use it on vector a and b
[1] 0.1155245 0.3837642 0.4023595 0.3465736
  1. 1.3 Create a function to get the weighted average of 3 variables, where the third variable has weight 2. Use it on vectors a, b and c
[1] 16.00000  6.00000  5.00000 29.33333
  1. 1.4 Create a function that randomly select three observations of a variable and sum them. Use it on vector d.

tip: use set.seed(10) so that the value is the same as the solution

[1] 19
  1. 1.5 Create a function that returns the a standardized version of another variable f(x) = (x))/σx, where is the mean and σx is the standard deviation of x. Use it on the first 25 values of the casual variable of the data set bike.
 [1]  2.97799483  0.31621897  0.16982130  0.01011475 -0.33591611 -0.25606284
 [7]  0.54246992 -0.52224042 -0.70856473 -0.88158016 -0.85496241 -1.09452223
[13] -0.92150680 -0.70856473  1.52732699  1.91328449  0.12989466 -1.30746430
[19] -0.38915163 -0.32260723 -0.42907827 -0.18951844  0.56908768 -0.28268060
[25]  1.04820733
  1. 1.6 Create a function that returns the number of unique values of a variable. Use it on the Humidity (hum) variable from the data set bike.
[1] 595

Exercise 2: Apply functions

Instructions: solve the next exercises using one of the apply functions:

  1. 2.1 Get the class of every variable of the bike dataframe.
    instant      dteday      season          yr        mnth     holiday 
  "integer" "character"   "integer"   "integer"   "integer"   "integer" 
    weekday  workingday  weathersit        temp       atemp         hum 
  "integer"   "integer"   "integer"   "numeric"   "numeric"   "numeric" 
  windspeed      casual  registered         cnt 
  "numeric"   "integer"   "integer"   "integer" 
  1. 2.2 Get the sum of every (numeric) variable of the bike dataframe.
     temp     atemp       hum windspeed 
 362.1263  346.7528  458.9906  139.2454 
  1. 2.3 apply the function created in 1.1 to the first 10 observations of the each numeric variable in bike.
            temp      atemp       hum   windspeed
 [1,] 0.11845092 0.13222314 0.6493668 0.025742919
 [2,] 0.13211626 0.12513128 0.4845371 0.061771635
 [3,] 0.03855882 0.03587425 0.1912077 0.061657359
 [4,] 0.04000000 0.04499574 0.3486135 0.025694808
 [5,] 0.05150948 0.05256473 0.1909314 0.034931610
 [6,] 0.04175811 0.05438644 0.2685945 0.008021925
 [7,] 0.03862090 0.04361373 0.2486977 0.028468463
 [8,] 0.02722500 0.02632636 0.2871170 0.071184374
 [9,] 0.01913602 0.01349663 0.1885010 0.131007802
[10,] 0.02275059 0.02276719 0.2332088 0.049848153
  1. 2.4 Get the maximum value of total users for each day of the week.
   0    1    2    3    4    5    6 
8227 7525 7767 8173 7804 8362 8714 
  1. 2.5 Get the mean temperature for each season.
        1         2         3         4 
0.2977475 0.5444052 0.7063093 0.4229060 
  1. 2.6 Apply a function that returns the interquartile range (third quartile minus the first quartile) of every integer variable of bike.
   instant.75%     season.75%         yr.75%       mnth.75%    holiday.75% 
         365.0            1.0            1.0            6.0            0.0 
   weekday.75% workingday.75% weathersit.75%     casual.75% registered.75% 
           4.0            1.0            1.0          780.5         2279.5 
       cnt.75% 
        2804.0 

Exercise 3: Function 02

  1. 3.1 Create a function that gives the mode of a variable (most common value). Use it on vector d.

Tip: look up the functions: table() and order()

[1] 10
  1. 3.2 Get the mode of the temperature variable in the bike data. Explain what this mode tell us.
[1] 0.265833
  1. 3.3 Get the mean number of total users for holidays and for non-holidays. Interpret these numbers.
       0        1 
4527.104 3735.000 
  1. 3.4 Get the maximum and minimum values of the last three variables of bike. After looking at the results, do you think there is an observation where the number of registered and casual users are both zero? Why?
    casual registered        cnt 
         2         20         22 

    casual registered        cnt 
      3410       6946       8714 

Exercise 4: Function 03

  1. 4.1 Create a function that returns the median if the number in the series is more than 2 times the standard deviation from the mean. Otherwise it returns the same number. Generate 100 realizations of a variables x with x <- rnorm(100, mean = 2, sd = 1) and input the last number x[100] <- 5 to have at least one number that deviates from the mean two times. Print the function and the median in the console.
  [1]  2.50466258  1.24603998  1.39414363  1.82278950  2.17061755  2.24281413
  [7]  1.82059388  1.36948140  2.97869298  2.29329698  1.62967101  2.54361539
 [13]  1.29237516  1.63055423  0.67802435  3.28059746  2.66741505  3.69175496
 [19]  2.00126141  1.25753869  2.60968442  1.01039362  1.96515167  2.84715991
 [25]  3.52549801  1.93262467  2.21014300  1.91827838  2.01324940  0.79689366
 [31]  1.73884564  2.70991555  1.86804020  1.22489409  2.08640253  2.06456372
 [37]  2.11520792  3.76600406  1.18645101  1.91062508  2.31481673 -0.06054126
 [43]  1.40019739  1.04643414  2.55750589  2.04301918  2.97322126  2.10194801
 [49]  0.11938063  0.45623429  1.76626038  3.21967133  1.94280588  1.81278610
 [55]  3.00559998  2.07159930  3.36118735  2.81556240  2.52618604  1.57073665
 [61]  2.16474018  1.22680843  1.62147553  3.65189056  0.32389464  1.26189533
 [67]  1.42507079  3.08997497  2.12648948  1.27682666  3.42620243  2.92451306
 [73]  2.28589468  3.30546194  1.14803539  0.99454779  1.08487030  1.44829559
 [79]  0.73234433  2.04112404  1.78084660  1.35411387  1.95397877  2.22735950
 [85]  2.71209430  2.54704841  1.87845808  1.28930077  1.62094382  0.86075871
 [91]  1.93305021  2.32379563  2.46534289  1.09165400  0.54038095  2.04331155
 [97]  1.87193408  1.41680231  2.57055482  1.95397877

[1] 1.953979
  1. 4.2 Make a function that tells us if the mean divided by the median is between 0.9 and 1.1. Use it on the whole dataframe using an sapply function. Why are the median and mean sometimes different?
$instant
[1] TRUE

$dteday
NULL

$season
[1] FALSE

$yr
[1] FALSE

$mnth
[1] TRUE

$holiday
[1] FALSE

$weekday
[1] TRUE

$workingday
[1] FALSE

$weathersit
[1] FALSE

$temp
[1] TRUE

$atemp
[1] TRUE

$hum
[1] TRUE

$windspeed
[1] TRUE

$casual
[1] FALSE

$registered
[1] TRUE

$cnt
[1] TRUE

Exercise 5: Theoretical

  1. 5.1 Explain how inductive reasoning is different from deductive reasoning?

  2. 5.2 Explain the three components or constructs of the framework of inductive reasoning from Christou and Papageorgiou (2007).

  3. 5.3 In the next picture, describe:

Is the relationship positive or
negative?

  • What is the relationship between the variables advance and raises.

  • What is the relationship between the variables privileges and learnig.

  • What is the relationship between the variables ratting and advance.