Lecture 1: Introduction to Data Science and Big Data.
Introduction
Welcome to the Introduction to Applied Data Science! I’m sure you have heard the terms, Big Data, Machine Learning and Data Science in the media or perhaps from other peers. It seems that today, the terms are being used more often. Data Science and Machine Learning (ML) have reached the Top in the Google index of most frequent terms as fields of study. This pattern has been increasing for the last 10 years, and the trend seems to be upward.
Data Science is an emergent discipline that works in the intersection between statistics, mathematics and computer programming. This new discipline is growing faster due to the large amounts of data that we are generating, the maturity of information and communication technologies (ICT) and the high availability of powerful computational systems. Data science differs from statistics, the former justifies the development of algorithms to increase our inferential capabilities. That is, algorithms are needed to estimate certain parameters in a statistical or mathematical model, aiming only to claim a ‘causal-effect’. However, the latter, justifies the development of algorithms not necessarily for inference but for prediction and complex problem solving (Efron, 2021). For instance, Data Science and Machine Learning (ML) are growing rapidly in intersection with econometrics (Athey and Imbens, 2019). Broadly speaking, Data Science and ML use, statistics, algorithms, powerful numerical methods and computer training to solve problems. A heuristic approach means, that the process of problem-solving does not rely only on mathematical induction with a neat `closed-form’ solution, common in optimization, but also from approximations. These approximations are sometimes better and more cost-efficient given the large amount of data involved in the computations.
But, why all the fuss about around Data Science and ML? and why do we should care? Well, the main reason is that they have great implications for the economy overall and how we function as a society. You have probably heard that algorithms have beaten human champions, in games like chest or a more intuitive game Chinese game called Go. But far more than that, the field is touching and finding solutions to very relevant problems. Take a look at the following examples:
-
Environment: A ubiquitous concern today is the rapid increase in yearly measures of temperature at a global scale. Indeed, climate change has had a great impact on our livelihood, economy and overall livelihood. To fight climate change, researchers and policymakers are pooling effort and resources to find novel ways of addressing this common thread. In this regard, the study of Song and Wang (2018), uses Big data paired with econometric methods to study how participation in Global Value Chains affects positively the progress of green technologies. To describe the evolution path and evolution law of regional agricultural sustainable growth, the work of Song and Wang (2018), combines Big data and ML to develop a framework (model) and give policy recommendations. The seminal paper by Huntingford and Huntingford, Et Al (2019) from Harvard, summarizes well the intersection between Data Science and Climate change. Their work shows many contributions from ML to the development of Earth System models, Weather forecasting and Climate impacts.
-
Cancer: ML and Data science have proven to be effective methods applied in the field of cancer research. For instance, the study of Kourou,Exarchos, Et Al. (2015), applies classification algorithms from ML to detect cancer patients into high or low-risk groups. Similarly, the work of Bi, Hosny, Et Al. (2019), show that ML and AI are effective for the prompt detection of cancer by automating the initial interpretation of radiographic images to assess whether or not to administer an intervention and subsequent evaluation of patients. A third important example is the research by Hirasawa, Aoyama, Et Al. (2018), which uses neural networks algorithms to assess thousands of endoscopic images for the evaluation and diagnostics of gastric cancer.
-
Transportation: Our cities are growing in density and to ensure economic growth we have to attend the movement of labor within and between cities, regions and counties. In this area, Data Science and ML, are also finding solutions to the complex problem of today’s labor mobility. The survey of Ghofrani, He, Et Al (2018) recollects the major contributions of Data Science on the railway transportation industry from the last 15 years. He shows how Data Science has contributed to solving problems with logistics; such as traffic control systems, train tracking, ticket sales and automatic fare collection. In Safety, the field has implemented a system of incident analysis and risk management. For all these, researchers have applied ML and Data Science for image processing, string and semantic analysis and clustering and classification algorithms among many other applications.
-
Economics: Data Science, ML and Big data seem to evolve as new empirical methods in Economics. In particular, in the field of econometrics, Data Science contributes with robust data manipulation tools, new tools for variable selection and prediction, and more flexible relationships that go beyond the linear model from OLS (Varian, 2014). In the same line, Athey and Imbens (2019), argue that methods that work in the intersection of ML and econometrics most of the time perform better than the sole use ML for causal inference of average treatment effects and estimation of the counterfactual effects in evidence-based policy and consumer choice models. For instance, the survey of Koum, Xiangrui, Et All. (2019), revises the developments in the field of financial risk assessment combined with ML, Big Data analysis, network analysis and sentiment analysis. The study of Gu, Kelly, Et Al. (2020) compares the use of ML algorithms (trees and neural network) vis vi traditional linear-regression methods on the problem of asset pricing and show higher performance on the problem of asset pricing. Another interesting example is the research by Nickerson and Rogers (2014), that show the diffusion of Data Science and ML techniques for performing individual-level predictions about supporting candidates and issues or changing their behavior conditional on being targeted with specific campaign interventions.
The landscape of Data Science in Economics.
To understand better the scope of Data Science and ML this section presents an overview of the High Impact Factor peer-review publications in the field of economics. First, we explore the top 10 publications within the field that have had the highest reach among top-researchers using Data Science. What we can observe is the high overlap, between common empirical disciplines like economics but also a connection to finance, market analysis, and policymaking.
Top 10 Publications in High Impact Journals.
Source: Web of Science, 2022
Indeed, the histogram of publications from the field of Data Science shows that in the last years more economists are adopting this empirical set of skills. What is clear as well, is that the pattern seems to be exponential, with a clear increase in the trend between the years of 2014-2015.
But what about the sub-disciplines in the field of economics working with Data Science? The top 10 publications show that the field is growing faster in the intersection between empirical economics, finance and policy analysis. The following picture looks at what are the top sub-disciplines within the field of economics working with data science. What the data is showing is that also fields like management, health economics and operations research are incorporating the use of Data Science for their research.
The rise of Big Data
In the last decade, we have seen an increase in the number of publications using Data Science in economics. But this is not only given by the growth of the field itself but also due to the large collection of data. In 2014, it was calculated that there are over 2 billion people worldwide are connected to the Internet, 5 billion individuals with mobile phones, but by 2020, it was predicted an increase of 44%, that is 50 billion devices with internet access. This example is just from the telecommunication sector, but image how much data we are generating globally in the financial, eCommerce, transportation or health care sectors.
It is difficult to define what Big Data is, it seems, somehow to be a rather “vague term” that refers to just a large collection of data. Big data, is more a term “used to describe a wide range of concepts: from the technological ability to store, aggregate, and process data, to the cultural shift that is pervasively invading business and society, both drowning in information overload” [@2015DeMauro10.1063/1.4907823]. Their review provides an excellent group of definitions from different authors, that can guide our understanding of what the term encompasses:
Towards a Definition of Big Data
Definitions of Big Data | |
---|---|
1 | High volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. |
2 | The four characteristics defining big data are Volume, Velocity, Variety and Value. |
3 | Complex, unstructured, or large amounts of data. |
4 | Can be defined using three data characteristics: Cardinality, Continuity and Complexity. |
5 | Big data is a combination of Volume, Variety, Velocity and Veracity that creates an opportunity for organizations to gain competitive advantage in today’s digitized marketplace. |
6 | Extensive datasets, primarily in the characteristics of volume, velocity and/or variety, that require a scalable architecture for efficient storage, manipulation, and analysis. |
7 | The storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning. |
8 | The process of applying serious computing power, the latest in machine learning and artificial intelligence, to seriously massive and often highly complex sets of information. |
9 | Data that exceeds the processing capacity of conventional database systems. |
10 | Data that cannot be handled and processed in a straightforward manner. |
11 | A dataset that is too big to fit on a screen. |
12 | Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. |
13 | The data sets and analytical techniques in applications that are so large and complex that they require advanced and unique data storage, management, analysis, and visualization technologies. |
14 | A cultural, technological, and scholarly phenomenon that rests on the interplay of Technology, Analysis and Mythology. |
15 | Phenomenon that brings three key shifts in the way we analyze information that transform how we understand and organize society: 1. More data, 2. Messier (incomplete) data, 3. Correlation overtakes causality. |
16 | Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value. |
Source: De Mauro, Andrea and Greco, Et. all, 2015
Indeed, the last definition, 16, provided by De Mauro, Andrea and Greco, Et. all (2015), highlights the “raw-value” of Big Data. Namely, Big Data itself is of no use without a specific set of skills, the skills of Data Scientist. Big data, implies a large volume of data, but this is not necessarily clean and neat as typical relation database used by companies and government.
Properties of Big Data
In this section, I try to identify some properties of Big Data, that I have reflected upon in my observation and practice.
-
Cost-effective: In the past, empirical studies use to work with large teams for data collection to implement surveys or deploy field experiments. However, today, Data Scientist with Web Scrapping skills, can collect with a single computer million of observations.
-
Dimension: A feature of big data is different from survey data, which only contains a lesser number of variables. Big data typically contain multiple dimensions of variables on the same unit of observation.
-
Scope: Another element of Big Data is that not it can merge multiple layers of units. That means that Big data is often able to map observations from micro-units that is merged with higher-order units. You can think of users of cellphones from a district, that is merged with data from their state and even their country. Some authors refer to this as the granularity of the data.
-
Unstructured: However, with the advent of Data Science and ML, more and more scientists are harvesting data available from different databases and websites. This data, however, is unstructured and fuzzy, deviates from neat relational databases and requires special Data Science skills to harvest the value out of it.
-
Magnitude: Given that Big Data contains occupies a large volume of cyberspace, regular computers are not able to process and load the data for analysis. For instance, the work of Qian, (2014), proposes that a solution to overcome computer overloading is to process the Big Data sequentially. This is an important and somehow more obvious feature of Big Data, the fact this kind of data is growing exponentially from Gigabyte (\(1024^3\)) to Petabyte (\(1024^7\)), and perhaps shortly to larger files.
Data Science with R
stackoverflow.co is a platform of programmers and Data Scientiest that serves 100 million people every month, making it one of the most popular websites in the world. Every year, the platform releases a survey of the Market of computer languages; without exception, for the last 7 years at least R has been listed among the top-computer used. A great feature of R is that everything is free and open-source. That means that you can deepen your knowledge by learning how the functions
, packages
and snippets
of code that high-developers perform, are accessible to you. Indeed, In my practice, I have learned a lot digging the Github
repositories of the packages that I use, even sometimes I have contributed or built upon this knowledge.
The community of R users is ever-evolving, and highly supportive, till the day, The Comprehensive R Archive Network has around 18990 fully working packages
. If you are not familiar with what the packages
are, think of them as add-ons. For instance, when you install R, it comes with the “so-called”, The R Base Package. This “out-of-the-shelf” R has the main functions that have been used and depured for many years. However, this core package evolves slowly, and other communities, for instance, “econometricians” and “statisticians” develop and publish the state of art methodologies in the form of packages. You can read at the last developments of these powerful packages and methodologies in the The R Journal. The versatility of R, the collaborative community and state of the art methodologies implemented are powerful reasons to deploy your Data Science skills using R. The article of Weston, S. J., & Yee, D. (2017), Why you should become a useR: A brief introduction to R., list three main reasons why you should learn R.
Reasons to learn R
- will always be able to perform the newest statistical analyses as soon as anyone thinks of them;
- will fix its bugs quickly and transparently; and
- has brought together a community of programming and stats nerds (a.k.a., useRs) that you can turn to for help.
But if that is not enough to encourage you to steep your learning curve and start learning Data Science with R, right away, take a look at the demand for Data Science from the biggest companies in the world.
Source: Deepanshu Bhalla, 2016
The scope of Data Science
What is a Data Scientist?
It is complicated to integrate all the skills and knowledge that the Data Science field encompasses, simply because the field is evolving rapidly. But perhaps, starting with a simple definition of a Data Scientist can help us draw some distinctions over the work of a programmer and a statistician/econometrician. According to Hicks and Irizarry, (2018), a Data Scientist has two main features; one is that they focus on analysis, but different from a statistician they focus on how to process and interpret data to answer real-world questions. The second feature is that have very strong coding skills, they know how to acquire, clean and manipulate data for analysis.
The research question is related to the type of analysis, they jointly determine the strategy of data collection and manipulation. As we will learn in Lecture 2, the scope of Data science is mostly driven by the use of inductive reasoning* to derive conclusions about our data. In plain words, induction means collecting smaller pieces of information to draw a general conclusion(s) about our object of study. Our conclusions are the outcome of the type of analysis that we are conducting. Understanding the type of analysis is paramount for selecting the right methodology. This will be covered in detail in Lecture 3. The scope of Data Science is well described from the following diagram.
Source: Based on the Frameworks of Prakash, Padmapriy, and Kumar (2018) and De Mast, Nuijten & Kapitan (2021)
I use the frameworks of Prakash, Padmapriy, and Kumar (2018) and De Mast, Nuijten & Kapitan (2021) to describe three main processes of Data Science, as described in the previous diagram.
-
Data: Involves all the processes of data management: acquisition, extraction, cleaning, storage and retrieving. In principle, the ideal is that retrieving the data is a requirement for the second step, which is data analysis. However, in reality, these two processes may end up being non-linear, but, they are an interactive loop. For instance, while conducting analysis, you may want to add a new variable that was not properly extracted or stored; this happens often.
-
Type of Analysis: The kind of analysis connects back to the question that we want to address. All Data Science questions say something about the relation between a dependent variable \(Y\) and a set of the explanatory variable(s) \(X\). What we want to say, or in other words, the type of claim, determines if the analysis can be descriptive, (covered in Lecture 3), or predictive or causal (discussed in Lecture 4). Descriptive analysis, is characterized by questions such as “What is the current state of affairs?”; “How often, how many, when?” or “what is the association between two variables?”. The predictive analysis revolves around making forecasts about the outcome variable \(Y\), with a set of predictors \(X\)– prediction is the default outcome of ML algorithms. Finally, we have the causal analysis that uses statistical modelling, econometrics sometimes mixed with ML, to estimate the effect \(X\) on the dependent variable \(Y\). Causal analysis typically can respond to “Why \(Y\) behaves in a certain manner?”; the answer is due to the effect of \(X\). The interest of the last type of analysis is thus quantifying or measuring the effect of \(X\) on \(Y\).
-
Methods of Data Science: The methods used by Data Science are evolving rapidly and is not the objective to give you an exhaustive list. Rather, just the intuition of the scope of Data Science and the overall process of can we relate a question, to the process of data acquisition ending with the analysis and conclusion. Broadly, descriptive methods can use measurements of central tendency, scatter plots, histograms and pie charts to describe the state of affairs of certain variables. Then, if the study is predictive we can start by using Linear Regression and ML algorithms. Finally, if we aim for causal analysis, we must have a more refined model, based on theory and intuition, and we care about the statistical significance of the estimated relation between \(X\) and \(Y\). The state of the art of empirical econometrics nowadays is growing in the intersection between econometrics and ML techniques (Gu, Kelly, Et Al, 2020).
Follow-Up
After this brief overview into the field of Data Science you are ready to start developing your skills. The best way to learn is by doing, for that reason every tutorial is designed to teach you the basics of the main pillars of Data Science using R. Start Tutorial 1, where you will learn the syntaxis and semantics of R, right from the start. While learning, use the analogy of learning a foreign language, syntax is the set of characters (or letters), that are arranged in a certain way to perform an action. The action, is the meaning (semantic), namely the specific operation that R will perform when you write in the console and press enter
.
Happy Learning!
Mario H. Gonzalez Sauri.