Our goal in Session 00 is to prepare ourselves technically for what follows. We need to install (1) the programming language R, the RStudio IDE (IDE stands for: Integrated Development Environment), (3) understand how to organize our files and folders, and make sure that everything is working as expected.
Our goal in Session 01 is to develop and intuition of R, and a feeling for the programming language R. Yes, we need to work on our emotions, dear students. Because programming in Data Science - as well as programming in general - is nothing but communication, and a very specific one. Programming languages are not very tolerant in the way they communicate with us: for example, they demand that we be very upfront and straightforward, and they have no tolerance for ambiguity. If they do not understand something that we say, they will immediately respond with a cold - and at first cryptic - error message. That means that we need to be very patient while we learn the strict rules of communication with machines in Data Science and elsewhere. I guarantee you that there is a specific emotional state, a specific state of mind, that accompanies the focused work of any programmer and Data Scientist. It is a kind of calmness coupled with a good, very good intuition about the nature of the programming language they use. This is our goal: we need to start developing our intuition about R. How does R like to work?
Data Science is, well, all about data. But data lives somewhere. Where? How do we find them? In this session we will learn about the basic I/O (Input/Output) operations in R: how to load the disk stored data written in various formats in R, and how to store them back. We will learn that sometimes it makes no difference if the dataset lives on the Internet or on our local hard drives. We will learn to perform the basic I/O operations in base R before we start meeting the friendly tidyverse packages - readr, for example - that provide improved and somewhat more comfortable to work with procedures. We will learn more about the data.frame in R: what is subsetting and is it done, how to summarize a dataset represented by a data.frame in R, how to bind dataframes together. We continue to play with lists too and start learning about simple operations on strings in base R.
We study a practical problem of dealing with many dataframes with identical (or nearly identical) columns: that would be similar to a set of tables with identical schemata in the world of Relational Database Management Systems (RDBMS). In the near future we will start drawing parallels between working with dataframes in R and the relationally structured tables in RDBMS like MySQL, MariaDB, PostgreSQL, and similar. While learning how to manage a set of dataframes we start introducing the R control flow: various iterations like for, repeat, and while loops, and decision making with if … else and switch. Of course, we recognize the nature of R as a functional and vectorized programming language, so we immediately show how to avoid using a loop when lapply() or any of its sisters - like Reduce() that we will introduce in this session - can do the job. We want our code to be a series of mappings from input data through various transformations onto the results, and consider everything else to be what is called a glue code. We will later show how {dplyr}, {tidyr}, and other packages are perfectly suited for that purpose.
In this session, we are consolidating and expanding our knowledge and understanding of functional programming in R by applying functionals to calculate basic statistics in a sample of measurements. Additionally, we will review basic descriptive statistics: mean, median, Q1, Q3, IQR, minimum, maximum, variance, and standard deviation. As the final new concept of functional programming in R introduced during the course, we will present function factories: functions that create functions!
This is the beginning of our journey into Data Science and Analytics in R. Thus far, we have been learning about R programming, more or less treating R as any other programming language. Now we begin to master the real power of R in its areas of specialization: analytics, statistics, machine learning, visualization, and reporting. And as you will see: in statistics, nothing compares to R.
Today, we begin laying the fundamental mathematical foundations for working in Data Science, which includes an introduction to probability theory. Probability theory is a branch of mathematics that deals with uncertain events—events that, as the name suggests, can occur only with some probability. But what does that mean? If something is uncertain, inherently unpredictable, is it even possible to describe it mathematically and treat it with the appropriate mathematical framework? The development of mathematics over the past three centuries, culminating in the 20th century, has shown that this is indeed possible. Probability theory is a mathematical discipline that connects the natural language we use to intuitively express ourselves about uncertain events with precise mathematical expressions. Through set theory, algebra, combinatorics, and mathematical analysis, it allows us to incorporate uncertain events into science and engineering. The fundamental concepts we are introducing in this session are: event algebra, probability, random variables, distributions (discrete and continuous) of random variables, and the central limit theorem, whose significance in many empirical domains cannot be overstated.
Today, we begin laying the fundamental mathematical foundations for working in Data Science, which includes an introduction to probability theory. Probability theory is a branch of mathematics that deals with uncertain events—events that, as the name suggests, can occur only with some probability. But what does that mean? If something is uncertain, inherently unpredictable, is it even possible to describe it mathematically and treat it with the appropriate mathematical framework? The development of mathematics over the past three centuries, culminating in the 20th century, has shown that this is indeed possible. Probability theory is a mathematical discipline that connects the natural language we use to intuitively express ourselves about uncertain events with precise mathematical expressions. Through set theory, algebra, combinatorics, and mathematical analysis, it allows us to incorporate uncertain events into science and engineering. The fundamental concepts we are introducing in this session are: event algebra, probability, random variables, distributions (discrete and continuous) of random variables, and the central limit theorem, whose significance in many empirical domains cannot be overstated.
We will dive deep into theory today in an effort to understand the logic of statistical modeling by error minimization. We will talk a bit about different types of mathematical models that are being developed and used in contemporary Data Science, Machine Learning, and AI. We will distinguish between supervised and unsupervised learning on one, and then reinforcement learning on the other hand. Then we focus on supervised learning and the simple regression model again in order to understand how mathematical models are estimated from empirical data. We will demonstrate several different approaches to estimate a given model’s parameters from data and discuss why they all converge along the same lines of perspective towards one single, important insight: statistics is learning.
We introduce the concepts of partial and part correlation, of essential importance in multiple linear regression, and take a look at what R has to offer in the {ppcor} package in that respect. Then we go back to the simple linear regression model and discuss its assumptions. All statistical models make some assumptions about the data, and only if their assumptions are satisfied then the application of the model to the data is justified. We proceed by introducing the concept of bias of a statistical estimate and introduce a way to estimate it via numerical simulations: the parametric bootstrap.