Machine learning 101: a really simple explanation about Linear Regression

by Benedetta Tagliaferri

Hello everybody!

Predictive analytics has always been a challenging concept for me.

During my financial degree and even before, during high school, I found this type of analysis to be tricky. Often you just know how to apply the model but you don’t really understand what underpins it.

For this reason, I decided to make this blog series explaining machine learning and predictive analytics in a simple way, hoping to help you apply this analysis with Alteryx tools in a more meaningful way.

First, I must give credit to https://www.coursera.org,  which has a machine-learning course that I found gave really simple and clear explanations about predictive analytics.

Let’s start with regression aka SUPERVISED LEARNING.

In supervised learning, we have a data set and already know what our correct output should look like, having an idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into “regression” and “classification” problems.

In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

For example: if we have data about the size of houses on the real estate market, can we predict their price? Price as a function of size is a continuous output, so this is a regression problem.

Linear regression with one variable

Univariate linear regression is used when you want to predict a single output value from a single input value. We’re doing supervised learning here, so that means we already have an idea what the input/output cause and effect should be.

Let’s do an example of this.

Let’s take the housing prices with different sizes (and different prices).  A friend of mine wants to sell his house that is 1250 feet^2 and he wants to know the approximate value of it.

   ML1

The idea is that you plot a line that best fits the data and based on that I can tell my friend that the house is worth 340k. This is an example of supervised learning and regression problem.

More formally, in supervised learning we have a data set called training set with the house prices and sizes. From this dataset I want to learn how to predict the house prices.

ML2

 m = the number of training examples (number of rows)

 x = input

 y = output

 (x,y) = one training example (one single row)

 ML3

In general supervised learning needs:

a training set (house prices by sizes) which we fit to our learning algorithm. As an output we get h (hypothesis) which is a function that takes the size of houses as an input, and tries to output the estimate value of y.

h is a function that maps from x to y

So how do we represent h?

ML4

The function is predicting the value of y, given the value of x.

This model is called linear regression, better known as univariate linear regression because we have just one variable.

In my next blog post I am going to speak about the cost function which is actually what we use to decide which model best fits our data.

Hope it’s useful!

Thanks