If you are working on a job right now, do you seriously believe that you are earning as much as you should be? What if I tell you there is a way to find out just about what range your salary should be within as per the current job-market? Or maybe what if I tell you that there is a way to study how the stock market has been over time so that you are better prepared for investing your hard-earned money? Maybe it is hard to imagine, but these things can be explained really well using linear regression (…well to some extent!).

In this article, we will talk about one of the most well-known and implemented algorithms – Linear Regression. After reading this article, you will –

  • Understand the concept behind Linear Regression
  • Understand the advantages and disadvantages of using Linear Regression
  • See examples of real-life problems solved by Linear Regression

Wait a minute, I think I’ve seen this before…

clear bulb beside white notepad on white surface
Photo by Burak Kebapci on Pexels.com

And yes, you are absolutely right. There is a very high chance that you have read about or heard about Linear Regression before, specially if you have attended a Statistics 101 (or that maybe you excel at Excel and have a solid financial background, in which case maybe we can talk about building my investment portfolio sometime?).

This algorithm, or most algorithms for that matter, work on the same basic principle – Reducing the error with the help of some optimization function. Machine Learning is focused on reducing the error between what the training set says the true label/class value is, and what the model predicts it to be. As you can see, this is why we can simply pick and choose which algorithms to apply and where. Hence, you will see this kind of overlap between “plain and boring” statistics (obviously being sarcastic) and “cool and exciting” (not being sarcastic) machine learning time and again.

On the off chance you haven’t seen this before, don’t worry. You have come to the right place. We will go through the algorithm together. If you want to get a thorough understanding of Machine Learning, check out this cool article before proceeding –

Supervised v/s Unsupervised Machine Learning

Based on the article above, we can firmly say that machine learning problems usually tend to be of two types –

  • Supervised Machine Learning Problem
    • The data we have contains the “right answers” as well. This means that for each input training record, we know what the output looks like.
    • So for example, we might be identifying cats in photos. So the data we will use is images with or without cats, with a label class specifying whether or not a cat is in a particular image or not. This will be the ground truth.
    • This way, the machine learning algorithm will see what its output should look like – hence the name, “supervised”.
    • Traditionally Supervised Machine Learning problem can also be –
      • Classification – The output is made up of discrete class intervals. Like in the example above, the labels are {“Yes”, “No”}
      • Regression – The output is a continuous value. It could be a monetary value in some currency, or maybe the temperature at some point in the week. Say we are trying to find out the price of a house based on its features such as {house size, house age, number of rooms, number of bathrooms, garage size, etc.}, the output would be a price value.
  • Unsupervised Machine Learning Problem
    • We are dealing with data that has no classes or labels. This means that it is the algorithm’s job to find some structure in the data.
    • For example, we might be trying to perform customer segmentation. Data we will use is the data on our customers – {demographics, web clickstream, buying pattern, etc.}.
    • The machine learning algorithm will cluster similar customers together and separate out customers in different clusters who are not similar.

Linear Regression

Linear regression is one of the most popular and most widely used algorithms. Being one of the oldest techniques, we can also say that it is one of those algorithms which have been studied immensely to understand and implement. Hence you will find a million different implementations and names for Linear Regression.

Intuition

We know that when we talk about Machine Learning problems, we always have independent variables (the features) and the dependent variables (label classes). The intuition behind linear regression suggests that we can find a linear model that explains the contribution of each independent variable taken together to form the dependent variable (literally why the label class is called the dependent variable).

If you have had some experience in linear algebra, you will know what I am talking about – the hypothesis function is directly modeled on the equation of a straight line.

Traditionally speaking, when we have only one feature x, we call it Simple Linear Regression but when we have multiple features in X, we call it Multiple Linear Regression.

Simple Linear Regression

Straight Line Equation | Image by Author.

As mentioned above, the model hypothesis function of linear regression is based on the generic straight line equation-

h(x) = β0 + β1x

This straight-line equation gives us a way to address the contribution of the independent variable x and some bias-intercept value β0 (c) in order to form the hypothesis value. The hypothesis value h(x) is then compared with the dependent variable (y) to find out the correctness of the model (more on this later). This equation deals with only one independent variable, whose contribution is found out by an important metric β1, the slope of the line (m), which is exactly what the name suggests, the slope of the regression line. This is called Simple Linear Regression as we are dealing with only one variable. Here, a simple straight line governed by one independent variable can be fit through the data.

Multiple Linear Regression

When we are dealing with multiple independent variables, we call it Multiple Linear Regression. This algorithm allows us to find the contribution of each independent variable from ( x1, x2, x3,.. xn ) to form the hypothesis value h(x). The equation looks like this-

h(x) = β0 + β1x1 + β2x2 + β3x3 + … + βnxn

Here we are calculating the contributions, or the coefficients, for each independent variable to finally find out the hypothesis value h(x). We can attach an x0 as well to β0 where x0 is always equal to 1 to make a more generic hypothesis function. This hypothesis value is then compared with the y values given in the training dataset to find the correctness of the model.

Cost Function

Be it Simple Linear Regression or Multiple Linear Regression, if we have a dataset like this (Kindly ignore the erratically estimated house prices, I am not a realtor!)

House Size – x1Number of Rooms – x2Number of Bathrooms – x3Central Heating – x4House Price – y
1200 sq. ft. 32Yes400k
1050 sq. ft.22No300k
First two rows of the dataset

We will feed the algorithm this dataset which will then try to find the coefficients for the x values and calculate the h(x) function value. If we get a value far away from the corresponding y value for the data record, then there should some way that the algorithm changes the values for the coefficients in order to better fit the dataset. Here is where the cost function comes in. There are several ways to modify the β values in order to better fit the data. We will go through the two common ones here –

  • Ordinary Least Squares / Squared Error Function
  • Gradient Descent

Ordinary Least Squares / Squared Error Function

In this optimization method, we use the sum of all squared differences between the hypothesis value and the actual y value to make the regression line fit the data in a better way. Suppose we are dealing with the House Pricing problem again – we take the first row of data.

h(x) = β0 + β1 * (House size) + β2 * (Number of rooms) + β3 * (Number of Bathrooms) + β4 * (Central Heating)

Now that we have calculated the h(x) value for row #1, we compare it with its corresponding y value. The comparison is done with the help of OLS Cost Function –

J = 1/2 * m * ∑ ( h(x) – y )2

This J value is the cost of using a set of coefficients that are plugged into h(x). The objective here is to find that set of coefficients that minimize this cost function J. The higher the cost, the more the parameter values need to be changed in order to bring it down. If you look at the formula above, OLS is calculating the squared error of each and every example and summing them up. There is also the aspect of averaging the errors over all the examples, so we divide it by m, the number of records in the data. 1/2 is multiplied for derivation purposes, don’t worry about it.

So, how do we find the minimum? Well, there are a few possible ways. If you have had some experience in calculus you would already know one way to minimize a function is that you take its derivative. For the ones who do not have a liking towards that, skip the pdf below, but for the ones who like to live dangerously, here is the differentiation of the OLS function (They haven’t used averaging or the multiplication by 1/2 but it is all okay as the squared difference between h(x) and y remains the same) –

OLS-Derivation

So as we can see, we take the derivative and find out the values for all the parameters which give out the minima value for the cost function J. This way, the linear regression algorithm will produce one of the best-fitted models on this data.

Gradient Descent

Gradient Descent is another cool optimization algorithm to minimize the cost function. This is one of the most widely-used optimizing algorithms and is applied even for other machine learning algorithms. This is specially helpful when we have a large number of parameters.

The algorithm works in a very sensible manner. We start with a random set of parameters (the β values) and then work our way towards a more optimal set of parameter values with respect to this randomly chosen parameter set. This random initialization is not always helpful, but it certainly takes us to a point where we have some level of minima for J, the cost function.

https://gfycat.com/angryinconsequentialdiplodocus
Gradient Descent in play | Image from Machine Learning by Andrew Ng

The gif (still don’t know whether it is called gif or gif if you know what I mean) above puts us on a 3D contour plot. The parameters (θ here) are taken as the axes and the cost is calculated and then plotted as the contour. These are the steps followed –

  1. Start with some random values for θ parameters and their calculated cost using the same function as before (Gradient Descent is an optimization algorithm, so we apply the same cost function as before) –
    J = 1/2 * m * ∑ ( h(x) – y )2
  2. At each step of gradient descent, we look around to find that point that is more optimal, that is, the values of the parameters which reduce the cost. We decide which way to go to reach the bottom of the graph quickly by only taking baby steps. Hence, you can see that with each step of gradient descent we are coming down a slope and reaching a point of minima in the blue region. Then we say that the algorithm has converged.

As you can imagine, that point of minima in the blue region might not always be the point where the algorithm achieves global minima, that is, the point in the entire space where the cost will be the lowest. And the fact that we begin with random initialization is majorly the reason behind this. The algorithm can be encapsulated like this –

repeat until convergence {

θj = θj – α ( δJ(θ) / δθj )

}

Again, for people who have a background in calculus, this would make a lot of sense. To change the value of our parameters so as to reduce the cost, we are finding the partial derivative of the cost function J with respect to each θ and then subtracting a portion of this calculated value from the original parameter. α here is the learning rate. Although covering all the derivation behind gradient descent is beyond the scope of this article, I would like to provide you with an intuition of the algorithm. This would be extremely helpful regardless of whether or not you are skilled in calculus. Keep reading!

Intuition

For the purpose of this explanation, let us look at how only one variable is affecting the cost. Gradient Descent works on the aggregation of all θ parameters affecting the cost. Let me start with a simpler definition of a derivative – When we take a derivative at a point in a plot with respect to something, we are simply finding out the angle made by a tangent passing through that particular point. Does that make things simpler? – I think we can do better with visualization, like all things in data science.

Plotting cost v/s θ | Image by Author

As we can see in this graph, there are two points at which slopes have been calculated – A and B.

At B, if we draw a tangent (black line) we will see that the angle formed is positive. We say that the slope of the tangent is positive. This is also the same slope that is given by the derivative of the cost function y with respect to θ, at B.

Similarly at A, if we find out the derivative of the cost function y or if we simply find out the slope of the tangent at A, we will get a negative value. This is because the angle formed by the tangent is negative.

So we have established that at A, we will get a negative derivative value and at B, we will get a positive derivative value. If we plug these values back into our equation –

At A and all points around it –

θj = θj – α (some negative value)

This implies that we are increasing the value of θ.

At B and all points around it –

θj = θj – α (some positive value)

This implies that we are decreasing the value of θ.

The algorithm runs till convergence, the point at which the change or the tangent is practically zero. We can make out from the graph above that the convergence will be at the bottom of the graph. The tangent at that point will be a straight-flat horizontal line. And as is clear, the cost is actually decreasing with each step.

Role of Learning Rate

We can see that the above calculation has another factor affecting how much we increase or decrease the θ value – learning rate α. This directly affects how much we allow the derivative to change the current θ value. Usually, the values can be in the range of (1e-03, 1e-01).

If we do not include the learning rate (that would mean α = 1), the steps taken by gradient descent will be large. What I mean is that the change in θ values will be large. Gradient descent will more likely overshoot in this scenario and not converge at the minima for a long, long time. In order to take baby steps in the right direction, we include this hyperparameter called the learning rate. This takes just a fraction of the derivative value and allows small and steady changes, thus allowing a more controlled progression of the algorithm.

But we still have to test out and find the right value for the learning rate. It can get too small or too large.

If the learning rate is too small, the algorithm will take minuscule steps and take a LOT of time to converge. This will result in an unnecessary increase in computation resources.

If the learning rate is too large, the algorithm will take large steps and consistently overshoot the minima. This will again result in a LOT of iterations and computation resources to converge, if it even converges.

To know more about the derivations in detail, check out these notes from Andrew Ng’s Machine Learning class at Stanford University: CS229

Applications of Linear Regression

Although Linear Regression is simple when compared to other algorithms, it is still one of the most powerful ones. There are certain attributes of this algorithm such as explainability and ease-to-implement which make it one of the most widely used algorithms in the business world.

There are several use-cases and applications of Linear Regression. Some of the common ones –

  1. Forecasting a revenue figure based on past performances: For example, you need to know how much revenue how firm will be able to generate based on how it has performed over the last 12 months.
  2. Understanding the impact of certain programmes and marketing campaigns and generating insights
  3. Predicting some continous numerical figure like a student’s marks based on hours studied and other factors
  4. House Price prediction based on features such as house size, number of rooms, garage, area etc.
  5. Understanding the impact of certain features and attributes on the machine learning outcome is simpler – explainability through feature importance.

We have covered a lot of material in this post, and I hope you were able to grasp at least the essential concepts behind linear regression and how it works. Trust me, this is not all of it – we can go on and on about the various different things at play here. Let me know in the comments if there is anything in particular that you’d like me to cover!

As always, kindly like and subscribe to The Data Science Portal and share the article if you liked it!
Stay tuned, there’s a lot more to come!