Like humans can tell apart jam from jelly (or can we? 😉 ), with the help of machine learning, now machines can also label and categorize objects that they see. But they must be trained for it. Let’s see what that means…
In this article I want to go over a very important topic in Machine Learning – Supervised Learning. This is the most popular form of machine learning used in the industry. After reading this post:
- You will be able to understand what is Supervised Learning and how it works
- You will get to know about the types of Supervised Learning
- You will get to know about some example algorithms and real-world applications
We might want to start with the basics first and quickly go over the definition of Machine Learning just to cover all the bases.
Machine Learning
Machine Learning is a field of study concerned with building systems or programs which have the ability to learn without being explicitly programmed. Machine learning systems take in huge amounts of data and learn patterns and labels from that, to basically predict information on never-seen-before data.
Here is a popular definition of machine learning:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Tom Mitchell
I know this is difficult to fully comprehend so let me break it down into simpler terms. Think of experience as data, the task to be predicting something with that data, and the performance measure to be the answer to the question of whether or not the prediction is actually correct.
Let’s pick up an example. Suppose we are trying to build the next-generation spam filter system for Google to be used directly in Gmail. This would mean the experience would be taking millions and millions of emails, the task would be predicting whether a particular email is spam or not spam, and the performance would be measured by analyzing whether the prediction of this system was actually correct or not.
Machine Learning Model
A machine learning model is an algorithm which has been trained with some particular kind of historical data to predict something by applying it on never-seen-before data. It could be a class label, a numeric value or maybe even some interesting patterns in the data to build impactful insights.
This task of a machine learning model depends entirely on the problem at hand, which also decides what kind of data we are going to use. The machine learning problem we are trying to solve also dictates how we actually approach the problem.
Do we have an output in the training data?
If yes, what kind of output data? Discrete classes or numeric values?
A training record is comprised of features. As the name suggests, they are attributes of the data we are dealing with – a characteristic or a property of the object that the data is about.
A label is a way to tell the machine learning model whether or not the thing that it is supposed to look for in new data is actually present in this particular training record or not – it is what we are predicting. These are discrete values which the machine learning model can predict for never-seen-before data. For such machine learning problems, features are the input and labels are the output.
Another way a machine learning model can work is by predicting a numerical value. Suppose we are working with cars data. We have the data about car prices over the last 10 years. The data contains features like the company, year of manufacture, power, car type etc. etc. as features and of course the car price as the output. In this case, we will build a machine learning model which takes in all those features and tells us the price a new car.
Back to our Gmail spam filter, we would train the machine learning model with millions and millions of emails. In this situation the features would be the email subject, email body, the email:from field etc. and along with each and every email we will place a label of “spam” or “not spam”. This way the model can differentiate which email to pass and which email to filter out.
And of course there would be a lot of data preprocessing to convert the text and the rest of email content to something the machine learning model understands in the form of encoding or embedding. Definitely check out these articles if you would like to go deeper into the definition of machine learning or data preprocessing –
Machine Learning can be in two forms :
- Supervised Machine Learning
- Unsupervised Learning
The scope of this article is to address only Supervised Learning, but don’t worry as you scroll down you will find a link to an article dedicated to Unsupervised Learning as well 🙂
Supervised Learning
Supervised learning is a form of machine learning in which the input and output for our machine learning model are both available to us, that is, we know what the output is going to look like by simply looking at the dataset. The name “supervised” means that there exists a relationship between the input features and their respective output in the data. The aim of any machine learning algorithm we implement is to predict new but similar output for never-seen-before data by estimating that relationship.
For example, a problem like identifying if an orange is present in any image is something a machine learning model can handle. Another one, maybe a little more useful one, could be identifying whether or not a certain piece of text contains profanity.
You can see that these two problems are clearly very different. Let’s look at the following table:
Problem | Data | Label | Features |
---|---|---|---|
Orange Detection | Images | Yes / No | Pixel data extracted from images |
Profanity Detection | Text | Clean / Dirty | Encoded vectors from input text |
But at the same time, these two problems are very similar… how so?
-> In both of these situations, we will be training a machine learning model with data in which each training record along with the actual data contain a label. That label will tell us whether or not the orange is present in the image or not (Yes / No), or if profanity is actually present in that particular text (Clean / Dirty). In other words, the machine learning model is supposed to choose the outcome from a known set of possible outcomes. This set of possible outcomes is formed by the set of labels present in the data. The model tries to learn the relationship between the input features and the output label during its training.
- In the spam detection problem, the model will analyze the new email and give out a label of “Spam” or “Not Spam” for it
- In the orange detection problem, the model will analyze the new image and tell us if an orange is present in the image – “Yes” – or not – “No”
Now if we revisit our car price prediction problem from before, we will notice that it is also somewhat similar. Here also, the data contains each car with its own features like the company, year of manufacture ec. etc. and along with that, the price. In this case, the machine learning model is supposed to estimate or predict the price of a new car based on the relationship it learns from that historical data during its training.
Based on this, let us now move on to the last part of this article. Supervised Machine Learning problems can be of two types:
- Classification
- Regression
Classification
The spam filter, orange detection problem, and the profanity detection problem are machine learning problems in which we seem to have properly defined and discrete labels as output. So the machine learning model only has to tell us that label based on what it learns from historical data during its training. This type of supervised learning is called Classification.
Those discrete labels are often called as classes, and any such supervised machine learning problem is called a Classification Problem. Some of the most popular and widely used use-cases of machine learning are classification problems, and because of that some of the most widely used and implemented machine learning algorithms are classification algorithms. To name a few of them are:
- Naive Bayes Classifier
- K-Nearest Neighbour
- Logistic Regression
- Support Vector Machines
- Decision Trees
- Random Forest
- Neural Networks
Regression
The car price prediction problem from before is a machine learning problem in which we did not have discrete labels or classes, rather we had continuous numeric values in terms of the price of each car. By getting trained on historical car price data, the machine learning model will learn the relationship between the car features and their prices. It will then be able to predict the price of a new car by looking at its features.
So in this case, we have a continuous output variable, a numeric value which depends directly on the features which are present. One of the most talked-about use-cases for a supervised regression problem is Stock Price prediction. Although the perfect data set to train a model is very hard to find, people use regression techniques on sample data to get a rough estimate on real-world situations to do better in the stock market. Some algorithms in regression:
- Linear Regression
- Multivariate Regression
- LASSO Regression
- Ridge Regression
I hope this article provided you with some clarity over the topic of Supervised Machine Learning. It is a very important topic in Data Science and Machine Learning and is more understandable and explainable than some of the other cool stuff out there – something which is highly valuable in the business world. Explainability of an ML model is highly desirable in the business world as a lot of money gets invested into everything and it is expected that the model outputs are understandable not only by the business but also by the customers. Kindly like / subscribe / share to The Data Science Portal if you liked the article and want to see more of such content!
Thank you for reading!