Have you ever played one of those finish-the-pattern games? There are 6 shapes of different colors given in a row and you have to guess what the 7th shape will be. It is fascinating that even though we have never seen those kinds of patterns before we are still able to guess it correctly (well most of the times…)
What is happening here is that we are performing pattern recognition entirely based on what we really see and that too without any given rules to direct or streamline our judgement. We go through the data we are given, form our own rules, and then based on that guess what the next shape and color are going to be.
In the last post, we discussed Supervised Machine Learning in detail. We went over how it works and saw some really cool applications as well.
In this article, I want to go over another form of machine learning called Unsupervised Machine Learning which deals with patterns and unlabelled data. After reading this article:
- You will be able to understand what Unsupervised Machine Learning is and how it works
- You will know about some of the most important Unsupervised Machine Learning algorithms
- You will see some exciting real-world applications of Unsupervised Machine Learning
We might want to start with the basics first and quickly go over the definition of Machine Learning just to cover all the bases.
Machine Learning
Machine Learning is a field of study concerned with building systems or programs which have the ability to learn without being explicitly programmed. Machine learning systems take in huge amounts of data and learn patterns and labels from that, to basically predict information on never-seen-before data.
Here is a popular definition of machine learning:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Tom Mitchell
I know this is difficult to fully comprehend so let me break it down into simpler terms. Think of experience as data, the task to be predicting something with that data, and the performance measure to be the answer to the question of whether or not the prediction is actually correct.
Let’s pick up an example. Suppose you are running the marketing division for a high fashion clothing brand (How about Calvin Klein?). You want to launch a digital campaign for a new line-up of beachwear in about a months time. Obviously, like any other campaign, this one also has a budget and you want to make the best of it. So what do you do?
Well one way of course will be to simply launch the line-up and release ads. This will include all existing customers who have bought something from Calvin Klein previously and new customers who happen to see the advertisements. So, do we exhaust our budget on some cool visible-to-all advertisements?
After all, the more the viewership, the better right? ….Right?
Well….not really. Turns out there is a better way if we treat this as an unsupervised machine learning problem. We can keep the advertising channels to be the same but the way we utilize them can be a little different. We can look at the past data for advertising results and see what kind of patterns emerge from there. For instance, we can look at the engagement data to answer a few questions, assuming we have access to such information:
- Which type of digital advertising led to the most engagement? – Text / Image / Video?
- Which age group does the line-up appeal to?
- Which locations are the most popular?
- Which price range or segment was most popular?
- Which products were most viewed or put in wishlists?
- Which were the highest selling products?
One thing you should realise at this point is that we are dealing with data here which does not contain any labels, tags or classes. The data is as-is and we are just trying to make some sense out of it, and is being considered the experience from the definition earlier. There are several things here which can be done using different types of unsupervised machine learning techniques, which are considered to be the outcomes or predictions. For example, by clustering the customers we can form customer groups and can extract information like –
- Which group has the maximum amount of engagement with which channel of digital advertising?
- How are people split based on age?
- How are they split on spending?
- How are they split on location?
- What are the top products for each such customer group?
- In which customer groups is beachwear category more popular?
- For beach category products, what are the most common locations for our customers in each group?
- Are they closer to the beaches?
- How is their engagement with previous advertisements?
Knowing such things will help immensely in streamlining and basically providing a sense of direction to our advertising campaign. We will be able to better target customers with custom campaigns for each group and expect a much higher response rate than what it would be if we go ahead with the older one-size-fits-all advertisements.
Machine Learning Model
A machine learning model is an algorithm which has been trained with some particular kind of historical data to predict something by applying it on never-seen-before data. It could be a class label, a numeric value or maybe even some interesting patterns in the data to build impactful insights.
This task of a machine learning model depends entirely on the problem at hand, which also decides what kind of data we are going to use. The machine learning problem we are trying to solve also dictates how we actually approach the problem.
Do we have an output in the training data?
If yes, what kind of output data? Discrete classes or numeric values?
A training record is comprised of features. As the name suggests, they are attributes of the data we are dealing with – a characteristic or a property of the object that the data is about.
A label is a way to tell the machine learning model whether or not the thing that it is supposed to look for in the new data is actually present in this particular training record or not – it is what we are predicting. These are discrete values which the machine learning model can predict for never-seen-before data. For such machine learning problems, features are the input and labels are the output.
Another way a machine learning model can work is by understanding a pattern. The data does not always contain labels as described above. Sometimes the data is all we have and the aim is to categorize and provide the labels. The machine model in this case, will go through the data and cluster the data into several groups based on closeness and similarity.
And of course there would be a lot of data preprocessing to convert the data into something the machine learning model understands. Definitely check out these articles if you would like to go deeper into the definition of machine learning or data preprocessing –
Machine Learning can be in two forms :
- Supervised Machine Learning
- Unsupervised Machine Learning
The scope of this article is to address only Unsupervised Learning, but don’t worry as you scroll down you will find a link to an article dedicated to Supervised Learning as well 🙂
Unsupervised Machine Learning
Unsupervised Machine Learning is a form of machine learning in which the labels, classes or basically the target variable values are not available with us. The problems are usually like here’s some data, can you now make some sense out of it?
And although it might not make much sense to you right now ( 😀 ) this is usually how an unsupervised learning algorithm works. It goes through the data and finds patterns and associations amongst the data points as there are no target variables that those data points can be classified into. There are several closeness and similarity metrics which are calculated so that the data points which are similar get grouped or clustered together and the ones which are not are sent to different clusters.
This visual above shows how an unsupervised learning algorithm works. We start off with all the data points being the same color, but as we iterate over the data point again and again we find that some data point are quite similar and some that are quite different to each other, resulting in all of them being placed in several clusters. These algorithms are quite intuitive and as you can see, even the points close to each other in this space might fall into different clusters.
If you look at the definition of Machine Learning above by Tom Mitchell again, you will see that there is also one performance measure which helps us estimate how well the machine learning model is working.
In case of supervised learning, these performance measures usually revolve around the accuracy of predictions – verifying if the model is giving the expected prediction for a known input. For example, it could be a spam email detection system. We can measure the performance by simply taking in an email which we already know is spam, and try and predict its label. If it comes out to be spam, our model works fine but if it doesn’t, we need to make some adjustments.
But in the case of unsupervised learning, labels are not available. So the performance cannot be so clearly measured. Consider the same spam email detection system – the model will try and understand the patterns and underlying structure of the emails and will try to group similar looking emails together. A bit of post-processing will be required to actually verify how well the model is working. But there is a BIG advantage of using such unsupervised methods, let’s carry on!
Unsupervised Machine Learning helps us find all kinds of patterns in the data in the absence of labels and this property is super helpful and very much applicable in the real world. In fact, one of the most widely used implementations of unsupervised machine learning algorithms is in anomaly detection. Here’s why – the unsupervised learning methods are better than supervised learning methods at finding new patterns in unseen future data, and therefore is more adept at handling events which might be fraudulent. Fraudsters are modifying their ways all the time in attempts to hit you with new scams and viruses, and so unsupervised learning methods shine here as they are able to pick up fraudulent activities by analyzing real-time or near real-time data.
Along with this, these unsupervised algorithms may also find other interesting patterns which may very well help you in categorizing activities – such as if we go back to spam email detection problem, unsupervised algorithms may even help us in categorizing our emails as “family”, “work” or something else simply because those emails will be grouped separately.
So the problems where patterns are constantly changing, or are relatively unknown or maybe for which we do not enough labelled data, unsupervised learning algorithms are the way to go!
Let’s move on the next part of this article now and take a closer look at a few unsupervised learning algorithms.
- Dimensionality Reduction
- Clustering
Dimensionality Reduction
This family of algorithms require an article of its own but let’s go through it and try to keep it simple. Features of a dataset are also referred to as being the dimensions of that dataset – which in simple words means the columns of your dataset. Now you must be thinking, as the name of this algorithm suggests – why are we trying to reduce the number of features?
Curse of Dimensionality
This expression was coined by Richard E. Bellman and refers to various unfavourable phenomena that come up when dealing with data which has a high number of features. Let’s break this down into easier terms.
Having such a high number of features means that the number of variations that we will get will have to increase multiple folds as well. What does this mean?
Consider having 3 features, with each one of them having 2 possible values – True or False. How many records do we need to cover all possible variations?
Feature 1 | Feature 2 | Feature 3 |
---|---|---|
False | False | False |
False | False | True |
False | True | False |
False | True | True |
True | False | False |
True | False | True |
True | True | False |
True | True | True |
We can see that we need ( 23 ) number of records to cover all such variations. And this is an easy case – imagine what will happen if say we have 220 features with 100 of them having 3 unique values and rest 120 of them having 5 unique values. Then we will need at least ( 3100 X 5120 ) number of records!
As we can see, the number of records must increase quite drastically in order to provide a dataset which has relevant information. Otherwise, this leads to two major problems –
- Computational Challenges: Such huge-dimensional data is very difficult to handle as performing numerical computations on it is super expensive. It is also equally challenging to gather such information and because the number of records don’t usually increase as much as they should along with the number features, in most cases we have to deal with sparse datasets, that is, empty values for many, many features. Sparsity leads to issues such as unreliable and ineffective statistical results. Both time and space complexities take a major hit, and even the most optimized algorithms fail to work properly. Just the sheer size of this dataset causes all these issues!
- Generalization Challenges: Because of such a high dimensional dataset with such a huge number of features, there is an issue called overfitting which greatly affects model predictions. During training, the model goes through the training dataset and it leads to something called as a peaking phenomena – the predictive power of the machine learning increases with increase in the number of features, but only upto a certain point, the peak, after which the predictions become worse and worse. As mentioned above, the increase in number of features leads to sparsity due to which, it becomes relatively easier to find a “pattern” which the model memorizes so well that it is not able to generalize when it sees new data. This implies that the model practically memorizes the training data and is not able to predict correctly on new never-seen-before data.
The basic aim of these algorithms is to reduce the number of features to bring it down to the most important and relevant features. The most widely used dimensionality reduction algorithms are –
- Principal Component Analysis
- Singular Value Decomposition
Clustering
This family of unsupervised learning algorithms work by grouping together data into several clusters depending on pre-defined functions of similarity and closeness. Similar items or data records are clustered together in one cluster while the records which have different properties are put in separate clusters.
This is similar to the animation we saw before, and shows how a form of clustering called K-Means clustering works. how clustering works. Centroids, which represent the centers of clusters, are initiated at random positions and at every iteration the closeness and similarity metrics decide where those centroids need to move. There are two optimization functions running at the same time to make this happen –
- In any cluster, the sum of the distances between its centroid and the data points, called intra-cluster distance, is being minimized
- Distance between the centroids of all clusters, called inter-cluster distance, is being maximized
There are several other ways to cluster data points like this out of which the most widely-used amongst them are –
- Connectivity-Based Clustering – Hierarchical Clustering
- Centroid-Based Clustering – K-Means, K-Medoids
- Distribution-Based Clustering – Gaussian, Binomial
- Density-Based Clustering
Applications of Unsupervised Machine Learning
There are several applications of unsupervised machine learning, few of which are used in everyday real-life in super crucial decision making processes like in banks and hospitals. Few of the most popular ones are given below –
Customer Segmentation
Unsupervised Learning is very prominently used in customer segmentation to better understand the type of customers one might have. Due to the sheer number of customers and the huge variability in their behaviour, knowing what might please which customer is something which is getting more and more attention.
Customer segmentation is the process of grouping customers together into several groups or clusters based on their characteristics such as demographics, spending patterns, likes and dislikes. There is super helpful when you want to target your customers as per these characteristics. Like shown above, another example could be a study of what type of cars people prefer from the data taken from a car dealership – it will help the dealership point the customer to the right car for them by showing them digital advertisements.
This is also a very reliable way knowing who are the most valuable customers – you can target them the best with exclusive offers and what not. Along with this, customer segmentation models may also give information about the customers who never return – this helps in keeping and increasing your customer base as now you can apply this model to judge a new customer and can apply some special marketing techniques to make them come back.
Anomaly Detection
Unsupervised Machine Learning is used in Anomaly Detection in a very interesting manner. As explained above, dimensionality reduction is the process of reducing the number of features and extracting out the features which are most relevant. There is an industry standard method which is used to grade the newly formed or selected feature set – how about we reconstruct the original dataset from this new reduced feature set?
If the dimensionality reduction algorithm actually understood the data properly and reduced it appropriately, and we try and go back from here, we should get back a dataset which is somewhat closer to the original one. The way to validate the reconstructed data is to use some metric which finds the difference between the original values and this reconstructed one. As the dimensionality reduction initially worked fine, the metric should give very low differences. But what do you think about the observations or the data records which do not the general pattern as observed in the dataset?
What if we are dealing with the data comprised of credit card transactions? Do you think a fraudulent transaction will follow the general pattern of the credit card user?
Highly unlikely!
And so, after reconstruction the difference metric will give a high value as the reconstruction will be highly influenced by the general pattern and trends of the whole dataset. And this is a clear indication that the transaction might be flagged as suspicious.
Feature Extraction
This is a direct application of dimensionality reduction algorithms – and as the name suggests, feature extraction is basically reducing the number of features in any dataset. This is done usually before doing any other kind of analysis on the dataset when the datasets are so big in terms of features that it is difficult to comprehend and run any sort of machine learning algorithms on it.
Some direct uses cases are –
- Image dataset – image pixels are considered the features here and naturally they are huge in number
- Video dataset – image pixels are taken from frames and so they pose a similar problem
- Music dataset – features could be frequency or any other audio signal
- Textual dataset – features could be words or characters even
In all the above use-cases, there is a need to perform feature extraction and reduce the number of features and noise so that the focus of any data analysis or machine learning model that we might apply afterwards is only on the most relevant data.
I hope this article provided you with some clarity over the topic of Unsupervised Machine Learning. It is a very important topic in Data Science and Machine Learning. Explainability of a machine learning model is highly desirable in the business world as a lot of money gets invested into everything and it is expected that the model outputs are understandable not only by the business but also by the customers. So the more models you build the more you have to make sure that the concepts are understandable and explainable to all kinds of people – technical or non-technical. Kindly like / subscribe / share to The Data Science Portal if you liked the article and want to see more of such content!
Thank you for reading!