Data Scientist — The sexiest job of the 21st Century

Harvard Business Review

I am sure you have heard this even if you are not a data-head like I am. This was in a major publication from Harvard Business Review by Thomas H. Davenport and DJ Patil where they talked about how data is becoming more and more accessible and easily tameable with the emergence of technology which is focused on big data. Every major company today is devising a data science strategy and improving their services — Amazon with their recommender engine, Facebook with their feed ranking, Google with their whole suite of products, and so many other companies…!
In order to understand all that cool stuff, let us first go through the basics of Data Science and try and breakdown the big terms you will see if you google “Data Science”…

What qualifies as Data?

Let us start by defining crux of it all — data

Data is a collection of information usually composed of pieces which may or may not be in the same format

Let us take the case of a really, really popular service — Netflix. People love watching movies, but everyone has a preference. For instance, say we are trying to build a system which recommends movies to people based on their taste. Some like action, some thriller, some maybe horror, sci-fi or comedy. This makes the problem really interesting and you want your system to get that right — You don’t want to recommend me a chick-flick when I like action movies now do you!

Now even for a human, let us say for you to suggest me a movie, you would want to know about my taste, my preferences, my likes and dislikes. So you’ll ask me questions… 
“Do you like this….. or….. do you like that? What do you think about this?… I see… you prefer movies which have an interesting storyline and of course, action. Based on this, I would suggest you to watch these movies…”.

As is evident, you will first ask about my preferences before suggesting something to me, and with this exercise you are collecting data.

Which movie do you like the most? | Image by Author

For the recommender system also, we will be training it with data we gather about the movie tastes of individual people. It will be similar to instructing the recommender system which type of movies to suggest whenever it sees a certain pattern. For example, it sees Iron Man, Spider Man, and Captain America — Thor might be a good suggestion, what do you think?

What is Big Data?

For any data science problem, it is essential that we are able to gather a good amount of relevant data.

When we talk about Big Data, it usually involves datasets formed in a manner mentioned above, but so voluminous that managing and modelling it with commonly used software and tools takes time and computational power more than that the tolerable limit, hence the addition of the term — BIG. 
There is not really a limit to the size when we work with big data with the data ranging from terabytes to many, many zettabytes. It can be structured or unstructured and is difficult to process using traditional database and software techniques.

Characteristics of Big Data

There are several ways to define big data in terms of its characteristics, the most common of them using the popular 4 Vs of Big Data.

Volume

This is the quantity of the data — the justification of adding ‘Big’. This determines how much information and insights can be extracted from data. For powerful systems, lots and lots of data must be used to train the model in order for the model to perform exceptionally well. But while having that amount of data is amazing, what you should always maintain is that data is rich enough. So while having years and years of movie data on a person is good in general, it may not be so useful if that data is from that person’s childhood. Then we will be including kids movies as well when we want to suggest movies to a grown up! 

Variety

This property relates to the nature and type of the data we are dealing with. We could be dealing with text, images, audio, video or maybe time-stamped events — structured, unstructured, semi-structured and even complex-structured data. The analysis, insights and the processes we derive and implement are highly dependent on the variety of data. This is because having a wide variety of data requires an equally wide variety of approaches for storing and processing that data. 

Veracity

Incoming data although huge in size, as mentioned above, is not of much use if it is not relevant. The quality of data we are receiving hugely affects the intuitions and insights which are derived, as well as our analysis results. 
For instance, let us go back to the movie recommendation engine. The data that gets included in the analysis must be correct, so it should not have TV shows as part of our final dataset. We are recommending movies to a person and having TV shows as part of our records for that person is not only unnecessary but also incorrect. Although TV shows can be categorized by genre, but that is exactly where the similarities with movies run out. They have a completely different structure — episodes which may be 20 minutes or above grouped together in sequential seasons. 

Velocity

The speed at which data is generated and fed into our analysis systems guides our approach at handling and processing it to a large extent. Big Data is often available in real time and this frequency of generation and in turn the frequency of handling this massive amount of data are two of the most important factors which drive the analysis. We want the latest data to train our recommender system and we want to give out suggestions as quickly as possible!

Data Science

Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe

Wikipedia

I am sure you must have seen various different topics like Data Analytics, Machine Learning, Artificial Intelligence, Natural Language Processing etc. etc.. Data Science might be called as an umbrella term for all these fields — all of them involve data which has to first undergo a data preprocessing pipeline. The data gets converted into a format which is now suitable for us to run our analysis models. If you want to know more about this, check out this article : Data Preprocessing.

Why is it science you ask?

When solving any problem in data science, what happens is that one tries to leverage from the fields of statistics, machine learning, data analysis and other related fields in order to analyze and understand actual phenomena using relevant data, and then present it to the world much like what happens for an actual scientific experiment. The problem is first observed, and then appropriate hypothesis are developed which are experimented over and over again to improve the hypothesis and get better results. The correctness of any result is established by the type of problem we are trying to solve — in terms of problem-specific metrics and KPIs. This is exactly what happens in a data science experiment — there is a huge likeliness in terms of framing the problem, experimenting to generate understanding and communicating the understanding to the rest of the world.


In this article, I wanted to give an introduction to Data Science and talk a little bit about Big Data in general. I hope you liked the article!
Kindly share / like / subscribe to thedatascienceportal. See you in the next post!