You’ve probably heard already that as a data scientists you really need to know the data you’re working with. And for good reason. It is one of the most important roles of a data scientist to have the capability to understand a piece of data before starting to work with it.
It is pretty vague when you just say “know” your data. What does it mean to know your data? What does it entail exactly?
Let me break it down for you and give you some of the things that comes to my mind when I tell someone they need to know their data. I'll group these in 4 categories: Overall data knowledge, understanding of features, awareness of potential problems, and data format.
This is the big picture. Looking at data as a whole will give us a better insight on what might go wrong and what to look out for when doing data exploration and cleaning. Not all will apply to all projects but some examples of things you should consider are:
1. An automated script was collecting the data and on certain days it was down so there might be gaps in the data.
2. There was a character limit on the text field that held news text content and that’s why some data points are not completely collected.)
On this level, you get more granular and consider each feature separately. Looking at the features closely will give you valuable insights into the problem you’re working on. Many important decisions about the design of the solution are made on this level.
There can be problems with the overall dataset or in features. It is crucial for the success of any project to be aware of them, address them and make sure they are contained.
Format of the data is the overall shape the data comes in. It is common to change the frequency or granularity of the data. The main questions to address here are:
Not all of these points will be applicable to all projects and for some, there might be other things you need to consider. But I think this list is a good starting point.
Data science is not an exact science and does not have rules to follow. It is very important to keep an open mind for possible pitfalls, problems and patterns when exploring and getting to know your data. And that is one of the charms of data science: the unique problems you face with every new dataset.
So next time you start a project, don’t just quickly clean your data and jump to modelling. Make sure that you can answer most of these questions. Go as granular as you can and understand how each feature works alone and how they work together. You'll see that it makes a big difference.