Data cleaning is messy. It is also perpetual. If you handle raw data, the chances are that it requires cleaning. You often hear that data scientists or data analysts spend up to 80% of their time cleaning data. However, for a beginner, it isn’t always clear how to go about cleaning data. What is clean? Here follows some techniques and concepts that I have learnt on the matter so far.
Before starting, I’d like to give credit where credit is due. Much of this was learnt from a companion to Lecture 4 of Harvard CS109: Data Science by Chris Beaumont. Thanks Chris!
Let’s start by taking a look at the play_scraper library. This is a helpful library that allows you to scrape the Google Playstore for data. Its straightforward and not very time intensive. The documentation is simple and after a pip install, we are good to go.
Although the purpose of this blog post is to discuss data cleaning methods, in reality the purpose would be to gain insight from data. Of course to do that, will require some understanding of WHAT can be gleaned from the data, so let’s go ahead and take a look.
Part 1: Consolidation and HDFStore.
It is as simple as instantiating a play_scraper object and calling a method from it. I called the collections method and along with the various parameters, this returns a detailed list of ‘Trending’ applications from the UK. I chose detailed as it will give us more scope to flex our data cleaning muscles.
The list is a list of dictionaries, that can easily be passed into a pandas DataFrame. Just like that we have completed the first step towards clean data, by consolidating it into a Dataframe.
Or almost. As we explore the data, we will be manipulating it. This may result in an accidental mistake or corruption. You may have noticed that it took 30.3 seconds of wall time to scrape the data. That is too long to reprocess, if needs be. With this in mind, let’s glance at HDFStore.
HDFStore allows us to store and retrieve pre-processed data when needed. Here is how you store the data:
We can now access the data when needed, next time we open up Jupyter Notebook or if we corrupt the data. Here is how you access it:
Part 2: Cleaning the Data
Not all data is retrieved in the same condition. This makes an algorithmic approach quite inappropriate. There is too much variety in the presentation of data. Rather, I think it is more helpful to follow a heuristic approach to cleaning a DataFrame. This will provide goals towards which you can work and will help direct you if you are unsure how to proceed with data cleaning. However, it is important to stress: it won’t always be necessary or desirable to accomplish these points, ultimately you will have to make the decision as to what is and what isn’t necessary.
In general, the properties of a clean DataFrame are fourfold:
1) Each row describes a single object.
2) Each column describes a property of that object.
3) Columns are numeric where appropriate.
4) The information contained within columns cannot be further reduced.
As proof of my above point, Play Scraper ensures that each row is an individual object. Therefore we will move onto work with the columns.
Each Column Describes a Property of that Object.
Calling the .info() method on the raw DataFrame reveals what we are working with: 30 columns x 60 rows. Broadly speaking, the cleaning for this task falls into 3 categories:
1) Dropping useless or duplicate columns.
2) Splitting the columns into irreducible information.
3) Cleaning up the appropriate data types.
It is also worth noting that this data only contains 60 observations. This is not really enough to make an informed analysis with, we will look to automate the process to increase efficiency with future batches of data.
Dropping Useless or Duplicate Categories and Missingno
Without pre-existing knowledge of the domain, there’s no shortcut to finding out which columns are duplicates or useless than by scanning through them.
A useful tool though, can be Missingno. This library allows you to quickly visualise any missing data of a data set.
We soon see that the ‘description’ and ‘description_html’ columns contain no data. We can keep this is mind when we consider which columns we want to drop.
The numerous columns relating to developers provide an example of the problem of duplicate columns.
Each of these is a unique string associated with the developer of the app. This poses some dangers to the statistical methods we will be running on the data set after cleaning it. The first relates to collinearity: these variables are nominal – in that they refer to a unique characteristic of the object under study, but they do not express any other characteristic. For example, they are not ordinal, they are not a ratio, they are not continuous. As such, the uniqueness of each of the variables related to developer carry no more information for the purpose of a statistical analysis than another. If we were to keep the lot of them, any correlation with the dependent variable would be split between them. Certain methods, like the Random Forrest, would be able to deal with this more robustly than others, however this takes us to the second problem, computing time.
In the current situation, of a DataFrame of 30x60, it is unlikely that we are going to run into any issues with regards to computation while conducting a statistical analysis. It is also unlikely that we are going to uncover any meaningful statistical relationships with a DataFrame of this size. As I have already mentioned, in the cleaning of a DataFrame, it is always worth keeping an eye on the next step of an analysis. In the likely situation that we are going to require larger amounts of data, it will be useful to streamline the process and to reduce any extra computation.
With this in mind, here is the function that will remove the DataFrames in question.
Splitting the Columns into Irreducible Information
Whereas the last section dealt with nominal variables, this section is interested in dealing with categorical variables.
The Content Rating column provides a good example of this. There is a lot of potentially useful data here, but it is difficult to access in its current form. This is understandable when we look at those instances that have two categories.
In the case of Hole.io, we have an observation that falls into two categories: PEGI12, and Moderate Violence. Although a human can easily read this, at present the algorithm would read this combination as a unique category (a unique list of strings).
This may or may not be significant, but by splitting the categories the algorithm will have a better understanding of the predictive power of categories.
A solution to the problem of categorical variables, is the employment of a method known as one hot encoding. This encodes the variables into columns represented by binary values.
In the above example, this would result in Hole.io having a 1 (or True) value in both the PEGI12 column and the Moderate Violence column. This allows for more powerful predictions and a better understanding of feature importance. Here is an example of it applied to the Category column.
Here are some of the functions I wrote to do this.
Columns Are Numeric Where Appropriate and The Information Cannot Be Reduced Any Further
This one is self-explanatory. If our purpose were to predict the number of installs of an app based on the data gathered, it would be helpful to have the number of installs available. At present they are stored as string. While it is true that an algorithm could predict these values as a category, the values would lose all ordinal value. The algorithm would be unable to say that ‘5000+’ is smaller than ‘500,000+’, only that one is associated with a certain set of predictors.
Likewise, this applies to score and a number of other columns. This function deals with the cleaning of the datatypes.
Conclusion and A Note On Automation
It is at this point one can start considering the next ‘steps’ of an exploratory analysis. This would include the exploration of the global properties of the data, as well group properties within the data. Ideally, this would be furnished with visualisations and descriptive statistics. But the reality is, in the uncertain realm of exploratory data analysis, unforeseen problems can easily arise.
Take, for instance, my decision to select the data set on the basis of ‘popularity’. It might be that after having cleaned my data set, I suddenly became aware of the implications of this selection bias. Had I haphazardly approached the problem of data cleaning, I might have to laboriously translate my previous code to a new set of data, in a frustratingly time-consuming manner. As it happens, I went about the cleaning process with automation in mind, and the functions written for the purpose of cleaning will be equally applicable to any new data from the same source as the data used. A cleaning process approached in this manner will help enormously when the inevitable road-blocks associated with the process arise.