Preparing data for analysis

Before we apply any learning technique, we must first prep our data (assuming we have data). You will almost never receive data that is 100% ready to use. Even when I receive datasets from my data engineering peers, the data often requires hours of additional clean up.

In this part of the series, we’ll discuss cleaning up data and getting it ready. That doesn’t just mean fixing 'data typos', but also engineering different data types into something more malleable.

Understanding Dimensions and Measures

Datasets come in many shapes and sizes. You will be hard pressed to find a structure that works for data engineers and scientists across all domains. What is good for analysis may not be good for instrumentation — instrumentation is the professional word for data capture. Data engineers have a hard job. They have to build a path between a customer product and our databases without dropping, corrupting, or otherwise messing up the data; oh and this is often all at the scale of thousands of gigabytes per minute. Read: a nearly impossible task.

So it should be no surprise that our data often arrives unsuitable for immediate use. To make sense of our messy data, we need to understand the anatomy of datasets.

More often than not, your data will come in the form of rows and columns — just like a spreadsheet. Rows are often individual records, or one set of observation like a single person in a group, or an octopus in a reef. Each column represents an attribute or ‘feature’ of that record. An octopus row might have ‘species’, ‘sex’, ‘length’, ‘weight’, or ‘beak size’ as attributes.

Each attribute might have a data type like number, date, or text. When writing code, it is important to understand different data types. But since this is a code-free course, we can simplify to just dimensions and measures.

Measures are a bit more intuitive. Measures are things that are measured. Whoah Elliott! Slow down. Weight, length, and beak size are all things that could change over the course of an octopus' lifetime. Typically, a measure is something that can be aggregated — meaning we could count it up, add or subtract it, or find a center value like the average.

Messy data

With a basic understanding of dataset anatomy, we can start to untangle the mess of data we receive. It is pretty common to receive data that has a strange data structure, missing values or typos. Our next steps will help address these issues to make our algorithms perform smoothly and, generally, make the process a lot more fun.

Tidy data

Hadley Wickham, the Tom Hanks of Data Science, wrote a fantastic paper on Tidy Data describing how messy data can evolve. The basic premise is that every record should be represented by a single row. Every column makes up a single attribute, like name or height, and every cell is a value. This will structure our data in a way that is easy to analyze and computation friendly.

Tidy data pt. 2

With each row a record, we also give an end user of this data an understanding of the data's 'grain'. Grain is a term used to describe how we distinguish one record from another. In some simple cases, it might be just a new record with a column called 'id' that increments. More likely, it will be a combination of dimensions.

Think for a minute about how we uniquely identify people in society. In the USA we have a social security number, but that is hidden in most cases. Instead, we refer to people by their first and last name, age, and zip code. There might be millions of Elliotts, hundreds of Elliott Whitlings but only one that is also living near Portland, Oregon and in his thirties.

Understanding the data grain is key when prepping your data and tidy data goes hand-in-hand with knowing and sharing your grain.

Encoding data

One of the most common tasks is data encoding. This can turn either a dimension or a measure into a discrete flag — most often 0 or 1 (but occasionally True or False). Generally, a simple rule divides the data up. For measures, this is often a greater or less than statement and dimensions is often on a string value.

By encoding data, we now have machine intelligible fields to target. Since machine learning and statistics are math-based, feeding text to an algorithm is fruitless. Encoded data lets the machine think natively.

One additional benefit is that zero and ones are wonderfully fast when aggregating data. For example, imagine you want to know how many dogs are in a data set. If you used unencoded data, you would have to count each row that equals 'dog'. With encoded data, you just sum the column which is much faster for most data manipulation tools and languages.

Binning

Binning is the cousin of encoding. This technique changes a continuous value to a discrete value.

Imputation

Earlier I mentioned that we sometimes have missing or bad data. There are two general approaches to dealing with this issue. Probably the most common is to throw the bad data out. This works really well if you have a huge dataset and can spare the cleanup.

Sometimes in midsized data sets, it is possible to recover those records by using an imputation technique. There are a few approaches but, in general, we find records that are similar in other ways to our bad record. For example, if we are missing the number of goals-per-game an athlete makes, we might grab data from athletes in similar league, position, years of experience, and height (obviously, this is all sport dependent). We could grab the nearby records and weigh their goals-per-game metric in a way that makes the athlete closest to the bad record have the most impact. We then synthesize a new value for our bad data by weighing the goals-per-game against all nearby athletes.

This will only give us a decent guess. Sometimes those guesses are really valuable but occasionally they can skew the data. Imagine if we are comparing a world-class player against grade school equivalents. Our imputation technique would give a wildly wrong answer. Anytime there is not nearby data or the record represents an outlier, we need to be careful.

Normalization

Normalization is a great tool to make a field much more manageable. Simply put, it recalculates the data to make the minimum value 0 (or sometimes -1) and the max value 1. Normalization not only makes most computations much easier but it also creates a contextual standard. Knowing something is normalized means any audience knows that a value close to 1 is a high value. Compare that to 322 — what does that number mean? Is it 322 out of 323 or out of 87 million?

I really like normalizing my data but you do need to be careful as you are changing the units of the data. A normalized fuel mileage value is no longer in miles per gallon and will need to be converted back if you are using that value as a new input. Don't worry-this is an easy process and I find it is commonly worth the tradeoff.

Standardization

Standardization takes data transformation to the next level. Standardization is a process that sets the average of the data set to zero and shapes the data to have a standard deviation of 1. The math of this process isn't super important but it is fairly straightforward.

In theory, standardization is a great tool for clustering or other unsupervised learning algorithms (we'll talk about the difference of unsupervised vs supervised later). In practice, I have found standardization to be a rarely utilized tool in my data prep tool belt. Normalization is a much more common transformation in application, although I'm sure other experts have diverging experiences.

Training and testing data

One of the final tasks when preparing data is to split the data into testing and training data. This only applies when we need to use a supervised machine learning algorithm (we will get to those in a few chapters). The key idea of splitting up our data is to create two sets of data. We'll use one to teach our model. The other is a way to make sure our teaching worked.

We don't have a perfect formula for how we create these two data sets, but the general best practice is to randomly assign two-thirds of the data to training and the rest to testing. While folks might squabble over the ratios or the assignment technique, the key is make both sets representative. We want the training data to get enough data so the model can learn comprehensively but we don't want the model to simply memorize everything.

Over & Under fitting

Memorization is bad for a few reasons but, practically speaking, it means our model is not generalizable. Data scientists focus a lot on 'generalization' as it describes if our model works in the real world. A model that is too simplistic will provide overly-vague answers in the real world with low confidence, while an overfit (way too specific) model makes bad predictions.

Underfit models are generally a symptom of missing something in the experimental design. Perhaps a key metric was not measured. Meanwhile overfitting is the result of using too many parameters in our model. The data scientist threw every possible measurement at the model. Imagine trying to predict the weather by factoring into your calculation how many pinecones were stepped on by dogs in public parks. That measurement is way too convoluted and specific to help us make a good guess. Remember, models are not reality — only useful approximations!

The goal is to generate a model which lands in a Goldilocks zone — it provides accurate answers when new data is inputed by neither over-simplifying nor convoluting. In many occasions, you can tune your model to a good fit by reducing parameters or adjusting the metrics utilized. Sometimes, it is as easy as changing the model type itself (from linear to quadratic for example).

Connecting back to MLD

Preparing data is the junction between measurement and learning. It isn’t the most glamorous part of a data worker's job, but it is impossible to avoid. Folks will complain that, in practice, data prep takes up more time than the actual analysis. This is partially true, but it is not just necessary before we apply our learning tools, it also is an opportunity to familiarize ourself with the data.