Missing Values and Duplicated Data
Missing values and duplicate data are two more factors leaving a bad impact on data quality. This video introduces the concepts of missing values and duplicate data and walks you through the processes of getting rid of these issues.
What You'll Learn
> Origin and impact of missing values
> Getting rid of missing values in a data set
> Handling duplication in data
Another one that shows up very frequently is missing values. Sometimes, missing values are there because the information is not collected. So whether you’re looking at census information or survey information, in particular, people will often decline to give their age and weight or will decline to give their annual income, so you just have missing values. Other times, the attributes that you’re collecting may not be applicable to all cases, right? If you’re asking people about the annual income of each member of their household on a survey, well, the children in the household don’t have an annual income, it doesn’t make sense, so you just code that as a missing value.
We’ll talk a lot more about handling missing values when we get to data pre-processing but kind of the fundamental ways we can handle it is to throw out all the data objects that have any missing values. We can estimate our missing values using means or medians or something else. We can, with some algorithms but not all, ignore the missing values on a row-by-row basis. Or we can just throw the attribute out entirely, which is something we might want to do if we have an attribute that is 80% missing, we probably just want to throw that column out.
One of the other things you can do in some algorithms is to replace missing values adaptively. This happens a lot in categorical, where you’ll count the probabilities of an attribute value appearing over your whole dataset and then replace all the missing values such that those probabilities don’t change. And we’ll talk a little bit more about that when we get to preprocess. I guess I want to get the basics of how you handle missing values.
The third category then, alongside missing values, noise, and outliers, is duplicate data. This is particularly a problem when we’re merging data from heterogeneous sources. So, if we have some data from Google Analytics coming from our website and we have some other data from actual uses (you know click counts, and sort of dwelling time and things like that), that’s from another system, or maybe we have a Java applet (as much as those things still exist on the internet) that collects some data inside of it. If we want to merge that data, we will sometimes have duplicate data objects. We’ll have the same person with multiple email addresses. We’ll have the same person represented with two different IDs because they’re coming from two different systems. So, generally speaking, duplicate data though, is pretty easy to handle, assuming that you can detect it properly, which is to get rid of the duplicates, merge it together, but if you’ve got data that’s heterogeneous, that’s from multiple sources, then you do have to be really careful about filtering out your duplicates.
Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.
© Copyright – Data Science Dojo