We will discuss the first type of issues in data quality: noise and outliers, understand where does the noise come from in to the data and how does it impact data quality. We will discuss how to measure and record noise and to get a better understanding of outliers in a data set.
What You'll Learn
> Meaning of noise in a data set
> Sources and origin of noise
> The problem of outliers in a data set
So, those of you who have scientific or signal processing background are probably familiar with the term "noise". Noise in a data science context is when we have an invalid signal of some sort that overlaps valid data. This obscures our actual attribute values. And, fundamentally, what it means is that some of our data objects have invalid values in some of the attributes. They have, in other words, inaccurate values there. Examples of this in real life: we have the distortion of a person’s voice over the phone, snow on old television screens, particularly the old CRT television screens.
Noise can appear because of human inconsistency and labeling. You see this a lot in sports, for instance, that require human judging. There’s a lot of inconsistency in how people get labeled there. And, just in general, if you’re trying to, say, rank websites, human inconsistency in labeling can be a real problem.
As sort of a practical example of what noise can do when there’s a lot of it, this is a pretty straightforward signal. We’ve got two sine waves here with different frequencies but the same amplitude - there’s a blue one and a green one - and, so, we could generate the sine wave. It looks very clean, very pretty. We can even distinguish the two different sine waves. If we add those two waves together and then throw noise at it - just basic white noise like you might see in any kind of randomization thing - and you end up with something that looks like this. So, the noise has completely obscured our actual signal. Noise is, again, fundamentally invalid data points that are obscuring our signals. There’s always some noise in any system. It’s just the nature of the universe, sadly. But understanding where your noise is at its worst and how you can deal with it is very important. Even recognizing that it’s there is the first step - recognizing which of your attributes are noisy versus which are less noisy.
A complementary problem to noise is the problem of outliers. So, outliers often look like noise at first. They’re data objects that have characteristics that are considerably different from most of the other objects in the data set. So, if we look at the visual here - we’ve got some sort of two-dimensional graphing of our data and most of each dot - each pixel point, represents a data object that’s been plotted on the graph. So, we’ve got four clusters - very nicely-defined clusters - and then we’ve got these three other points just hanging out in the middle of nowhere, far away from all of the other data. The big distinction between outliers and noise is that outliers are actually valid values. The data was collected properly - it’s clean, but it’s outside of the normal range. The data object, for some reason, doesn’t look like a normal object. All right, so that’s outliers and noise. Those are the first category of data quality problems that get encountered a lot.
Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.
© Copyright – Data Science Dojo