The first data cleaning strategy is data aggregation where two or more attributes are combined into a single one. This video explains the concept of data aggregation with appropriate examples. The importance of aggregation in data pre-processing is highlighted along the way.
What You'll Learn
> Data aggregation as a data cleaning strategy
> The significance of data aggregation
> Examples of data aggregation
> Impact of aggregation on variability
So, the first strategy - and this one is first because we see it a lot - is aggregation.
We’ll combine two or more attributes or objects into a single attribute or object. This can be where we are trying to reduce the scale of our data, reduce the number of attributes or objects. So, we could, for instance, combine two attributes - to combine a high-temperature attribute and a low-temperature attribute in order to get a temperature difference attribute. We’ve now combined two columns into one column. Basically, every algorithm has some time dependence on the number of attributes it runs, and certainly, in terms of visualization and exploration, there're only so many attributes that you can look at at the same time or hold in your head at the same time.
On the other hand, we might want to combine a bunch of different objects. If we have users who have many different sessions, or who navigate to many different pages, we’ll have dwell times that are different for every page and every session, and we might want to combine all those dwell times in order to get one data object that is the average behavior for each user, rather than the 10 or 15 different sessions for that user. So, the reason why we do this is exactly that.
If we want to average user times, for instance, we’re changing our scale. We want to aggregate cities into regions, states, or countries. We want to aggregate dwell times across sessions or across pages. And one of the big advantages of aggregation, particularly averaging, is that aggregated data tends to have less variability. It’s a way of reducing the effective noise; well, it’s a way of reducing the effect of random noise. If you’ve got human labeling errors, then you’ve got human labeling errors. If you’ve got sampling procedure errors, you have sampling procedure errors. But if you’ve got random errors, say random noise, then aggregated data will very much tend to reduce that.
As an example of that - and I really like this next page for this - these two are graphs of precipitation in Australia. So, these are histograms where the height of each block is the number of locations where precipitation was measured which had, in this case, a standard deviation of the X value when we measured it on a monthly basis. So, we’re measuring the average monthly precipitation and the standard deviation of that monthly precipitation at 500 different land locations in Australia. When we do that on a monthly basis, we get these very widespread standard deviations. Some places are very consistent in their rainfall. There're these two peaks, and then you have this long tail of places that are just all over the place in terms of the variability in precipitation. On the other hand, if we take those exact same land locations and instead, find the average yearly precipitation - the standard deviation of that - we get this very nice single-peaked, mostly single-peaked, very short-tailed histogram.
We’ve significantly reduced our variability. We’ve reduced our random noise in our dataset by increasing the scale by aggregating our data over a longer time period. So that’s one of the big reasons that we use aggregation.
Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.
© Copyright – Data Science Dojo