Another very common method of pre-processing is sampling. Those of you who are from a statistics background will understand sampling quite well. r_break r_break Sampling is the main technique that we use for data selection. It’s used almost always for preliminary investigation of the data but it’s often used even for the final data analysis, even in data science. Statisticians have been sampling for the length of time and the discipline has existed because obtaining the entire set of data of interest is either too expensive, too time-consuming, or even, in a lot of cases, theoretically impossible. There is no way that you can sample, that you can obtain the entire set of some kinds of data, it’s just not possible. So, you have to sample carefully. r_break r_break Data miners sample often because processing our entire set of data is too expensive or time-consuming. If you talk about someone - like a group, something like LinkedIn or Facebook or Google - you’re talking about hundreds of terabytes into petabytes worth of data that they have stored in their servers. You cannot process that kind of data on anything remotely resembling a human lifespan, even with modern technology. We can process a lot of data, but there’s still a fundamental limit of what we can process, and on top of that, there’s a fundamental limit of what we as humans can look at at the same time. When you’re sampling, there is one thing more than anything else that you have to keep in mind, which is, representation. r_break r_break The key principle when you’re sampling is, that the sample will work almost as well as using the entire data set, if and only if, the sample is representative. Representative is one of those fun words that mean something different for every data set, right? Sometimes, representative is as easy as unweighted random sampling. Other times, this is particularly true if you were doing something like anomaly detection. We need to make sure that whatever sample we take has an appropriate proportion of anomalies versus normal data. In other contexts, it gets even more complicated. r_break r_break Sometimes you want to make sure we balance out our different classes in a classification context or that certain kinds of attribute values that are needed - that even target values but attribute values - are all represented in a certain way. And Balachander notes that sampling will typically exclude outliers and may have noise and that’s absolutely true. Sampling, if done improperly, can absolutely add noise to your data or, well, not really add noise in our context, but certainly can introduce noise. And outliers are probably not going to appear because you don’t sample enough to make them appear, and that’s true. That’s actually one of the advantages of sampling that it will exclude outliers most of the time. So, if we aren’t in an anomaly detection context then we don’t care - and we don’t want outliers muddying the waters, so to speak. We’ll want to exclude them, and sampling can help us do that.