Data Sampling Types


r_subheading-Course Description-r_end This video introduces various types of data sampling along with appropriate examples. Depending on the nature of the data and the processes involved, one or more of these various types can prove useful in data analysis. r_break r_break r_subheading-What You'll Learn-r_end • Introduction to types of data sampling. r_break • Random and stratified sampling. r_break • Sampling with and without replacement.

-

There are several different types of sampling that are important. These will come up as we talk about over the course of the Bootcamp. r_break r_break So, there’s simple random sampling, where there’s an equal probability of selecting any particular item. There’s stratified sampling, where we split the data into several partitions and draw out random samples from each partition. If we’re doing stratified sampling with equal-sized partitions, then that’s equivalent to simple random sampling. But in a lot of cases, we don’t do it with equal-sized partitions, we have different sized partitions to draw from, which is what makes it fundamentally different from simple random sampling. Or we are drawing different numbers of points out of the different partitions. So, these are two fundamental ways of actually grouping the data. r_break r_break When we’re actually sampling, there’re two kinds of sampling that come up - the sampling without replacement, which is what most people think of when they’re thinking of sampling. Sampling without replacement is as if we have a bag, and it’s got five red balls and four blue balls and three green balls in it. And we reach into the bag and pull a ball out and we see, 'Aha, I drew a red ball.' Then we take that red ball and we put it on the table. And then if we want another item, we reach back in and pull out a different ball. So, now the second time we draw, instead of there being five reds and four blues and three greens, there’s four reds, four blues, and three greens. That’s sampling without replacement. We do not replace what we’re sampling back into the bag. r_break r_break On the other hand, there are uses - and this is actually one of the most fundamental concepts of a very common type - of modeling having sampling with replacement as part of it. In sampling with replacement, instead of taking the red ball out and then putting it on the table and drawing again, we reach into the bag, pull out a ball and say, 'Aha, it’s red,' note down on a piece of paper say that it’s red, then put the red ball back, shake it up, and draw another ball out again. Record its color, put it back in the bag. r_break r_break So, without replacement, with replacement, that’s exactly what it sounds like but they end up having very different mathematical results. And as a result, and because of that, they are used in different contexts. All right, so another aspect we need to think of around sampling is what size of sample we want to do. And I really like this picture because I think that it very excellently illustrates the problems with sample sizes. When we sample, we do lose information, just like with aggregation. So, you have to be careful not to make your sample too small. So if we look over here, we have this data set, and it’s just position data. This is, I think, some sort of lithography picture. We’ve got these black structures, and then we’ve got this sine wave in the background, and then a little bit of just random noise scattered all over the place. If we subsample this by a quarter, so we sample 2000 points, we can still see the structures, the big thick structures, are still represented. But the sine wave has almost entirely disappeared. We’ve lost that background image. And if we go down even farther, if we subsample by another quarter down to 500 points, we’ve lost even the information of these things. r_break r_break You can look at this and you can kind of see the structures but only because you know what the structures need to look like. If I showed you just this graph first, you wouldn’t pick out the structures. You wouldn’t be able to, there’s just not enough information there. So, we want to reduce our sample size, we want to sample a small enough size that we can process it efficiently, that we can analyze it efficiently, that we can explore it efficiently. But we have to be really careful not to take too small a sample. And, unfortunately, there really isn’t a good rule of thumb on this necessarily. But you need to play with it. You need to take lots of different samples of different sizes. You need to do this to figure out when your information starts to disappear.

Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.