To understand the nature of a data set and to visualize its behavior, measures of central tendency and spread are extremely helpful. This video introduces basic concepts of center and spread and their significance in data analysis.
What You'll Learn
> Median versus mean
> Visualizing skewness of data sets
> Range, variance, and standard deviation
Other measures we care about are the measures of center. Median versus mean is an age-old debate on the internet going all the way back about whether the median or the mean is the better way to measure the center of data. And as is often the case with age-old debates on the internet, the answer is both.
Means are easy to calculate but very sensitive to outliers. Means also can give you a real sense of the skew if you have skewed data. Means can give you a sense of the skew of your data very nicely. On the other hand, the median is the number such that 50% of values are below it and 50% of values are above it. The median is the 50th percentile value. There’s also something called a trimmed mean which I want to talk about a great deal.
So, medians tell you exactly where your center is. If you really want to know what the exact middle of your data is, such that 50% of people are below it and 50% are above it, the median’s great. It’s basically immune to outliers. It’s very good that way. But it’s harder to calculate in some ways, and it doesn’t tell you anything about the skew of your data. If you do have a really long tail, the mean will let you know about that in particular. It’s the difference between the median and the mean that is often what we care about because that’s what tells us about how our data is skewed. We want both numbers. One is not necessarily better than the other.
The last summary statistics that we tend to care about are measures of spread: range and variance. Variance or standard deviation is the most common measure of a spread of a set of points. It tells us about how different the points are very nicely. Variance and standard deviation are, effectively, measures of the spread of our data very directly. The range is the difference between maximum and minimum, which is definitely something we might care about. But the range, variance, and standard deviation are all very sensitive to outliers, so there are other measures that we use.
We use the interquartile range, which is the difference between the 75th percentile value and the 25th percentile value in a set of data. And we’ll sometimes use the median absolute deviation, which is essentially the median of the variances. And sometimes, we’ll use the average absolute deviation too, which is the mean of the variances. All of these show up as we’re trying to calculate summary statistics.