Mean, Median, Mode, Range and the Bell Curve

Let's start out super basic. 

I think the first time I came across mean, median, and mode was in either middle school or high school. They are easy enough to understand and use so you can pass your test/quiz, but at least in my case, my teachers never explained why anyone would everrrrrrr need to use these outside of this one specific test in school.

Mean, Median, and Mode are useful for showing you characteristics about your data set. This is something you will need to keep in mind the more advanced you become in mathematics/stats/data science/whatever. 

This blog is meant to discuss data science and math in a general kind of way to try and fill in any gaps you come across due to a forgetful teacher or maybe you need a mildly different perspective on the material. I aim to help. That being said, there are a number of fantastic websites that let you practice everything I mention here and I'm sure plenty more. For example, Khan Academy (Link).


Let's begin my going over Mean. The mean of a data set (also called average or arithmetic mean) is the sum of all numbers in your set of data, divided by the number of entries. I'll make up a mini data set to explain this:

24, 25, 25, 25, 26, 27, 28, 29, 30. 

If we add all these together, our sum is 239. We have 9 entries, so 239/9=26.55. We round this up and our mean then becomes 27.

If this mini data set was your quiz scores or test scores for the semester, you can calculate your test average by doing this and hopefully, you have a passing average! Most of background is in teaching and my students do this all the time, especially the closer to finals haha.

Next up, Median. The median of our mini data set is the middle number when we order all our entries. Here's our same mini data set:

24, 25, 25, 25, 26, 27, 28, 29, 30.

Because we have 9 entries, our median is 26. In this small set of entries, all this really tells us is exactly what our middle number of the set is. Maybe the median of a data set you compile for a graduation project will have more meaning to it. For example, maybe you realize the median is not only the halfway point of your data set that you've been putting together by performing water samples in a nearby creek, but it's also when you started seeing some type of change, maybe a large increase in water pH or something else (this is the type of thing that requires you to know your data set well in order to have the best understanding of what is happening or what you're looking for).

Mode! Mode is the number that appears the most frequently in your data set. In ours used here, that would be 25. Why is this useful? Let's say instead of only 9 entries in our data set, we had 100 entries. If 75 of the 100 entries were the number 25, chances are our average (mean) would hover very near 25 as well. This tells us a small piece of information about our data set as a whole.

Related: if most of your 100 entry data set is the number 25, but you have a couple data entries that are muchhh higher or lower that what you have been typically finding as you do your (for example) water sampling, these odd entries are called outliers and they can affect other statistical tests you do later on and can skew your data. 

Looking at the range of data.  Range is simply the difference between your lowest and highest values. Once again, our mini data set from before:

24, 25, 25, 25, 26, 27, 28, 29, 30.

Our highest value is 30 and our lowest value is 24, thus 30-24=6. Our range is 6. So our data set is not very large (we already knew that) but other data sets can be hugeeee!!! I've put together a data set with more than 400 entries and it would've had a much lager range.

Lastly, let's discuss the Bell Curve.
Picture Source
If you Google 'bell curve' you'll come across a graph with a shape that looks like a camel hump. You'll also, at some point, hear about the bell curve in association with a normal distribution. The center of the bell curve, it's highest point (in what is considered a normal distribution) is the median (remember that's the middle number of our mini data set). In a normal distribution, the median, mode, and mean (average) are usually going to be the same number approximately. Why? Most of the time in sampling the mean, median, and mode are the most frequent numbers. (Ex. most people are average, that's why it's called the average).

Going back to our shape, you'll notice that the 'hump' is symmetrical and the red line approaches 0 (the x axis) as you near the bottom of the hump. This shape represents all your water sampling entries with the outliers being the data points on, and under, this red line that are farthest from the mean, median, and mode.

Phew! That was a good chunk of information. 
I'll continue going into more about bell curves and normal distributions later as we get more complex. 

Comments

Popular posts from this blog

Welcome!