Unit 2 Review, and Assessing Correlation
Lesson 10 of 10
Objective: SWBAT compute (using technology) and interpret the correlation coefficient of a linear fit.
Today students will learn to compute and interpret the correlation coefficient. This lesson also serves as a review and summary of some other key ideas from Unit 2. To get started, we return to some of the data from our work with Gapminder.
On the second slide of the lesson notes I post today's opener. I tell students that by the end of today's lesson, we'll be able to something called the "correlation coefficient" to quantify the correlation between two variables, but that to get started I'd like everyone to try to guess which of these six data pairs has the strongest positive correlation.
I've chosen six pairs of data. Students are familiar with these data sets because of their work on Gapminder, and some of these data pairings were originally proposed by students during that lesson. As you'll see, these pairs will allow students to see what it means when the correlation coefficient is positive, negative, and close to zero.
I invite students to turn and talk about their ideas in groups, and that in a few minutes I'll ask everyone to share what they think. This activity helps to set the stage for the lesson, and to build some suspense. Once students have made their predictions, they can't wait to find out if they're right.
After a few minutes, I ask each group to share their predictions, and then - with a healthy bit of argument - we rank the six data pairs in order from strongest correlation to weakest. I write the predictions on the side board, and I ask students if they're ok with the list as it is. It's impossible for everyone to be completely happy with the list - but that's the point! I invite everyone to write their own predictions, especially if their rankings differ from the class consensus. Once everyone has a grasp on their own ideas, we're ready to move on. "Now," I say, "I'll show you how to measure correlation and we'll see who is right!"
How to Calculate the Correlation Coefficient
This activity serves as a great review, because we're going to draw on a lot of what I hope students have learned over the last few weeks - both in terms of skills and knowledge. One skill is being able to run a linear regression on bivariate data on a TI-83/4. I tell students that if they know how to do that, then they already know how to find the correlation coefficient. "In fact, some of you have already noticed the correlation coefficient," I say. "To anyone who has asked me what that 'r' means on your calculator screen -- that's it! R is known as the correlation coefficient."
I distribute this two-sided handout, which provides the data that we'll use to figure out which correlation is strongest. Note that I'm only providing a small subset of all the data available on Gapminder. This list consists only of the world's 12 most populated countries. Later in the lesson, I'll ask students whether this sample is sufficient for us to draw conclusions about all countries, and whether they think the results might change if we chose a different set of countries, which serves to lay foundations for our deeper study of random sampling in Unit 3.
I assign each group to work with a different data set. Students should use their graphing calculators to find the equation of the regression line, and they should record the resulting value of r. Each student should do this before comparing their results in small groups to ensure accuracy. I circulate to get a quick check in on who can and can't run a regression on the calculator. If anyone is struggling, I provide a quick review and say that I expect these steps to be written in their notes.
Results and Discussion
I give students about five minutes before we reconvene to share results, and then it's time for the big reveal. Here are the rankings:
- Median Age vs. % Internet Users r = 0.83
- Poverty Rate vs. Children Per Woman r = 0.67
- GDP/capita vs. Life Expectancy r = 0.63
- Aged 15+ Employment Rate vs. Food Supply r = 0.20
- % of Roads Paved vs. Traffic Deaths per 100,000 People r = -0.0094 (or essentially 0)
- Body Mass Index vs. Child Mortality r = -0.51
Remember that we're ranking these data pairs by the strength of the positive correlation (we'll address the fairly strong negative correlation between BMI and child mortality in a few minutes). I write the r-values next to our prediction list, and as I do so, I say, "the greater the correlation coefficient, the stronger the positive correlation is." I also remind students that when we talk about correlation, we're talking specifically about the linear relationship between two data sets.
It's fascinating to reveal these results. I have yet to have a class correctly predict that the correlation is strongest between Median Age and Internet Use. Students tend to think that the Internet is a tool of youth, so if anything, they'll often argue that "older countries" will have fewer people using the Internet. Seeing that notion turned on its head forces us to ask why, and we can talk about the other factors that might result in a population with a low median age. Similarly, students are often surprised to see the emphatic lack of correlation between paved roads and traffic deaths. It's easy to postulate that unpaved roads are dangerous, and therefore there are more deaths, or conversely that more paved roads result in faster driving and therefore more fatal accidents, but neither of these appears to be true. So, if not for the condition of the roads, what might make a country more or less safe for driving?
Once we have all the r values, I show students how to interpret this number. To help clarify the meaning of each, I've prepared slide #4-15 in the lesson notes, with data sets and graphs. I show students that a strong correlation means that the points are close to the regression line, and a weak correlation means that that points are all over the place.
To summarize, I give some generalized notes about the correlation coefficient. I remind students that the word correlation is defined as a measure of the linear relationship. There can be data that is very clearly positively associated, but that is not linear. We'll study that in the next unit. Another upcoming topic is sampling. I ask students if they believe that these results could be used to generalize for all countries. In the next section of today's lesson, we'll address this idea by taking a closer look at the BMI vs. Child Mortality example.
Important, but not imperative to today's lesson, is the interpretation of these linear models. If students are having an easy time with the technical aspects of computing the regression and finding r, I'll pay more attention to interpretation, but if they're struggling I won't emphasize it. This lesson will often take two days, and if everything else is going well, we'll spend more time with interpretation.
I've already spent time with literal interpretations of slope and intercept, and here is a chance to talk about the relationship less formally. Rather than saying that "Life expectancy increases by 0.0003 years for every $1 the per capita GDP increases..." it's more common to say that in countries with greater per capita GDP, the life expectancy is longer.
I should also note that I'm ignoring residuals, but touching on some related ideas. The purpose of this course is to provide initial exposure to some deep math topics while reinforcing core math skills. A lesson on calculating residuals and showing how the least squares methods works is a great fit along those lines; in the limited time I have here, I've chosen to leave it out, but there are plenty of good reasons and resources for including such a lesson in a course like this.
Focus on "Data Set F": Body Mass Index vs. Child Mortality
Depending on how much time I have left, and whether or not I'm planning on extending this lesson another day, I want to take at least some time to explore some ideas in greater depth. First, I want to focus on the correlation between BMI and Child Mortality. To begin, I address the meaning the negative correlation coefficient. "If the correlation coefficient is -1, that indicated a very strong negative correlation," I say. "Notice that the r-value doesn't just tell us how strong a correlation is, it also tells us whether the correlation is positive or negative."
With that in mind, I say that it's a little misleading to have ranked this data last in our list. If we were only to rank the data by strength of correlation, without concern as to whether it's positive or negative, then this should have been ranked fourth, above Data Sets D and E. But there's even more to it than that.
My students I take a look at this scatterplot from Gapminder. I want everyone to have a clear idea that the set of countries we choose will have a big influence on our results and the conclusions draw. It would have been possible to select a dozen countries with a near-perfect negative correlation. On the other hand, we could choose a different dozen from Japan to South Africa that would make it look like there's a positive correlation!
Even within the data for these 12 most-populated countries, we can change the results by ignoring certain extreme points. For example, Nigeria is far above the initial regression line (y = 271.7 - 9.49x), for which r = -0.51. If omit Nigeria from the data set, we'll get a different model (y = 237.85 - 8.46x) with r = -0.70. If we also ignore Japan, r = -0.78, which would make this correlation nearly as strong as that of Data Set C.
After this introduction to the idea of sampling, which we'll explore in great depth soon, we review the distinction between correlation and causation. Even though correlation may be strong, can we say that one of these variables causes the other to change? Does increasing the BMI of your population cause fewer children to die? Could it be the other way around? Is there some third factor?
Another Example: BMI vs. Food Density
For our last example, we leave the Gapminder World data to look a small data set adapted from Amstat's article about Median Median lines. The purpose of this activity is for students to review what they know by using technology to compute the median-median line and the least squares line, and then to participate in a class discussion in which we compare the two. We will continue to investigate the role of extreme data points on a regression model. I provide the data set on slide #19 of the lesson notes, and I discuss the differences between median-median and least squares regression models in this narrative video.
Up Next: Practice and Review
Students will need some time to solidify their new knowledge about the correlation coefficient. Here, I use some textbook exercises that give kids a chance to practice applying what they know. In particular, I want students to practice looking at scatter plots and matching them to phrases that describe the data ("Which of the scatterplots show a moderately strong linear association? or a negative correlation?") or to a set of correlation coefficients.
Today's lesson is followed by a work period in which students complete these sorts of practice exercises and a unit exam.