Comparing Apples to Oranges
Comparing two measures with radically different scales can be visually challenging. How can we make comparisons between two populations when the magnitude of one completely wipes out the other? Using a simple normalization process in Looker that can be performed by the end user, users can discover insights and even make rough probability statements about the distributions of two or more populations. To motivate this discussion, we will go through an example using the CDC’s natality dataset from 2013.
This comparison will focus on two populations: women who gave birth in 2013 and who smoked one or more cigarettes a day during pregnancy, and women who gave birth in 2013 but who didn’t smoke any cigarettes during pregnancy. The area plots and tabular results are shown below.
Common sense might suggest a possible relationship between birth weight and the mother’s smoking habits during pregnancy, but it’s hard to tell from the plots. The problem is that there are so many more mothers who don’t smoke than who do, it’s hard to make a visual comparison.
Enter normalization. By normalizing the data, we force the two populations onto the same scale. In Looker, this is accomplished via a simple table calculation:
This divides each cell by the column total, yielding the following output:
Dividing column cells by the column total will tell us what percentage of the population that cell contributes to the total. In the above tabular results, the meaning of the green column on the right is the following: Of newborns birthed by mothers who smoked during pregnancy, .73% were between 2 and 3 pounds, 1.41% were between 3 and 4 pounds, 3.63% were between 4 and 5 pounds, 11.71% were between 5 and 6 pounds, etc. Since we are looking at relative distributions rather than the absolute magnitude of these populations, we can easily compare them visually. Astute readers will recognize this as a probability density function, or pdf. In this case, our suspicion that smoking habits during pregnancy is a factor that affects birth weight is further buttressed by these probability distributions.