Apples to Oranges: Comparing Distributions with Different Scales


(Daniel Nelson) #1

Comparing Apples to Oranges

Comparing two measures with radically different scales can be visually challenging. How can we make comparisons between two populations when the magnitude of one completely wipes out the other? Using a simple normalization process in Looker that can be performed by the end user, users can discover insights and even make rough probability statements about the distributions of two or more populations. To motivate this discussion, we will go through an example using the CDC’s natality dataset from 2013.

This comparison will focus on two populations: women who gave birth in 2013 and who smoked one or more cigarettes a day during pregnancy, and women who gave birth in 2013 but who didn’t smoke any cigarettes during pregnancy. The area plots and tabular results are shown below.

The Problem

Common sense might suggest a possible relationship between birth weight and the mother’s smoking habits during pregnancy, but it’s hard to tell from the plots. The problem is that there are so many more mothers who don’t smoke than who do, it’s hard to make a visual comparison.

The Solution

Enter normalization. By normalizing the data, we force the two populations onto the same scale. In Looker, this is accomplished via a simple table calculation:

This divides each cell by the column total, yielding the following output:

Dividing column cells by the column total will tell us what percentage of the population that cell contributes to the total. In the above tabular results, the meaning of the green column on the right is the following: Of newborns birthed by mothers who smoked during pregnancy, .73% were between 2 and 3 pounds, 1.41% were between 3 and 4 pounds, 3.63% were between 4 and 5 pounds, 11.71% were between 5 and 6 pounds, etc. Since we are looking at relative distributions rather than the absolute magnitude of these populations, we can easily compare them visually. Astute readers will recognize this as a probability density function, or pdf. In this case, our suspicion that smoking habits during pregnancy is a factor that affects birth weight is further buttressed by these probability distributions.


#2

Dear Daniel,

First of all, thank you for this post. I am working on my master thesis. This was exactly what I was looking for, to compare two datasets of different magnitude. It is a very simple but effective way to compare the datasets.
I have tried to find some (scientific) source to back up this method in my thesis. However, I was unable to find this. I was wondering if you have any source or reference I could use.

Best,
Kadir


(Daniel Nelson) #3

Hi Kadir,

Sorry this is a bit late, hopefully it still helps!

Since you’re writing a paper do note that I played fast and loose with the term Probability Distribution Function - this is properly a PMF, which is discrete, vs. PDF which is continuos.

No ref’s off hand, but we can show this pretty easily with an example. Generally, if you have a histogram, you can get to a PMF by dividing by the number of observations. Some people call this scaling.

Remember that PMF’s must satisfy two properties:

Lets say you have a histogram that looks like this.

x
x x
x x x
1 2 3

If we divide each y value by the count of observations, n=6, then we get

Pr(x=1) ~ 1/2,
Pr(x=2) ~ 1/3,
Pr(x=3) ~ 1/6

These are all obviously non negative, and if you add these together, you get 3/6 + 2/6 + 1/6 = 6/6. QED, this is a proper PMF :slight_smile: