Posted by Pranav Kakhandiki, Edited by Anika Asthana| Sep 23, 2018, 6:00:00 PM
Statistics is often thought of as straightforward. Data always leans one way, supports one idea, and displays one conclusion. But what if there was a way for data to trick you; to appear simple, but completely change if viewed from another angle. This is the case with Simpson’s paradox.

The table above provides data for the amounts of democrats and republicans senators who support the Landmark Civil Rights Act of 1964. When analyzing the total votes democrats and republicans, it appears that Republicans favored the Act more than democrats. Overall, 80% of Republicans supported it while only 61% of Democrats approved of it. When dividing each category into northern and southern senators, however, the data tells a different story. In each category, the percentage of democrats who voted in favor of the Act is higher. In the north, 9% more Democrats favored the Act than Republicans. While in the South, 7% more Democrats supported it. So what’s going on? How can the total percentage of Republicans voting for the Act be higher, but the percentage of Democrats is higher in each category? More importantly, what data should we trust? Let’s take a look at another example.
Student A | Student B | |
Test #1 | 99/100 = 99% | 1/1 = 100% |
Test #2 | 0/1 = 0% | 1/100 = 1% |
Total percentage | 99/101 = 98% | 2/101 = 1.9% |
Students A and B both take two tests, but comparing them can be tricky
Take a look at the table above. Two students, A and B, have different teachers, who give two tests in the first semester. While student A has a 98% overall test average, student B only has a 1.9% overall test average. Student B, however achieved a higher grade on each individual test. This vast difference in overall percentages versus individual test percentages occurs because of differences in sample sizes. In this case, although student A scored 0% on test #2, their 0/1 is insignificant compared to their 99/100. Student B, contrarily, scored 1/1 on test #1, which doesn’t influence their grade significantly compared to their 1/100.
Sadly, there is no single way to escape Simpson’s paradox. In the example above, the total percentages should be trusted over the individual test scores, as the tests were worth different amounts of points. In other cases, the analysis of each individual category is more helpful than looking at the total percentages. For instance, when trying to decide the best hospital to treat cancer, it is better to examine the recovery rate of cancer patients at each hospital, not patient recovery of all diseases. Furthermore, many companies manipulate Simpson’s paradox to their advantage, which makes analyzing the data tremendously harder. The trick is to search for “lurking” variables which hide more information that is not inherently obvious.

Data can be misleading. Sometimes, trusting overall percentages of data is better than analyzing arbitrary categories, and other times, the very opposite is true. To fully comprehend data and make a conclusion, one must objectively analyze the status quo, searching for hidden features or variables that could provide insightful results. Otherwise, we are susceptible to those who transfigure data, manipulating it to tell a different story that promotes their opinion.
Very interesting! I have a follow-up question – how does one find hidden features and variables in data?
LikeLiked by 2 people