Adam Smith, Wealth of Nations: “I have no great faith in political arithmetick …”
I am not being facetious when I call statistics “political arithmetic”. The British coined the phrase to describe data collection and projections they were using to assess the population and determine public policy. The Britannica Online Encyclopedia reports “In the 1680s the English political economist and statistician William Petty published a series of essays on a new science of “political arithmetic,” which combined statistical records with bold—some thought fanciful—calculations, such as, for example, of the monetary value of all those living in Ireland. These studies accelerated in the 18th century and were increasingly supported by state activity, though ancien régime governments often kept the numbers secret. Administrators and savants used the numbers to assess and enhance state power…”.
Statistics gets used to process the data gathered in scientific experiments and opinion polls. In these situations, the principled pollster spends a great deal of effort attempting to frame neutral questions that do not lead the responder to a certain conclusion, devising a method of selecting a representational segment of the population and carefully documenting the boundaries of the inquiry. Principled scientists also carefully detail the boundary conditions of their experiment or study, describe the methods used to collect and process the data and preserve the original data in its entirety. Obtaining data that is not biased or otherwise polluted is a delicate art.
There is, of course, the other aspect of statistics, that portion that provides data to support or refute a particular assertion. This is where the aphorism “Figures don’t lie, but liars do figure” comes into play. We have all seen it; I have received uncounted stacks of opinion polls from political parties that blatantly lead one into filling in their desired answers. We see studies in newspapers and business journals that purportedly forecast economic futures, environmental disasters and so forth and are left wondering just who is right and by how much.
The global warming/climate change argument is a marvelous example. Without taking a position either way on the debate, we can use this ongoing statistical debate to illustrate how each side uses their data to press their case.
Let’s look at this article from Range Magazine, “Are Climate Skeptics Wrong – or Right?” by S. Fred Singer, Ph.D. The data that is being pushed back and forth to prove or disprove global warming has carefully picked boundaries that prove the arguer’s assertion. Since any start or endpoint other than the entire history of the Earth from its coalescence from whirling gases to the present day is necessarily arbitrary, there isn’t anything really wrong with it as long as the author explains why he chose his particular beginning and end points. It is interesting, however, what these choices do to the case being presented.
When addressing the question “Is the planet in fact warming?” Mr. Singer observes “This crucial question cannot be answered honestly unless one specifies the time interval referred to. Clearly, the climate has warmed since the last ice age. It has also warmed since about 1850, in recovering from the Little Ice Age (roughly 1400-1800). But it has not warmed since the Medieval Warm Period of 1,000 years ago, or since the Holocene Optimum, which reached even higher temperatures 5,000 to 8,000 years ago. Nor has it warmed during the past decade.”
Those who insist global warming is real and is caused by human activity, have been primarily running with the 1850-2000 dates, since these neatly coincide with the ramping up of steam engines in the Industrial Revolution. Skeptics are fond of the data from the last decade and also pointing out the Medieval Warm Period when Vikings colonized Greenland and Roman Times, when wine grapes and olive trees grew in Germany.
Another popular way to present data in a way to sway public opinion in support of a stance is to choose exactly how to present the data. If a report declares that the level of a substance has risen 300%, it is wise to know if it went from 1 part per billion to 3 parts per billion, which is hardly worth considering, or if it went from 1 in every 10 samples to 3 in every 10 samples, which is considerable. Conversely, the Air Resources Board, in supporting the draconian diesel regulations under which practically any sort of diesel engine operates today, is fond of declaring they are saving 19,000 lives per year. That sounds like a lot until you remember there are 35 million people in California. That makes the deaths/population = .00054 or 0.054%, which is scarcely the epidemic they claim. Looking at the cause of death declarations for 2010 (the latest date currently available on the internet), there were 233,143 deaths overall, which gives us a death rate of under 1% of the population, which makes us remarkably robust overall. The listed most frequent causes of death ranged from heart disease at 58,034 to Parkinson’s disease at 2,232. The closest cause of death that could relate to the effects of diesel pollution was chronic lower pulmonary disease at 12,928, but it is quite a stretch to declare each case of lung disease was caused by exposure to diesel particulates. Exposure to diesel emissions is not listed explicitly, as one might expect, since the Parkinson’s disease death rate is so much lower than ARB’s diesel emission fatality numbers.
So what’s going on here? Where does ARB get this 19,000 figure? Frankly, I don’t know. They have not been transparent with this data. My best guess is that they are taking any death that is cancer or lung related, then multiplying in some probability factor that “estimates” the proportion of the fatalities that had diesel pollution as a contributing factor. Of course, people exposed to diesel may well have been exposed to other pollutants, too, like tobacco smoke. It’s hard to view any multiplier as anything more than pure supposition. Still, projections like this are common tools in “political arithmetic” and have been since its inception.
The take-away lesson when confronted with statistics in graphs and reports is to determine what hasn’t been presented. If percentages are given, look for the actual numbers. If numbers are given, look for the percentages. If one or the other are missing, the author may well be trying to hide something. Ask also for why those particular boundary conditions were chosen. As we saw in the global warming example, what is in bounds and what is out of bounds determines the end result. Ask yourself why the author uses medians instead of averages. A median only says that half the samples were higher than the median and half were lower. An average incorporates the values themselves into processed picture of the whole. An average by itself is a great deal less useful that the average and its standard deviation, which defines how widely the data is spread around the average. The data could be very tightly clustered around the average, which will produce a small standard deviation, or it could range widely, which will produce a large standard deviation. Then again, the distribution may look like a two humped-camel, with a dip in the middle or lean more one to one side or the other. This is important information, and it is often left out. The link shows three classic bell curves with moderate, tight and wide data distributions.
Finally, there is the devilish problem of determining whether numbers that track each other demonstrate cause and effect or are merely correlated. Comparing the correlation between the results of the Super Bowl and the future of the stock markets makes for a fun parlor conversation, but using the Super Bowl results to decide when to invest in the stock market is silly.
So, the next time you look at a statistical report, try to see what isn’t there, then try to figure out why. It can be very illuminating, and may make a difference in how you view the situation. Good hunting!