Example 1.14: Sample Surveys Section
The following essays illustrate the importance of statistics in research and everyday life. To do this we ponder the question of what would things be like in A World Without Statistics?
During the 1920’s Ola Babcock Miller was active in the women’s suffrage movement and later became famous in Iowa for starting the State Highway Patrol in her first of three terms as Iowa’s Secretary of State. Her election in 1932 came as quite a surprise to the political pundits of the day as she became the first woman, and the first Democrat since the Civil War, to hold statewide political office in Iowa. However, her election did not surprise her son-in-law, George Gallup, who predicted her victory using the first scientifically sampled election poll ever. Three years later Gallup founded the American Institute of Public Opinion and became well-known for correctly predicting that Franklin Roosevelt would defeat Alf Landon in the 1936 Presidential election – in contrast to the predictions made by Literary Digest magazine based on a non-randomly gathered convenience sample more than 40 times as large.
The use of randomly generated samples pioneered by George Gallup is now the staple of hundreds of organizations that try to determine the outcome of elections before they happen. These organizations face severe challenges caused by the difficulty of reaching portions of the population, a response rate of less than 10% amongst those who are reachable, and the unknown demographic composition of the voters in an election yet to occur. New technological advances in reaching potential respondents and improved statistical modeling have helped to address some of these issues – but the level of bias in an individual election poll is typically on the same order as the random error. Luckily, the diversity of the methodology across many pollsters produces industry averages that still have very small errors in predicting the dozens of major races polled in each election cycle. Without Statistics we would have a poor understanding of public opinion … and without statistics, we wouldn’t know who won the election.
Example 1.15: Search Engines Section
The first Internet search engine was created in 1990 at McGill University in Canada and attempted to create a searchable census of all of the files on FTP sites at that time. But, as the size of the Web grew exponentially, it soon became apparent that a method of sampling the Internet would be needed to produce the indexes required for searching. A solution was soon developed where sampling is done by a web crawler or “spider” – software that takes information from a web page and all of the pages that it links to and all of the pages that they link to and so on.
Statistical issues then arise in the indexing component of the process. What variables should be saved? For example, one variable in these indexes examines the number of inbound links to a page weighted by measures of the quality of the sites that link to the page in question. Interestingly, this is proportional to an estimate of the equilibrium probability of landing on a given page after a large number of clicks in a Markov model of Internet browsing. Next, what data structures and index size make for the quickest computation without sacrificing relevance? Even the large index maintained by Google, which is more than a hundred Petabytes in size, holds just a small fraction of the estimated 30 trillion pages on the Internet.
Finally, searching algorithms must produce results in a split second. Results are rank orders of websites in the index that should be strongly related to the probability that the site is relevant to user intent and needs. Models for predicting relevance are constantly updated – partly to ensure that website owners improve their sites using best practices and not simply to artificially match with search ranking variables. Current models are based on several hundred variables continuously examined using variable selection and model building experiments against user responses to search engine results. Do users click more often on the highest ranked items? Do they stay longer on the sites they go to?
Thus, statistical issues are addressed in the sampling (web crawling), indexing, and ranking phases of the operation of a good search engine. Without statistics, we would have to search the Internet one site at a time.
Example 1.16: Weather Forecasting Section
When astronomer William Ernest Cooke was hired to run the new Perth Observatory at the end of the nineteenth century, he was also charged with providing weather forecasts for the surrounding areas in Western Australia. But Mr. Cooke was not satisfied with merely presenting his best forecast - he also wanted to provide the public with a measure of its uncertainty. So, in 1905 his team began attaching uncertainty values on a five-point scale to their twice-daily forecasts. Indeed, the forecasts he gave the highest rating to turned out to be correct 98.5% of the time, while those given a rating in one of the two most uncertain categories were only correct for 56.5% of the forecasts. The century following these pioneering probability-based weather forecasts have brought vast improvements due to the same themes that improve the quality of any statistical enterprise:
- Systematically Collected Data - remote sensing devices, including satellites, now automatically transfer data with nearly universal coverage of the planet;
- Sound Statistical Models - now informed by a scientific understanding of the relationships amongst dozens of measured variables;
- Computational Efficiency - weather forecasts and “cloud” computing make a perfect pair; and
- Statistical Reporting - forecasts now routinely carry a prediction, and an assessment of the uncertainty over time and space, of multiple aspects of the weather.
If it wasn’t for statistics we wouldn’t know the probability of rain tomorrow and we wouldn’t know how to dress for the weather today.
Example 1.17: Animal Models in Medicine Section
In his 1865 book An Introduction to the Study of Experimental Medicine Claude Bernard, the French scientist often called the father of modern medicine, argued against the use of “statistics” in medicine. Interestingly, he was really arguing against poor statistical practice at the time in observational clinical studies and for using the scientific method in laboratory investigations based on many ideas that are statistical in nature. He argued against the misuse of statistics that display only averages without understanding the sources of variability in the data. He argued that causative claims emerge more readily from experiments than from observation. He argued that experiments should have an underpinning of clear hypotheses that can be demonstrated or negated. He described how to eliminate sources of bias and was the first to suggest the use of blind experiments to foster objectivity. Claude Bernard often turned to the use of animals, especially mice, in experiments as a model of human physiology. Although mouse models are not appropriate for all human conditions, they have been very fruitful in investigating the mechanisms underlying many disease processes – especially those in cancer. Examples of such experimental systems that today provide great insight into the biology and genetics of human cancers include:
- purebred mice that lack immune systems;
- animals with a specific genetic aberration underlying a disease;
- mice that can have their genome manipulated to remove a specific cancer fighting mechanism; and
- mice that are amenable to transplantation of a human tumor.
Statistical ideas in handling variation are at the heart of all of these murine experiments. Statistics allows us to quantify the variability in measurements to decide on the scope of an experiment; to reduce the variability in designs through appropriate controls; to examine the variability in analyses though statistical modeling; and to precisely state the inferential conclusions that arise.
Without statistics pre-clinical medical science would be less efficient and more subject to ambiguous interpretation and without statistics lab mice would have nothing to do.
Example 1.18: Process Control Section
In 1950 the Japanese Union of Scientists and Engineers (JUSE), hosted W. Edwards Deming in a series of workshops on statistical process control to Japanese engineers and upper-level managers. The workshops focused on improving business processes and reducing the variation in results. They presented the endless feedback loop of “Plan-Do-Check-Act” for improvement pioneered by Deming’s mentor Walter Shewart:
- Plan: Establish objectives and design/revise business processes to improve results
- Do: Implement the plan and systematically collect data
- Check: Analyze the data, especially differences from planned implementation and expected results to identify weaknesses in the plan
- Act: Determine root causes for variations from objectives and make changes to improve the process. Restart the cycle for continual improvement
Importantly, Deming’s workshops went beyond Shewart’s methods for improving production processes and providing technical details to staff. He also focused on management processes and insisted that his audiences include executives in position to make decisions. The results were stunning and led to the worldwide success of companies such as Sony and Toyota that quickly became known for the reliability of their products and the resulting loyalty of their customers.
Deming continued to refine his ideas and synthesized them in his famous 14 points of quality management presented in his 1982 book Out of the Crisis. Adaptations of Dr. Deming’s integrated approach have had a prodigious influence on businesses worldwide.
Without statistics we wouldn’t know that fixing things that are broken is less efficient than avoiding broken things – and automobile warranties just wouldn’t be the same.
Example 1.19: Astronomy Section
Astronomy is perhaps the oldest science and the first to systematically collect data for analysis. For example, the ancient Greek astronomer and mathematician Hipparchus noticed the scatter in Babylonian measurements of the length of the year and wrote about the general problem of combining data to quantify a phenomenon – deciding on the middle of the range. Problems arising from the analysis of astronomical problems continued to fuel the development of statistical methods for centuries – including Legendre’s suggestion of the Least Squares approach and Gauss’ presentation of the normal distribution in their studies of the orbits of comets and planets. In the 20th century, astronomy turned toward physics for insights but important advances in the study of stochastic processes led to new discoveries about the clustering of galaxies and how they are distributed in the universe.
Today, the field of astrostatistics studies important problems like estimating how many Earth-like planets there might be in our Galaxy. It is very hard to find planets orbiting other stars, but since 1995 several thousand have been found by virtue of their tiny effects on the host stars (e.g. Doppler shifts as the planet orbits, or 0.01% diminution of light as it transits in front of the star). A major goal is estimating the parameter dubbed eta-Earth, the fraction of stars with Earth-like planets in Earth-like orbits. The problem is sensitivity to survey bias: it is easier to detect bigger and more massive planets the size of Jupiter or Neptune, and planets closer to the star (within the orbit of Mercury). Because of this, virtually no planets are known with Earth-size and Earth-orbit ... but we can use statistical models and methods to extrapolate from the bigger surveys. Several researchers have done this in the last few years, and perhaps the most important just emerged at the end of 2013 based on data from the National Aeronautics and Space Administration’s Kepler mission. The results are converging: about 6%±2% of Sun-like stars have Earth-like planets that may be habitable - not too close/hot to the star, not too far/cold from the star. That means there are billions of “Earths” in the Galaxy with the closest probably only about 10 light-years away. Thus, without statistics, we wouldn’t know that the Universe is full of Earth-like planets and we wouldn’t know where the planets are.
Example 1.20: Epidemiology and the affects of Smoking Section
At the end of World War II, Bradford Hill and Richard Doll in the Statistical Research Unit of the Medical Research Council decided to study the alarming increase in cases of lung cancer that was occurring in Great Britain. The two main causes for the increase being investigated were increased air pollution and increased use of tar in paving roads. They interviewed lung cancer patients in 20 London hospitals and patients in the same hospitals with a different diagnosis.
The results convinced Richard Doll to quit smoking – but others were not convinced so Doll and Hill sent letters to 59,000 medical doctors asking about their smoking habits and if they would agree to be followed for health effects over time. About 40,5000 replied and by 1954. The results convinced Bradford Hill to quit smoking.
These landmark observational studies and others in the 1950’s showed that all the hallmarks of causation were all there:
- Time course
- Strong relationship
- Unlikely to occur by chance
Today the biological mechanisms underlying the disease-causing effects of smoking have been thoroughly studied in statistically designed and evaluated laboratory experiments, genetic abnormalities, micro-RNA switches that change the behavior of genes, and even differences in the types of bacteria that inhabit a smoker’s body. But without statistics, the strength of the evidence could not be well evaluated and we wouldn’t know that smoking was bad for you.
Example 1.21: Unemployment Data Section
Between 1880 and 1935, census takers in several countries began asking questions about employment and unemployment to help their nations’ economic planning. Census enumerators were told to define a person as unemployed if they had generally been gainfully employed previously but were not currently working. Unfortunately, this led to many difficulties such as pinpointing when an individual left the work force voluntarily for retirement or further training. The “previous gainful employment” definition of the workforce also underestimated the true unemployment rate since those who had never held a job, but wanted one, were uncounted.
However, a group of U.S. statisticians working in the Division of Social Research of the Works Progress Administration (WPA) under direction of John Nye Webb had a better idea. They reasoned that a well constructed survey would provide more accurate results and could be applied more often than a costly census. Secondly, they developed an operational definition of the unemployed as people who were not working in the previous week but were searching for work. They tested their ideas in a pioneering random sample in 1937 – the first scientifically constructed national sample conducted by the Census Bureau and the first to include confidence intervals in the subsequent reporting of results. The WPA “searching for work” definition of the unemployed performed well and later became the international standard after adoption by the United Nations’ International Labour Organization in 1954.
Without statistics governments wouldn’t understand key facts about their economies – and people wouldn’t know when they are unemployed.
Example 1.22: Computer Controlled Devices Section
The first computer controlled machines were created soon after World War II with milling tools developed as an early application of the new technology. Today, an endless list of devices are run by computer software from cars to appliances to telephones to airplanes to traffic lights to medical instruments to elevators – using statistical algorithms to guide decision making at their core. Designing an algorithm that operates a bank of elevators optimally involves knowing the arrival times of people coming to operate them along with the matrix of probabilities that the next operator will request a ride from any particular floor to another at that time of day. Optimality criteria then involve making the wait time probability distribution for operators and the energy usage or runtime distribution for the elevator as small as possible (e.g. small expected value; small probability of going above some threshold; etc…). As the data necessary to fully know this matrix of probabilities is unavailable at the time of installation – methods that include a “learning” component, such as those applying Bayesian updates are a recent approach. Without the statistical modeling and computational advances that underlie these algorithms, wait times for elevators and their energy consumption would increase. Without statistics we’d have to take the stairs.
Example 1.23: The Census Section
Statistics has helped to maintain the United States’ representative form of government. According to the U. S. Constitution (1787), the number of representatives assigned to a particular state is in proportion to its population count determined by a census that is conducted every ten years. The United Nations recommends that ten-year interval between censuses as a minimum for all member states. That standard has thus been adopted by most nations with only a few having a census more often (Australia, Canada, Japan, and New Zealand use a five-year interval).
Each decennial census is an attempt to provide complete counts of all people (including limited demographics such as name, age, date of birth, gender, race/ethnicity, household relationships) and all habitable dwellings in the country using statistical methods. In the example of the United States 2010 Census, the data is being used:
- To apportion the 435 seats in the U. S. House of Representatives
- To help draw boundaries to meet local, state, and federal requirements for representation
- To assist in the allocation of hundreds of billions of dollars per year in state and federal funding to local, state, and tribal governments
- To plan economic development and assess the need for schools, hospitals, job training, etc.
- To plan communities and to predict future needs
- To plan the location of roads and public facilities
- To analyze social and economic trends
- To plan and evaluate government programs and policies
- To help meet many local, state, and federal legal requirements
With decennial censuses serving as anchors, nations may also conduct more detailed sample surveys much more frequently (monthly, quarterly, and annually), to attempt to collect information from carefully selected (usually using probability) subsets of all the people in order to estimate specific characteristics using statistical methods.
Without statistics, any nation would lack vital information about its people - not knowing:
- Who they are
- How many they are
- What they do
- How they live
- Where the people are