Even with the promises of big data, open data, and data hacking it is important to remember that having more data does not necessarily mean being more informed. The real value of data, whatever its quantity or scope, comes from the question(s) the data can help answer.
There are various reasons any given set of data might or might not provide reliable answers, the most basic being the data’s accuracy. Clever new technologies that scan, scrape, geocode, or mobilize loads of data aren’t much use if the data are wrong. All we end up with is scanned, scraped, geocoded, and mobilized misinformation. Garbage in-garbage out, as they say.
Getting from data to answers requires understanding the meaning of the data and its relevance to our questions. With statistical data much of this meaning and relevance depends on three particular ideas:
- How the data were selected
- The group/population researchers are trying to learn about
- How these two relate
I am here to tell you that if you master these ideas your statistical knowledge will immediately quadruple! Okay. I admit my estimate of learning gain could be inflated (but maybe not). In any case, the importance of these ideas cannot be exaggerated. British statistician T.M.F. Smith called them “the most basic concepts in statistics” in his 1993 presidential address to the Royal Statistical Society. He also said:
In statistics data are interesting not for their own sake but for what they tell us about the phenomenon that they represent, and specifying the target population and the selection mechanism should be the starting point for any act of statistical inference.1
Although you may have already learned these ideas in research and statistics courses, I encourage you to revisit your understanding of them. I say so because your recall may have been colored by misinformation on this topic appearing in library literature and national research projects. I discuss some of this misinformation further on.
In the meantime, let’s explore these ideas using a library example: Suppose we are interested in learning about consumer demand for e-books in the U.S. And we happen to have access to a big datafile of e-book circulation for all U.S. public libraries—titles, authors, call numbers, reserve lists, e-reader devices, patron demographics, and the like. We analyze the data and then issue this pronouncement:
Ebook demand in the U.S. is highest for these genres: romance novels, self-improvement, sci-fi/fantasy, biographies, and politics/current events.
Is our pronouncement correct? Not very. The list of genres is a poor reflection of consumer demand for e-books, first, because our data describe only public library borrowers instead of all e-book consumers. (Our datafile did not tap demand among e-book purchasers.) Second, the list is probably inaccurate for another reason. In the libraries’ collections, the proportions of e-books in the genres are likely to differ from those for all e-books published nationally. Demand for one genre, say e-biographies, that are underrepresented in library holdings will be understated compared with demand for e-biographies among U.S. consumers as a whole. So, besides giving a slanted view of consumer behavior, the e-book datafile is also slanted in terms of the genres consumers have access to in the first place.
Third, the pronouncement is inaccurate even when we limit our question just to demand among library e-book borrowers. The small number of available e-book copies will have made it impossible for some borrowers to check out the e-books they wanted when they wanted them. This user demand will not necessarily be accounted for in the e-book datafile.
The reasons just given for doubting the pronouncement are all related to how the e-book data were collected in the first place—the selection mechanism using Professor Smith’s term. Understanding how collection methods affect what data can and cannot tell us is the knowledge-quadrupling information I’m talking about. Here is that information in a nutshell:
The way that data are selected either support or detract from the validity of conclusions drawn. Thus, data selection directly affects the accuracy of answers gleaned from the data. Inaccuracy due to data selection, called selection bias, comes from slantedness and/or incompleteness of data. This bias occurs when certain types of subjects/respondents are systematically over- or under-represented in the data. Relying on biased data is usually risky and sometimes irresponsible.
Most library and information science professionals do not understand selection bias. Nor are they well-informed about survey sampling best practices. And, as I mentioned, some library literature and national projects have misinformed readers about these topics. I’d like to discuss a few examples as a way to clear up some of the confusion.
One example is an article in a peer-reviewed library journal about the generalizability2 of survey findings (also known as external validity). The researchers wondered whether specific user traits were so common that they were very likely true for all academic libraries. User traits is a term I have devised as shorthand for attributes, behaviors, or trends detected in survey or other library data. (It’s not an official term of any sort, nor did the researchers use it.) A trait might be something like:
The average length of time undergraduate students spend in university libraries is markedly shorter for males than for females.
The researchers figured that if a trait like this one were found to be true in surveys conducted at theirs and a dozen or so peer libraries, then it should be true across the board for all academic libraries. They proceeded to identify several uniform traits detected in multiple surveys conducted at theirs and their peer libraries. (Thus, their study was a survey of surveys.) They ended up advising other academic libraries not to bother studying these traits on behalf of their home institutions. Instead, the other libraries should just assume these traits would hold true exactly as they they occurred at the libraries that had already done the surveys.
This is bad advice. The researchers’ sample of library survey results was too limited. They reached out only to the dozen or so libraries that were easily accessible. Choosing study subjects this way is called convenience sampling. Almost always convenience samples are poor representations of the larger group/population of interest. (There is another type of convenience sampling called a self-selected sample. This is when researchers announce the availability of a survey questionnaire and then accept any volunteers who show up to take it. We’ll revisit this type of slanted sampling further on.)
The best way to avoid selection bias in our studies is the use of random (probability) sampling. Random sampling assures that the subjects selected provide a fair and balanced representation of the larger group/population of interest. The only thing we can surmise from a convenience (nonprobability) sample is that it represents the members which it is composed of.
Because they used an unrepresentative (nonprobability) sample rather than a representative (probability) sample, the researchers in the example above had no grounds for claiming that their findings applied to academic libraries in general.
Before moving to the next example some background information is necessary. I suspect that library researchers have taken statistics courses where they learned certain statistical rules-of-thumb that, later on, they end up mis-remembering. As you might expect, this leads to trouble.
Statistics textbooks usually talk about two basic types of statistics, descriptive and inferential. Descriptive statistics are summary measures calculated from data, like means, medians, percentages, percentiles, proportions, ranges, and standard deviations. Inferential statistics (also known as statistical inference) have to do with extrapolating from the sample data in order to say something about the larger group/population of interest. This is exactly what we’ve been discussing already, being able to generalize from a sample to a population (see footnote #2). This amounts to inferring that patterns seen in our samples are fair estimates of true patterns in our target groups/populations. Drawing representative samples provides the justification for this inference.
Inferential statistics also entail a second type of inference related to how random chance can cause apparent patterns in sample data that are not likely to be true in the larger population. This more esoteric form of inference involves things like hypothesis testing, the null hypothesis, statistical significance, and other convoluted issues I’ve written about before here. It so happens that statistics textbooks often recommend the use of random (probability) sampling in studies when researchers intend to conduct statistical signficance testing, a rule-of-thumb that may have confused researchers in this next example.
This example is a study of public library programs published in a peer-reviewed journal in which researchers acknowledged their use of convenience (nonprobability) sampling. It seems they focused on the textbook recommendation I just mentioned in the prior paragraph. They apparently reasoned, “Since it’s bad form to apply statistical significance testing to a convenience sample, we won’t do that. Instead, we’ll stick to descriptive statistics for our sample data.” They proceeded to report means and medians and percentages and so on, but then announced these measures as applicable to U.S. public libraries in general. The researchers abided by the esoteric statistical rule, yet ignored the more mainstream rule. In fact, its importance qualifies this rule for the knowledge-quadrupling category and is stated here:
Without representative sampling there is no justification for portraying survey findings as applicable (generalizable) to the larger population of interest.
Library organizations violate this rule every time they post a link to an online survey on their website. Inviting anyone and everyone to respond, they end up with the biased self-selected sample described already. Then, as in the prior example, they report survey findings to users and constituents as if these were accurate. But they are not. Because this practice is so common, it has become respectable. Nevertheless, the promotion of misinformation is unethical and—unless you work in advertising, marketing, public relations, law, or politics—professionally irresponsible.
Textbook rules-of-thumb may have also been a factor in this next example. By way of introduction, recall that we collect a sample only because it is impossible or impractical to poll every member from the group/population of interest. When it is possible to poll all members of the population, the resulting survey is called a census or a complete enumeration. If we have the good fortune of being able to conduct a census, then we do that, of course.
Unfortunately, researchers publishing in a peer-reviewed library journal missed this opportunity even though they happened to have a complete enumeration of their population—an electronic file, actually. Instead of analyzing all of the data in this datafile, for some reason (probably recalling cookbook steps from a statistics course) the researchers decided to extract a random sample from it. Then they analyzed the sample data and wrote the article based only on that data. Because these researchers relied on a portion rather than the entire dataset, they actually reduced the informativeness of the original data. They failed to understand this:
The more the composition of a sample matches the larger group/population, the more accurate measures taken from the sample (means, medians, percentages, and so on) are. As a corollary, a larger representative sample is better than a smaller representative sample because its composition typically matches the larger population more closely. When a sample is the equivalent of a census of the entire population, the sample is perfectly accurate (generally speaking).
And, yes, I should also add:
A small representative sample is better than a large unrepresentative sample. And an unrepresentative sample is possibly better than not conducting the survey at all, but (a) not by very much and (b) only if our luck is reasonably good. If our luck is bad, measures from the unrepresentative sample will be totally wrong, in which case not conducting the survey is the better option. (Better to be guided by no information than wrong information.)
If your library always has good luck, then it should by all means use an unrepresentative sampling method like convenience sampling. You can explain to the library’s constituents how the library’s consistently good luck justifies this use.
Now onto a final case of unadulterated selection bias from a few years back. I believe the high visibility of this study justifies naming the organizations that were involved in its production. My purpose is to remind readers that the status of an institution and the quality of its data analysis practices are not necessarily related. Which, in a way, is good news since it means humble organizations with limited credentials and resources can learn data analysis best practices and outperform the big guys!
So to the story. This is about the study funded by a $1.2 million grant from the Bill & Melinda Gates Foundation to the Online Computer Library Center (OCLC) entitled, From Awareness to Funding, published in 2008.3 The surveys used in the study were designed and conducted by the internationally acclaimed advertising firm, Leo Burnett USA (a la MadMen!).4
For this study Leo Burnett researchers surveyed two populations, U.S. voters and U.S. elected officials. Survey respondents from the voter group were selected using a particular type of probability sampling. (This is good, at least on the surface.) The resulting sample consisted of 1900 respondents to an online questionnaire. The elected officials sample was made up of self-selected respondents to invitations mailed to subscribers of Governing, a professional trade journal. In other words, elected officials were selected via a convenience sample. (This is bad.) Nationwide, 84 elected officials completed Leo Burnett USA’s online questionnaire. (This is not so good either.)
Roughly, there are 3,000 counties in the U.S. and 36,500 cities, towns, townships, and villages.5 Let’s say each entity has on average 3 elected officials. Thus, a ballpark estimate of the total count of elected officials in the U.S. is 112,500. To omit officials in locales with no public library let’s just round the figure down to 100,000.
High-powered Leo Burnett USA settled for a very low-powered and quite unreliable sample—84 self-selected officials—to represent a population of about 100,000. The OCLC report acknowledged this deficiency, noting:
Due to the process by which respondents were recruited, they represent a convenience sample that is quantitative but not statistically representative of all local elected officials in the United States.6
Professional standards for marketing researchers caution against misrepresenting the quality of survey samples. The Code of Marketing Research Standards obliges marketing researchers to:
Report research results accurately and honestly… [and to] provide data representative of a defined population or activity and enough data to yield projectable results; [and] present the results understandably and fairly, including any results that may seem contradictory or unfavorable.7
So, the responsible thing for Leo Burnett USA to do was to announce that reporting any details from the 84 respondents would be inappropriate due to the inadequacy of the sample. Or, in light of marketing research professional standards, they could have made an effort to draw a probability sample to adequately represent the 100,000 U.S. elected officials—perhaps stratified by city/town/township/village.
But alas, Leo Burnett USA and presumably the OCLC authors chose a different strategy. First, as a technicality, admit the sample’s deficiency in the report text (their quotation above). Then, ignore both the deficiency and the admission by portraying the data as completely trustworthy. As a result, an entire chapter in the OCLC report is devoted to quite unreliable data. There you will find 18 charts and tables (an example is shown below) with dozens of interesting comparisons between U.S. elected officials and U.S. voters, thoughtfully organized and confidently discussed.
Source: From Awareness to Funding, OCLC, Inc. p. 3-5.
So what’s not to like? Well, we might dislike the fact that the whole thing is a meaningless exercise. When you compare data that are accurate (like the purple-circled 19.9 voters figure above) with data that are essentially guesswork (like the blue-circled 19.0 elected officials figure above), the results are also guesswork! This is elementary subtraction which works like this:
The Wild-Ass Answer is off by however much the Wild-Ass Guess is! This isn’t necessarily part of the knowledge-quadrupling information I mentioned. But it’s handy to know. You can see another example here.
As I said, this 2008 OCLC report was promoted far and wide. And the dubious elected officials data were showcased in the GeekTheLibrary initiative (the $5+ million sequel to the 2008 OCLC study) as shown in these banners that appeared on the initiative’s website:
Banners posted on http://www.geekthelibrary.org.
Due to selection bias the statements in both banners are pure speculation. Incidentally, the GeekTheLibrary initiative was, shall we say, data-driven. It was designed based on findings from the 2008 OCLC study. We can only hope that there weren’t very many program strategies that relied on the study’s insights into U.S. elected officials.
That, of course, is the problem with unrepresentative survey samples. They are likely to produce unreliable information. If our objective is accurate and unbiased information then these samples are too risky to use. If our objective is going through the motions to appear data-driven and our audiences can’t verify our data on their own, then we can use these samples with no worries.
1 Smith, T.M.F. (1993). Populations and selection: Limitations of statistics, Journal of the Royal Statistical Society – Series A (Statistics in Society), 156(2), 149.
2 Generalizability refers to the extent to which we have ample grounds for concluding that patterns observed in our survey findings are also true for the larger group/population of interest. Other phrases describing this idea are: Results from our sample also apply to the population at large; and we can infer that patterns seen in our sample also exist in the larger population. Keep reading, as these ideas are explained throughout this blog entry!
3 De Rosa, C. and Johnson, J. (2008). From Awareness To Funding: A Study of Library Support in America, Dublin, Ohio: Online Computer Library Center.
4 On its Facebook page Leo Burnett Worldwide describes itself as “one of the most awarded creative communications companies in the world.” In 1998 Time Magazine described founder Leo Burnett as the “Sultan of Sell.”
5 The OCLC study purposely omitted U.S. cities with populations of 200,000 or more. Based on the 2010 U.S. Census there are 111 of these cities. For simplicity, I had already rounded the original count (36,643) of total cities, towns, townships, and villages to 36,500. This 143 adjustment cancels out the 111 largest U.S. cities, making 36,500 a reasonable estimate here.
6 De Rosa, C. and Johnson, J. (2008), 3-1. The phrase “sample that is quantitative” makes no sense. Samples are neither quantitative nor qualitative, although data contained in samples can be either quantitative or qualitative. The study researchers also misunderstand two other statistical concepts: Their glossary definition for convenience sample confuses statistical significance with selection bias. These two concepts are quite different.
7 Marketing Research Association. (2007). The Code of Marketing Research Standards, Washington, DC: Marketing Research Association, Inc., para. 4A; italics added.