Feeds:
Posts
Comments

Indentured Certitude

I want to share some information with you from a resource I mentioned last month. The resource is Edward Suchman’s 1967 book, Evaluative Research and the information is this diagram, which presents a basic model of evaluation:1

I share the diagram because it presents two ideas that don’t always percolate to the top of discussions of library outcome assessment. The first idea is the need for programmatic values to be made explicit beforehand. Suchman, who worked in the public health field, gave this example:

Suppose we begin with the value that it is better for people to have their own teeth rather than false teeth. We may then set our goal that people shall retain their teeth as long as possible.2

Of course, it’s quite possible to hold different values. For instance, one might prefer false teeth over natural ones. This might be based on the belief that lifelong dental care is too expensive. And maybe on the realization that technological advances have made false teeth more durable than natural teeth. In this case the whole idea of program success would work differently. The deterioration of natural teeth would be seen as progress since it prepares citizens for the advantageous transition to artificial ones.

An absurd example, probably. But it’s a useful way to illustrate how evaluation works regardless of the specific values involved. You cannot “demonstrate value” without having first announced what values you subscribe to. These values, whatever their content, define the mission (long-term goals) of institutions and agencies—civic participation, lifelong learning, lifelong dentures, and so on. The demonstrable value of a program is a measure (an estimate, actually) of the extent to which the program accomplishes the goals implied by the values.

The espoused values will be ideal, something like “good dental health improves the quality of life of the public.” Goals will be more specific, like “80% of Americans will have access to affordable natural teeth replacement.” Related objectives would along the lines of “Tooth extraction services will be convenient, inexpensive, painless, and safe.”

Best Laid Plans

The second idea in Suchman’s diagram is that assessment is part of a more general planning process. Planning is another key reason for specifying goals (desired outcomes) up front. Without this information program designers are in the dark about what the program is meant to accomplish. It will be no surprise, then, that interventions and methods which designers come up with turn out to be stabs in the dark!

A well-specified goal would be “70% of American adults will have false teeth within the next ten years.” Another would be “All American children will have false teeth by the time they are in 9th grade.” This sort of clarity tells program designers (and funders, also) exactly where the program is supposed to be headed.

Then it’s up to the program designers to figure out how to get there. In my dental health example, there will be multiple ways to convince the public that they will really love having manufactured teeth in their own and their children’s mouths. One way would be encouraging neglect of natural teeth, for instance, by making sugar readily available to pre-school children. Another might be levying a weighty tax on tooth brushes. You get the idea: potentially effective methods must be thought through with the basic goals in mind. This exercise will also lead to relevant drill-down objectives. (Sorry, I couldn’t resist.)

Suchman called this process Identifying Goal Activity (6pm in the diagram), which means devising intervention methods that seem likely to attain program objectives.3  Next is the actual enactment of the program, labeled Putting Goal Activity into Operation (8pm in the diagram).4

Only after the first five tasks (noon to 8pm on the circle) have been completed can program outcomes be assessed. Results from the outcome assessment(s) are then reconciled with original values. (I purposely avoid the term alignment because of its faddishness!) This might lead to clarifications of or adjustments to values, goals, objectives, measures, and program activities. And then the cycle repeats again, as the continuous clockwise arrows indicate.

Now consider Richard Orr’s classic evaluation model shown here:5

Notice that values are missing from this framework. The main point of the model is classifying traditional library statistics in a way that distinguishes these from ostensible benefits which library programs and services produce. Unlike the Suchman diagram, arrows in Orr’s model indicate causality. Resources necessarily produce (cause) capability which necessarily produces demand which necessarily produces utilization which necessarily produces beneficial effects. Obviously, Orr adopted a simplistic view of things. (Because of the certainty it implies the model can be called deterministic.)

Ye Olde Evaluation Steps

In real life, cause-and-effect linkages between programs and outcomes are hardly automatic. (We could say they are probabilistic.) Program evaluation practitioners understood this sixty years ago as seen in these “essential steps in evaluation” taken from a 1955 U.S. Public Health Service report:

1.  Identification of the goals to be evaluated
2.  Analysis of the problems with which the [program] activity must cope
3.  Description and standardization of the [program] activity6
4.  Measurement of the degree of change that takes place [in the target population]
5.  Determination of whether the observed change is due to the [program] activity or to some other cause
6.  Some indication of the durability of the efforts.7

Step 5 is the one I’m talking about. Now take a look at this diagram from the Suchman book:8

We can explore this diagram in depth some later time, but you can get the gist pretty quickly: Early evaluation practitioners realized that cause-and-effect relationships are really complicated. They did not settle for the optimistic linkages of the sort appearing in Orr’s model.

As for the essential steps from the 1955 report, you’ll see these basic ideas repeated in introductory books and articles on library assessment, sometimes presented as if they are a 21st century innovation! In actual fact, the ideas are well established. It is the library profession’s awareness of these that is new.

Truthfully, libraries haven’t really cut their teeth on the basic concepts of evaluation and assessment, like the steps listed above. I have been to assessment conferences and committee meetings where participants have only the vaguest understanding of essential ideas like these. Worse, these newcomers have to contend with misleading pronouncements made in (shall we call it) the professional narrative, the most egregious of which is the mantra that assessment is about demonstrating value. It is not. And I appeal to you to eschew such an ill-conceived idea!

Assessment is about determining value.

 
—————————

1  Suchman, E. A. (1967). Evaluative research: Principles and practice in public service and social action programs, New York: Russell Sage, p.34.
2  Suchman, E. A., p. 35.
3  This activity is the same as the idea of logic models mentioned in my prior blog entry. In evaluation literature, a description of how program designers believe a program will work, including causal linkages between program components, is also called program theory.
4  In current evaluation literature this phase is called program implementation.
5  Orr, R.H. (1973). Measuring the goodness of library services: A general framework for considering quantitative measures, Journal of documentation, 29(3), 315-332. Orr’s model also came up in my April 2009 and my January 2012 post.
6  Here the term standardization refers to the idea of program fidelity introduced in my earlier entry.
7  U.S. Department of Health Education & Welfare, (1955). Evaluation in mental health: A review of the problem of evaluating mental health activities, (U.S. Public Health Service Publication No. 413), Washington DC: U.S. Government Printing Office, 21. Underlining added. This report borrowed these essential steps from French, D. G., (1952). An approach to measuring results in social work: A report on the Michigan reconnaissance study of evaluative research in social work, New York: Columbia University Press, 178.
8  Suchman, E. A., p. 84.

The campaign to assess public library outcomes got a tremendous boost by Library Journal’s Director Summit held last month in Columbus, Ohio. It’s heartening to see library leaders getting serious about making outcome assessment integral to the management of U.S. public libraries! The excitement and determination are necessary for making progress on this front. And it sounds like the summit was designed to let folks absorb relevant ideas in ways that make them their own.

The onset of this newfound energy is the perfect time to commit ourselves to gaining a firm grasp on the core concepts and methods of outcome assessment. Although measurement of outcomes is a new undertaking for libraries, it has been around for a long time in other contexts. In fact, outcome evaluation approaches have been studied, debated, refined, and chronicled over the past forty-five years by theorists and practitioners in the field of program evaluation.1

For example, the Library Journal article mentions logic models, a framework─a structured exercise, actually─that organizations use to spell out a rationale about how program activities will, theoretically, produce short-term, intermediate, and long-term outcomes. The term surfaced in the late 1990’s when it began to be applied to a framework developed in the 1960′s and 1970′s by Edward Suchman2 and Carol Weiss,3 and enhanced by Joseph Wholey4 and Peter Rossi and Howard Freeman in the 1980′s.5

Incidentally, our profession can boast to having independently developed the same essential framework in Robert Orr’s groundbreaking 1973 article (cited in my April 2009 post). In his comprehensive book on library evaluation, Joe Matthews uses Orr’s framework as a springboard from which he progresses to descriptions of evaluation and measurement topics, including library outcomes.6 And there are other resources on library outcome evaluation in our own literature, like the book by Peter Hernon and Robert Dugan.7

The Wheel Doesn’t Need Re-Invented

Following the example set by the LJ Directors’ Summit, we also need to venture beyond our own profession and learn from other fields. There is a wealth of program evaluation and performance measurement knowledge that we can take advantage of. For instance, information about logic models8 can be found in a definitive guide made available by the W. K. Kellogg Foundation,9 in the program evaluation textbook by McDavid and Hawthorn,10 and in the authoritative handbook edited by Wholey, Hatry, and Newcomer.11  To get a sense of the range of issues involved in outcome assessment and evaluation in general, take a look at the tables of contents in the McDavid and Hawthorn book and in the latest edition of the leading evaluation textbook by Rossi, Lipsey, and Freeman.12

With the passage of the Government Performance Results Act of 1993 performance measurement and program evaluation have been intensified on the federal level. An example is this model developed by the Center for Disease Control and Prevention. Outcomes have gotten more attention in local and state governments, for instance, in the second edition of Hatry’s definitive guide to performance measurement.13 And the Urban Institute recently developed a compendium of outcome indicators for nonprofit organizations.

Tapping knowledge from the fields of program evaluation and performance measurement will help us master evaluation concepts and methods which, in turn, prepares us to confront roadblocks and challenges such as the summit attendees foresaw. I don’t know what specific roadblocks and challenges they identified, but I suspect two likely candidates to be: (1) producing timely and relevant evaluation results and (2) integrating evaluation results into management decision-making. (Hint: These two problems have been perpetual themes in the field of program evaluation for quite some time.14)

A Terminology Tip

As newcomers to outcome assessment, we would be wise to learn the relevant terminology in order to assimilate the basic ideas, rather than giving into the temptation to simply parrot the jargon. Good examples are the terms outcomes and impacts. Though usage varies, traditionally these terms mean this: Outcomes are changes that can be confirmed to have occurred in the target population or situation that programs were designed to change. Impacts are outcomes that can be demonstrated to have been produced (caused) by the programs, services, or interventions applied. (Impacts are also to referred to as program effects.)

Take for example a highway construction effort meant to decrease rush hour congestion in a given city. Say a project is approved that will double the lanes of a main highway. Somehow the project team measures and compares before- and after-project travel times along with traffic jam frequency and duration. When the project is completed they announce that their measurements show traffic congestion decreased by 30%, citing this percentage as the impact of the project. However, because the construction took two years to complete, a period which happened to include the commencement of the Great Recession, traffic volumes were decreasing anyway. Plus, more commuters began working at home than before and ride-sharing increased, both due to gasoline prices. Meaning the total measured outcome is not fully attributable to the highway expansion. Therefore, the 30% impact claim is too high (by how much we can’t be sure). The true impact is the percentage of improved traffic flow not attributable to other causes like those I listed. (A thorough account of impact evaluation methods can be found in the classic book by Lawrence Mohr.16 By the way, because the term impact is synonymous with program effect, impact evaluation is also called program effectiveness evaluation.)

Again, definitions of outcomes and impacts are not carved in stone. Sometimes the terms are used interchangeably. Joe Matthew’s book defines outcomes, impacts, and effects as generally the same, all referring to results demonstrated to have been directly produced by library services. In academic library assessment causes and effects tend to be downplayed, as they are for the most part by Hernon and Dugan.15 To them the term outcomes means any and all relevant results, regardless of what combination of factors may have produced (or inhibited) these.

Another terminology conundrum is the difference between evaluation and assessment (if there is any difference). Some other time I’ll delve into that curiosity for those who might be interested.

Pre-Ordained Conclusions Are Not Data

Right now I want to offer two caveats that I hope will contribute to the success of this re-energized drive for outcome assessment: First, outcome evaluation is a quite sophisticated form of evaluation. For public libraries inexperienced with evaluation and assessment, attempting an outcome study as a first project is extremely ambitious. Libraries will do best if they approach this process in deliberately small and incremental steps. (Another topic to elaborate on at a later date.)

Second, the purpose of outcome evaluation is not to “share success stories” as the LJ article states. The purpose is to look impartially at both successes and failures. In fact, learning that programs, or portions of programs, have been ineffective or worked sub-optimally is a giant success! It helps organizations adjust program designs or replace them with something better. And reporting only wonderful program successes is a disservice to the community. (Just think about how people view spin in the political arena.)

Public libraries should not pursue outcome assessment merely to communicate glowing reports to their stakeholders. Being a data-driven organization does not mean collecting all the data you can for the purpose of reaching pre-ordained conclusions. It means the exact opposite, namely, that until you’ve measured and studied a situation systematically, your knowledge of it is mostly speculation and guesses.

I hope, then, that this advice will be a useful addition to the discussion. It is important that we be as methodical as possible as we venture down this new outcomes path. Or I should say up, as it is surely an incline with plenty of resistance for our professional leg-muscles. Fortunately, the directors’ summit shows that hilly terrain can be invigorating!

—————————

1  Works from the literature of program evaluation and evaluation research are rarely cited in library assessment and evaluation literature, suggesting that our profession is unaware of the literature from this other field. The only exception I’ve encountered is Powell, R. R., Evaluation Research: An Overview, Library Trends, 51:1 (2006), 102-120.
2  Suchman, E. A. (1967). Evaluative research: Principles and practice in public service and social action programs, New York: Russell Sage.
3  Weiss, C. (1972). Evaluation research: Methods for assessing program effectiveness, Englewood Cliffs, NJ: Prentice-Hall.
4  Wholey, J. S. (1983). Evaluation and effective public management, Boston: Little-Brown.
5  Rossi, P. H. and Freeman, H. E. (1987). Evaluation: A systematic approach,, 3rd ed., Beverly Hills, CA: Sage Publications.
6  Matthews, J. R. (2007). The evaluation and measurement of library services, Westport, CT: Libraries Unlimited.
7  Hernon, P. and Dugan, R. E. (2002). Outcomes assessment in your library, Chicago: American Library Association.
8  I wish that the field of program evaluation would have chosen a less esoteric-sounding label. The underlying concepts are not particularly complex. Incidentally, a concept nearly identical to logic models resurfaced in the mid-1990′s in the business field in the balanced scorecard movement. That movement labeled the concept strategy maps.
9  W. K. Kellogg Foundation, (2004). W.K. Kellogg Foundation logic model development guide,. The foundation also provides an excellent primer on evaluation, The W. K. Kellogg Foundation evaluation handbook.
10  McDavid, J. C. and Hawthorn, L. R. (2006). Program evaluation & performance measurement: An introduction to practice, Thousand Oaks, CA: Sage Publications.
11  Wholey, J. S., Hatry, H. P., and Newcomer, K. E. (1994). Handbook of practical program evaluation, San Francisco: Jossey-Bass. These editors are legends in the field of program evaluation.
12   Rossi, P. H., M. W. Lipsey, and Freeman, H. E. (2007). Evaluation: A systematic approach, 7th. ed., Thousand Oaks, CA: Sage Publications. These authors are legends in the field of program evaluation.
13   Hatry, H. (2006). Performance measurement: Getting results, 2nd, ed., Washington DC: Urban Institute Press.
14   See Rutman, L., (1980). Planning useful evaluations: Evaluability assessment, Beverly Hills, CA: Sage Publications; Smith, M. F. (1989). Evaluability assessment: A practical approach, Boston: Kluwer Academic; Patton, M. Q. (1978). Utilization-focused evaluation, Beverly Hills, CA: Sage Publications; Patton, M. Q. (2012). Essentials of utilization-focused evaluation, Thousand Oaks, CA: Sage Publications.
15  Mohr, L. B. (1995). Impact analysis for program evaluation, Thousand Oaks, CA: Sage Publications.
16  Hernon, P. and Dugan, R.E. (2002).

Data Are Not Psychic


It’s great to see other librarians advocating for the same causes I harp on in this blog. I’m referring to Sarah Robbins, Debra Engel, and Christina Kulp of the University of Oklahoma, whose article appears in the current issue of College & Research Libraries. The article, entitled “How Unique Are Our Users?”1  warns against the folly of using convenience samples. It implores library researchers to honestly explain the limitations of their studies. And the authors are resolute about the importance of understanding the generalizability of survey findings, a topic which also happens to be the main focus of their study.

I bring up their article for a different reason, however. It is an example of how difficult and nuanced certain aspects of research and statistics can be. Despite the best of intentions, it’s amazingly easy to get tripped up by one or another detail. Robbins and her colleagues got caught in the briar patch that is statistics and research methods. I say so because the main conclusions reached in their study are not actually borne out by their survey results.

I’ll give the short explanation now and then follow with the whole story. Basically, the researchers argue that when certain user behaviors or perceptions are thought to be essentially the same regardless of the users’ home institutions, survey findings from one institution can be applied (“generalized”) to other institutions. Then they present findings from their own survey demonstrating that certain user behaviors and perceptions are, indeed, the same across institutions. However, the statistical generalizability2  of survey data depends solely on how the data were collected, that is, on the sampling method(s) used. Only when one’s own institution has been included in the initial sample selection process can survey results be justifiably generalized to one’s own institution.

In addition, the researchers’ method for confirming user behaviors and perceptions to be essentially the same across institutions relied on a mistaken understanding of the statistical procedure they used. They assumed the procedure, statistical significance testing, proved something that it cannot actually prove.

What Survey Data Can Tell Us

Now for the longer story. In their study Robbins and her colleagues sought to answer this research question: “To what extent can survey data from other institutions be generalized to one’s own institution?”  By this I assume they mean, “Should survey data from other institutions be viewed as trustworthy estimates of actual attributes or states of affairs at one’s own institution?”

Let me suggest an example that might help answer this question: Suppose we recruit two survey research teams to study opinions about the legendary Steve Jobs. We assign one team to conduct telephone interviews of individuals from a randomly selected sample of 100 adult residents of our community. The other team is assigned to conduct 100 woman-on-the-street interviews in various locations in the community (that is, we employ a convenience sample). Tallying the survey results, we learn that all respondents in both surveys reported having high admiration for Steve Jobs. This leads us to announce, “There is no reason to conduct additional surveys on this topic. It is clear that there is unanimity about Jobs. In fact, the responses were so consistently positive that we are confident that they are applicable to other U.S. communities also.”

But what evidence is there to confirm that opinions about Jobs are unanimous nationwide? At first, we might think that Steve Jobs’ reputation makes this unanimity credible. However, our beliefs about Jobs’ reputation are not objective evidence. Is the unanimity observed in our data evidence that the surveys provide a balanced account of opinions nationwide? No, it is not. For one thing, we obtained unanimous responses from the convenience sample. Yet we know that that sampling method is usually grossly inadequate due to built-in biases.

More importantly, to obtain a valid measurement of a phenomenon our measurement must span the entire scope of that phenomenon. If we wanted to find out how much a man weighs, we wouldn’t just weigh his big toe, unless we have a really tried-and-true formula for calculating body weight from big toe weight. Neither would we ascertain his weight by weighing a stranger standing nearby. Nor would we assume that the man’s weight will be equal to the average of the last ten people we weighed.

The same goes for survey sampling. Our sample must be selected from the specific population we’re interest in, in this case the U.S. population. And the sample must span the breadth of that population. Only by doing so are we justified in making the leap (generalizing or inferring) from the sample to the population.

Nothing in the content of survey data—including uniformity or diversity of responses—justifies our generalizing from the data to one or another population. To put this another way, data are not psychic. They cannot divine or foretell information about some other realm or context beyond their own. Examining patterns in a set of data we have on hand tells us very little about patterns that will appear in another setting. Sure, trends in the data might hint at possible scenarios in other settings. But these trends cannot confirm the scenarios to be factual.

The Bane That Is Statistical Significance

Yet Robbins et al. argued that the content of their survey data were indicators both of that survey’s statistical generalizability as well as the statistical generalizability of other surveys. They based their argument on something called a chi-square test. This is a statistical tool that identifies and confirms statistical trends in categorical data. A key component of this tool is statistical significance testing, a topic I also discussed in a recent blog entry.

I beg the reader’s pardon for the tedious material I am about to present. However, we have to cover the generally indigestible details of statistical significance testing in order to understand how the researchers got caught in its snare. So here goes. Statistical significance testing is basically a pass-fail quality test for survey data. The testing confirms, to a reasonable degree, that trends observed in data are not artifacts caused by sampling side-effects. These artifacts are often described in phrases that attribute trends in the data to “the luck of the draw” or to “chance alone.” Other phrases describe the trends as “statistical noise” or having occurred “by accident.”

Statistical significance testing begins with an assumption established for the sake of argument and known as the null hypothesis. The word null is used because this assumption describes a situation of no difference. An example null hypothesis is:

There are no differences in the average number of electronic journals accessed by faculty from one institution compared with averages for faculty at other institutions.

This hypothesis supposes (remember, this is all for the sake of argument) that the journal access averages will be essentially equal among all of the institutions studied.

The alternative hypothesis takes the contrary stance:

There are differences in the average number of electronic journals accessed by faculty from one institution compared with averages for faculty at other institutions.

The hypothesis supposes journal access averages among the institutions studied will be unequal (that is, they will vary).

Typically, researchers want to be able to dismiss (technically, to decide to reject) the null hypothesis in favor of the alternate hypothesis. When data for a particular survey questionnaire item pass statistical significance testing, they are declared statistically significant, meaning that differences observed between two or more groups (such as faculty from institution A compared to institution B) for the item are likely to reflect real differences rather than statistical noise.

Notice that a positive outcome from statistical significance testing is not proof that the alternate hypothesis is true. Rather, this outcome is a bet, based on calcluated probabilities, that the null hypothesis is untrue. The outcome is considered to be evidence supporting, but not directly proving, the alternative hypothesis—in this example, the claim that there really are differences in electronic journal access averages depending on institution.

On the other hand, when data for a particular survey item fail significance testing, they are declared statistically insignificant, meaning that the differences observed are most likely statistical noise. In this case, the null hypothesis has not been dismissed (rejected). Thus, this outcome provides no evidence supporting the alternate hypothesis.

However—and this is a big however—a decision not to reject the null hypothesis is not the same as proving the null hypothesis to be true. Statistical significance testing is silent about the actual trueness or falseness of the null hypothesis. This is why statisticians are careful to say, “We fail to reject the null hypothesis” instead of, “We accept the null hypothesis to be true.”3

Getting Tripped Up

So that’s pretty much the story on statistical significance, tangled as it is. I invite you now to the next chapter, how statistical significance testing fit in the study by Robbins and her colleagues. (Unfortunately, various characters from the last chapter necessarily bleed into this one. I apologize for that.)

The researchers examined responses of engineering faculty to eight questionnaire items, noting how these varied among the sixteen institutions surveyed. They looked at which items passed statistical significance testing and which ones failed. They considered any items passing significance testing to be poor candidates for their purpose, which was to enable librarians to describe their own institution using another institution’s survey results. Since institutional differences on these items were real rather than statistical noise, the researchers concluded that for these items librarians could not rely on other institutions’ survey results.

Conversely, the researchers viewed items failing significance testing as good candidates for their purpose (described above). Because differences observed between institutions on these items were very likely due to statistical noise, the researchers concluded that the sixteen surveyed institutions were essentially the same on the items, as other non-surveyed institutions would be also. In other words, they believed that the outcome of statistical significance testing proved the null hypothesis of no difference to be true.

This is the snare Robbins and her colleagues got caught in. As I explained already, statistical significance testing does not evaluate the actual trueness of the null hypothesis. As statistician Howard Wainer describes this:

…even if two individuals are not ‘statistically different’ it does not mean that the best estimate of their difference is zero.4

For this reason the researchers’ chi-square results are not a sound basis for concluding that all academic institutions are essentially the same on these items.

Who Needs Statistical Significance Anyway?

In their defense, I don’t think the researchers took the logic of statistical significance completely to heart. They ignored it when it seemed unreasonable, as these statements indicate:

…while the chi-square analysis indicates that, regardless of the institution surveyed, researchers could expect to receive similar results [if they were to survey faculty at other institutions], the results themselves suggest otherwise.5

While the association between institution and responses to this question [about departmental duties of faculty] were statistically insignificant for the most part, librarians at the institution are best poised to know the job requirements of their institutions’ faculty…6

The importance that faculty place on assistance from library personnel was not shown to be statistically significantly associated to institution (p=0.123). This suggests that, regardless of the institution surveyed, researchers could expect to receive similar results [if they were to survey faculty at other institutions]. However, a look at figure 2 tells a different story.7

And they freely analyzed between-institution variation of some of the items without regard for statistical significance. The graph below is an example:

Source:   Robbins et al., 2011, ‘How Unique Are Our Users?’ College & Research Libraries.     Click for larger image.

Pondering these visible differences may well have led Robbins and her colleagues to some questions not addressed in the article: If a single institution is going to adopt survey data from other institutions as their own, what specific data values should they adopt? Should average responses from the sixteen institutions in their study be used as the best estimates? Or perhaps averages gleaned from past or even future studies? Or should an institution choose one particular surveyed institution to extrapolate from? If so, on what criteria should they base their choice?

Returning to the more general theme of the study, I should mention a second question the researchers posed:

…is it reasonable to believe that faculty members or students at one institution are all that different from faculty and students at another institution?8

Some user perceptions or behaviors may well be uniform, for example, the belief by faculty that the electrical outlets in their offices ought to work, or the belief by students that tuition hikes are undesirable.

However, for the majority of issues pertinent to library assessment there will always be variation, within institutions and between them. To understand and appreciate this variation, even for purposes of ultimately ignoring it, we have to measure it. And this measurement has to occur in the actual settings and situations we want to understand.  

  
—————————

1  Robbins, S., Engel, D. and Kulp, C. (2011). How unique are our users? Comparing responses regarding the information-seeking habits of engineering faculty,” College & Research Libraries, 72(6), 515-532.
2  Statistical generalizability is also referred to as external validity. Though this second term came to be used in connection with survey research studies, it was first applied to experimental and quasi-experimental studies. With these studies, external validity is contrasted with internal validity, which refers to how well experimental conditions were controlled so as to make the results trustworthy.
   Generalizability (or generalization) is a generic term that includes statistical generalizability. See Pollit, D.F. and Beck, C.T. (2010). Generalization in quantitative and qualitative research: Myths and strategies, International Journal of Nursing Studies, 47, 1451–1458.
3  See Luk Arbuckle’s blog entry for a thorough discussion of the problem of proving the null hypothesis.
4  Wainer, H. (2009). Picturing the uncertain world: How to understand, communicate, and control uncertainty through graphical display, Princeton, NJ: Princeton University Press, p. 76.
5  Robbins, S. et al. (2011). p. 526. (Blue emphasis added.)
6  Robbins, S. et al. (2011). p. 521. (Blue emphasis added.)
7  Robbins, S. et al. (2011). p. 526. (Blue emphasis added.)
8  Robbins, S. et al. (2011). p. 519.

Beauty Is As Beauty Does

Brave New Words

Infographics is one of two new fashionable terms used nowadays to refer to statistical charts and graphs. The other term is visualizations, which replaces such archaic words as graphs, charts, pictures, diagrams, and illustrations. Sometimes the term is affectionately shortened to data viz by its really cool practitioners.

In the infographics/visualization/data viz movement there are two basic schools of thought. One school emphasizes principles of artistic design and the other emphasizes information clarity. The first prizes graphics that are beautiful and appealing, while the other judges visualizations based on how informative they are.1  Many adherents of the first approach to graphics are marketing and advertising professionals. Lest you presume that they subscribe to the motto
ars gratia artis,
I must point out that their interest is purely utilitarian. To them, beautiful graphics that don’t persuade audiences and reel in more customers are useless.

Infographics can have quantitative or qualitative content or both. In librarydom you may have seen infographics mentioned in Library Journal’s TheDigitalShift, in this issue of AL Direct, or in this YALSA blog. Many of the tools recommended by these sources pertain to qualitative content.

3-D Horrors

Here, though, I will be writing just about quantitative graphics, that is, data visualizations. And I want to show you how efforts to decorate and beautify these can lead to trouble. I begin with a good example of bad results as seen in the chart below. This chart shows libraries’ responses about the value of statistical measures as percentages. I created the chart for a small survey I conducted in 2008. At the time I felt it to be a respectable graphic. In hindsight I realize better. Style-wise, its color choices and shading are garish.

But the chart’s main shortcoming is that it communicates really wrong information. Notice how none of the bars reaches the 100% mark. (In this type of chart each bar is supposed to total 100%.) When you create what Microsoft Excel calls a 100% stacked bar in 3-D, the software adjusts the perspective so the 100% line hovers above the actual data.

 

Click for larger image showing legend and horizontal axis labels.2

So now you have to wonder, since the bar tops are wrong, how about the lines dividing the bar segments? Are they off too? I’m not sure what Microsoft programmers were thinking, but the final result is misinformation.

Even without Excel’s display defects, 3-D stacked bar charts are difficult to read in general. It’s really hard to judge the relative heights of the segments. Take a look at the chart again. In the second bar from the left, would you say the aqua segment is longer than the aqua segment in the third bar from the right? Which bar edges should you judge by? Should you match up the front edges of the bars with the gridlines next to the percentage labels? Or the back edges to the rear wall?

The unadorned graph below displays the same data as my 3-D chart does. Creating one bar chart per response category (high value, medium value, low value and no value) and arranging these left to right makes for easier comparisons of the bar lengths. No need to worry about being deceived by ill-designed 3-D chart scaling.

Click for larger image.

Visual effects added to a graphic, like 3-D, bright colors, giant fonts, bold arrows and such are meant to enhance the graphic’s message. But quite often these fall flat. Consider the 3-D chart below, which tries to be tastefully unflat. Note that the non-perpendicular left axis exaggerates the lengths of the lower bars, pushing them further to the right than they should be. A minor defect, for sure. But there’s something else about this chart that makes it appear vaguely off-kilter.

Source: ALA, Condition of U.S. Libraries: Trends, 1999-2010.3     Click for larger image.

The graphic below shows the image rotated 90° counter-clockwise to a perspective our eyes are more accustomed to. Here the incline in the chart axis (which looks like a candlestick base) is a visual cue that the right-most bars recede slightly into the distance. In the prior chart the bars cascade downward, as if the viewer were looking from above the image. This is one reason for that chart’s visual oddness. Another reason is easier to see in the rotated chart.

90° counter-clockwise rotation of bars from prior chart.    Click for larger image.

If you ignore the chart axis (the base) and look only at the right two or three bars, they appear to be advancing into the foreground. This is because in a 3-D perspective rendered on a 2-D page, larger objects appear closer and smaller ones appear more distant. The chart is, in effect, an optical illusion. Click on the graphic to view a larger image and then stare at that for a moment. You should be able to make the left end of the axis recede into the background and the higher right edge shift into the foreground.

Rather than burden the audience with peculiar visual effects, I suggest a plain vanilla graphic something along these, er, lines:

Click for larger image.

More 3-D Horrors

The next chart produces some curious optical illusions. First, notice how the three right bar segments appear to have non-parallel edges, as if the stack were bending leftward. And when you first look at the left axis, it sometimes appears concave rather than flat. Finally, if you gaze a minute or so at the entire chart, you can get the bars to flip inside out! (Try it.)

Source: CLIR, Census of Institutional Repositories in the United States, 2007.4

Again, a simple bar chart gets this information across clearly without visual incident:

In the next chart can you tell whether the bars are truly horizontal or not?

Source: CLIR, Census of Institutional Repositories in the United States, 2007.5

The illusion is kind of fun, even. But fun isn’t the objective here. If the 3-D effects in these examples were pleasant and appealing, with no side effects, they’d be fine. But they aren’t pleasant or appealing. They are distracting, sometimes subliminally so.

Artistic License

Some proponents of the aesthetic approach to visualization will disagree with me. And they’ll probably dismiss my vanilla charts as boring and stupid. Escheresque side effects won’t bother them since viewer bewilderment isn’t their biggest challenge. When a viewer is bewildered, at least he’s engaged. What really bothers these proponents is graphical ordinariness, blandness, and tedium−ills they combat with entertaining and provocative images. For these purposes decorations of any and all kinds will do, like the cartoonish icons in this example:

Click for larger image.

The infographic is not very informative. We know there are some proportions or other of people, houses, cars, and animals in the U.K. But the pictures interfere with our appreciating these otherwise quite specific proportions.

The idea in a graphic like this one is that all of the icons are quantitatively equal, like tally marks. However, as you see, visually they are unequal. The thickly inked houses overpower the delicately drawn automobiles, suggesting at first glance that the U.K has more houses per 100 persons than automobiles. (The opposite is true.) Children are tinier than adults. Canes make older men and women wider than able-bodied adults. And chickens are tiny chicks. (I don’t think those are eggs.) Incidentally, although the chickens are the most, shall we say, populous (275 chickens/100 people), they are alloted about the same space as dogs, cats, and sheep. (There is no eggscuse for that.)

The creator of this inconic infographic did include group counts in gray circles at the right. (Numbers can be such useful annotations to infographics.) Except these are mostly illegible. You can also try viewing the graphic in its original location if you like.

Chanelling Data

Because graphic designers tend to be more comfortable with geometry than arithmetic, they believe that shapes and forms automatically illuminate any subject matter. This occasionally leads them to underestimate the intelligence of their audiences. (Readers generally know what houses and chickens are shaped like.) But sometimes their arithmetical mucking around enters the realm of the sublime. Like this astounding depiction of unoccupied airline seats in the U.S. in a single year devised by a recognized data viz expert:

Click to view article. Scroll down to this image.

An intriguing mandala, to say the least. Can you unravel its clues? Maybe you guessed that the circle’s circumference represents 365 days of the year. Do you get that the pastel blue partial border represents 217 days? The white portion, which is 148 days long, has no meaning. So it’s a convenient place to put a labeling phrase. What do you suppose that blank areas in beige near January and July mean? That there are no unoccupied seats during these months?

Evidently, the spatial arrangement of the icons is connected to the numbers inscribed in the center of the image. There we learn that the icons symbolize certain cryptic quantities: 1 seat icon = 2.5 million seats and 1 plane icon = 6.5K empty 747 jets. Add the 2.5 and 6.5 together and you get 9. Add 7 + 4 + 7 together you get 18, and adding 1 + 8 together yields a second 9! Does the mandala reveal an esoteric numerology, some sort of Da Vinci code of airline statistics?

Hardly. I’m afraid the only relevant numeral here is zero, the informational content of this viz. You can only make sense of it by referring back to the supposedly inferior chart for which the mandala is intended as an improvement. This chart can be seen in the article under the heading ‘If the Client Wanted an Excel Chart, They Wouldn’t Need You.’ (That heading says it all.)

Engauging Designs

Good visualizations should not require viewers to solve puzzles, at least not without their prior consent. But standards of this sort are stifling to the more radical data viz artists. Take a look at how the designer of the mandala graphic above turned this humdrum chart…

Click to view article. Scroll down to this image.

…into this:

Click to view article. Scroll down to this image.

Plenty of pretty bluish content along with plenty of fuzzy information. Because the dial’s scale extends only to the largest data value (7200), the outer light blue band looks more like background than the Women’s World Cup data. Two of the needles are almost invisible and the one for New Year’s Eve 2010 in Japan is invisible. Then, we have to extrapolate the peak value in the dial (as I did already) because it’s hidden by a stylishly shadowed needle. Oh, wait. Could the shadow be the missing needle?

Almost all of the data portrayed in this image are difficult to see. Plus, the concentric pastel arcs depicting the categories are totally wrong. They suggest that the peak value, 7200, is so large that it surrounds the other four values. A quick glance at the upper chart debunks this. Comparisons of the arc lengths in the lower diagram are invalid because the arcs are portions of circumferences from circles with unequal diameters.6

Graphical beautification enthusiasts iconic metaphors like the mechanical gauge, quantitative reasoning be damned. After all, this particular metaphor has been blessed by the Balanced Scorecard movement. So, it’s definitely state-of-the-art. However, like any tool, this metaphor has strengths and weaknesses. One weakness is that it is prone to misinterpretation. It can give viewers really wrong ideas about the data. Take the image below for example:

Click for larger image.

The pinpoint accuracy of the needles implies high-precision data. Yet, the data are anything but! Measurement scales for the reported propensity of human beings to behave one way or another are nothing like the scales used in altimeters, barometers, speedometers, tachometers, or thermometers. Neither are scales measuring attitudes and beliefs, which library advocacy surveys often use.

As with any survey, in this example the two data values are estimates of true figures in the population under study. Each will have some margin of error. So, the actual numbers in the population may or may not be nearly equal. They could easily differ by 0.5, 1.0 or more. Let’s say, though, that we know that the true figures from the population differ by 0.5, and that the dial needles show this difference. With these types of scales 0.5 might still be a negligible amount. Perhaps any number between 8.0 and 9.0 might have about the same meaning. For these kinds of scales pin-pointing numbers on dials is overkill.

Now, pretend that these dials are applause meters for two competing live shows, so that none of the issues I just raised applies. The dials are still difficult to read. First, our eyes must scan and then ignore the prominent numerals and tick-marks that occupy most of the graphic. The dials’ green regions help some by leading us to the needles (though the meaning of these regions is unexplained). On the left dial the needle points just shy of the first tick-mark past 8, on the right a tad beyond that same mark. But the tick-marks are weirdly spaced so that each one is 0.4 units away from its neighbors. In the left gauge the graphical software has fouled up. The needle should rest exactly on the tick-mark.

Struggling to decipher these miscues, we discover that the graphic’s most reliable information appears in the perimeter: the numbers 8.4 and 8.5. So, the question is, what is the purpose of the dials? Just to look pretty on the page? Though graphical beauty and fanciness might capture viewer attention, if these make information incomprehensible, they’re a waste of time.

 
—————————

1  Of course, it is possible for graphics to be simultaneously beautiful and informational. Well-designed graphics can be elegant in their clarity and visual appeal. See Edward Tufte’s book Beautiful Evidence.
2  In the larger image, notice that the angle and close spacing of the horizontal axis labels make them difficult to visually trace to their corresponding bars. This is a good reason to avoid angled labeling. And beware of Microsoft Excel on this. When you narrow the overall width of a bar chart, Excel may decide to realign your labels at an angle. Be kind to your readers and don’t let the software get away with it.
3  Davis, D. (2009). The condition of U.S. libraries: Trends, 1999‐2009, Chicago: American Library Association, p. 12.
4  Markey, K. et al. (2007). Census of institutional repositories in the United States, Washington, DC: Council on Library and Information Resources, p. 34.
5  Markey, K. et al. (2007). p. 24.
6  An example of the effective use of circular graphs to illuminate data is the beautifully informative rose charts designed by Florence Nightingale.

Library Science


Evaluation, assessment, and performance measurement are not what you’d call sciences. But these activities do share certain things in common with science and the scientific method.1  One is the requirement that theories be tested based on the compilation of objective evidence. Another is the idea of replication, which is carefully repeating a measurement or experiment in order to verify that the initial findings were not an accident or mistake of some sort.

Then there’s the more philosophical concept known as falsifiablity. A scientific theory needs to be such that there is some way that it can be examined and possibly disproved. A credible scientific theory is one that holds up under repeated attempts to be proven wrong.

In everyday terms, there is a lot of transparency and double-checking in science. I bring these ideas up because, as it happens, there is a claim made in my prior blog entry that needs rechecked. The claim is:

On the basis of per capita statistics, smaller U.S. public libraries out-perform the largest U.S. public libraries.

After I made the claim a couple of things made me wonder more about it. First, Tord Høivik found quite opposite trends based on Norwegian public library statistics he analyzed from the KOSTRA municipal-county reporting system.2

Then, when I mentioned my claim to Keith Curry Lance, he said that these differences can be a side-effect of the data ranges that the libraries are sorted into (for instance, expenditure categories such as [a] under $10,000, [b] $10,000 to $49,999, [c] $50,000 to $99,999, and so on). I had forgotten about a chapter that statistician Howard Wainer wrote on this exact topic. In his book Picturing the Uncertain World he explains how it is possible to manipulate the data ranges to create trends that don’t really exist in the underlying data.3

Revisiting My Claim

Though I didn’t consciously manipulate the two categories I used,4   it wouldn’t hurt to look at a more complete range of data. For this purpose I devised a few alternate schemes shown here:5

Click for larger image

This table begins with the 11-category scheme that IMLS happens to use,6 followed by other schemes I created to contain fewer categories. This involved changes like expanding the highest category to include populations of 600,000 and above, defining the smallest libraries as serving populations up to 4,000, and so on. For fun, I included the binary scheme, libraries serving communities less than 200,000 and those serving 200,000 or more.7

As the table shows, with fewer categories the counts of libraries in each category are higher. And, regardless of the category breakdown, the larger population categories consist of very few libraries (88 for 300K-599.9K and 71 for 600K +) compared to the smaller categories (for instance, 2777 for 10K-49.9K).

Next, I plotted median values for several per capita measures, like volumes per capita as seen here:

Click for larger image

One effect of using fewer and expanded categories is the disappearance of the highest values seen in the upper (11-category) chart. Note in that chart that the 15.7 value for the <1K group disappears in the lower left (8-category) chart, where the median value for the <4K group is 8.5. This lowering of more extreme values is primarily due to the broader category sizes. More libraries in a group tend to moderate the median values, bringing them downward.

The volumes per capita measure directly supports my prior claim. The smallest libraries out-perform the largest libraries. In the top (11-category) chart, with each larger size group the median value decreases. In the 8-category chart the claim is mostly true, although the 300-599K category is even with the 600K + category at 1.8. With the bottom right chart above (for < 200K and 200K +) the claim is true, but not very convincing due to the simple binary categories.

But, let’s take a look at visits per capita:

Click for larger image

As with the volumes per capita charts, here we see that the highest values in the 11-category (upper) chart are gone in the 8-category chart. But the interesting thing about visits per capita is how jagged and uneven the patterns are in both of these charts. In the 11-category chart, the 500K-999.9K group out-performs all others, and the smallest (<1K) category places second. Even so, the 1M category earns 4th place, matching or exceeding two other smaller categories. In the 8-category scheme, the three smallest categories beat the 600K+ group, but this last group earns 4th place, with the other smaller categories lagging behind it.

It Depends

So the situation isn’t so cut-and-dried, is it? Now, I invite you to look at other library measures. Click on the two images below to see charts of library input and output measures (including volumes and visits):

                                   

Click images to view detailed charts

The basic patterns in the measures shown in the charts appear in the table below. From this summary, it seems that, in general, smaller public libraries do out-perform the largest ones. Still, it depends on the definition of largest and



 

 

 

 

 

 

Summary of Trends for 10 Library Input & Output Measures

smaller libraries. And whether the ten measures that I chose qualify as a reasonable collection of meaningful library measures. In any case, the claim is definitely untrue for operating expenditures and reference transactions. Also, it pertains only to the set of libraries represented in the 2009 IMLS data. The claim doesn’t apply to other years or to other countries, nor to subsets of U.S. libraries, for instance, all libraries west of the Mississipi River or to Public Library Data Service (PLDS) libraries.

So here you see something else about science-inspired measurement. The interpretation of the data has to stay within the bounds of that data. Because I extrapolated well beyond the actual data I had (see footnote 4), my original claim was kind of flimsy. I am lucky that examining more data didn’t end up disproving the claim completely!

  
—————————

1  Some of the foundational ideas in evaluation, assessment, and especially performance measurement have also been borrowed from the field of financial auditing. See Beryl Radin’s 2006 book, Challenging the Performance Movement: Accountability, Complexity, and Democratic Values and Michael Power’s 1997 book, The Audit Society: Rituals of Verification.
2  See Tord’s comment in my prior post.
3  Wainer, H., 2009, Picturing the Uncertain World: How to Understand, Communicate, and Control Uncertainty, Princeton, NJ: Princeton Unviersity Press. See chapter 14, “The Mendel Effect.” Actually, the trick Wainer describes applies to situations where two measures aren’t related at all. Meaning that higher or lower values in one measure are not reflected in the other measure. Even so, spacing the category boundaries just right can make the opposite appear true!
4  These were libraries serving communities from 15,000 to 20,000 and those serving communities of 100,000 and over.
5  If it weren’t so cumbersome, a better way to categorize library size would be considering both service area population and library expenditures together. This was described in a 1998 IMLS publication (NCES-98-310) by Keri Bassman and her colleagues entitled, How Does Your Public Library Compare?
6  See Table 1 in Henderson, E., Miller, K., Craig, T. et al. (2010). Public libraries survey: Fiscal year 2008 (IMLS-2010-PLS-02), Wahsington, DC: Institute of Museum and Library Services.
7  I borrowed this population threshold from de Rosa, C. and Johnson, J. (2010). From awareness to funding: A study of library support in America, Dublin, OH: Online Computer Library Center. Their study surveyed only libraries in communities with less than 200,000 population.

How Do You Know That?

I borrowed the title for this entry from a 2009 study of student research practices by Randall McClure and Kellian Clink. Their study is cited in an article in the current issue of College & Research Libraries that Joe Matthews brought to my attention. This article is Students Use More Books After Library Instruction by Rachel Cooke and Danielle Rosenthal. Both articles explore research sources and citations that undergraduate students use in writing assignments. Though it’s this second article I want to discuss, McClure’s and Clink’s well-chosen title is too good to pass up. In fact, I’m thinking of making it the motto of this blog!

Anyway, in their article Cooke and Rosenthal report that university English composition students “used more books, more types of sources, and more overall sources when a librarian provided instruction.”1  Their statement contains two separate claims: (1) the quantity of citations used by composition students receiving library instruction differs from the quantity that composition students not receiving instruction used; and (2) this difference is due to library instruction.

Let’s look at the first of these and save the second for some other time.2 I suggest we start by applying a basic principle of information literacy, which I paraphrase here:

Don’t just take the information in front of you at face value. Try to identify possible shortcomings that might compromise the accuracy, authenticity, or relevance of the information. The fewer shortcomings the information has, the more trustworthy it is.

Put another way, you have to be sure how you do know that the information is true.

Statisticians also happen to be preoccupied with the relative trueness of information. And one thing they obsess over is inaccuracy that comes from measuring a subset (sample) of a larger group as a way to learn about that group. In the Cooke and Rosenthal study this larger group is all freshman English composition students enrolled in their university in the study year (or perhaps in the last couple of years). Their sample consists of those students whose papers the researchers examined.

Samples are only estimates of what is true for the larger population. The data always contain a certain amount of inaccuracy, making the sample an imperfect representation of the population.3    If you have ever played cards, you’ve experienced this same situation. You’ve seen how easy it is to be dealt a 7-card Rummy hand that contains only low cards, or only two suits, or a majority of face cards. You know hands like these are aren’t very good representations of the contents of the whole deck of cards because you know the contents of that deck.

This misrepresentation happens in the same way with samples, except we are mostly ignorant about the contents (those characteristics we want to know about) of the whole population that we are studying. So, we can’t tell how far off our sample values are.

Keeping this in mind, let’s look at these data that Cooke and Rosenthal present:

Average citations per paper - Cooke_Rosenthal                            Source: Cooke and Rosenthal, 2011, p. 335.

Just as a Rummy hand might have one suit over-represented, the sample of non-instructed students could contain a high proportion of papers having only one citation—higher than the true proportion in the larger population. If this is the case, then the 3.2 average is too low.  The average for the population might really be 4.0 or so. And the opposite could be true for the instruction group. What if, by chance, two over-zealous students in the sample each put 20 citations in their papers? In a small sample of about 60 students, these two high values would inflate the sample average. So, maybe the real figure for all instructed students is more like 4.5.

For these reasons we can’t simply judge survey data by appearances alone. While the 5.3 average appears much higher than 3.2, something like the scenarios I just described could be closer to the truth. Mind you, this is not to suggest that my alternate scenarios actually are true. This is merely to ask, how do you know that they are not?

Enter the statisticians. They are experts at looking for ways to make the murky waters of data clearer (kind of), in this case by taking sampling uncertainty into account. They realize that if scenarios like the ones I described can be shown to be very unlikely, this justifies our having more confidence in the sample results. They approach this challenge by putting the sample data through an extra hurdle which is essentially an additional quality test. This extra hurdle is known as statistical significance testing. If the data pass this testing, statisticians tell us we can be very (95% or more4) confident that the findings are not due to the vagaries of sampling. In other words, we can be reasonably sure that the numeric differences we observe are authentic, rather than being “due to the-luck-of-the-draw” or “attributable to chance.”

Statistical significance testing is simply a tool that allows us to address one pesky data problem (among several), the possibility that our data primarily reflect the-luck-of-the-draw. You can read about it in introductory statistics textbooks. But beware. This is a troublesome topic that can be really confusing. As explained in the video5  below, a high probability that the numbers are not attributable to chance does not mean we can apply the same high probability to our belief that the data are accurate and reliable. This is the odd thing about statistical significance testing. It indicates what our data probably are not, not what they probably are.

As you can tell, this topic entails a lot of probablys, maybes, likelys, and possiblys, as well as twisted logic! But we can definitely say that, while statistical significance testing is not the be-all and end-all of data quality assurance, it does help alleviate some of the problems connected with sampling. If Cooke and Rosenthal, and also the other researchers cited in their article, were to put their data through this extra quality test, that would be one less worry for information-literate readers. Otherwise, there will always be the nagging doubt that the differences and contrasts they report are just statistical accidents.

  
—————————

1  Cooke, R. and Rosenthal, D. (2011). Students use more books after library instruction: An analysis of undergraduate paper citations, College & Research Libraries, 72(4), p. 332.
2  The second claim is more complicated than you might think. It requires us to consider what is sufficient proof that a library program or service is the sole (or even major) cause of an observed outcome. Plus, we’d need to delve into the objectives and desired learning outcomes of the instructional program. As Joe Matthews pointed out to me, counts of citations are probably not valid indicators of the quality of the student papers. McClure and Clink raise this same question. See: McClure, R. and Clink, K. (2009). How do you know that? An investigation of student research practices in the digital age, portal: Libraries and the Academy, 9(1), p. 120.
3  Statisticians call this sampling error. There is also nonsampling error, which we eventually have to consider.
4  If we choose the 95% significance level as our testing criterion, and the data pass this test, this means that there is a 95%, or 19 out of 20, chance that our sample results are not statistical flukes. And there is a 5%, or 1 in 20, chance that they are flukes. At the 99% significance level, there is a 1%, or 1 in 100, chance that the data are statistical flukes.
5  I link to this video because the creator, Jose Silva, makes two excellent points: Our data are estimates; and statistical significance is “about your results coming up by accident.” Ironically, part of Silva’s explanation is wrong. His statement “Given our estimate of 320, the probability that the real effect size is zero is less than 5%” is backwards. 5% is the probability that our sample values (or any values more extreme than these) would occur, given a basic assumption we start with for the sake of argument. This assumption always involves zero, one way or another. For Cooke’s and Rosenthal’s data in the table above this would take the form, “Let’s assume the averages of the instructed and non-instructed groups are equal, meaning their difference is zero.”
    This assumption is the groundwork for the testing, not the result of it. So the 5% applies to the data, not to the assumption. It indicates how probable our sample values would be if, in the real world, the zero-related assumption were actually true. For more information see Kline, R., 2004, Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research, Washington, DC: American Psychological Association.

I recently ran across a series of studies suggesting that prayer tends to lessen anger and aggression. Researchers concluded that prayer helps people adopt a more positive view of adverse or irritating circumstances. There also happens to be a sideline to their findings that illustrates something you don’t hear much about from proponents of outcome assessment in libraries. It involves this statement by the researchers:

These results would only apply to the typical benevolent prayers that are advocated by most religions… Vengeful or hateful prayers, rather than changing how people view a negative situation, may actually fuel anger and aggression.

Though the aims of the prayer studies differ from those of outcome studies, the two research approaches are similar in this respect: When studying effects of a program, treatment, or intervention, if we’re not sure about the exact content of that program, treatment, or intervention, then we have a problem. In the field of program evaluation this problem falls under the rubric of program fidelity. Here’s a brief explanation:

In outcome research, an intervention can be said to satisfy fidelity requirements if it can be shown that each of its components is delivered in a comparable manner to all participants and is true to the theory and goals underlying the research.1

In the prayer studies, subjects were permitted to use whatever type of prayer they wanted. They might have recited traditional prayers, made up extemporaneous prayers, or chosen silent or contemplative forms of prayer. Or they may have just pretended to pray. The researchers considered any and all styles of prayer to be equivalent, except, as we learn later, vengeful and hateful prayer. (Which makes me wonder how they could be sure that no subjects chose this form!)

However, in the arena of publicly funded programs, insufficient information about the specific content of program interventions impedes good evaluation and decision-making. An example will help make this clearer. Say a rural literacy program includes specific educational materials for parents along with a specially prepared video. But suppose only a portion of the parents actually receive the materials. And suppose others don’t have a way to play the video, and that others don’t have the time to watch it. Ignorance of these facts can lead program managers to the wrong conclusions when they look at outcomes. They might decide, for instance, that the parental education component should be discontinued because it was not cost-effective.

It’s also possible that a program might be implemented uniformly but incorrectly. Maybe all participants in the family literacy program received an outdated version of the video. Or there may have been errors in the eligibility information communicated to the school district. Or enthusiastic staff may have decided to improvise by awarding prizes to children who completed program milestones early. Obviously, there are all kinds of ways that program implementation can deviate slightly or substantially from the original plan.

Of course, deviations from the official program or intervention design can lead to undesirable results. The problem of incomplete distribution of materials to parents mentioned above is an instance where a program variation interferes with desired outcomes. And so, apparently, is the case of vengeful prayer.  But the opposite is also possible. What if staff deliver a poorly conceived program so creatively that the outcomes are positive?

In either instance we’d have to concede that the delivered program is not the one that had been intended. So, the basic message is:

Without evidence that a program has been implemented properly, it is difficult to determine whether a program ‘works’ or meets its intended goals.2

Notice that evidence about (careful measurement of) program implemention is essential. We can’t just assume that programs consist of the right things and are delivered in the right ways. Nor should we rely on hopes and prayers that programs and services are delivered uniformly and correctly.

  
—————————

1  Dumas, J. E., Lynch, A. M., Laughlin, J. E., Smith, E. P., and Prinz, R. J. (2001). Promoting intervention fidelity: Conceptual issues, methods, and preliminary results from the Early Aalliance Prevention Program, American Journal of Preventive Medicine, 20, 38-47.
2  Esbensen, F., Matsuda, K. N., Taylor, T. J., and Peterson, D. (2011). Multi-method strategy for assessing program fidelity: National evaluation of the revised G.R.E.A.T. program, Evaluation Review, 35(1), 14-39.

A recent article in AL Direct entitled The Smartest Readers presents some simple library rankings based on that stalwart library measure, circulation per capita. Rankings like these are, at least to me, a reminder of a perennial conundrum concerning the meaning of per capita library measures. For more than a century librarianship has puzzled over how to evaluate these statistics. Do per capita data tell us whether or not libraries are doing a good job? What amounts of materials made available or levels of services delivered are sufficient for libraries with specific missions and serving communities of a particular size and makeup?

Mainly, libraries have to rely on their own ingenuity to interpret per capita or per constituent data (like per student, faculty, employee, subscriber, stakeholder, and such). About the only official guidance they have gotten over the decades is advice about comparing (benchmarking) their data with appropriate peer libraries. Lacking some more objective gauge of statistical performance, libraries end up applying what might be called the more-is-better rule. Indeed, this convenient and popular assumption is at the heart of library ranking schemes like the LJ Index, HAPLR, the Bibliotheksindex BIX, and the ARL Index.

So it’s natural that ALA librarian Karen Muller, author of the article, would sort the data from high to low. And she limits the comparisons to roughly similar libraries, those serving communities of 100,000 or more, although she chose that group in order to cross-check her top 20 list with Amazon.com’s rankings of Most Well-Read Cities.

But here’s an interesting tidbit I have to offer on this topic: If statistics-based advocacy is our aim, we can put a better foot forward by advertising the performance of smaller libraries moreso than larger ones. As a group, smaller libraries almost always outshine the largest libraries on per capita measures.

Let’s see how this works. For comparison purposes I gathered 2009 IMLS data for a group of smaller public libraries to match the size of the group that Muller used in her article. She analyzed the 549 largest U.S. public libraries and my group consists of the 551 U.S. libraries serving communities of 15,000 to 20,000.1

Chart 1A below is a scatterplot of the data that Muller analyzed showing circulation and community population together.2  Notice that the circles in the chart are clustered toward the left since these libraries serve communities ranging from 100,000 to nearly 5 million. Most of the community populations are between 100,000 and 200,000, which accounts for this clustering.3

Chart shown here omit some outlying data points. Click charts for complete views.  Move cursor over circles in complete views to see individual library data.

Chart 1B shows libraries from chart 1A that make the top 20 rankings based on their circulation per capita measures. These are the same libraries that appear in Muller’s top 20 list (see footnote #2). Note also that horizontal lines in charts 1A and 1B indicate the median circulation per capita values for the groups.

Charts shown here omit some outlying data points. Click charts for complete views.  Move cursor over circles in complete views to see individual library data.

Charts 2A and 2B show the same circulation measures for libraries with populations ranging from 15,000 to 20,000.

Charts shown here omit some outlying data points. Click charts for complete views.  Move cursor over circles in complete views to see individual library data.

Notice in the charts that the median circulation per capita is higher for the smaller libraries (charts 2A and 2B) than for the larger libraries (charts 1A and 1B). For example, for the two top 20 lists (charts 1B and 2B), the median circulation per capita value is 23.1 for large libraries and 26.6 for the smaller libraries.

Now let’s try another measure, volumes of print materials per capita. Chart 3A shows this statistic for large libraries and 3B does so for the top 20 large libraries ranked on the statistic.

Charts shown here omit some outlying data points. Click charts for complete views.  Move cursor over circles in complete views to see individual library data.

Granted, in some instances a couple of large libraries might outperform all smaller libraries on a given measure, as Boston Public Library does with its 15.0 volumes per capita measure. (Click on chart 3B above to see larger chart. On that chart move cursor over the circle located in the top row and center column of the grid.) Yet charts 4A and 4B demonstrate that smaller libraries, as a group, consistently perform better. Again, median values shown in the charts exceed those for the larger libraries. Plus, the distribution of circles tells this story too. More circles in chart 4A extend higher on the vertical scale than in chart 3A. For example, only 5 larger libraries reported volumes per capita of 8 volumes or higher (chart 3A) while 16 smaller libraries did (chart 4A). And only 11 larger libraries exceed 6 volumes per capita compared to 48 smaller libraries that do.

Charts shown here omit some outlying data points. Click charts for complete views.  Move cursor over circles in complete views to see individual library data.

Finally, for one more measure, total library staff per 1,000 community population, we see these same patterns in charts 5A through 6B. Medians are higher for smaller libraries and, overall, circles in charts 6A and 6B extend higher than in 5A and 5B. (When comparing charts 5A to 6A and 5B to 6B be sure to take note of the vertical axes value labels, as the scales differ on each.)


Charts shown here omit some outlying data points. Click charts for complete views.  Move cursor over circles in complete views to see individual library data.

The bar charts shown here illustrate the differences between the two groups on median values of the per capita measures.

Click for larger image.

Of course, we have to look at several more than three library measures to declare this a trend. But I bet you that it is. Actually, I’m sandbagging here because certain pioneers of public library performance statistics documented this phenomenon as early as 1973. In their landmark study that analyzed per capita measures Ernest DeProspo, Ellen Altman, Kenneth Beasley, and Ellen Clark wrote:

A statistical comparison of libraries of different sizes…suggests that small libraries give a greater return per dollar spent, and that the economy of scale normally expected in larger institutions is not evident.4

Weird, isn’t it?

  
—————————

1  Using 2009 IMLS data, public libraries in communities of 10,000 or less number 5,388, and those in communities from 10,000 to 20,000 number 1,400. So, the smallest comparably sized segment I could find is the group serving communities from 15,000 to 20,000, which consists of 551 libraries, excluding libraries outside of the 50 U.S. states.
2  Charts 1A, 3A, and 5A present data for 539 U.S. public libraries in the 50 states serving communities of 100,000 or more. The listing at the bottom of chart 1B (click chart above to see complete image) matches the table in Muller’s article exactly. This suggests that figures in the article are from 2009 IMLS data, although the article says they are from 2008.
3  The horizontal axis of Chart 1A and a couple others in this post use “logarithmic scaling.” This gives more space to smaller data values toward the left of the charts and makes the arrangement of values to the right more condensed. Thus, the axis values get progressively higher, left to right. The axis value labels indicate this progression clearly. (Without logarithmic scaling circles representing the smaller large libraries would be jammed next to the 100,000 axis gridline.)
4   DeProspo, E. R., Altman, E., Beasley, K. E., and Clark, E. C. (1973). Performance measures for public libraries, Chicago: Public Library Association, p. 22. Red emphasis added.

The field of program evaluation has grappled with the political context of institutional performance measurement for decades. For libraries and universities, though, the politics of accountability is newer terrain. In some instances these organizations have unwittingly enrolled in a crash course on the subject, learning in real-time how volatile the process can be.

A prime example is the recent controversy about faculty productivity within the University of Texas System (UT). At the request of the its Board of Regents, UT released a 821 page spreadsheet disclosing detailed records on faculty compensation, course enrollment, class sections taught, research time allocations, and other related data. Each page of the document contains this curious disclaimer in red:

The data in its current draft form is incomplete and has not yet been fully verified or cross referenced. In its present raw form it cannot yield accurate analysis, interpretations or conclusions.

Essentially, they’re saying, “Well, here are our data, but they can’t be trusted.” How odd that UT administrators were so willing to imply that their institutional research capabilities are impaired rather than attribute any level of accuracy to their data. One would hope that the salary data are fairly accurate since UT’s financial records have to meet acceptable accounting standards. So maybe it’s the other data that are so shaky and unreliable.

Whatever the case, the professed squishiness of the data reminded me of the popular quotation by Sir Josiah Stamp, the early 20th century British inland revenue secretary, economist, banker, and jack-of-several-statistical-trades:

The Government are very keen on amassing statistics—they collect them, add them, raise them to the nth power, take the cube root, and prepare wonderful diagrams. But what you must never forget is that every one of those figures comes in the first instance from the chowty dar (village watchman) who just puts down what he damn pleases.1

Aside from the vagaries of self-reported data—a complex topic on its own—the point is that we must be cognizant of the quality of any data we are considering. And this assessment is always relative depending on our purposes. For some purposes we need highly accurate and precise data, for others less so. And the same data might be relevant for one investigation but irrelevant for another.

In his 1919 book Sir Josiah Stamp also had something to say about perceptions of the worthiness of data:

We are all familiar with the class of persons who despise and distrust statistics. They are the first to rush to statistics when they are in trouble, and use them without investigation or discrimination…  At another moment the fickle user of figures seeks to prove that statistics…have no real meaning; and because estimates made upon one particular principle are not really serviceable for every possible use, they are condemned as being useful for none.2

By dissing their own data the UT officials are trying to inoculate their institutions against unpleasant findings that might lurk in the data. (There are probably other motives for this tactic as well. I would have loved being a fly-on-the-wall in the meetings where that statement was crafted!)

But do the UT administrators seriously believe that sleuthing the data in their present form is a complete waste of time? I can’t imagine that they do. By claiming that the spreadsheet is basically 0% accurate the administrators imply that the final audited and cross-checked one will be 100% accurate. Neither of these estimates is very likely to be true.

“Accurate3 analysis, interpretations, and conclusions” can be derived from UT’s data as they are, as long as these are qualified by a fair estimate of the accuracy of the data. And you can bet that UT will be receiving requests for this very estimate. (“You mean you released garbage data now, intending to replace it with non-garbage data later?”)

Besides, of all people, academics realize that sound arguments depend upon the quality of the evidence and the logical consistency of the arguments themselves. People can draw really wrong conclusions from the most accurate of data.

The thing that bothers me, though, is that UT’s alarmist disclaimer is the mirror image of the sort of exaggeration I complain about in this blog. Eventually, libraries and universities are going to have to abandon their fickle, knee-jerk reactions of rushing to statistics that support their cases and condemning those that don’t.

  
—————————

1  Stamp, J. (1929). Some economic factors in modern life, London: P. S. King & Son, pp. 258-259.
2  Stamp, J. (1919). The wealth and income of the chief powers, London: Royal Statistical Society, p. 2.
3   Better wording of the disclaimer would be use of the term valid rather than accurate. The validity of a proposition is its relative weightiness and its logical consistency, including how well patterns in the evidence support the arguments made. Justifiable—or we might also say warranted—conclusions and interpretations are said to be valid, whereas trustworthy data are said to be accurate.
   Accuracy, a concept distinct from precision, pertains to the trueness of the data, themselves, and to their faithful preservation when quantitative techniques are applied. Data analysis typically means using quantitative techniques to examine and identify trends in data. However, when the term analysis refers to more general implications drawn from data, then the adjective valid would apply. (I think.)

The U Word

This week Chase Bank sent an email to its customers saying that one of their vendor’s computer systems were hacked. The bank stated that they:

…are confident that the information that was retrieved [i.e., stolen] included some Chase customer e-mail addresses, but did not include any customer account or financial information. Based on everything we know, your accounts and financial information remain secure.

Confidence based on whatever they happen to know, eh?  Because Chase could easily be mistaken, customers would be foolish to put their full trust in the bank’s assurances. I definitely plan to keep an eye on my Chase account for the next several months.

This same caution also applies to the most recent OCLC membership report, Perceptions of Libraries, 2010: Context and Community. The report’s energetic graphics and narrative make the information seem to be true. But, as my prior posts1 explain, surveys are always incomplete and imperfect. Findings from a single survey like OCLC’s are just not weighty enough to deserve our unconditional trust. At best, the findings are well-calculated estimates, at worst they can be really bad guesses. In a word (the U word, actually), some level of uncertainty is always embedded in the information.

As with the Chase Bank situation, we have to wonder what data the OCLC researchers don’t have. Specifically, what kinds of people didn’t respond to the survey? How likely were over-surveyed college students to respond? Which citizens relying solely on public library computers would spend their online time answering the survey? Which sorts of people won’t be found among the few millions of Harris Interactive survey panel members?2 And so on.

The OCLC study and the Chase data theft have something else in common. In both cases the institutions are selective about which information they pay attention to. In the jargon of scientific research this is called confirmation bias—intentionally or unintentionally considering only evidence that supports one’s preconceived notions and ignoring evidence to the contrary. Scientists and engineers strive to avoid this bias since it can be so dangerous.

Here’s a creative interpretation of data which otherwise don’t support the OCLC researchers’ viewpoint:

Millions of Americans, across all age groups, indicated that the value of the library has increased during the recent recession.3

The researchers base this statement on data provided in the study, which I have plotted in the top barchart (with green bars) shown here:

Click for larger image.

Note in the top barchart that among the different respondent age groups (labeled on the horizontal axis) from 16% to 36% perceived the library as more valuable. The researchers call these “double-digit percentages” to suggest that they are large. But their relative smallness becomes obvious when we look at the rest of the data. The lower barchart (with brown-beige bars) shows that from 64% to 84% did not perceive libraries as more valuable. Each age group has a 2 to 1 or higher majority reporting no increased library value. However, in the OCLC study the voices of these majorities fell upon deaf ears.

The minority also won in the chart shown here:

Source: Perceptions of Libraries, 2010, OCLC, Inc., p.27.   Click for larger image.

The chart’s caption is untrue. Only 37% of survey respondents reported increased library use, and the OCLC study doesn’t say how much this use increased.

Eagerness to prove a point occasionally leads the OCLC researchers to misinterpret the data they do pay attention to. The study includes a wheel-shaped chart, shown here, that I also discussed in my prior post:

OCLC Circle Chart

    OCLC Circular Chart Comparing Respondent Group Library Use
Source: Perceptions of Libraries, 2010, p. 24    Click for larger image.

The chart gives percentages of economically impacted respondents who use one or another library service monthly, compared to non-impacted respondents.4 Although the caption reads, “Economically impacted Americans use library services more frequently,” the data in the chart aren’t measures of more versus less frequent use. Instead, they indicate what the groups do with equal frequency, that is, monthly.

Based on data of the sort that’s in the chart, we might observe that one group has a higher proportion of its members represented in the entire group of monthly users than another group. But this would not mean that members of that group use library services more frequently than the other group does, nor that one group is more prominent among all monthly users than the other group.5  Neither would this sort of data indicate whether one group uses library services more frequently than they did in the past.

Most readers of the study won’t recognize the one giant leap of faith the OCLC researchers make in order to confirm their preconceptions. They use the word Americans repeatedly, implying that they are reporting definitive information about the U.S. population at large. However, the study’s portrayal of Americans isn’t definitive and the researchers know this. The Americans portrayed in the survey are merely individuals in the U.S. who participate in Harris Interactive’s survey panels. Statistically, findings from the sample of 1,334 Americans who were polled 6  can describe only (statisticians say can be generalized to) this larger group of panel participants. Beyond this, this larger group is believed to be reasonably representative of all online users in the U.S.

But, here’s the glitch. In the last few pages of the report the researchers make this remarkable disclosure:

The online [American] population may or may not represent the general [American] population.7

Well, that pretty much covers the options, but what does it mean? It means the researchers have no idea how well the study figures resemble figures that are true for the entire U.S. population. Although statements like the following pervade the OCLC report…

A third of American families had at least one family member who experienced a negative job impact during the recession.8

…in the fine print we learn that the researchers aren’t very confident about these. (We would say they are 50% confident about the statements if we assume may or may not indicates two equal probabilities.) Thus, for the statement above the true amount could just as easily be a half of American families, three-eighths, a fourth, a fifth, one tenth, or something else. This uncertainty pertains also to extrapolations the researchers make from survey percentages to U.S. population counts, like:

13 million economically impacted Americans—that is more than the populations of New York, Chicago and Houston combined—are using the library more during the challenging economic time.9

It’s hard to know in which direction and how much this estimate might need adjusted to accurately reflect the U.S. population. Maybe we need to remove Houston from the count, or maybe we should add Philadelphia.

Does the sketchiness (uncertainty) in the OCLC findings mean the study itself is somehow defective? Not at all. All surveys, and all measurements for that matter, are sketchy to one degree or another. We just need to be intelligent about this sketchiness.

If the data are too inexact for our purposes, we need to improve our data collection methods and then re-measure. If the data are good enough, we are obliged to use them conscientiously, explaining their limitations (uncertainty) loudly and clearly so we don’t lead our audience astray.10  If people become overconfident about our study conclusions, they could end up making the wrong decisions, like choosing to blissfully ignore their bank statements.

  
—————————

1  See Discussing Accuracy, Checking It Twice, Stranger Than Fiction, and Objects In Mirror Are Closer Than They Appear.
2  Statisticians at Harris Interactive understand questions like these and, in some cases, apply statistical adjustments meant to address them. But these adjustments are not guaranteed to work (see Discussing Accuracy).
3  Online Computer Library Center, (2010). Perceptions of libraries, , Dublin, OH: Online Computer Library Center, p. 44.
4  The report defines economically impacted respondents on p. 20. Non-economically impacted respondents are those not meeting the criteria outlined in that definition.
5  Replacing the words “more frequently” with “at a higher rate” makes this easier to comprehend. Also, don’t confuse the idea of which group uses a service more (group A uses more services than group B) with which group uses a service more frequently (most group A members use the service twice a month while most group B members use it monthly). The former has to do with total number of services used, the latter with the rate of use.
6  Online Computer Library Center, (2010). p. 102. To arrive at this figure you have to total the left column in the table entitled Total U.S. Respondents.
7  Online Computer Library Center, (2010). p. 102. Red emphasis added. In case you didn’t figure it out, may or may not is the epitome of uncertainty!
8  Online Computer Library Center, (2010). p. 6.
9  Online Computer Library Center, (2010). p. 26.
10  The OCLC researchers do provide some information about the limitations of their data in the form of a margin of error estimate of +/- 2.7%. Add or subtract the margin to/from the survey figures and you get a good (statisticians say plausible) guess about where the true values from the population probably lie. By reporting this margin, the researchers are announcing they are very (95%) confident that the true figures from the larger population fall within this range. However, since the sample was drawn from Harris Interactive’s participant population, the +/- 2.7% range only applies to that population. Unfortunately, this margin of error doesn’t provide us with a plausible range of values for the U.S. population at large. For an entertaining primer on margin of error see Stranger Than Fiction.

Older Posts »

Follow

Get every new post delivered to your Inbox.