I Think That I Shall Never See…

This post is about a much discussed question: How did the Great Recession affect U.S. public libraries? I’m not really going to answer the question, as that would amount to a lengthy journal article or two. But I am going to suggest a way to approach the question using data from the Institute of Museum and Library Services (IMLS) Public Libraries in the United States Survey. Plus I’ll be demonstrating a handy data visualization tool known as a trellis chart that you might want to consider for your own data analysis tasks. (Here are two example trellis charts in case you’re curious. They are explained futher on.)

As for the recession question, in the library world most of the discussion has centered on pronouncements made by advocacy campaigns: Dramatic cuts in funding. Unprecedented increases in demand for services. Libraries between a rock and hard place. Doing more with less. And so forth.

Two things about these pronouncements make them great as soundbites but problematic as actual information. First, the pronouncements are based on the presumption that looking at the forest—or at the big picture, to mix metaphors—tells us what we need to know about the trees. But it does not.

In the chart below you can see that the Great Recession had no general, across-the-board effect on public library funding. Some libraries endured severe funding cuts, others more moderate cuts, others lost little or no ground, and the majority of libraries actually had funding increases in the aftermath of the recession.

IMLS0611_CumChangeOpExp_500

 Bars to the left of the zero line reflect libraries with decreases; bars to the right, increases. Change of -10% = 10% decrease. Change of 10% = 10% increase. Click for larger image.

In the chart note that 35% of libraries had 5-year inflation-adjusted cumulative decreases of one size or another. Of these libraries, about half (18% of all libraries) had decreases of 10% or greater and half (17% of all libraries) had decreases less than 10%. 65% of libraries had cumulative increases of any size. Of libraries with increases, two-thirds (43% of all libraries) had increases of 10% or greater and one-third (22% of all libraries) with increases less than 10%. By the way, expenditure data throughout this post are adjusted for inflation because using unadjusted (face-value) figures would understate actual decreases and overstate actual increases.1

The second problem with the advocacy pronouncements as information is their slantedness. Sure, library advocacy is partial by definition. And we promote libraries based on strongly held beliefs about their benefits. So perhaps the sky-is-falling messages about the Great Recession were justified in case they actually turned out to be true. Yet many of these messages were contradicted by the available evidence. Most often the messages involved reporting trends seen only at a minority of libraries as if these applied to the majority of libraries. That’s is essentially what the pronouncements listed above do.

A typical example of claims that contradict actual evidence appeared in the Online Computer Library Center (OCLC) report Perceptions of Libraries, 2010. Data in that report showed that 69% of Americans did not feel the value of libraries had increased during the recession. Nevertheless, the authors pretended that the 31% minority spoke for all Americans, concluding that:

Millions of Americans, across all age groups, indicated that the value of the library has increased during the recession.2

In our enthusiasm for supporting libraries we should be careful not to be dishonest.

But enough about information accuracy and balance. Let’s move on to some nitty-gritty data exploration! For this I want to look at certain trees in the library forest. The data we’ll be looking at are just for urban and county public library systems in the U.S. Specifically, the 44 libraries with operating expenditures of $30 million or more in 2007.3 The time period analyzed will be 2007 to 2011, that is, from just prior to the onset of the Great Recession to two years past its official end.

Statistically speaking, a forest-perspective can still compete with a tree-perspective even with a small group of subjects like this one. Here is a graph showing a forest-perspective for the 44 libraries:

Median Coll Expend

Median collection expenditures for large U.S. urban libraries.  Click to see larger graph.

You may recall that a statistical median is one of a family of summary (or aggregate) statistics that includes totals, means, ranges, percentages/proportions, standard deviations, and the like. Aggregate statistics are forest statistics. They describe a collective as a whole (forest) but tell us very little about its individual members (trees).

To understand subjects in a group we, of course, have to look at those cases in the data. Trellis charts are ideal for examining individual cases. A trellis chart—also known as a lattice chart, panel chart, or small multiples—is a set of statistical graphs that have been arranged in rows and columns. To save space the graphs’ axes are consolidated in the trellis chart’s margins. Vertical axes appear in the chart’s left margin and the horizontal axes in the bottom or top margin or both.

Take a look at the chart below which presents data from agricultural experiments done in Minnesota in the 1930’s. It happens that the data depicted there are famous because legendary statistician R. A. Fisher published them in his classic 1935 book, The Design of Experiments. Viewing the data in a trellis chart helped AT&T Bell Laboratories statistician William Cleveland discover an error in the original data that went undetected for decades. The story of this discovery both opens and concludes Cleveland’s 1993 book Visualizing Data.4

The core message of Cleveland’s book is one I’ve echoed here and here: Good data visualization practices can help reveal things about data that would otherwise remain hidden.5
Trellis Chart Example

Trellis chart depicting 1930′s agricultural experiments data.
Source: www.trellischarts.com.  Click to see larger image.

At the left side of the chart notice that a list of items (these are barley seed varieties) serves as labels for the vertical axes for three graphs in the top row. The list is repeated again, serving as axes labels for the graphs in the second row. On the bottom of the chart numbers (20 to 60) are repeated and form the horizontal scales for the two graphs in each column. The layout of a trellis chart provides more white space so that the eye can concentrate on the plotted data alone, in this case circles depicting experimental results for 1931 and 1932.

With multiple graphs arranged side by side a trellis chart makes it easy to compare how different cases (aka research subjects) compare on a single measure. The chart below shows more about how this works using library data:

Demo trellis chart

Trellis chart example with library collection expenditures data.  Click for larger image.

The chart presents collection expenditures as a percent of total operating expenditures from 2007 to 2011. The cases are selected libraries as labeled. Notice how easy it is to identify the line shapes—like the humped lines of Atlanta, Baltimore, Cuyahoga Co., and Hawaii. And the bird-shapes of Brooklyn and Hennepin Co. And the District of Columbia’s inverted bird. Trellis charts make it easy to find similarities among individual trends, such as the fairly flat lines for Baltimore Co., Broward Co., Cincinnati, Denver, and King Co. Nevertheless, the charts presented here are more about identifying distinct patterns in single graphs. Each graph tells a unique story about a given library’s changes in annual statistics.

Incidentally, the trellis charts to follow have been adapted slightly to accommodate cases with exceptionally high data values. Instead of appearing in alphabetical order with other libraries in the chart, graphs for cases with high values appear in the far right column as shown in this one-row example:

Trellis Chart Adaptation

Row from trellis chart with high value graph shaded and in red.
 Click for larger image.

Notice that the graph at the right is shaded with its vertical axis clearly labeled in red, whereas the vertical axes for the non-shaded/black-lettered graphs appear at the left margin of the chart row. In this post all shaded/red-lettered graphs have scaling different from the rest of the graphs in the chart. By using extended scaling just for libraries with high values, the scaling for the rest of the libraries can be left intact.6

With that explanation out of the way, let’s look for some stories about these 44 urban and county libraries beginning with total operating expenditures:

Oper Expend Chart #1

Chart #1 interactive version

Oper Expend Chart #2

Chart #2 interactive version

Total Operating Expenditures.  Click charts for larger images. Click text links for interactive charts.

Take a moment to study the variety of patterns among the libraries in these charts. For instance, in chart #1 Brooklyn, Broward County, Cleveland, and Cuyahoga Co. all had expenditure levels that decreased significantly by 2011. Others like Denver, Hawaii, Hennepin Co., Houston, Multnomah Co., and Philadelphia had dips in 2010 (the Great Recession officially ended in June the prior year) followed by immediate increases in 2011. And others like Boston, Orange Co. CA, San Diego Co., and Tampa had their expenditures peak in 2009 and decrease afterwards.

Now look at collection expenditures in these next two charts. You can see, for instance, that these dropped precipitously over the 4-year span for Cleveland, Los Angeles, Miami, and Queens. For several libraries including Atlanta, Baltimore, and Columbus expenditures dipped in 2010 followed by increases in 2011. Note also other variations like the stair-step upward trend of Hennepin Co., Houston’s bridge-shaped trend, the 2009 expenditure peaks for King Co., Multnomah, San Diego Co., and Seattle in 2009, and Chicago’s intriguing sideways S-curve.

Coll Expend Chart #1

Chart #1 interactive version

Coll Expend Chart #2

Chart #2 interactive version

Collection Expenditures.  Click charts for larger images. Click text links for interactive charts.

Again, with trellis charts the main idea is visually scanning the graphs to see what might catch your eye. Watch for unusual or unexpected patterns although mundane patterns might be important also. It all depends on what interests you and the measures being viewed.

Once you spot an interesting case you’ll need to dig a little deeper. The first thing to do is view the underlying data since data values are typically omitted from trellis charts. For instance, I gathered the data seen in the single graph below for New York:

NYPL Coll Expend

Investigating a trend begins with gathering detailed data. Click for larger image.

The example trellis chart presented earlier showed collection expenditures as a percent of total operating expenditures. This same measure is presented in the next charts for all 44 libraries, including links to the interactive charts. Take a look to see if any trends pique your curiosity.

Coll Expend as pct chart #1

Chart #1 interactive version

Coll Expend as pct chart #2

Chart #2 interactive version

Percent Collection Expenditures .  Click charts for larger images. Click text links for interactive charts.

Exploring related measures at the same time can also reveal things about a data pattern we’re investigating. For example, collection expenditure patterns are made clearer by seeing how decreases in these compare to total expenditures. And how collection expenditures as a percentage of total expenditures relate to changes in the other two measures. The charts below make these comparisons possible for the 4 libraries mentioned earlier—Cleveland, Los Angeles, Miami, and Queens:

Multiple collection measures

Chart #1 interactive version

Multiple measures with data values

Chart #2 interactive version

Understanding collection expenditure trends via multiple measures. Chart #1, trends alone. Chart #2, data values visible.  Click charts for larger images. Click text links for interactive charts.

The next step is analyzing the trends and comparing relevant figures, with a few calculations (like percentage change) thrown in. Cleveland’s total expenditures fell continuously from 2007 to 2011, with a 20% cumulative decrease. The library’s collection expenditures decreased at nearly twice that rate (39%). As a percent of total expenditures collection expenditures fell from 20.4% to 15.6% over that period. Still, before and after the recession Cleveland outspent the other three libraries on collections.

From 2007 to 2010 Los Angeles’ total expenditures increased by 6% to $137.5 million, then dropped by 24% to $113.1 million. Over the 4-year span this amounted to a 13% increase. For that same period Los Angeles’ collection expenditures decreased by 45%. By 2010 Miami’s total expenditures had steadily increased by 38% to $81.8 million. However, in 2011 its expenditures fell to $66.7 million, a 17% drop from 2010 level but an increase of 13% over the 2007 level. Miami’s collection expenditures decreased by 78% over from 2007 to 2011, from $7.4 million to $1.6 million.

Total expenditures for Queens increased by 17% from 2007 to 2009, the year the Great Recession ended. Then by 2011 these expenditures dropped to just below 2007 levels, a 2% cumulative loss over the 4 years and a 19% loss from the 2009 level. From 2007 to 2011, though, Queens collection expenditures declined by 63% or $7.3 million.

Talk about data telling stories! Three of the 4 libraries had percent of total expenditures spent on collections decrease to below 6% in the aftermath of the recession. To investigate these figures futher we would need to obtain more information from the libraries.

As you can see, trellis charts are excellent tools for traipsing through a data forest, chart by chart and tree by tree. Obviously this phase takes time, diligence, and curiosity. Just 44 libraries and 5 years’ worth of a half-dozen measures produces a lot of data! But the effort expended can produce quite worthwhile results.

If your curious about other interesting trends, the next two sets of charts show visits and circulation for the 44 urban and county public libraries. Looking quickly, I didn’t see much along the lines of unprecedented demand for services. Take a gander yourself and see if any stories emerge. I hope there isn’t bad news hiding there. (Knock on wood.)

Visits

Chart #1 interactive version

Visits chart #2

Chart #2 interactive version

Visits.  Click charts for larger images. Click text links for interactive charts.

Circ Chart #1

Chart #1 interactive version

Circ Chart #2

Chart #2 interactive version

Circulation.  Click charts for larger images. Click text links for interactive charts.

 
—————————

1   The 2007 through 2010 expenditure data presented here have been adjusted for inflation. The data have been re-expressed as constant 2011 dollars using the GDP Deflator method specified in IMLS Public Libraries in the United States Survey: Fiscal Year 2010 (p. 45). For example, because the cumulative inflation rate from 2007 to 2011 was 6.7%, if a library’s total 2007 expenditures were $30 million in 2007, then for this analysis that 2007 figure was adjusted to $32 million.
   Standardizing the dollar values across the 4-year period studied is the only way to get an accurate assessment of actual expenditure changes. A 2% expenditure increase in a year with 2% annual inflation is really no expenditure increase. Conversely, a 2% expenditure decrease in a year with 2% annual inflation is actually a 4% expenditure decrease.
2   Online Computer Library Center, Perceptions of Libraries, 2010: Context and Community, p. 44.
3   In any data analysis where you have to create categories, you end up drawing lines somewhere. To define large urban libraries I drew the line at $30 million total operating expenditures. Then, I based this on inflation adjusted figures as described in footnote #1. So any library with unadjusted total operating expenditures equal to or exceeding $28.2 million in 2007 was included.
4   See anything unusual in the chart? (Hint: Look at the chart labeled Morris.) The complete story about this discovery can be found here. Page down to the heading Barley Yield vs. Variety and Year Given Site. See also William S. Cleveland’s book, Visualizing Data, pp. 4-5, 328-340.
5   Using ordinary graphical tools statistician Howard Wainer discovered a big mistake in data that were 400+ years old! His discovery is described in his 2005 book, Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Wainer uncovered anomalies in data appearing in an article published in 1710 by Queen Anne’s physician, John Arbuthnot. The original data were registered christenings and burials collected in England from 1620 to 1720 at the orders of Oliver Cromwell. See Wainer, H. Graphic Discovery, 2005, pp.1-4.
6   The chart below illustrates how a larger scale affects the shapes of a trend line. The scale in the left graph ranges from $25M to $100M, while the scale of the right graph ranges from $25M to $200M. Because the left graph scaling is more spacious (smaller scaling), its trend line angles are more accentuated.

Different Axes Example

Click for larger image.

Posted in Advocacy, Data visualization, Library statistics | Leave a comment

Roughly Wrong

I decided to move right on to my first 2014 post without delay. The reason is the knot in my stomach that developed while viewing the Webjunction webinar on the University of Washington iSchool Impact Survey. The webinar, held last fall, presented a new survey tool designed for gathering data about how public library patrons make use of library technology and what benefits this use provides them.

Near the end of the webinar a participant asked whether the Impact Survey uses random sampling and whether results can be considered to be statistically representative. The presenter explained that the survey method is not statistically representative since it uses convenience sampling (a topic covered in my recent post). And she confirmed that the data only represent the respondents themselves. And that libraries will have no way of knowing whether the data provide an accurate description of their patrons or community.

Then she announced that this uncertainty and the whole topic of sampling were non-issues, saying, “It really doesn’t matter.” She urged attendees to set aside any worries they had about using data from unrepresentative samples, saying these samples portray “real people doing these real activities and experiencing real outcomes.” And that the samples provide “information you can put into use.”

As well-meaning as the Impact Survey project staff may be, you have to remember their goal is selling their product, which they just happen to have a time-limited introductory offer for. Right now the real issues of data accuracy and responsible use of survey findings are secondary or tertiary to the project team. They could have chosen the ethical high road by proactively discussing the strengths and weaknesses of the Impact Survey. And instructing attendees about appropriate ways to interpret the findings. And encouraging their customers to go the extra mile to augment the incomplete (biased) survey with data from other sources.

But this is not part of their business model. You won’t read about these topics on their website. Nor were they included in the prepared Webjunction presentation last fall. If the issue of sampling bias comes up, their marketing tactic is to “comfort” (the presenter’s word) anyone worried about how trustworthy the survey data are.

The presenter gave two reasons for libraries to trust data from unrepresentative samples: (1) A preeminent expert in the field of program evaluation said they should; and (2) the University of Washington iSchool’s 2010 national study compared its convenience sample of more than 50,000 respondents with a smaller representative sample and found the two samples to be pretty much equivalent.

Let’s see whether these are good reasons. First, the preeminent expert the presenter cited is Harry P. Hatry, a pioneer in the field of program evaluation.1  She used this quote by Hatry: “Better to be roughly right than to be precisely ignorant.”2  To understand Hatry’s statement we must appreciate the context he was writing about. He was referring mainly to federal program managers who opted to not survey their users at all rather than attempt to meet high survey standards promoted by the U.S. Office of Management and Budget. Hatry was talking about the black-and-white choice of high methodological rigor versus doing nothing at all. The only example of lower versus higher precision survey methods he mentioned is mail rather than telephone surveys. Nowhere in the article does he say convenience sampling is justified.

The Impact Survey team would have you believe that Hatry is fine with public agencies opting for convenient and cheap data collection methods without even considering the alternatives. Nevertheless, an Urban Institute manual which Hatry served as advisor for, Surveying Clients About Outcomes, encourages public agencies to first consider surveying their complete roster of clientele. If that is not feasible, public agencies should then use a sampling method that makes sure findings “can be projected reliably to the full client base.”3  The manual does not discuss convenience sampling as an option.

Data accuracy is a big deal to Hatry. He has a chapter in the Handbook of Practical Program Evaluation about using public agency records in evaluation research. There you can read page after page of steps evaluators should follow to assure the accuracy of the data collected. Hatry would never advise public agencies to collect whatever they can, however they can, and use it however they want regardless of how inaccurate or incomplete it is. But that is exactly the advice of the Impact Survey staff when they counsel libraries that sample representativeness doesn’t really matter.

The Impact Survey staff would like libraries to interpret roughly right to mean essentially right. But these are two very different things. When you have information that is roughly right, that information is also roughly wrong. (Statisticians call this situation uncertainty, and the degree of wrongness, error.) The responsibility of a quantitative analyst here is exactly that of an information professional. She must assess how roughly right/wrong the information is. And then communicate this assessment to users of the information so they can account for this in their decision-making. If they do not consider the degree of error in their data, the analyst and decision-makers are replacing Hatry’s precise ignorance with the more insidious ignorance of over-confidence in unvetted information.4

The second reason the presenter gave for libraries not worrying about convenience samples was an analysis from the 2010 U.S. Impact Public Library Study. She said that study researchers compared their sample of 50,000+ self-selected patrons with another sample they had which they considered to be representative. They found that patterns in the data from the large convenience sample were very similar to those in the small representative sample. She explained, “Once you get enough data you start seeing a convergence between what is thought of as a representative sample…and what happens in a convenience sample.”

So, let me rephrase this. You start by attracting thousands and thousands of self-selected respondents from the population you’re interested in. And you continue getting more and more self-selected respondents added to this. When your total number of respondents gets really large, then the patterns in this giant convenience sample begin to change so that they now match patterns found in a small representative sample drawn from that same population. Therefore, very large convenience samples should be just as good as fairly small representative samples.

Assuming this statistical effect is true, how would this help improve the accuracy of small convenience samples at libraries that sign up for the Impact Survey? Does this statistical effect somehow trickle down to the libraries’ small samples, automatically making them the equivalent of representative samples? I don’t think so. I think that, whatever statistical self-correction occurred in the project’s giant national sample, libraries using this survey tool are still stuck with their small unrepresentative samples.5

While it is certainly intriguing, this convergence idea doesn’t quite jibe with the methodology of the 2010 study. You can read in the study appendix or in my prior post about how the analysis worked in the opposite direction. The researchers took great pains to statistically adjust the data in their convenience sample (web survey) in order to counter its intrinsic slantedness. Using something called propensity scoring they statistically reshaped the giant set of data to align it with the smaller (telephone) sample, which they considered to be representative. All of the findings in the final report were based on these adjusted data. It would be very surprising to learn that they later found propensity scoring to be unnecessary because of some statistical effect that caused the giant sample to self-correct.

As you can see, the Impact Survey staff’s justifications for the use of convenience sampling aren’t convincing. We need to rethink the idea of deploying quick-and-easy survey tools for the sake of library advocacy. As currently conceived, these tools require libraries to sacrifice certain of their fundamental values. Gathering and presenting inaccurate and incomplete data is not something libraries should be involved in.

 
—————————

1   The presenter said Hatry “wrote the book on evaluation.” Hatry is legendary in the field of program evaluation. But the book on evaluation has had numerous co-authors and still does. See Marvin Alkin’s 2013 book, Evaluation Roots.
2   The complete quotation is, “I believe that the operational principle for most programs is that it is better to be roughly right than to be precisely ignorant.” Hatry, H.P. (2002). Performance Measurement: Fashions and Fallacies, Public Performance & Management Review, 25:4, 356.
3   Abravanel, M.D. (2003). Surveying Clients About Outcomes, Urban Institute, Appendix C.
4   Yes, convenience samples produce unvetted information. They share the same weakness that focus groups have. Both data collection methods provide real information from real customers. But you take a big risk assuming these customers speak for the entire target group you hope to reach.
5   As I mentioned in my recent post, there is a known statistical effect that can make a library’s convenience sample perfectly match a representative sample drawn from the population of interest. This effect is known as luck or random chance. Just by the luck of the draw your convenience sample could, indeed, end up exactly matching the data from a random sample. The problem is, without an actual random sample to cross-check this with your library will never know whether this has happened. Nor how lucky the library has been!

Posted in Advocacy, Probability, Research, Statistics | 3 Comments

Wasting Time Bigtime

We all know that the main function of libraries is to make information accessible in ways that satisfy user needs. Following Ranganathan’s Fourth Law of Library Science, library instructions guiding users to information must be clear and simple in order to save the user’s time. This is why library signage avoids exotic fonts, splashy decorations, and any embellishments that can muddle the intended message. Library service that wastes the user’s time is bad service.

So I am baffled by how lenient our profession is when it comes to muddled and unclear presentations of quantitative information in the form of data visualizations. We have yet to realize that the sorts of visualizations that are popular nowadays actually waste the user’s time—bigtime!  As appealing as these visualizations may be, from an informational standpoint they violate Ranganathan’s Fourth Law.

Consider the data visualization shown below from the American Library Association’s (ALA) Digital Inclusion Study:

Digital Inclusion Total Dash

ALA Digital Inclusion Study national-level dashboard. Click to access original dashboard.

This visualization was designed to keep state data coordinators (staff at U.S. state libraries) informed. The coordinators were called upon to encourage local public libraries to participate in a survey conducted last fall for this study. The graphic appears on the project website as tool for monitoring progress of the survey state by state.

Notice that the visualization is labeled a dashboard, a data display format popularized by the Balanced Scorecard movement. The idea is a graphic containing multiple statistical charts, each one indicating the status of an important dimension of organizational performance. As Stephen Few observed in his 2006 book, Information Dashboard Design, many dashboard software tools are created by computer programmers who know little to nothing about the effective presentation of quantitative information. Letting programmers decide how to display quantitative data is like letting me tailor your coat. The results will tend towards the Frankensteinian. Few’s book provides several scary examples.

Before examining the Digital Inclusion Study dashboard, I’d like to show you a different example, the graphic appearing below designed by the programmers at Zoomerang and posted on The Center for What Works website. It gives you some idea of the substandard designs that programmers can dream up:1   

What Works Chart

Zoomerang chart posted on http://www.whatworks.org. Click to see larger version.

The problems with this chart are:

  • There are no axis labels explaining what data are being displayed. The data seem to be survey respondents’ self-assessment of areas for improvement based on a pre-defined list in a questionnaire.
  • There is no chart axis indicating scaling. There are no gridlines to assist readers in evaluating bar lengths.
  • Long textual descriptions interlaced between the blue bars interfere with visually evaluating bar lengths.
  • 3D-shading on the blue bars has a visual effect not far from something known as moiré, visual “noise” that makes the eye work harder to separate the visual cues in the chart. The gray troughs to the right of the bars are extra cues the eye must decipher.
  • The quantities at the far right are too far away from the blue bars, requiring extra reader effort. The quantities are located where the maximum chart axis value typically appears. This unorthodox use of the implied chart axis is confusing.
  • The questionnaire items are not sorted in a meaningful order, making comparisons more work.

We should approach data visualizations the way we approach library signage. The visualizations should make the reader’s task quick and easy—something the Zoomerang chart fails at. Here’s a better design:2

What Works Revision

Revision of original (blue) Zoomerang chart posted above. Click to see larger version.

WARNING:  Beware of statistical, graphical, and online survey software. Nine times out of ten the companies that create this software are uninformed about best practices in graphical data presentation. (This applies to a range of vendors, from Microsoft and Adobe to upstart vendors that hawk visualization software for mobile devices.) Indiscriminate use of these software packages can cause you to waste the user’s time.

The Digital Inclusion Study dashboard appearing at the beginning of this post wastes the user’s time. Let’s see how. Note that the dashboard contains three charts—a gauge, line chart, and map of the U.S. The titles for these are imprecise, but probably okay for the study’s purposes (assuming the state data coordinators were trained in use of the screen). Still, for people unfamiliar with the project or users returning to this display a year later, the titles could be worded more clearly. (Is a goal different from a target? How about a survey submission versus a completion?)

Understandability is a definite problem with the map’s color-coding scheme. The significance of the scheme is likely to escape the average user. It uses the red-amber-green traffic signal metaphor seen in the map legend (bottom left). With this metaphor green usually represents acceptable/successful performance, yellow/amber, borderline/questionable performance, and red, unacceptable performance.

Based on the traffic signal metaphor, when a state’s performance is close to, at, or exceeds 100%, the state should appear in some shade of green on the map. But you can see that this is not the case. Instead, the continental U.S.is colored in a palette ranging from light reddish to bright yellow. Although Oregon, Washington, Nevada, Michigan, and other states approach or exceed 100% they are coded orangeish-yellow.3  And states like Colorado, North Carolina, and Pennsylvania, which reported 1.5 to 2 times the target rate, appear in bright yellow.

This is all due to the statistical software reserving green for the highest value in the data, namely, Hawaii’s 357% rate. Generally speaking, color in a statistical chart is supposed to contain (encode) information. If the encoding imparts the wrong message, then it detracts from the informativeness of the chart. In other words, it wastes user time—specifically, time spent wondering what the heck the coding means!

Besides misleading color-coding, the shading in the Digital Inclusion Study dashboard map is too subtle to interpret reliably. (The dull haze covering the entire map doesn’t help.) Illinois’ shading seems to match Alabama’s, Michigan’s, and Mississippi’s, but these three differ from Illinois by 13 – 22 points. At the same time, darker-shaded California is only 5 points lower than Illinois.

The Digital Inclusion map’s interactive feature also wastes time. To compare data for two or more states the user must hover her device pointer over each state, one at a time. And then remember each percentage as it is displayed and then disappears.

Below is a well-designed data visualization that clarifies information rather than making it inaccessible. Note that the legend explains the color-coding so that readers can determine which category each state belongs to. And the colors have enough contrast to allow readers to visually assemble the groupings quickly—dark blue, light blue, white, beige, and gold. Listing the state abbreviations and data values on the map makes state-to-state comparisons easy.

BEA_GDP Map

A well-designed data visualization. Source: U.S. Bureau of Economic Analysis. Click to see larger version.

This map is definitely a time saver!

Now let’s turn to an…er…engaging feature of the ALA dashboard above—the dial/gauge. To the dismay of Stephen Few and others, dials/gauges are ubiquitous in information dashboards despite the fact that they are poor channels for the transmission of information. Almost always these virtual gadgets obscure information rather than reveal it.4  Meaning, again, that they are time wasters.

The gauge in the dashboard above presents a single piece of data—the number 88. It is astonishing that designers of this virtual gadget have put so many hurdles in the way of users trying to comprehend this single number. I hope this bad design comes from ignorance rather than malice. Anyway, here are the hurdles:

  1. The dial’s scaling is all but invisible. The dial is labeled, but only at the beginning (zero) and end (100) of the scale, and in a tiny font. To determine values for the rest of the scale the user must ignore the prominent white lines in favor of the obscured black lines (both types of lines are unlabelled). Then she has to study the spacing to determine that the black lines mark the 25, 50, and 75 points on the dial. The white lines turn out to be superfluous.
  2. The needle is impossible to read. The green portion of the banding causes the red tick-marks to be nearly invisible. The only way to tell exactly where the needle is pointing is by referring to the ‘88’ printed on the dial, a requirement that renders the needle useless.
  3. The uninitiated user cannot tell what is being measured. The text at the center of the image is masked at both edges because it has been squeezed into too small a space. And the gauge’s title is too vague to tell us much. I am guessing that the dial measures completed survey questionnaires as a percentage of some target quantity set for the U.S. public libraries that were polled. (And, honestly, I find it irritating that the 88 is not followed by a percent symbol.)
  4. The time period for the data depicted by the gauge is unspecified. Not helpful that the line chart at the right contains no scale values on the horizontal axis. Or, technically, the axis has one scale value—the entirety of 2013. (Who ever heard of a measurement scale with one point on it?) The dial and line chart probably report questionnaires submitted to date. So it would be especially informative for the programmers to have included the date on the display.
  5. Although the red-amber-green banding seems to be harmless decoration, it actually can lead the reader to false conclusions. Early on in the Digital Inclusion Study survey period, submissions at a rate of, say, 30%, would be coded ‘unacceptable’ even though the rate might be quite acceptable. The same misclassification can occur in the amber region of the dial. Perhaps users should have been advised to ignore the color-coding until the conclusion of the survey period. (See also the discussion of this scheme earlier in this post.)

The graphic below reveals a serious problem with these particular gauges. The graphic is from a second dashboard visible on the Digital Inclusion Study website, one that appears when the user selects any given U.S. state (say, Alaska) from the dashboard shown earlier:

Digital Inclusion Alaska Chart

ALA Digital Inclusion Study state-level dashboard. Click to see larger version.

Notice that this dashboard contains five dials—one for the total submission rate for Alaska (overall) and one for each of four location categories (city, suburban, town, and rural). While the scaling in all five dials spans from 0% to 100%, two of the dials—city and town—depict quantities far in excess of 100%. I’ll skip the questions of how and why the survey submission rate could be so high, as I am uninformed about the logistics of the survey. But you can see that, regardless of the actual data,the needles in these two gauges extend only a smidgen beyond the 100% mark.

Turns out these imitation gauges don’t bother to display values outside the range of the set scaling, which, if you think about it, is tantamount to withholding information.5  Users hastily scanning just the needle positions (real-life instrument dials are designed for quick glances) will get a completely false impression of the data. Obviously, the gauges are unsatisfactory for the job of displaying this dataset correctly.

So now the question becomes, why use these gauges at all? Why not just present the data in a single-row table? This is all the dials are doing anyway, albeit with assorted visual aberrations. Besides, there are other graphical formats capable of displaying these data intelligently. (I won’t trouble you with the details of these alternatives.)

One point about the line chart in the Alaska (state-level) dashboard. Well, two points, actually. First, the weekly survey submission counts should be listed near the blue plotted line—again, to save the user’s time. Second, the horizontal axis is mislabeled. Or, technically, untitled. The tiny blue square and label are actually the chart legend, which has been mislocated. As it is, its location suggests that both chart axes measure survey completions, which makes no sense. The legend pertains only to the vertical axis, not to the horizontal. The horizontal axis represents the survey period measured in weeks. So perhaps the label “Weeks” would work there.

In charts depicting a single type of data (i.e. a single plotted line) there is no need for a color-coded legend at all. The sort of detail that software programmers will know nothing about.

Finally, a brief word about key information the dashboard doesn’t show—the performance thresholds (targets) that states had to meet to earn an acceptable rating. Wouldn’t it be nice to know what these are? They might provide some insight into the wide variation in states’ overall submission rates, which ranged from 12% to 357%. And the curiously high levels seen among the location categories. Plus, including these targets would have required the dashboard designers to select a more effective visualization format instead of the whimsical gauges.

Bottom line, the Digital Inclusion Study dashboard requires a lot of user time to obtain a little information, some of which is just plain incorrect. Maybe this is no big deal to project participants who have adjusted to the visualization’s defects in order to extract what they need. Or maybe they just ignore it. (I’m still confused about the purpose of the U.S. map.)

But this a big deal in another way. It’s not a good thing when nationally visible library projects model such unsatisfactory methods for presenting information. Use of canned visualizations from these software packages is causing our profession to set the bar too low. And libraries mimicking these methods in their own local projects will be unaware of the methods’ shortcomings. They might even assume that Ranganathan would wholeheartedly approve!

 
—————————

1   Convoluted designs by computer programmers are not limited to data visualizations. Alan Cooper, the inventor of Visual Basic, describes how widespread this problem is in his book, The Inmates Are Running the Asylum: Why High Tech Products Drive Us Crazy and How to Restore the Sanity.
2   Any chart with closely spaced bars can be subject to moiré, especially when bold colors are used. Pastel shades, like the tan in this chart, help minimize this.
3   Delaware also falls into this category and illustrates the distortion intrinsic to maps used to display non-spatial measures. (Shale deposit areas by state is a spatial measure; prevalance of obesity by state is a non-spatial measure.) Large states will be visually over-emphasized while tiny states like Delaware and Rhode Island struggle to be seen at all.
4   My favorite example, viewable in Stephen Few’s blog, is how graphic artists add extra realism as swatches of glare on the dials’ transparent covers. These artists don’t think twice about hiding information for the sake of a more believable image.
5   This is extremely bad form—probably misfeasance—on the part of the software companies. More responsible software companies, like SAS and Tableau Software, are careful to warn chart designers when data extend beyond the scaling that chart designers define.

Posted in Data visualization | Leave a comment

Strength in Numbers

I want to tell you about a group of U.S. public libraries that are powerhouses when it comes to providing services to the American public. You might suppose that I’m referring to the nation’s large urban and county systems that serve the densest populations with large collections and budgets. These are the libraries you’d expect to dominate national library statistics. However, there’s a group of libraries with modest means serving moderate size communities that are the unsung heroes in public library service provision. These are libraries with operating expenditures ranging from $1 million to $4.9 million.1   Due to their critical mass combined with their numbers (there are 1,424 of them) these unassuming libraries pack a wallop in the service delivery arena.

Their statistical story is an interesting one. Let me introduce it to you by means of the patchwork graphic below containing 6 charts known as treemaps.

MeasMapsClrD_540

Click to view larger graphic.

From a data visualization standpoint treemaps (and pie charts also) have certain drawbacks that were identified in my prior post.2 Still, treemaps do have their place when used judiciously. And their novelty and color are refreshing. So, let’s go with them!

At first glance, treemaps are not that easy to decipher. Let me offer a hint to begin with and then follow with a fuller explanation. The hint: Notice how prominent the gold rectangles are among the 6 treemaps shown above. As the graph legend indicates, gold represents the $1 million to $4.9 million expenditure group that this post is focused on. (I purposely color-coded them gold!) And the story is about the appearance and meaning of these gold rectangles. Or more exactly, the rectangles representing this expenditure group—however the rectangles are colored. (Below you’ll see I also use monochrome-shaded treemaps to tell the story.)

Now let’s see how treemaps work. Treemaps are like rectangular pie charts in that they use geometrical segments to depict parts-of-a-whole relationships. In other words, treemaps present a categorical breakdown of quantitative data (not the statistician). A single treemap represents 100% of the data and the categories are represented by inset rectangles rather than pie wedges. The sizes of treemap segments reflect the data quantities. In some cases treemaps also use color to represent data quantities, as this green treemap does:

LibbyExpAreaD_450

Number of Libraries by Expenditure Group  
Click to view larger interactive chart.

Before getting to the quantitative aspects of this green chart, let me explain that the ballooned text is an interactive feature of Tableau Public, the statistical software used to generate the chart. If you would, click the treemap now to view the interactive version. At the top right of that chart is a legend indicating how color-shading works. Also, below the treemap is a bar chart displaying percentages data—the same figures visible on the treemap balloons.

In monochrome treemaps like the green one above the largest and darkest rectangle represents the highest number in the data, and the smallest and lightest represents the lowest number. The largest rectangle is always located at the top left of the treemap and the smallest at the bottom right. All tree maps, including the 6 charts above, follow this top-left-to-right-bottom arrangement. But only monochrome treemaps use color-shading in addition to rectangle sizing to portray quantities. With multi-color treemaps color is used instead to encode the data categories, in our case, library expenditure ranges.

In all of the treemaps included in this post each inset rectangle represents one of the total operating expenditure ranges listed in the first column of this table:

Table-LibsPopbyExp

Data Source: Public Libraries in the U.S. Survey, 2011, Institute of Museum & Library Services.3

Although these expenditure categories pertain to quantities (dollar ranges), remember that categories are always qualitative, that is, non-numerical.4  To emphasize the fact that the categories are non-numerical, in the table their labels have letter prefixes.

In the green treemap the rectangles are arranged according to the number of public libraries falling within each expenditure range, left to right as described above. However, this order does not apply to the bar chart appearing below the interactive version of the treemap. That bar chart is instead sorted by the category ranges low to high.

The largest rectangle in the treemap is the $50K or less category. Hovering the pointer over the $50K or less rectangle in the interactive chart or looking at the corresponding bar in the bar chart shows this category’s percentage as 21.1%. Similarly, 16.1% of public libraries fall under the next-largest rectangle which represents the $400K – $999.9K category. And the categories with the third and fourth largest rectangles, $1.0M – $4.9M and $100K – $299.9K, account for 15.4% and 14.7% of the total number of libraries. (The complete data appear in the table above.)

This next treemap (below in blue) depicts population data for each expenditure category. To see detailed figures click on the chart. (Again, a bar chart appears below the treemap there.) Hover the pointer over the top left rectangle (the 1.0 million to $4.9 million category) of the interactive treemap. Notice that this category serves the highest population amount, 88.4 million or nearly 29% of the total U.S. population served by public libraries. Next is the $10.0 million to $49.9 million group, the rectangle just below. Libraries in this category served 83 million or 27% of the population in 2011.

PopbyExpArea_450

Population Served by Expenditure Group
Click to view larger interactive chart.

Now the question is, how do these two measures—U.S. public library counts and population served—relate to each other? The next bar chart offers an answer:

PopLibsbyExpBar_400

Click to view larger interactive chart.

Again, click on the bar chart to see the interactive version. Then, click on the legend (Libraries / Population Served) to highlight one set of bars at a time. You’ll see that from left to right the libraries percentages (green bars) drop, remain fairly level, and then drop again. The population served percentages (blue bars) swoop up from left to right to the $1.0 million to $4.9 million category, step down, up, and then down again.

These trends are not surprising since we know that the smallest libraries serve the smallest communities on the smallest budgets. And that these communities are very numerous. Likewise, the largest libraries serve the largest communities, which are few in number. But it is surprising that the $1.0 million to $4.9 million category serves as large of a swath of the population as it does. And that the next highest libraries expenditure-wise, the $5.0 million to $9.9 million group, does not keep up with this first group. (I’m curious about why this would be the case.)

In a moment I’ll get to other key measures for this spectacular $1.0 million to $4.9 million category of libraries. But first I thought I would lay out the population data differently by looking at how libraries are distributed independent of library expenditures. Just as a reminder of how it works. The chart below shows the distribution of public libraries among 11 population categories labeled on the horizontal axis. Adding up the percentages for the left 5 bars, you see that 77% of public libraries serve communities with less than 25,000 population. Note in the bar chart (and treemap in the interactive version) that the 10K to 24.99K population group contains the most libraries.5

PopOnlyBarD_480

Distribution of Population Served
Click to view larger interactive chart.

Okay. Now lets look at total operating expenditures by expenditure category in the purple chart here:

OperExpbyExpArea_450

Total Expenditures by Expenditure Category
Click to view larger interactive chart.

In this chart the two left-most rectangles look identical in size, don’t they? Click on the interactive version and you can see that the $10 million to $49.9 million and the $1.0 million to $4.9 million groups each account for more than $3 billion in annual public library operating expenditures. And their expenditure levels are nearly equal. The $10 million to $49.9 million group outspends the $1.0 million to $4.9 million group only by 1.2% ($38 million).

Where service provision is concerned, however, the $1.0 million to $4.9 million libraries shine. First, as seen in the interactive version of the chart below, their 2011 total visits surpassed the $10 million to $49.9 million group by 2% or 18 million. Granted, if we were to combine all libraries with expenditures exceeding $10 million into a single category, that category would win out. But the point here is that the 1,424 members of the $1.0 million to $4.9 million group are able to generate library services at nearly the same level as the largest urban libraries in the country. Without a doubt, the productivity of these moderate size libraries is substantial.

visitsbyexpareaD_450

Total Visits by Expenditure Category
Click to view larger interactive chart.

On circulation, however, the $10 million to $49.9 million libraries out-perform the $1.0 million to $4.9 million group. The former group accounts for 4% more of total U.S. public library circulation than the latter. These larger libraries account for 31.6% of all circulation nationwide, compared to the $1.0 million to $4.9 million group which accounts for 27.6%. (Click on the gray chart below to view these and other figures.)

circbyexpareaD_450

Total Circulation by Expenditure Category
Click to view larger interactive chart.

Yet circulation is the only major output measure where the $1.0 million to $4.9 million libraries play second fiddle to libraries from the other expenditure categories. Besides total visits, our 1,424 libraries excel in total program attendance and public Internet computer users. The next (olive) treemap shows the 7% margin (6.5 million) for total program attendance this group holds over the second-place group.

attenbyexpareaD_450

Total Program Attendance by Expenditure Category
Click to view larger interactive chart.

The final treemap below gives data on public Internet computer users. Again these middling libraries exceed the $10 million to $49.9 million libraries by 2.5% or about 8.3 million computer users. Rather startling that this group of libraries would outpace the large and well-equipped libraries of the nation in the delivery of technology services to communities.

pitsusrbyexpareaD_450

Total Public Computer Users by Expenditure Category
Click to view larger interactive chart.

To recap the data presented here let’s revisit the 6 multi-color treemaps introduced at the beginning of this post. We can see the gold rectangle is the largest among all expenditure groups for population, visits, program attendance, and public Internet computer users. And it is 2nd highest in operating expenditures and circulation.

As I mentioned, the standing of the largest expenditure categories could be enhanced by merging the $10 million to $49.9 million and the $50 million or more categories into a single category. (Of course, any boundary within this wide range of expenditures would be arbitrary.) Even so, the $1.0 million to $4.9 million group would still show a strong presence, leaving its next largest peers, the $5.0 million to $9.9 million category, in the dust. No matter how you slice the data, the $1.0 million to $4.9 million group is a major player in national library statistics. Now we need to think of some appropriate recognition for them…

 
—————————

1   Based on the Public Libraries in the United States Survey, 2011, Institute of Museum and Library Services.
2   It was Willard Brinton who identified the problem in his 1914 book. In my prior post scroll down to the sepia graphic of squares arranged laterally. There you see Brinton’s words, “The eye cannot fit one square into another on an area basis so as to get the correct ratio.” Bingo. With treemaps this is even more problematic since a single quantity in the data can be represented by different-shaped but equivalent rectangles—stubby ones or more elongated ones. You’ll see in the examples that it is impossible to visually determine which of two similarly-sized rectangles is larger. This difficulty also applies to pie wedges.
3   For purposes of this post I used only libraries reporting to the Institute of Museum and Library Services in 2011 that were located in the continental U.S., Alaska, and Hawaii.
4   The expenditure groups are examples of categorical data. Other examples are geographical regions of the U.S. and library expenditure types (collection, staffing, technology, capital, and so forth). Categorical data are also called nominal data or data on a nominal scale.
5   For detailed information about the statistical and geographic distributions of small libraries see the new report, The State of Small and Rural Libraries in the United States, IMLS Research Brief. No. 5., Sept. 2013.

Posted in Data visualization, Measurement, Statistics | Tagged , , | Leave a comment

Quadruple Your Statistical Knowledge In One Easy Lesson

Even with the promises of big data, open data, and data hacking it is important to remember that having more data does not necessarily mean being more informed. The real value of data, whatever its quantity or scope, comes from the question(s) the data can help answer.

There are various reasons any given set of data might or might not provide reliable answers, the most basic being the data’s accuracy. Clever new technologies that scan, scrape, geocode, or mobilize loads of data aren’t much use if the data are wrong. All we end up with is scanned, scraped, geocoded, and mobilized misinformation. Garbage in-garbage out, as they say.

Getting from data to answers requires understanding the meaning of the data and its relevance to our questions. With statistical data much of this meaning and relevance depends on three particular ideas:

  1. How the data were selected
  2. The group/population researchers are trying to learn about
  3. How these two relate

I am here to tell you that if you master these ideas your statistical knowledge will immediately quadruple! Okay. I admit my estimate of learning gain could be inflated (but maybe not). In any case, the importance of these ideas cannot be exaggerated. British statistician T.M.F. Smith called them “the most basic concepts in statistics” in his 1993 presidential address to the Royal Statistical Society. He also said:

In statistics data are interesting not for their own sake but for what they tell us about the phenomenon that they represent, and specifying the target population and the selection mechanism should be the starting point for any act of statistical inference.1

Although you may have already learned these ideas in research and statistics courses, I encourage you to revisit your understanding of them. I say so because your recall may have been colored by misinformation on this topic appearing in library literature and national research projects. I discuss some of this misinformation further on.

In the meantime, let’s explore these ideas using a library example: Suppose we are interested in learning about consumer demand for e-books in the U.S. And we happen to have access to a big datafile of e-book circulation for all U.S. public libraries—titles, authors, call numbers, reserve lists, e-reader devices, patron demographics, and the like. We analyze the data and then issue this pronouncement:

Ebook demand in the U.S. is highest for these genres: romance novels, self-improvement, sci-fi/fantasy, biographies, and politics/current events.

Is our pronouncement correct? Not very. The list of genres is a poor reflection of consumer demand for e-books, first, because our data describe only public library borrowers instead of all e-book consumers. (Our datafile did not tap demand among e-book purchasers.) Second, the list is probably inaccurate for another reason. In the libraries’ collections, the proportions of e-books in the genres are likely to differ from those for all e-books published nationally. Demand for one genre, say e-biographies, that are underrepresented in library holdings will be understated compared with demand for e-biographies among U.S. consumers as a whole. So, besides giving a slanted view of consumer behavior, the e-book datafile is also slanted in terms of the genres consumers have access to in the first place.

Third, the pronouncement is inaccurate even when we limit our question just to demand among library e-book borrowers. The small number of available e-book copies will have made it impossible for some borrowers to check out the e-books they wanted when they wanted them. This user demand will not necessarily be accounted for in the e-book datafile.

The reasons just given for doubting the pronouncement are all related to how the e-book data were collected in the first place—the selection mechanism using Professor Smith’s term. Understanding how collection methods affect what data can and cannot tell us is the knowledge-quadrupling information I’m talking about. Here is that information in a nutshell:

The way that data are selected either support or detract from the validity of conclusions drawn. Thus, data selection directly affects the accuracy of answers gleaned from the data. Inaccuracy due to data selection, called selection bias, comes from slantedness and/or incompleteness of data. This bias occurs when certain types of subjects/respondents are systematically over- or under-represented in the data. Relying on biased data is usually risky and sometimes irresponsible.

Most library and information science professionals do not understand selection bias. Nor are they well-informed about survey sampling best practices. And, as I mentioned, some library literature and national projects have misinformed readers about these topics. I’d like to discuss a few examples as a way to clear up some of the confusion.

One example is an article in a peer-reviewed library journal about the generalizability2 of survey findings (also known as external validity). The researchers wondered whether specific user traits were so common that they were very likely true for all academic libraries. User traits is a term I have devised as shorthand for attributes, behaviors, or trends detected in survey or other library data. (It’s not an official term of any sort, nor did the researchers use it.) A trait might be something like:

The average length of time undergraduate students spend in university libraries is markedly shorter for males than for females.

The researchers figured that if a trait like this one were found to be true in surveys conducted at theirs and a dozen or so peer libraries, then it should be true across the board for all academic libraries. They proceeded to identify several uniform traits detected in multiple surveys conducted at theirs and their peer libraries. (Thus, their study was a survey of surveys.) They ended up advising other academic libraries not to bother studying these traits on behalf of their home institutions. Instead, the other libraries should just assume these traits would hold true exactly as they they occurred at the libraries that had already done the surveys.

This is bad advice. The researchers’ sample of library survey results was too limited. They reached out only to the dozen or so libraries that were easily accessible. Choosing study subjects this way is called convenience sampling. Almost always convenience samples are poor representations of the larger group/population of interest. (There is another type of convenience sampling called a self-selected sample. This is when researchers announce the availability of a survey questionnaire and then accept any volunteers who show up to take it. We’ll revisit this type of slanted sampling further on.)

The best way to avoid selection bias in our studies is the use of random (probability) sampling. Random sampling assures that the subjects selected provide a fair and balanced representation of the larger group/population of interest. The only thing we can surmise from a convenience (nonprobability) sample is that it represents the members which it is composed of.

Because they used an unrepresentative (nonprobability) sample rather than a representative (probability) sample, the researchers in the example above had no grounds for claiming that their findings applied to academic libraries in general.

Before moving to the next example some background information is necessary. I suspect that library researchers have taken statistics courses where they learned certain statistical rules-of-thumb that, later on, they end up mis-remembering. As you might expect, this leads to trouble.

Statistics textbooks usually talk about two basic types of statistics, descriptive and inferential. Descriptive statistics are summary measures calculated from data, like means, medians, percentages, percentiles, proportions, ranges, and standard deviations. Inferential statistics (also known as statistical inference) have to do with extrapolating from the sample data in order to say something about the larger group/population of interest. This is exactly what we’ve been discussing already, being able to generalize from a sample to a population (see footnote #2). This amounts to inferring that patterns seen in our samples are fair estimates of true patterns in our target groups/populations. Drawing representative samples provides the justification for this inference.

Inferential statistics also entail a second type of inference related to how random chance can cause apparent patterns in sample data that are not likely to be true in the larger population. This more esoteric form of inference involves things like hypothesis testing, the null hypothesis, statistical significance, and other convoluted issues I’ve written about before here. It so happens that statistics textbooks often recommend the use of random (probability) sampling in studies when researchers intend to conduct statistical signficance testing, a rule-of-thumb that may have confused researchers in this next example.

This example is a study of public library programs published in a peer-reviewed journal in which researchers acknowledged their use of convenience (nonprobability) sampling. It seems they focused on the textbook recommendation I just mentioned in the prior paragraph. They apparently reasoned, “Since it’s bad form to apply statistical significance testing to a convenience sample, we won’t do that. Instead, we’ll stick to descriptive statistics for our sample data.” They proceeded to report means and medians and percentages and so on, but then announced these measures as applicable to U.S. public libraries in general. The researchers abided by the esoteric statistical rule, yet ignored the more mainstream rule. In fact, its importance qualifies this rule for the knowledge-quadrupling category and is stated here:

Without representative sampling there is no justification for portraying survey findings as applicable (generalizable) to the larger population of interest.

Library organizations violate this rule every time they post a link to an online survey on their website. Inviting anyone and everyone to respond, they end up with the biased self-selected sample described already. Then, as in the prior example, they report survey findings to users and constituents as if these were accurate. But they are not. Because this practice is so common, it has become respectable. Nevertheless, the promotion of misinformation is unethical and—unless you work in advertising, marketing, public relations, law, or politics—professionally irresponsible.

Textbook rules-of-thumb may have also been a factor in this next example. By way of introduction, recall that we collect a sample only because it is impossible or impractical to poll every member from the group/population of interest. When it is possible to poll all members of the population, the resulting survey is called a census or a complete enumeration. If we have the good fortune of being able to conduct a census, then we do that, of course.

Unfortunately, researchers publishing in a peer-reviewed library journal missed this opportunity even though they happened to have a complete enumeration of their population—an electronic file, actually. Instead of analyzing all of the data in this datafile, for some reason (probably recalling cookbook steps from a statistics course) the researchers decided to extract a random sample from it. Then they analyzed the sample data and wrote the article based only on that data. Because these researchers relied on a portion rather than the entire dataset, they actually reduced the informativeness of the original data. They failed to understand this:

The more the composition of a sample matches the larger group/population, the more accurate measures taken from the sample (means, medians, percentages, and so on) are. As a corollary, a larger representative sample is better than a smaller representative sample because its composition typically matches the larger population more closely. When a sample is the equivalent of a census of the entire population, the sample is perfectly accurate (generally speaking).

And, yes, I should also add:

A small representative sample is better than a large unrepresentative sample. And an unrepresentative sample is possibly better than not conducting the survey at all, but (a) not by very much and (b) only if our luck is reasonably good. If our luck is bad, measures from the unrepresentative sample will be totally wrong, in which case not conducting the survey is the better option. (Better to be guided by no information than wrong information.)

If your library always has good luck, then it should by all means use an unrepresentative sampling method like convenience sampling. You can explain to the library’s constituents how the library’s consistently good luck justifies this use.

Now onto a final case of unadulterated selection bias from a few years back. I believe the high visibility of this study justifies naming the organizations that were involved in its production. My purpose is to remind readers that the status of an institution and the quality of its data analysis practices are not necessarily related. Which, in a way, is good news since it means humble organizations with limited credentials and resources can learn data analysis best practices and outperform the big guys!

So to the story. This is about the study funded by a $1.2 million grant from the Bill & Melinda Gates Foundation to the Online Computer Library Center (OCLC) entitled, From Awareness to Funding, published in 2008.3  The surveys used in the study were designed and conducted by the internationally acclaimed advertising firm, Leo Burnett USA (a la MadMen!).4

For this study Leo Burnett researchers surveyed two populations, U.S. voters and U.S. elected officials. Survey respondents from the voter group were selected using a particular type of probability sampling. (This is good, at least on the surface.) The resulting sample consisted of 1900 respondents to an online questionnaire. The elected officials sample was made up of self-selected respondents to invitations mailed to subscribers of Governing, a professional trade journal. In other words, elected officials were selected via a convenience sample. (This is bad.) Nationwide, 84 elected officials completed Leo Burnett USA’s online questionnaire. (This is not so good either.)

Roughly, there are 3,000 counties in the U.S. and 36,500 cities, towns, townships, and villages.5  Let’s say each entity has on average 3 elected officials. Thus, a ballpark estimate of the total count of elected officials in the U.S. is 112,500. To omit officials in locales with no public library let’s just round the figure down to 100,000.

High-powered Leo Burnett USA settled for a very low-powered and quite unreliable sample—84 self-selected officials—to represent a population of about 100,000. The OCLC report acknowledged this deficiency, noting:

Due to the process by which respondents were recruited, they represent a convenience sample that is quantitative but not statistically representative of all local elected officials in the United States.6

Professional standards for marketing researchers caution against misrepresenting the quality of survey samples. The Code of Marketing Research Standards obliges marketing researchers to:

Report research results accurately and honestly… [and to] provide data representative of a defined population or activity and enough data to yield projectable results; [and] present the results understandably and fairly, including any results that may seem contradictory or unfavorable.7

So, the responsible thing for Leo Burnett USA to do was to announce that reporting any details from the 84 respondents would be inappropriate due to the inadequacy of the sample. Or, in light of marketing research professional standards, they could have made an effort to draw a probability sample to adequately represent the 100,000 U.S. elected officials—perhaps stratified by city/town/township/village.

But alas, Leo Burnett USA and presumably the OCLC authors chose a different strategy. First, as a technicality, admit the sample’s deficiency in the report text (their quotation above). Then, ignore both the deficiency and the admission by portraying the data as completely trustworthy. As a result, an entire chapter in the OCLC report is devoted to quite unreliable data. There you will find 18 charts and tables (an example is shown below) with dozens of interesting comparisons between U.S. elected officials and U.S. voters, thoughtfully organized and confidently discussed.

OCLCFundingCh3Chart

Source: From Awareness to Funding, OCLC, Inc. p. 3-5.

So what’s not to like? Well, we might dislike the fact that the whole thing is a meaningless exercise. When you compare data that are accurate (like the purple-circled 19.9 voters figure above) with data that are essentially guesswork (like the blue-circled 19.0 elected officials figure above), the results are also guesswork! This is elementary subtraction which works like this:

WildAssGuessSubraction

The Wild-Ass Answer is off by however much the Wild-Ass Guess is! This isn’t necessarily part of the knowledge-quadrupling information I mentioned. But it’s handy to know. You can see another example here.

As I said, this 2008 OCLC report was promoted far and wide. And the dubious elected officials data were showcased in the GeekTheLibrary initiative (the $5+ million sequel to the 2008 OCLC study) as shown in these banners that appeared on the initiative’s website:

GeekTheLibElectedOfficialsBanners

Banners posted on http://www.geekthelibrary.org.

Due to selection bias the statements in both banners are pure speculation. Incidentally, the GeekTheLibrary initiative was, shall we say, data-driven. It was designed based on findings from the 2008 OCLC study. We can only hope that there weren’t very many program strategies that relied on the study’s insights into U.S. elected officials.

That, of course, is the problem with unrepresentative survey samples. They are likely to produce unreliable information. If our objective is accurate and unbiased information then these samples are too risky to use. If our objective is going through the motions to appear data-driven and our audiences can’t verify our data on their own, then we can use these samples with no worries.

 
—————————

1   Smith, T.M.F. (1993). Populations and selection: Limitations of statistics, Journal of the Royal Statistical Society – Series A (Statistics in Society), 156(2), 149.
2    Generalizability refers to the extent to which we have ample grounds for concluding that patterns observed in our survey findings are also true for the larger group/population of interest. Other phrases describing this idea are: Results from our sample also apply to the population at large; and we can infer that patterns seen in our sample also exist in the larger population. Keep reading, as these ideas are explained throughout this blog entry!
3   De Rosa, C. and Johnson, J. (2008). From Awareness To Funding: A Study of Library Support in America, Dublin, Ohio: Online Computer Library Center.
4   On its Facebook page Leo Burnett Worldwide describes itself as “one of the most awarded creative communications companies in the world.” In 1998 Time Magazine described founder Leo Burnett as the “Sultan of Sell.”
5   The OCLC study purposely omitted U.S. cities with populations of 200,000 or more. Based on the 2010 U.S. Census there are 111 of these cities. For simplicity, I had already rounded the original count (36,643) of total cities, towns, townships, and villages to 36,500. This 143 adjustment cancels out the 111 largest U.S. cities, making 36,500 a reasonable estimate here.
6   De Rosa, C. and Johnson, J. (2008), 3-1. The phrase “sample that is quantitative” makes no sense. Samples are neither quantitative nor qualitative, although data contained in samples can be either quantitative or qualitative. The study researchers also misunderstand two other statistical concepts: Their glossary definition for convenience sample confuses statistical significance with selection bias. These two concepts are quite different.
7   Marketing Research Association. (2007). The Code of Marketing Research Standards, Washington, DC: Marketing Research Association, Inc., para. 4A; italics added.

Posted in Accountability, Library assessment, Measurement, Research, Statistics | Tagged , , , , , | 1 Comment

Putting the Best Findings Forward

I think I’m getting jaded. I am beginning to wonder whether lobbying for balanced reporting of evaluation and research findings is a waste of time. With voices more influential than mine weighing in on the opposite side, I’m having trouble staying positive. Granted, I do find inspiration in the work of people much wiser than me who have confronted this issue. One such source is my favorite sociologist, Stanislav Andreski, who wrote the following in his book, Social Sciences as Sorcery:

In matters where uncertainty prevails and information is accepted mostly on trust, one is justified in trying to rouse the reading public to a more critical watchfulness by showing that in the study of human affairs evasion and deception are as a rule much more profitable than telling the truth.1

The problem is, wisdom like Andreski’s languishes on dusty library shelves and the dust-free shelves of the Open Library. Much more (dare I call it?) airtime goes to large and prestigious institutions that are comfortable spinning research results to suit their purposes.

Fortunately, I am not so demoralized as to pass up the opportunity to share yet another institution-stretching-the-truth-about-research-data story with you. This involves an evaluation project funded by the Robert Wood Johnson Foundation and conducted by Mathematica Policy Research and the John W. Gardner Center for Youth and Their Communities at Stanford University. In case you are not aware, the Robert Wood Johnson Foundation is well known for its longstanding commitment to rigorous evaluation research. I have not surveyed their publications or project output. And cannot say whether the instance I’ll be describing here is typical or not. I hope it isn’t.

Before getting to the story details, a little background information on research methodology is necessary. Randomized controlled trials are the optimal research method for determining whether a given activity or program (an intervention) really works. This method is designed specifically to confirm that the intervention actually produced (i.e., caused) changes observed in the target population. The approach, also known as experiments and clinical trials, is the highest level of evidence in evidence-based practice. In evaluation, the approach falls under the more general categories impact evaluation and effectiveness evaluation.

Now to the specifics of the Robert Wood Johnson Foundation project. This was a randomized controlled study of the effectiveness of a public school recess program, known as Playworks, conducted in 2011 and 2012. The image below is a link to an issue brief published by the foundation. (The smoking gun.) If you would, click on the image and read the headlines and the bullet points appearing at the middle of the page:

Robt Wood Johnson Brief

Click to see Robert Wood Johns Foundation issue brief.

Notice that the headline reports that the Playworks study confirmed “widespread benefits” of the program. In the body of the text note also the statement about a “growing body of evidence that…organized recess…has the potential to be a key driver of better behavior and learning.” (Phrases like widespread benefits and key driver are sure signs that public relations or marketing professionals had a hand in the writing. When publicists and marketeers are nearby, evasion and deception cannot be far behind.)

The bullet points specify the main positive impacts the Playworks program had and also address that nagging quantitative question—how much? How much less bullying was there? A 43 percent difference in rating scores (amounting to a 0.4 point difference). How much increase in feelings of safety? A twenty percent score difference (0.6 points). What increases in vigorous activities? Playworks participants were vigorously active 14% of recess time compared to 10% for non-participants. How much enhancement to learning readiness? Following recess Playworks students were ready to learn 34 percent more quickly (a 3-minutes per day readiness advantage). You can read these and a myriad of other quantitative comparisons between Playworks schools (the treatment group) and other schools (the control group) in the full report by Mathematica and the Gardner Center (see Appendix 2). I’ll return to these comparisons further on.

But first, I want to say something about the Less Bullying bullet point. My purpose here is to stress the importance of knowing, in any research study, exactly what got measured. In the brief notice in the non-bold text that the idea is actually “bullying and exclusionary behaviors.” Now what would these be?

A footnote in Appendix 2 says that researchers measured these concepts by averaging teachers’ responses to seven questionnaire items. Four items were about students or parents reporting students being bossed or bullied. The other three items were about reported incidents of name-calling, pushing and hitting, and students feeling “isolated from their normal peer group.” (Interestingly, elsewhere the report says that 20% of the Playworks students reported they “felt left out at recess” compared to 23% of non-Playworks students.)

The measurement scale was something like this:

RWJ Safety Scale

Rating Scale for Reported Incidents of Bullying and Exclusionary Behaviors

In this figure the average score for the Playworks schools was 0.6 and for the non-Playworks schools it was 1.0. So the averages for these two groups were towards the lower end of the scale. Also, since the scale measured two things it’s hard to say what portion of the 0.4 Playworks difference pertained to bullying versus exclusionary behaviors. In any case and very roughly, on average the non-Playworks schools reported one more incident of either of these types of behaviors than the Playworks schools did (I believe this would be per surveyed teacher.)

Now consider the next bullet point, Increased Feelings of Safety at School. The study found teachers’ ratings of how safe students felt at school to be 20% (0.6 points) higher at the Playworks schools. But what about students’ own feelings of safety? Referring to the Mathematica and Gardner Center full report, the average level of safety that students felt at Playworks schools was 4% (0.1 point) higher than for the other schools. And the average level of feelings of safety at recess was 8% (0.2 point) higher at Playworks schools than at the other schools. Of course, designers of the Playworks study infographic (below) mixed up these findings, stating that students felt as safe as teachers thought they did. But the study shows otherwise.

Playworks infographic

Playworks study infographic.   Click for larger image.

As you’d expect, the issue brief reported those outcomes where the greatest impacts were detected.2  Designers of the infographic went overboard, cherry-picking items that teachers agreed with wholeheartedly. (I say “overboard” because the infographic designers decided to deceive readers by hiding an important report finding that disputes two of their 90+ percentages. For the un-cherry-picked version, see the last item in the list of excerpted findings in the next paragraph.)

In both cases the idea is putting the best findings forward. So you have to refer to the full report to get a fix on the real story. If you study the study in detail, you’ll see that the brief paints a rather rosy picture (and the infographic is way out in left field). Consider these not-as-rosy findings:

Significant impacts were observed in domains covering school climate, conflict resolution and aggression, learning and academic performance, and recess experience, suggesting that Playworks had positive effects. No significant impacts were detected in the other two domains addressing outcomes related to youth development and student behavior.3 [underlining added]

Playworks had a positive impact on two of the five teacher-reported measures of school climate but had no significant impact on the three student-reported measures of school climate.4 [underlining added]

Teachers in treatment schools reported significantly less bullying and exclusionary behavior [this is the 0.4 point difference graphed above]. However, no significant impacts were found on teacher reports of more general aggressive behavior…student reports of aggressive behavior, students’ beliefs about aggression, or students’ reports on their relationships with other students.5  [underlining added]

There were no significant differences on six additional outcome measures that assessed student engagement with classroom activities and academic performance, homework completion and motivation to succeed academically.6  [underlining added]

Playworks had no significant impact on students’ perceptions of recess, as measured in the student survey. In particular, there was no significant impact on six items that measured the type of recess activities in which students were engaged, such as talking with friends or playing games and sports with adults during recess. There was also no impact on six items that measured student perceptions of recess, such as enjoyment of recess or getting to play the games they wanted to play. In addition, no impact was found on six items that measured student perceptions of how they handle conflict at recess, such as asking an adult to help them solve a conflict or getting into an argument with other students during recess.7  [underlining added]

There were no significant impacts of Playworks on eight measures of youth development. In particular, students in treatment and control schools had similar reports on a six-item scale that measured feelings about adult interactions…In addition, a similar percentage of treatment and control students reported getting along well with other students. There was also no significant difference on a scale that included eight items asking students to indicate their effectiveness at interacting with peers in conflict situations, such as their ability to tell kids to stop teasing a friend. Teachers in treatment and control schools also reported similar perceptions of students’ abilities to regulate their emotions, act responsibly and engage in prosocial and altruistic behavior.8  [underlining added]

Despite the fact that most treatment teachers who responded to the survey felt that Playworks reinforced positive behavior during recess (96 percent) and resulted in fewer students getting into trouble (91 percent) [shown in the the infographic above], there were no significant impacts of Playworks on multiple indicators of student behavior. Treatment and control group students who took the student survey reported similar levels of disruptive behavior in class and behavioral problems at school. Teachers in treatment and control schools reported similar amounts of student misbehavior, absences, tardiness, suspensions and detentions among their students.9  [underlining added]

Reads like those warnings and contraindications that accompany pharmacy prescriptions, doesn’t it? Nevertheless, these facts are important for understanding the truth about this study:  There was a mixture of positive impacts in certain areas and a lack of impacts in several others.

As a quick-and-dirty way to try to make sense of this I tallied results for five “outcome domains” reported in tables in Appendix 2 of the full report:10  school climate (table 3), conflict resolution and aggression (table 5), academic performance (table 7), youth development (table 10), and student behavior (table 12). The chart below shows the point differences11 between the Playworks and non-Playworks schools:

Playworks Bar Chart

Click for larger image.

For 70% (19) of the 27 outcome measures the difference between Playworks schools and non-Playworks schools’ scores was 1/10th of a point or less. Seventy eight percent (21) of the measures had differences of 2/10ths or less. On the remaining 22% (6) of the measures treatment and control schools differed by 3/10ths points or more. So, on nearly 80% of the outcome measures the Playworks schools were almost identical to the control group schools. Meaning that for these outcomes the Playworks program made no appreciable difference.

Several factors could have interfered with measuring effects that Playworks program may have had in reality. For instance, in the full report researchers noted that the program was not implemented consistently at each school. Only half of the treatment group schools followed the Playworks program regimen closely (five schools followed the regimen moderately close). Another potential problem is the multi-item questionnaire scales and their 4-point measurements. Reliability and validity issues could have compromised accurate measurements.

Still, it’s fair to say that the study was methodologically sound (e.g. randomization and sufficient sample sizes) and thorough (e.g. extensive scope of survey questions). It’s a shame that the hard work invested in assuring the quality of the study was diminished by an issue brief containing such casually drawn conclusions. Based on this study alone, the chances that organized recess could perform as a key driver of better [school] behavior and learning appear to be slight. And the idea that the Playworks program had widespread benefits is untrue. As for the infographic, its insubstantial content matches its cartoonish design perfectly!

The point being that the study needs to be summarized in its entirety, which is to say, honestly. Certainly, Playworks, Mathematica, the Gardner Center, and the Robert Wood Johnson Foundation will have done a comprehensive review of the full report in private. But why not share something about this important activity with the public at large? Include a few sentences in published summaries explaining that measurements are approximate and findings can be uncertain. And that researchers often need to sort through mixed signals in the data to draw the soundest conclusions possible.

These are the kinds of messages worth putting forward. Something to try to improve on the truth-to-lie quotient out there.

 
—————————

1    Andreski, S. (1973). Social Sciences As Sorcery, p. 13. 
2   The issue brief bullet point More Vigorous Physical Activity is a finding from a separate study. Additional positive impacts of the Playworks program reported in the executive summary of the full study by Mathematica and the Gardner Center were omitted from the issue brief. Namely, Playworks schools scored higher on teachers’ perceptions of: (a) students feeling included during recess; (b) student behavior during recess; (c) student behavior “after sports, games, and play”; and (d) how much students enjoyed adult-organized recess activities.
3   Bleeker, M. et al. 2012. Findings from a Randomized Experiment of Playworks: Selected Results from Cohort 1. Robert Wood Johnson Foundation, p. 10.
4   Bleeker, M. et al. 2012. p. 11.
5   Bleeker, M. et al. 2012. p. 12.
6   Bleeker, M. et al. 2012. p. 13.
7   Bleeker, M. et al. 2012. p. 14.
8   Bleeker, M. et al. 2012. p. 15.
9   Bleeker, M. et al. 2012. p. 16.
10   I included 27 of the 39 outcome measures listed in these tables. I intentionally omitted twelve measures that are percentages of respondents reporting certain outcomes of interest—like per cent of teachers reporting student detentions in the last 30 days. Because these twelve measures are indirect indicators, they are misleading. One teacher in ten (10%) having three incidents of detention will not be equivalent to three teachers in ten (30%) each having one incident of detention.
11   My quick-and-dirty approach intentionally avoids the topics of statistical inference and statistically significant differences, which are sometimes more confusing than helpful (see the heading “The Bane That Is Statistical Significance” in my prior post.) A large majority of measured differences between Playworks schools and other schools failed to be statistically significant. Besides, the hurdle of statistical significance was even higher in this study due to the need to account for what is called multiple hypotheses testing. My quick-and-dirty look at tiny measured differences is a simpler way to conceive what statistical significance would basically tell us. Due to the formulas for statistical significance testing, tiny differences like 1/10th of a point would only pass the test if sample sizes were much higher than they were in the study.

Posted in Advocacy, Outcome assessment, Program evaluation, Reporting Evaluation/Assessment Results, Research | Leave a comment

Paved with Good Intentions

It never hurts to revisit the basics of a method that we’ve chosen to apply to a task we want to accomplish or a problem needing solved. So, the recent announcement of the Library Edge benchmarks is a good occasion to discuss that particular performance assessment method. In the third edition of his book, Municipal Benchmarks, University of North Carolina professor David Ammons describes three types of benchmarking:1

1.  Comparison of performance statistics
2.  Visioning initiatives
3. “Best practices” benchmarking

The idea behind item #1 is that the sufficiency of an organization’s performance can be judged by comparing its performance data with other organizations or against externally defined standards. Comparisons of different organizations using only performance data, without any reference to standards, is called comparative performance measurement. An Urban Institute handbook of the same name by Elaine Morley, Scott Bryant, and Harry Hatry gives an in depth explanation of this method.2 Libraries are already familiar with this approach as comparisons made among selected peers using statistical data for circulation, visits, volumes held, expenditures, registered users, and so on. Library rating systems like the LJ Index of Public Library Service and BIX/Der Bibliotheksindex belong to this category also.

The next type of benchmarking (item #2 above) is visioning initiatives. This approach involves targets set as long-term community goals. These typically are community-wide outcomes meant to represent overall social welfare or quality of life as measured by what are called social indicators. These projects track measures of things such as life expectancy, literacy, crime, civic engagement, and so forth. An example of this approach is the Oregon Benchmarks project.

Best practices benchmarking (item #3 above) comes from the field of quality management. This approach focuses on a specific work process or procedure and studies organizations with outstanding performance records for this process or procedure. The purpose is obtaining ideas for improving the process or procedure in the organization. This is also known as “corporate-style benchmarking” because of its prevalence among private sector organizations. Due to the narrow focus, this approach usually does not reveal much about organizational accomplishments. (Later on I give an example about city street maintenance that illustrates this and should also explain the title I’ve chosen for this post.)

Ammons says that the first benchmarking method—comparison of performance statistics—is the most meaningful for public sector organizations. Comparative statistics provide important context and perspective for an organization’s performance data. Still, as with any tool or method, these have their limitations. For example, consider the fact that statistical comparisons always involve ranking data high to low. The result is that half of the organizations end up classified as higher performers and the other half as lower (using the statistical median to indicate the middle of the group, rather than average). Yet, sometimes the difference between higher and lower performers may be insignificant, especially for organizations that fall near the middle of the distribution. Those slightly below the median are classified as sub-standard while those slightly above are not. It is also possible that the performance of the entire group is unsatisfactory on some organizational measure. However, those at the top of the list still receive high ratings.

This is where standards are useful, including the Library Edge benchmarks (which are essentially standards). Standards can provide an anchor by establishing some minimal level of satisfactory performance. And their content will cover a range of important aspects of a program, service, or intervention. At the same time, the fact that some body or another has issued standards does not automatically make these valid or appropriate. As Ammons observes:

Many so-called [municipal performance] standards are vague or ambiguous, have been developed from limited data or by questionable methods, may focus more on establishing favored policies or practices than on prescribing results, or may be self-serving, more clearly advancing the interests of service providers than service recipients.3

The purpose of comparing performance data—either to other organizations or to externally defined standards—is twofold: (1) to be a catalyst for improving performance and (2) to enable the general public and the elected officials representing them to assess the degree to which a public institution is productive and effective. Although this second purpose is not mentioned in the Library Edge benchmarks, it has been a recognized principle for some time in the library field. The statement below, written in 1956 by historian Gerald W. Johnson and appearing in the Minimum Standards for Public Libraries, 1966, illustrates this:

One obligation resting upon every public institution in a democracy is that of standing ready at all times to render an account of itself to the people and to show cause why they should continue to support it. No institution is so lordly that its right to existence is beyond challenge, and none, except perhaps public monuments, can rightfully claim present consideration on the basis of past distinction.4

For 21st century libraries the closest thing to Johnson’s statement is the catch phrase “demonstrating library value,” which usually means painting as positive a picture of libraries as possible. Nevertheless, Johnson was referring to the notion of accountability that comes from governmental accounting and auditing. The main standard-setting body in that field, the Governmental Accounting Standards Board (GASB), describes it this way:

Public accountability is based on the belief that the taxpayer has a “right to know,” a right to receive openly declared facts that may lead to public debate by the citizens and their elected representatives…Financial reporting should assist in fulfilling government‘s duty to be publicly accountable and should enable users to assess that accountability by… providing information to assist users in assessing the service efforts, costs, and accomplishments of the governmental entity.5

In accountability the main ideas are accuracy, full disclosure, verifiable information, and, as the GASB concept paper notes, assessing accomplishments. This last item is the reason Ammons recommends that standards and benchmarks take the form of “thresholds of effectiveness, service quality, efficiency, outcomes, or results.”6 Citizens evaluate public agencies primarily on accomplishments and benefits delivered at a reasonable cost, rather than on efforts and intentions. Incidentally, best practice benchmarking and continuous improvement regimens don’t really assess accomplishments and benefits. For example, reporting that a city’s street maintenance department uses the most advanced tools and highest quality materials to patch potholes does not inform the public about how responsive and thorough the department is at getting the holes patched, nor how long the patches actually last.

Rather than accomplishments and benefits, the Library Edge benchmarks are about intentions. That is, capacity (sufficient and reliable technology equipment, sufficiently trained staff, user training available, technical assistance offered, relevant website content, organized employment, health, e-government, and education materials, etc.) and service outputs (WiFi/public computer sessions, training and individual assistance delivered, website usage, etc.). This limited scope is appropriate in this early rendition of the benchmarks. But pretty quickly here 21st century libraries are going to need to master more advanced performance measurement methods than just monitoring capacity and outputs.

High quality benchmarks for public library technology services will need to follow the lead of other municipal services. For instance, there are city traffic management departments that evaluate their efforts by measuring improvement in average travel speeds for major traffic arteries during rush hours; and evaluate neighborhood speed-control programs based on measured decreases in traffic speeds and in speeding citations. Public health programs are evaluated on measures like obesity rates and infant mortality rates in the general population. And fire department performance is often gauged not only by elapsed time within which fires are suppressed, but also on the rate of reduction of the incidence of fires community-wide. Of course, these different measures are each compared to data from other towns and cities.

Public library technology proponents should consider developing similar measures beginning with raw counts, for instance, numbers of e-government transactions, vocational exams, or online education courses successfully completed by patrons. Then, to conduct true benchmarking these counts need to be made comparable across libraries and communities (as Ammons advises). That is, they need computed as rates, such as number of e-government transactions successfully completed per public computer or WiFi session, or per e-government training session hour offered; or percentage of e-government training class attendees who test competent in the use of selected e-government sites.

When public library technology benchmarking progresses to this stage, then our field will be leaders rather than followers in performance measurement. We will also have extraordinarily meaningful data to report to library boards, elected officials, and taxpayers. Quite a shift from reporting routine counts of this year’s service outputs, bandwidth increases, or even splashy new technologies installed. And, by the way, with measures of accomplishments and benefits the Library Edge value-of-public-computing stories will document the actual destinations public technology leads to.

 
—————————

1   Ammons, D. N. (2012). Municipal benchmarks: Assessing local performance and establishing community standards, Armonk, NY: M.E. Sharpe, p. 15.
2  Morley, E., Bryant, S. P., & Hatry, H. P. (2001). Comparative performance measurement, Washington, DC: The Urban Institute Press.
3  Ammons, D. N. (2012). p. 4.
4  American Library Association. (1967). Standards for Public Library Systems, 1966,, Chicago: American Library Association, p. 1.
5  Governmental Accounting Standards Board. (1987). Concept Statement 1: Objectives of Financial Reporting, p. ii. Retrieved from http://www.gasb.org/jsp/GASB/Page/GASBSectionPage&cid=1176160042391#gasbcs5.
6  Ammons, D. N. (2012). p. 16.

Posted in Accountability, Advocacy, Measurement, Reporting Evaluation/Assessment Results | 3 Comments