Data Accessibility

Summer Reading (Part 2)

Last week I wrote about my favorite new papers on mountains and phenology after a summer of scientific reading. In the second half of my top ten list, I’m highlighting some plant mysteries and best practices of 2018. 

“Plant mysteries” is a label that I’m using to lump together three plant papers that I can’t stop thinking about. They cover some of my favorite methodological quirks — historical field notes, herbarium digitization, citizen science — and two genera that I think are cool — Sibbaldia and Erythronium. The mysteries range from: Is this still here? to Why is this here in two colors?  to Can I get this specimen to tell me what else grew here? without much thematic overlap, but all three papers tell gripping stories. If nothing else, they share a strong natural history foundation and well-executed scientific writing that made for lovely hammock-reading.

“Best practices” are just that — descriptions of how we can improve our science as individuals and collectively. We can design better spreadsheets for our data and we can support gender equity in our scientific societies. I strongly recommend that all ecologists read up on both. 

Plant Mysteries

I didn’t particularly notice [trophy collecting/associated taxa/pollen color polymorphism] before, but now I can’t not see it…

1. Sperduto, D.D., Jones, M.T. and Willey, L.L., 2018. Decline of Sibbaldia procumbens (Rosaceae) on Mount Washington, White Mountains, NH, USA. Rhodora, 120 (981), pp.65-75.

I love this deep dive into the history of snowbank community alpine plant that occurs in exactly one ravine in New England (though it’s globally widespread across Northern Hemisphere arctic-alpine habitats). Over the past four decades, surveys in Tuckerman’s Ravine have documented a continuous decline in the abundance of creeping sibbaldia, and recently researchers have been unable to find it at all. This would make creeping sibbaldia the first documented extirpation of an alpine vascular plant in New England. Dr. Daniel Sperduto and coauthors revisit the photographs and notes from contemporary surveys and find that mountain alders are encroaching on the creeping sibbaldia’s snowbank habitats. These notes also include anecdotes of local disturbances like turf slumping at the sites where creeping sibbaldia used to be found. In herbaria across New England, Sperduto and coauthors discovered sheets covered with dozens of specimens — this “trophy collection activity” in the 19th century led them to calculate that “there are more than three times as many plants with roots at the seven herbaria examined than the maximum number of plants counted in the field within the last 100 years.” I am obviously partial to New England alpine plants, and I got to see Sperduto present this research as a part of an engaging plenary session at the Northeast Alpine Stewardship Gathering in April, so you could write this off as a niche interest. Despite this, I see creeping sibbaldia as a lens for considering the universal mysteries of population decline and extirpation, and the challenges of tying extirpation to concrete cause-and-effect stories. 

2. Pearson, K.D., 2018. Rapid enhancement of biodiversity occurrence records using unconventional specimen data. Biodiversity and Conservation, pp.1-12.

Leveraging herbarium data for plant research is so hot right now. But what if you could squeeze even more information from a specimen label? For example, many collectors note “associated taxa” along with the date and location of collection. The associated taxa are plants that were seen nearby, but not collected — a kind of ghostly palimpsest of the community that grew around the chosen specimen. Herbaria across the globe have spent the past decades digitizing specimens and uploading photographs of their pressed plants. In this process, the associated taxa on specimen labels are often stored in a ‘habitat’ database field. In this impressive single-author paper, Dr. Kaitlin Pearson extracts the associated taxa data from Florida State University’s Robert K. Godfrey Herbarium database with elegant code that can recognize abbreviated binomial names and identify misspellings. She then compared the county-level distributions of the associated taxa database with their known county-level distribution from floras and herbarium specimens. Incredibly “the cleaned associated taxon dataset contained 247 new county records for 217 Florida plant species when compared to the Atlas of Florida Plants.” There are plenty of caveats: the associated taxa can’t be evaluated for misidentification the way a specimen can, and lists of associated taxa are obviously subject to the same spatial biases as herbarium specimens. But this is clearly a clever study with a beautifully simple conclusion: “broadening our knowledge of species distributions and improving data- and specimen-collection practices may be as simple as examining the data we already have.” 

3. Austen, E.J., Lin, S.Y. and Forrest, J.R., 2018. On the ecological significance of pollen color: a case study in American trout lily (Erythronium americanum). Ecology, 99(4), pp.926-937.

Did you read Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Modelsin a seminar and think, this seems like an amazing resource but I’m an ecologist and examples about school children watching Sesame Street or election outcomes and incumbency for US congressional election races just don’t resonate with me? The ecological and evolutionary mystery of red/yellow pollen polymorphism is super interesting in its own right and Dr. Emily Austen and coauthors thoroughly attack this question. For me — and I’ve admitted here before that I am the kind of learner who benefits from repetition  — Austen’s statistical methods are the star. Austen demonstrates glm best practices and brings stunningly clear plant ecology examples to the Gelman and Hill framework. I would probably teach this paper in a field botany course (trout lilies are charismatic! look at this fun map of pollen color polymorphism!), but I would absolutely prefer to assign it in a statistical methods course, especially as a supplement/set of alternative exercises to Gelman and Hill. 

Best Practices

Do this…

1. Potvin, D.A., Burdfield-Steel, E., Potvin, J.M. and Heap, S.M., 2018. Diversity begets diversity: A global perspective on gender equality in scientific society leadership. PloS one, 13(5), p.e0197280.

Gender equality in biology dramatically decreases as you look up the ladder in academia — compare the gender breakdown in the population of graduate students to tenured professors and gender disparity is stark. Leadership in our field is still heavily male skewed. Dr. Dominique Potvin and her coauthors asked, is this true in scientific societies too? Scientific societies are generally more open than academic departments, and there is more transparency in the process of electing governing boards and leadership positions. Potvin and coauthors leveraged these traits to ask: what is the role of scientific societies in rectifying gender inequity? why are some societies better than others at promoting women in leadership? After considering 202 societies in the zoological sciences, they found that the culture of the society — the age of the society age, size of its board and whether or not a it had an outward commitment or statement of equality — was the best predictor of equality in the gender ratio of society boards and leadership positions. This “outward commitment or statement of equality” covered anything published on the society website — a statement, committee, or other form of affirmative action program — that “implies that the society is dedicated to increasing diversity or improving gender equality.” Of the 202 societies they studied, only 39 (19.3%) had one of these visible commitments to equality. Whether societies with high proportions of female board members were more likely to draft and publish these statements, or whether societies that invested time and energy in producing such commitments attracted more women to leadership positions is a bit of a chicken-and-egg riddle. Societies looking to reflect on their own state of gender equality can take advantage of the resource presented in Table 6: “Health checklist for scientific societies aiming for gender equality.” Assessing gender equality is kind of a low hanging fruit — and the authors encourage societies to reflect on intersectionality and race, age, ethnicity, sexuality, religion and income level as well. Basically, if a scientific society is struggling to support white women in 2018, there’s an excellent chance it is failing its brown, LGTBQ, and first-generation members to a much greater extent.

2. Broman, K.W. and Woo, K.H., 2018. Data organization in spreadsheets. The American Statistician, 72(1), pp. 2-10.

If I could send a paper in a time machine, I would immediately launch Broman and Woo’s set of principles for spreadsheet data entry and storage back to 2009, when I started my master’s project. Reading through this list of best practices made me realize how many lessons I learned the hard way — how many times have I violated the commandments to “be consistent”, “choose good names for things”, or “do not use font color or highlighting as data”? Way too many! Eventually, I pulled it together and developed a data entry system of spreadsheets that mostly conforms to the rules outlined in this paper. But, if I’d read this first, I would have skipped a lot of heartache and saved a lot of time. This is an invaluable resource for students as they prepare for field seasons and dissertation projects. Thank you Broman and Woo, for putting these simple rules together in one place with intuitive and memorable examples! 

Happy Fall Reading! 

The Hidden Gems of Data Accessibility Statements

Sometimes the best part of reading a scientific paper is an unexpected moment of recognition — not in the science, but in the humanity of the scientists. It’s reassuring in a way to find small departures from the staid scientific formula: a note that falls outside of the expected syntax of Abstract-Introduction-Methods-Results-Discussion. As an early career scientist who is very much in the middle of sculpting dissertation chapters into manuscripts, it’s nice to remember that the #365papers I read are the products of authors who, like me, struggled through revisions and goofed off with coauthors and found bleak humor in the dark moments. 

Ecology blogs, twitter, and the wider media also love noting the whimsical titles, funny (and serious) acknowledgements, memorable figures, and unique determinations of co-authorship order that have appeared in the pages of scientific journals.

I enjoy stumbling on these moments of levity in my TO READ file; last spring I procrastinated formatting my dissertation by avidly reading the Acknowledgements section of anyone I’d even vaguely overlapped with in my PhD program. One place I have not thought to look for serendipitous science humor: the Data Availability Statement. As it turns out, I have been missing an interesting story.

A recent PLOS ONE paper set out to analyze the Data Availability Statements of nearly 50,000 recent PLOS ONE papers. This may sound like a dull topic, but Lisa Federer and coauthors' work is surprisingly engaging, topical, and thought provoking. In March 2014 PLOS unveiled a data policy requiring Research Articles to include a Data Availability Statement providing readers with details on how to access the relevant data for each paper. But, as Federer et al point out “‘availability’ can be interpreted in ways that have vastly different practical outcomes in terms of who can access the data and how.” 

Why do Data Availability Statements matter? In ecology, open data advocates make the case for reproducibility and re-use. So many of us work on small study areas and amass isolated spreadsheets of data, and then publish on our system, maybe throwing a subset of the data we collected into a supplementary file. But big picture questions that look across scales, ecosystems, and approaches rely on big data — and big data is often an amalgam of many small datasets from a wide array of scientists. Small (or any size) datasets that are publicly available, and easy to access in data repositories instead of old lab notebooks or defunct lab computers, are much more likely to have legs, to get re-used and re-tested, and contribute to the field at large.

While PLOS was on the vanguard of Data Accessibility Statements among peer-reviewed journals, Federer’s review of the contents of these Data Availability Statements makes it clear that we are not yet in the shiny future of Open Data. PLOS’ Data Accessibility policy “strongly recommends” that data be deposited in a public repository; Federer found that only 18.2% of PLOS papers named a specific repository or source where data were available. Most Data Accessibility Statements direct the reader to the paper itself or supplementary information. Even among the data repository articles, some Data Accessibility Statements indicated a repository but failed to include a URL, DOI, or accession number — basically sending readers on a wild goose chase to locate their data within the repository. 

Other statements seem to have been entered as placeholders, potentially intended to be replaced upon publication of the article, such as “All raw data are available from the XXX [sic] database (accession number(s) XXX, XXX [sic])” or “The data and the full set of experimental instructions from this study can be found at <repository name>. [This link will be made publically [sic] accessible upon publication of this article.]” These two articles, published in 2016 and 2015, respectively, still contain this placeholder text as of this writing.

 These examples of placeholders that made it into publication are embarrassing, but human, and as Federer points out, Data Accessibility Statements should be reviewed by editors and peer reviewers with the same scrutiny that we apply to study design, statistical analyses, and citations. I have worked on meta-analyses and projects that depend on data from existing digital archives. The frustration of chasing down supplementary information, Dryad DOIs, and GitHub addresses only to find a dead end or a broken corresponding author email address is a feeling akin to discovering squirrels chewing through temperature logger wires halfway through the field season. Federer notes that the tide is turning towards open data: after a rocky start in 2014 — Federer’s team parsed many papers likely submitted before (but published after) the Data Availability policy went into effect — 2015 and 2016 saw the percent of papers that lacked a Data Availability Statement drop dramatically. Over the same time period, Federer notes slight increases in the number of statements referring to data in a repository and fewer that claim the data is in the paper or — shudder — available upon request.

At a broader level, open data is a newly politicized topic. The EPA recently proposed new standards that would ban scientific studies from informing regulatory purposes unless all the raw data was widely available in public and could be reproduced. This is not so much a gold standard as a gag rule.

In a PLOS editorial, John P. A. Ioannidis points out that while “making scientific data, methods, protocols, software, and scripts widely available is an exciting, worthy aspiration” in eliminating all but so-called perfect science from the regulatory process, the EPA is committing to making decisions that “depend uniquely on opinion and whim.” Most of the raw data from past studies are not publicly available — and as Federer’s research shows, even in an age of required Data Availability Statements, open data is still a work in progress. And so we beat on — scientists against anti-science Environmental Protection Agency administrators, borne back ceaselessly in support of publishing accessible, open data as a kind of green light to past research. 

References:

Federer LM, Belter CW, Joubert DJ, Livinski A, Lu Y-L, Snyders LN, et al. (2018) Data sharing in PLOS ONE: An analysis of Data Availability Statements. PLoS ONE 13(5): e0194768. https://doi.org/10.1371/journal. pone.0194768 

Ioannidis JPA (2018) All science should inform policy and regulation. PLoS Med 15(5): e1002576. https://doi.org/10.1371/journal.pmed.1002576