Research outcomes of multi-author collaboration using open data

Q: What do you get when you mix a room full of zooarchaeologists with 200,000 records from seventeen archaeological sites?
A. An exercise in herding cats
B. A research paper in PLoS ONE
C. Both of the above

For better or for worse, the answer, in this case, is “C. Both of the above.”

In 2012, with support from the Encyclopedia of Life and the National Endowment for the Humanities, we brought a group of scholars together to integrate faunal data from seventeen archaeological sites in Anatolia and to collaboratively address a research question using those data. Our interest in organizing this project came from a desire to see more actual research outcomes drawing on data from multiple, open datasets. Up to that point, there had been a lot of discussion of the potential for data integration, but very little applied research showing how it actually happens, what the results might be, and what we can learn from the process of data sharing.

A better understanding of data sharing and reuse is important because funders of archaeology are increasingly requiring data management plans and open data, but researchers lack information on how to meet these requirements. Good data management should imply that our data can be accessed, understood, and reused by others. But achieving those goals involves solving some hairy problems. We thought that a good starting point would be to gain a better understanding of how people use data that they didn’t create. Collaborating with researchers in the actual process of data reuse could help identify key requirements in effective and meaningful data management.

Organizing collaboration on this scale with researchers often feels like “herding cats”. Collaboration takes hard work and trust, and involving data in collaboration requires patience, skills, methods and expectations that will hopefully become more mainstream. Everyone has other research, teaching and service commitments, and we know time is precious. We are grateful that so many researchers participated in this study, committing their hard-earned data, but also their creativity and thoughts on how to analyze these disparate datasets together. The success of this project was not a foregone conclusion and it really depended on the trust and commitment shown by this team!

For this project, everybody in our Anatolia bone study shared their raw datasets (mainly Excel spreadsheets). No individual dataset was a significant challenge on its own, but when viewed as a whole, the group of more than a dozen datasets was daunting in its complexity. Though the projects all recorded similar fields, recording styles varied greatly. The datasets took many hours of editing and alignment before they were ready for integrated analysis. When we met at a mid-project workshop in Kiel, Germany, we had to work through many different opinions on just what aspects of these data could be compared with confidence, and where methodological, sampling, and other factors made comparisons problematic. The details of this process can be found in a paper we published this summer in the International Journal of Digital Curation. The paper outlines our editorial process, including data cleaning and annotation steps that we performed to set the stage for analysis. It also discusses how these processes need to fit into larger systems of scholarly communications, including digital repositories, version control systems, and incentive structures.

As for the research paper in PLoS ONE… This is the part that comes after much “data wrangling” and discussion. Ben Arbuckle, of UNC Chapel Hill, spearheaded the data sharing effort, and his years of work building trust with this community was key to the project’s success. Project participants agreed to openly share their datasets in Open Context. The data came from archaeological sites in Turkey, spanning the Epipaleolithic through the Chalcolithic, with an aim to explore how integrated datasets can inform us about the spread of early domestic animals westward across Turkey. The project highlighted a complex regional picture in the spread of agriculture, with particularly notable differences between coasts and inland regions. The research outcomes were published this summer in PLoS ONE. This project is the first of its kind involving the large-scale, digital publication and integration of zooarchaeological datasets. We hope that this model for archaeological collaboration will encourage others to build on the datasets published in this project, therefore contributing more data to further inform this particular research question, as well as address new questions.