Tuesday, 28 July 2009

Provenance metadata: what and how to record it?

To effectively curate the research data produced by the two research groups participating in the EIDCSR project, it is crucial to capture provenance metadata that explains how the data was generated in the first place. This information enables validation and increases the value of the data.

The research groups in our case, collaborate as part of a BBSRC funded project and generate MRIs and histology data in laboratories using a variety of instruments and techniques, these datasets are then manipulated through process such segmentation to create 3D meshes, volumetric elements, that will serve to run computational simulations.

So what provenance metadata should be recorded and are there any subject specific metadata standards appropriate for these datasets?

Interviews with the researchers involved in the generation of data have shown that they well versed in recording information about their experiments on their lab-notebooks. When writing research articles they go back to these notebooks in order to document their methodologies. Therefore, I believe it is fair to assume that researchers know what information needs to be recorded about their experiments and simulations.

Discussions with the person responsible for the BBSRC Data Sharing Policy around metadata standards pointed us to the Minimum Information for Biological and Biomedical Investigations (MIBBI) portal. This resource provides minimum information guidelines for diverse bioscience domains and provides a registry of projects developing those guidelines.

A metadata standard used for experimental data widely used internationally is the Scientific Metadata Model developed by CCLRC (now STFC). The model includes information at the top level describing the study and the set of investigations i.e. experiment, measurement, simulation etc involved in this study. Then for each investigation it records specific information about the data:

Data holding - A logical hierarchy of the Data Collections and Atomic Data Objects and their directory style grouping. The Data Holding can be considered as the ‘root’ of the data file/object system.

§ Data description - A description of the data kept in this data holding from the data archive perspective. Including information like name, type, status, quality and software.

- Logical description - Reference to a set of logical description fields such as parameter [Name, id, class, units, value, facilities used, range], time period or facility used.

§ Data collection - Data Collections in the hierarchy of data organisation used in this Investigation; much like directories in a file system and they can be nested.

§ Atomic data object - Atomic Data Objects (files, blobs, named selects etc)

§ Related reference - Other Studies/Investigations related to this Data Holding and their type or relationship; e.g. derived from or used by

§ Data holding locator - A locator for addressing the overall Data Holding. (URI of top level directory or data)

How can this complex workflow process that involves several research groups with specialists skills and a variety of tools and techniques be recorded?

An answer to this question may be obtained by looking at the work of our colleagues in Southampton. Some weeks ago Simon Coles and Jeremy Frey visited the OeRC to tell us about their work on electronic lab notebooks. They have been involved in projects such as Smart Tea and CombeChem that deal with the management of laboratory information. Initially they had explored the idea of replicating printed lab-notebooks using tablet interfaces that would capture structured information. These have the benefits of good semantic information. In addition to this, they have experimented with the idea of laboratory blogs that allow recording step by step the process followed allowing discussing the data and providing flexibility and the power of web 2.0 technologies.

Monday, 13 July 2009

EIDCSR website launched

The EIDCSR website was launched last week at http://eidcsr.oucs.ox.ac.uk

This new site will contain information about this JISC funded project including background and methodology, it will aggregate posts from this blog, project bookmarks and will link to different reports, presentations and papers resulting from project activities.

Friday, 3 July 2009

Response to the DCC Data Management Plan Content Checklist

The Digital Curation Centre (DCC) circulated a draft template for consultation of a Data Management Plan Content Checklist in mid-June. This checklist was intended to act as an aide for researchers when producing data management plans (DMPs). The aim of the public consultation was to obtain feedback about the draft checklist as well as desired functionality for an online tool to be developed.

The following response has been gathered from internal discussions in the University of Oxford amongst members of the JISC funded Embedding Institutional Data Curation Services in Research (EIDCSR) project.

Members of the EIDCSR project consider that the draft template represents a significant step-forward towards the support and standardization of data management plans as an integral part of an application for funding. The document covers many of the issues required to be thought of at the outset of a research project and the web-based tool might be of real benefit to researchers and those supporting the application process within Universities.

Below feedback is organized into two sections covering the checklist and the desired functionality for the online tool.

Specific feedback about the checklist

  • More than a checklist where researchers can tick boxes, this seems to be a form to gather qualitative information about the research project, the plans and intentions for managing research data as well as researchers’ perceptions on issues like anticipated volumes or foreseeable uses of their data.
  • In order for the sections in the document to follow the DCC lifecycle model, section 3 on access and data sharing should be placed after section 6 on short-term storage.
  • Section 6.2 deals with where the data will be stored and the section is not marked bold. The media storage chosen it is a crucial aspect of data management and needs to be a core section.
  • It may be worth starting this exercise from the another perspective, if such plan is going to be peer-reviewed, what practice would be accepted and what practice would fail a peer-review process?
  • Some of the sections need to be unfolded to become more comprehensive. Section 2.3 could include questions about whether the data will contain personal or health information and whether consent forms will be used.
  • Section 4 on data collection should be asking about who will be creating/capturing the data and in what country will this happen (different countries will have different laws for data collection and sharing).
  • Section 7 should ask who will take responsibility over time for making decisions about the data when the original participants have gone and whether there is a process in place for transferring responsibility.
  • Section 3 could mention access and re-use of metadata (eg harvesting) as separate to access and re-use of the actual data.
  • Quality of data. Needs to be addressed too. Will the data be peer-reviewed? Is there some sort of kite-mark or indicator that data has been peer-reviewed?
  • Issues such as the closure of the data store and the responsibilities should also be covered on this checklist.

Desired functionality for online tool

  • It is crucial to define clearly what the aim of this interactive web-based tool will be and what it will do for researchers and those supporting them in the application process. It may be worth to discuss the functionality with researchers that currently need to provide a DMP with their applications to understand better their need as well as those from other staff involved in the application process.
  • It may be worth thinking how to encourage researchers to use this online tool to generate a DMP to then include it in their application. Could they be getting a sort of “seal of approval” from DCC saying that they have use their tool and guidance to develop their DMP?
  • Acceptability of the resulting checklist with funding agencies – if a funding agency supported, encouraged, or required its use there would be more chance of it being taken up
  • Apart from the examples of best practice how can researchers get guidance to develop these plans if they don’ have the required expertise to fill in one of the sections? Would DCC provide the support required?
  • Particular areas of functionality that such a system may need to have include:
  1. The capacity to export the data so that the information can be included with the actual funding application proposal. Could it also be adapted to be used as a reporting mechanism later in the project as some of the data management actions take place. Plans may have to change because of circumstances- that sort of situation should be able to be included.
  2. Examples of best practice in data management across several and distinct research disciplines.
  3. Advice on: legal and ethical issues for collecting and sharing data, standards for file and metadata formats, storage options, back-up, secure archives for long-term curation, etc