Friday 18 December 2009

Scientific data repositories workshop in Barcelona


A couple of weeks ago I was invited to talk at an incredibly inspiring event organized by the Centre de Supercomputació de Catalunya titled "Repositorios de datos cientificos" under their Jornadas Catalanas de Supercomputació.

We had an extraordinary day with a fantastic group of speakers that discussed issues around supporting researchers with their data management as well as disciplinary perspectives provided by real researchers.

The whole event was filmed and is available
online (for those who speak spanish!) and I also got interviewed and filmed for online publication known as Global Talent, you can also see this video (again in spanish!).

Thursday 26 November 2009

Data management policy in the University of Oxford

Neil Grindley, our JISC Programme Manager, has just posted "Data Management Policy: An Interview with Paul Taylor" in the JISC Information Environment Team blog.

Dr. Paul Taylor from the University of Melbourne is now finishing a secondment in the University of Oxford as part of the EIDCSR Project. He was instrumental in the development of University of Melbourne's data management policy and his work in Oxford will help producing a draft research records and data management policy together with a plan of action to implement it.



Image from: http://www.flickr.com/photos/28481088@N00/3352123533/

Friday 23 October 2009

EIDCSR technical analysis: from soft to hard

After having conducted the EIDCSR audit and requirements analysis exercise, we have started converting the high level requirements gathered into technical requirements. The idea is to produce a systems design document for a Systems Developer to start with the implementation. Howard Noble, from Computing Services, is leading this exercise for the next two months.

To start with the technical analysis, Howard and I have had a very fruitful meeting this morning. We have brainstormed ideas for a high level system design trying to identify the practical things that can be done to support the data management workflows of the research groups taking part in EIDCSR.


Using a board to produce a "rich picture" recording the processes we have encountered and our thoughts was extremely useful. We will now produce a "cleaner" version of this picture and bring it forward to key people in the research groups in a workshop. This will hopefully helps us to communicate what the project aims to achieve as well as getting feedback on the design so that researchers requirements drive any development .




Thursday 15 October 2009

First EIDCSR workshop and executive board meeting

Yesterday was a busy day for the EIDCSR Project.

In the morning, the first project event took place at Rewley House in Oxford with an exciting group of speakers brought together under the theme of "Data curation: from lab to reuse". Their presentations are now available on the project website and a report will be produced shortly.

The afternoon served to held the first EIDCSR Executive Board meeting where progress and next steps for the project
were discussed with the extraordinary helpful and encouraging members of the board.

Overall, a great day providing loads of food for thought.

Monday 12 October 2009

"Science these days has basically turned into a data-management problem"

The New York Times has an article about future scientists' ability to manage the large amounts of digital data being generated and how the likes of IBM or Google are trying to help, "Training to Climb an Everest of Digital Data", http://www.nytimes.com/2009/10/12/technology/12data.html. IBM and Google are contributing tools, computational power and access to large-scale datasets. It was actually two years ago this month that Google and IBM announced their partnership to provide universities with dedicated cluster computing resources, open source software, a dedicated website for collaboration, and a Creative Commons-licensed curriculum. In April this year the NSF funded projects at 14 US universities to take advantage of the IBM/Google Cloud Computing University Initiative. The New York Times article highlights some of these projects. The emphasis is certainly on the massive -- big compute clusters, big datasets -- and on data analysis. Not much though on the ongoing management of, access to, and preservation of data, even if Professor Jimmy Lin (University of Maryland) is quoted as saying, “Science these days has basically turned into a data-management problem”.

Wednesday 23 September 2009

EIDCSR workshop on 14 October

The first EIDCSR project workshop is taking place on 14 October, more details below:


Date and location

14 October at Rewley House, 1 Wellington Square, Oxford OX1 2JA

The event will start at 10.30 and will finish with lunch at 13.00


Description

This workshop is organized as part of the dissemination activities of the JISC-funded EIDCSR Project. The aim of the workshop is to hear about proven practice in selected data management areas identified as challenging for researchers through the EIDCSR audit and requirements analysis exercise. Whilst the EIDCSR Project is addressing the requirements of researchers working within medical and life sciences, the event is likely to be of interest to those working in, or supporting, other disciplinary areas.

The expected audience includes researchers who generate data in labs and computing simulations and staff from service units with an interest in research data management and curation issues.


Outcomes

Participants in the workshop will have the opportunity to learn about, and contribute to discussion of, the different approaches to the ensuring the flow of data between laboratory and in silico experimentation. In particular, the workshop will discuss:

* methods for the capture, storage and reuse of metadata in the laboratory;

* lifecycles integrating wet lab and in silico experimental data;

* for delivery and visualisation of large-scale data.

Programme

Some of the speakers will include:


Alan Garny, Oxford Department of Physiology Anatomy and Genetics - Alan will discuss his research group data management workflow and challenges.

Brian Brooks, Unilever Cambridge Centre for Molecular Informatics - Brian will talk about their Chemical Laboratory Repository In/Organic Notebooks (CLARION) Project.

Angus Whyte, Digital Curation Centre - Angus will share the experiences from the DCC SCARP Project on data management best practice.

Booking

To book a place please email eidcsr@oucs.ox.ac.uk

Wednesday 9 September 2009

Data audit and requirements analysis

One of the initial exercises to be conducted as part of the EIDCSR project was the audit and requirements analysis based on DAF to document the data practices and assets as well as to capture the requirements for tools and services of the research groups participating in the project. This exercise took place throughout the summer and the report describing the results will be available soon.

As I explained on a previous post, these research groups collaborate as part of a BBSRC grant to conduct research on ventricular architecture by using novel techniques such as Magnetic Resonance Imaging (MRI) and Diffusion Tensor MRI (DTMRI) and combine them with traditional histological techniques as well as with image processing with data registration and computational models for bio-mathematical simulation.

Their research workflow is well described by Gernot et. all (2009)* in the diagram below. It starts with the generation of complementary images stacks that are then processed in different ways to generate meshes that can be used for computational modelling of the heart.


The result of this complex process produces the following data outputs:
  • Histology data: large high resolution images produced by microscopes in the lab representing sections of a heart.
  • MRI and DTMRI data: stack of tiff images resulting from the raw data produced by the magnet in a lab.
  • Segmentation data: outputs resulting from applying image segmentation techniques to the histology and MRI data.
  • Mesh data: volumetric model produced from segmented data in a mesh generator.
  • Simulations: electrophysiological simulation using the mesh data and other input files that define the models and the parameters.
  • 3D heart atlas: representing an average representation of a heart ventricles obtained from the histology and MRI data.
And the research group requirements can be grouped under three themes:
  • Secure storage: all the data outputs presented above are stored on a combination of desktop computers and a project NAS system and researchers realize the need to keep the data safe by having appropriate and resilient back-up procedures.
  • Data transfer: the histology data are large and needs to be accessed by researchers within the groups and others.
  • Metadata: currently the provenance metadata for some of the data presented above is recorded in printed lab-books. This information is crucial when making the data available to others and it is required when publishing articles based on the data. In addition to this, it may be helpful to improve searching within the NAS system.
*Gernot Plank, Rebecca A.B. Burton, Patrick Hales, Martin Bishop, Tahir Mansoori, Miguel O. Bernabeu, Alan Garny, Anton J. Prassl, Christian Bollensdorff, Fleur Mason, Fahd Mahmood, Blanca Rodriguez, Vicente Grau, Jürgen E. Schneider, David Gavaghan, and Peter Kohl Generation of histo-anatomically representative models of the individual heart: tools and application Phil Trans R Soc A 2009 367: 2257-2292.

Tuesday 28 July 2009

Provenance metadata: what and how to record it?

To effectively curate the research data produced by the two research groups participating in the EIDCSR project, it is crucial to capture provenance metadata that explains how the data was generated in the first place. This information enables validation and increases the value of the data.

The research groups in our case, collaborate as part of a BBSRC funded project and generate MRIs and histology data in laboratories using a variety of instruments and techniques, these datasets are then manipulated through process such segmentation to create 3D meshes, volumetric elements, that will serve to run computational simulations.

So what provenance metadata should be recorded and are there any subject specific metadata standards appropriate for these datasets?

Interviews with the researchers involved in the generation of data have shown that they well versed in recording information about their experiments on their lab-notebooks. When writing research articles they go back to these notebooks in order to document their methodologies. Therefore, I believe it is fair to assume that researchers know what information needs to be recorded about their experiments and simulations.


Discussions with the person responsible for the BBSRC Data Sharing Policy around metadata standards pointed us to the Minimum Information for Biological and Biomedical Investigations (MIBBI) portal. This resource provides minimum information guidelines for diverse bioscience domains and provides a registry of projects developing those guidelines.


A metadata standard used for experimental data widely used internationally is the Scientific Metadata Model developed by CCLRC (now STFC). The model includes information at the top level describing the study and the set of investigations i.e. experiment, measurement, simulation etc involved in this study. Then for each investigation it records specific information about the data:

Data holding - A logical hierarchy of the Data Collections and Atomic Data Objects and their directory style grouping. The Data Holding can be considered as the ‘root’ of the data file/object system.

§ Data description - A description of the data kept in this data holding from the data archive perspective. Including information like name, type, status, quality and software.

- Logical description - Reference to a set of logical description fields such as parameter [Name, id, class, units, value, facilities used, range], time period or facility used.

§ Data collection - Data Collections in the hierarchy of data organisation used in this Investigation; much like directories in a file system and they can be nested.

§ Atomic data object - Atomic Data Objects (files, blobs, named selects etc)

§ Related reference - Other Studies/Investigations related to this Data Holding and their type or relationship; e.g. derived from or used by

§ Data holding locator - A locator for addressing the overall Data Holding. (URI of top level directory or data)


How can this complex workflow process that involves several research groups with specialists skills and a variety of tools and techniques be recorded?

An answer to this question may be obtained by looking at the work of our colleagues in Southampton. Some weeks ago Simon Coles and Jeremy Frey visited the OeRC to tell us about their work on electronic lab notebooks. They have been involved in projects such as Smart Tea and CombeChem that deal with the management of laboratory information. Initially they had explored the idea of replicating printed lab-notebooks using tablet interfaces that would capture structured information. These have the benefits of good semantic information. In addition to this, they have experimented with the idea of laboratory blogs that allow recording step by step the process followed allowing discussing the data and providing flexibility and the power of web 2.0 technologies.




Monday 13 July 2009

EIDCSR website launched











The EIDCSR website was launched last week at http://eidcsr.oucs.ox.ac.uk

This new site will contain information about this JISC funded project including background and methodology, it will aggregate posts from this blog, project bookmarks and will link to different reports, presentations and papers resulting from project activities.

Friday 3 July 2009

Response to the DCC Data Management Plan Content Checklist

The Digital Curation Centre (DCC) circulated a draft template for consultation of a Data Management Plan Content Checklist in mid-June. This checklist was intended to act as an aide for researchers when producing data management plans (DMPs). The aim of the public consultation was to obtain feedback about the draft checklist as well as desired functionality for an online tool to be developed.

The following response has been gathered from internal discussions in the University of Oxford amongst members of the JISC funded Embedding Institutional Data Curation Services in Research (EIDCSR) project.

Members of the EIDCSR project consider that the draft template represents a significant step-forward towards the support and standardization of data management plans as an integral part of an application for funding. The document covers many of the issues required to be thought of at the outset of a research project and the web-based tool might be of real benefit to researchers and those supporting the application process within Universities.

Below feedback is organized into two sections covering the checklist and the desired functionality for the online tool.

Specific feedback about the checklist

  • More than a checklist where researchers can tick boxes, this seems to be a form to gather qualitative information about the research project, the plans and intentions for managing research data as well as researchers’ perceptions on issues like anticipated volumes or foreseeable uses of their data.
  • In order for the sections in the document to follow the DCC lifecycle model, section 3 on access and data sharing should be placed after section 6 on short-term storage.
  • Section 6.2 deals with where the data will be stored and the section is not marked bold. The media storage chosen it is a crucial aspect of data management and needs to be a core section.
  • It may be worth starting this exercise from the another perspective, if such plan is going to be peer-reviewed, what practice would be accepted and what practice would fail a peer-review process?
  • Some of the sections need to be unfolded to become more comprehensive. Section 2.3 could include questions about whether the data will contain personal or health information and whether consent forms will be used.
  • Section 4 on data collection should be asking about who will be creating/capturing the data and in what country will this happen (different countries will have different laws for data collection and sharing).
  • Section 7 should ask who will take responsibility over time for making decisions about the data when the original participants have gone and whether there is a process in place for transferring responsibility.
  • Section 3 could mention access and re-use of metadata (eg harvesting) as separate to access and re-use of the actual data.
  • Quality of data. Needs to be addressed too. Will the data be peer-reviewed? Is there some sort of kite-mark or indicator that data has been peer-reviewed?
  • Issues such as the closure of the data store and the responsibilities should also be covered on this checklist.

Desired functionality for online tool

  • It is crucial to define clearly what the aim of this interactive web-based tool will be and what it will do for researchers and those supporting them in the application process. It may be worth to discuss the functionality with researchers that currently need to provide a DMP with their applications to understand better their need as well as those from other staff involved in the application process.
  • It may be worth thinking how to encourage researchers to use this online tool to generate a DMP to then include it in their application. Could they be getting a sort of “seal of approval” from DCC saying that they have use their tool and guidance to develop their DMP?
  • Acceptability of the resulting checklist with funding agencies – if a funding agency supported, encouraged, or required its use there would be more chance of it being taken up
  • Apart from the examples of best practice how can researchers get guidance to develop these plans if they don’ have the required expertise to fill in one of the sections? Would DCC provide the support required?
  • Particular areas of functionality that such a system may need to have include:
  1. The capacity to export the data so that the information can be included with the actual funding application proposal. Could it also be adapted to be used as a reporting mechanism later in the project as some of the data management actions take place. Plans may have to change because of circumstances- that sort of situation should be able to be included.
  2. Examples of best practice in data management across several and distinct research disciplines.
  3. Advice on: legal and ethical issues for collecting and sharing data, standards for file and metadata formats, storage options, back-up, secure archives for long-term curation, etc




      Monday 29 June 2009

      Update on recent data related activities

      The Australian National Data Service (ANDS) has recently launched two new services: register my data and identify my data. These services allow researchers to upload descriptions of their datasets for these to be published and to have persistent identifiers for their datasets to enable continuity of access.
      ANDS also advertises on their website that they will be working with a number of research projects to assist them with their data management.


      The Open Data Commons, a project from the Open Knowledge Foundation(OKF), has just released v1.0 of their Open Database licence, an open share alike licence for data and databases. It allows users to freely share, edit and use a database while maintaining freedom from other. The licence only governs the rights of the database and not the contents which in some case can be covered by other rights.




      The OKF has also started a project for distributed storage called Open Data Grid for an open distributed grid for open data.

      Friday 5 June 2009

      Data imperative event

      The data imperative event organized the RLUK/SCONUL Task Force on e-Research was held on Wednesday 3 June in Oxford with support from the Oxford e-Research Centre, RLUK, SCONUL and RIN.

      This was an excellent opportunity to confirm the extraordinary interest of librarians in this area as well as the difficulty to clarify their role and where the necessary funding comes from to allow addressing the challenge. Chris Keene shares his notes of the event from his blog and RLUK will be shortly making the presentations available.

      In the mean time you can access Prof. Paul Jeffrey's introduction to the workshop and my talk describing Oxford's recent work in this area.


      Tuesday 2 June 2009

      Welcome to the EIDCSR blog

      Welcome to this new blog where the JISC funded Embedding Institutional Data Curation Services for Research (EIDCSR) will be reporting on project outputs and other initiatives of interest.

      The EIDCSR project is an institutional collaboration in Oxford aiming to address the preservation requirements of two collaborating research groups in Oxford. Their data management and curation requirements will be scoped and selected elements of the digital curation lifecycle will be embedded, including policy, workflow, and sustainability solutions within the research process at an early stage. By joining up existing institutional and departmental services, as shown in the figure below, the project expects to use an approach that can scale up to address the data preservation challenges of research groups in Oxford.











      ShareThis