How to share and store data in an electronic lab notebook

Posted by Rory on September 23rd, 2010 @ 5:16 pm

In this blog I usually look at data sharing from the point of view of the core research unit, the lab.   That was the perspective I adopted a couple of weeks ago in a presentation, Electronic lab notebooks in biomedical research, at the Storing, Accessing and Sharing Data: Addressing the Challenges and Solutions event co-hosted by the Scottish Bioinformatics Forum and S3 in Edinburgh.  I’ll come back to that perspective in a minute, but first I’d like to contrast two very different institutional perspectives on data management described at the conference.

Sanger Institute:  centralized institutional data management

Phil Butcher, head of IT at the Sanger Institute, started with a high level overview of data management issues at Sanger.  He focussed mainly on the rapid growth in the amount of data generated at Sanger, and the other institutes with which it has large scale collaborations, and the issues relating to storing and finding data when there is so much of it.  The impression I came away with is that at Sanger data is viewed as an institutional matter, not something that individual labs or scientists manage or, apparently, have much of a say in.  That makes sense, because the research projects Phil mentioned were all large scale, involving large numbers of scientists, and the generation of huge amounts of data.  The title of Phil’s talk, Scaling up Science and IT: Sanger Institute’s Perspective, reflects the centralized approach.

London Research Institute:  decentralized institutional data management

The next speaker, Jeremy Olsen, head of IT at the London Research Institute, started by saying that based on Phil’s description of Sanger, the London Research Institute was very different indeed, more  a collection of individual research groups.  In describing his LRI  perspective Jeremy said that he would be sticking up for the “little guy”.  He proceeded to briefly overview how research is carried out at the LRI, introducing the various research groups and their research interests.  The LRI represents a very different paradigm from Sanger; at the LRI decentralization rules, as reflected by the title of Jeremy’s talk, Data Growth and Management in a Diverse Life Sciences Environment.  At the LRI there are fundamental issues relating to getting a handle on what research the various groups are involved in, what data they generate and how they manage it. Progress would need to be made on understanding  these issues before it would be possible even to consider a centralized approach to data management and what that might entail.

The lab: bottom up data management

When it came time for my presentation, I started by saying that if Phil was representing the centralized  institutional approach, and Phil was looking at  the “little guys” from an institutional perspective, I was going to look at the issue of data management and sharing from the point of view of the little guy him/herself, i.e. the PI.  In the academic context, it’s important to note that the Sanger model is the exception and the LRI  decentralized model is the rule.  In fact it is almost certainly the case that the LRI, decentralized as it is, is still towards the more organized and centralized end of the spectrum of academic biomedical institutions. That point was reinforced to me when speaking recently with the IT director of a medium – large biomedical research institute in Australia (800 people including 700 scientific staff).  His description of the issues he faced with getting a grip on what data there was in the labs at the institute, how they managed it (if they managed it all), and uncertainty about how to help PIs get a better handle on their data was uncannily reminiscent of Jeremy’s description of the situation at the LRI.

From the perspective of IT managers tasked with, among other things, trying to bring some order to the data generated by the research groups at their institution, to store it in a cost effective fashion and have it archived in a way that is useful in the future, multiple PIs generating ever increasing amounts of data may be a ‘problem’ to be managed or dealt with.  But from the PIs’ point of view it is their data and theirs to manage (or not) as they want.  There is a pretty fundamental difference in outlook here.

Electronic lab notebooks — part of the solution?

In my presentation I asked where electronic lab notebooks might fit into this picture, and whether they could have a role to play in crafting better data management solutions that meet the objectives of both PIs and IT directors.

ELNs tick some of the key boxes IT directors look for in best practice in data storage and sharing, including:

  1. Storing metadata in a structured fashion and ensuring controlled access.
  2. Effectively managing different data types, including attachments and imports.
  3. Allowing improved indexing  and search, through the use of structured metadata.

Electronic lab notebooks can also solve  the key data management problem facing many PIs:  coordinating a wide diversity of data type sets generated by a large number of people within the lab.  They can, that is, if they meet the following key requirements of today’s PIs:

  1. The ELN is flexible and can be set up the way the PI and their lab want it set up.
  2. It’s easy for the lab to transfer to the ELN.
  3. The ELN facilitates better exchange of information between members of the lab and, over time, better archiving.
  4. the ELN is web based and hence accessible anywhere, anytime.

So, electronic lab notebooks can help to solve the key data management  issue faced by  the core unit in academic institutions — labs.  And they provide a platform for data management that IT directors looking at the problem from an institutional perspective can work with.  As such they can be part of a solution which benefits both PIs, who are concerned with the research done in their group, and IT directors, who are concerned with the data generated throughout their institution.

Provenance in electronic lab notebooks

Posted by Rory on August 11th, 2010 @ 7:00 am

In this post I’d like to stimulate some discussion about provenance in electronic lab notebooks, and more generally in documenting biomedical research. This issue is of interest to various groups of  people, but they usually don’t talk to each other.  I’ll begin with observations on the issue from three people.  One is a biochemist who is also a leading commentator on documenting and communicating about biomedical research, the second is a thoughtful scientist who works in a lab and is is constantly looking for ways to get better organized in capturing her research, and the third is an informatician working on a project to bring the benefits of databases — including provenance — to wikis with a particular focus on biomedical research. Perhaps this post will stimulate some cross fertilization of ideas that otherwise might not take place.

The first person is Cameron Neylon.  Cameron has written a lot about different aspects of provenance in research, and helped organize a workshop on the issue in April where he delivered a presentation called In your worst nightmare:  how experimental scientists are doing provenance for themselves.  For the purposes of this discussion I’m going to focus on some comments Cameron made recently in a discussion started by Jonathan Eisen about possible electronic lab notebook systems.  Commenting on versioning and provenance, Cameron said,

“. . .versioning systems (generally) fail to provide a good way of capturing or thinking about the process that converts one thing to another. So I think the provenance problem or the process problem is the more interesting one.”

The second person is Kim Martin, who works at the Division of Pathway Medicine at Edinburgh University.  Kim has a strong interest in organizing her research and communicating with colleagues in an efficient way. Like Cameron, she feels that a simple audit trail showing all past versions of a particular record provides only a very limited perspective on the research process.

Kim has developed the idea of a journal view or ‘journalling’ in an electronic lab notebook as a way of being able to look back at the process of her work during a particular period of time.   To do this she wants to be able to very easily create a snapshot of everything she was doing on a particular day.  Here is Kim’s sketch of how such a ‘journal view’ might look:

Kim’s concept is that the electronic lab notebook would, through automatic linking, support the creation with a single click of a’ journal view’ of research and related activity undertaken on any given day.  One of Kim’s key objectives is to gain insights on the process of research which may have been undertaken some time ago, as a mnemomic device.  I think she shares this objective with Cameron — it would be interesting to get Cameron’s views on this.

The third person is James Cheney, at the Laboratory for Foundations of Computer Science, Edinburgh University.  I met James when we both spoke at the Biomedical-data day held at Edinburgh University in June. James gave a presentation called Databases + Wikis = Curated Databases.  Among the core areas of expertise of James’ group, which is led by Peter Buneman, is provenance for database queries and updates. They are working on a project aimed at bringing the benefits of databases, including the ability to deploy more sophisticated provenance, to wikis.  The project involves developing a “database wiki” which includes support for provenance and user queries about provenance, including the following planned features:

  • Basic Provenance: Record basic information about changes (userids of logged-in users, IP addresses of unknown users).
  • [DONE 0.2] Copy-Paste Provenance: Record provenance links relating data in consecutive versions of the tree.
  • Provide the ability to import data from other sources (including other DatabaseWikis) while automatically recording source information.
  • Query provenance: Propagate provenance along with queries embedded in pages, to support user queries about provennace
  • Bulk update provenance: Provide the ability to rearrange data within DBWiki pages or data using bulk updates while automatically recording provenance for these transformations.

With that background, it would be great to hear more from Cameron, Kim, James and others about:

  1. The nature and details of  the research  ‘process’ that needs to be captured.
  2. Reactions to Kim’s journalling idea — general reactions and also views on whether it provides a good (or at least a useful) angle on the research process, and how it might be modified to capture other aspects of the research process.
  3. Reactions to James’ planned provenance features — e.g. are these features likely to be useful to biomedical researchers, what other kinds of provenance would be useful in capturing the research process?
  4. Other thoughts on process and provenance in biomedical research stimulated by the above.