Motivating data producers

Chris Taylor

Monday, 01 Jun 2009 14:13 UTC

Hi.

So I’d like to kick this off by going straight to what I at least perceive to be the root: Why should a data producer would want to (a) make the effort to share and (b) do more than the bare minimum.

At present, there are many examples of data sharing policies promoted by funders, and a trend amongst publishers to require more information to accompany published articles. This is good, but in both cases these are ‘sticks’. What you get with a stick is the bare minimum; bare both in additional detail, and in the effort to ensure the robustness and digestibility of data.

Add to that the paucity of mechanisms for sharing any but the smallest and simplest data (noting that there are a small number of exceptions) and we have the current state of affairs: Limited numbers of minimally-annotated data sets scattered across a number of locations (a huge issue in itself), arbitrarily-structured and frequently in something computationally unpleasant like a PDF (looks pretty, but an arse to deal with).

If I get it, this meeting is about making data sets amenable to thorough quality assessment, and more generally, to make them accessible to all (subject to IP constraints). This requires that data producers do a lot more work — but why should they?

After a significant stint in standards, I’ve learned that pro bono arguments don’t wash, neither do ‘obvious’ ones like pointing at GenBank as a demonstrably useful artefact. The sticks aren’t that effective either tbh. What matters, and what will get data producers to do the right thing and more, is meaningful credit.

At a recent RIN meeting a group of us discussed all this and drew the conclusion that the missing piece is a mechanism by which many kinds of credit can accrue to an individual; publications (obviously), but also citations of publications, teaching/training input, reviewing, adding value (~curation/wikipedia-style), and crucially for this (probably overlong) argument, data sets and their onward citation.

The argument is that the better a data set is annotated, the more chance of it being useful, and the more diverse the ways in which it might be reused. If credit is awarded for the reuse of a data set, then the producer benefits (assuming that assessors respond appropriately). Therefore it is in the producer’s interest to share as much as possible, and to annotate as richly as they can.

If data producers do a good job, their data can be validated by a form of community oversight, and where feasible (cash-wise) can be reused in resources developed for the general public. Also the chronic lack of curators becomes less of an issue (though does not go away).

So what we need are (1) DOIs (leading candidate imho) on data sets etc., and (2) some sort of universal digital ID for producers (which would incidentally get rid of the problems around author identities on papers). Good deeds can then be rewarded.

Incidentally, consensual guidance on what to report is available in many cases, as are vocabularies and formats to mark data up and move them around. There are even some public databases for some uses, though the official plan for the UK seems to be that as (first and) last resort, individual institutions will maintain repositories (a user’s nightmare [fragmentation = loss of potential] that could be mitigated slightly by a central registry of holdings, though only a UK data centre would really do the job).

It’s all very Darwinian really: You get what you select for (and there’s no such thing as altruism).

Cheers, Chris.

Updated 01 Jun 2009 18:36 UTC

  • Replies

    Post a reply
    • Hey Chris – thankd for being the first to post!

      I guess I’d pick up on two broad points.

      Firstly, the forum topics are broader than data sharing, though this is mentioned. We wanted to explore the idea of how the research paper as the ‘article’ of information that scientists use to inform their work is changing/may change. Indeed, inherent to this – I think – is the opportunity (and challanges) to discovery, understand, link, cite, etc. down to the level of individual pieces of infomation within it, whether that be a dataset, image or whatever.

      Secondly, I agree with what you say about data. In my previous job I was involved in establishing data sharing policy for the UK Medical Research Council, and trying to get some practical tools and guidance and pilot a service to support it – chicken and egg I know!

      You mention DOIs – this BL presentation on the background to a new European DOI registration agency for research datasets may be of interest to you.

    • You raise many interesting points, Chris, but I wanted to ask about your suggestion of a doi for data sets. Don’t some already have them, in effect, via their accession numbers? Or is your point that these are too specific to a specific database, eg PDB, to be scalable?

    • Omg a couple of people told me at that RIN meeting that our group / I needed to talk to Adam Farquhar. It’s been on my to-do list for a few weeks but it just got promoted (= thanks for the link to the presentation, which I just read).

      Slides 6-11 are just bang on and the stuff about the JRA from 12 onwards is potentially fantastic — is that well-backed do you know?

      Anyway so @ Allan (with thanks again for the link): I did only address a small portion of this forum’s scope, but because I see that as the lynch pin for the rest. I certainly am a subscriber to the idea that the new publication is generally (especially in the sciences) more than the pages in the journal (however delivered). It is a body of work that is published and the paper is kind of a news piece based on it. In fact that was always the case, it’s just that getting the ‘rest’ of the publication used to be a lot trickier.

      Really what I was getting at wasn’t so much data sharing; for a start I think that is increasingly a given (looks around for experimentalist with garotte). The big issue as I see it (crucial to expectations of community oversight, data resuse or attempts to make data accessible to a wider audience) is whether people do a good job of releasing data.

      People strive to make papers the best they can because they know that a badly written paper limits its appeal and if no-one read it / gets it they won’t get cited (=lifeblood). There is though not much selection pressure for good data set annotation / presentation / accessibility. I think that passing credit back for a job well done (via an analogous peer mechanism) is the route to success.

      @ Maxine, thanks. There are absolutely examples of DOIs already fulfilling the role yes, but what I’m after is a more general mechanism. And as luck would have it in storm the BL with it all set to go it seems. How tickled am I :)

      If we can find a similarly-mature personal ID (maybe OpenID, I don’t know enough) and if failsafes can be created for idetity theft and all that sort of thing then we’re well on the way.

      Anyway I’m off to email Adam Farquhar…

    • Incidentally, Allan are you actually working with / over / under Adam directly and am I right to assume he’s the best person to annoy?

    • @Chris – yes Adam is your main contact re. BL-wide role in DOI registration. I work in the Science Technology & Medicine team which is involved in a range of activities (engagement with researchers, new services and new types of content that add value, etc). Thus sort of in parallel to Adam, for example the Science team may work with a group of say – oceanographic researchers – to understand how they currently discover, access, use and share datasets and see if there are any barriers…

      @Maxine – yes there are indeed indentifiers for some types of data. And I think (and this is not a plug) early on Nature and some other journals helped push the agenda by including requirements for authors to include accession numbers, etc. As you infer, cross-linking and scalability may be an issue.

      We are not looking to duplicate or usurp but provide an option where there may be an unmet need – if the needs met, then we don’t mind!

    • Hiya. Thanks for the clarification.

      I know it was only an off-hand example, but are you actually working with oceanographers at the moment? It’s just that I’ve been having a slew of conversations around oceanography and reporting/standards this morning (in the context of a handful of large collaborations) and I know of someone that would be very eager to talk to you wrt grants and things.

      So if it wasn’t plucked at random, can I follow up on this with you off line?

      Cheers, Chris.

    • Just to say, yes indeed, Nature does require deposition and accession numbers to appropriate databases. We just had an editorial this week about genetically modified mice – which apparently only very few researchers make available via the excellent repositories available. It’s free to read online. Given the vast range of types of material and databases available, my mind boggles at unifying them all – but that is one of the dreams of the semantic web, of course. Our listing of some of the databases in which we require authors to deposit data and materials is here, we always welcome suggestions for additions – we are very short on the physical sciences for example. (Suggestions can be sent to authors@nature.com)

    Post a reply

Search forums Advanced search

web feed

Submit this topic to

Advertisement