Nature Precedings forum: topic

This is a public forum

Who wants to be eating dirt? Musings on search, self-archiving, citations and "findability"

Hilary Spencer

Wednesday, 30 Apr 2008 22:54 UTC

A lively (and long) discussion follows Jennifer Rohn’s post, In which I get into a little muddle about archiving. I’d like to respond to something said about 2/3rds of the way down the comment thread, but I’m starting a new post as I’m not sure how many people will take the time to read all 109 comments (go Jennifer!)

In comment 50 or 60-something (ok-I didn’t count), Henry Gee poses the following hypothetical example:

Group X discovers something which they write down with a time stamp. Group Y makes the same discovery – by the time Group X gets wind of this and has mobilized its army of intellectual property droids, Group Y has published it and has gotten the credit. In the eyes of the world, Group Y has made the discovery and Group X is eating dirt.

Let’s modify this a bit: Say Group X posts a preprint of their manuscript on a preprint server like ArXiv, time-stamping it, but more importantly, making it easily available via Google. Even though Group Y publishes first, say they publish in a closed-access journal to which few universities have subscriptions (at the worst, say the journal doesn’t even make their articles available online—it’s print only). Now whenever someone is looking for articles on Z in Google, they tend to find Group X’s paper on ArXiv, but not the journal article. Group X’s article on ArXiv is cited a couple of times, while Group Y’s languishes behind a subscription wall, or, at the worst, in the university library’s stacks… So is “getting credit” getting the first publication, or getting one’s work consistently cited and used as a reference point?

Now, most journals have been moving towards making their content available online, and many universities have extensive journal subscriptions, so this might be an extreme example. Most people will cite a peer-reviewed article when given the choice of either citing a preprint or a peer-reviewed article. But people will almost always cite articles that they can find and read, over those that they can’t. (Isn’t it a breach of ethics to cite documents one hasn’t read?) Most publishers realize this, and are working to increase the “findability” of these articles, but sometimes they don’t do the best that they can. Sometimes sites like also Google rank papers on ArXiv higher than those on journal sites (though the question of the role of page rank in literature reviews might best be left for another day.)

One might also argue that no one uses Google to search for research, though trends suggest that this is changing, especially with the current crop of undergraduates. 1, 2

If the published version is difficult to find, but one is able to read the preprint and finds it useful, then I suspect that one is more likely to ferret out a copy of the published version (even in the university stacks) for citation purposes. One is more likely to spend the time and effort trying to get a copy of an article that one already knows will be useful over an article that may or may not be. (This might explain why papers with posted preprints tend to get more citations than those without available preprints 3, 4 and why open access articles tend to have higher citation rates 5).

There are many stories of disputes over claims of inventions (the modern computer, photography, the radio, the telephone, the steam engine…) A recent article from the NY Times’s Week in Review discusses Thomas Edison’s invention of the phonograph, noting that 17 years prior to Edison’s patent, a Parisian inventor had already created a device to make visual recordings of songs. Who was he? Who knows? The article goes on to note that Edison was perhaps not the first to invent the lightbulb, and was only credited as doing so when the Supreme Court ruled that the prior inventor’s patent was too broad. Did you know this? I didn’t.

The Times article suggests that credit for inventions is often correlated with who is able to make theirs accessible to the public, and not necessarily with who was first, even in the filing of patent documents. The author also notes the importance of timing: “Great ideas, while perhaps not novel, are delivered to us…just as we’re hungry for them.” Perhaps being the first to publication isn’t always the key to receiving credit, just as being the first to patent doesn’t mean that you won’t be eating dirt later.


1 Student Searching Behavior and the Web: Use of Academic Resources and Google

2 Information Illiterate or Lazy: How College Students Use the Web for Research *Disclosure: I didn’t read this article because I don’t have access, so I’m citing the abstract.

3 The Citation Impact of Digital Preprint Archives for Solar Physics Papers
Preprint version

4 E-prints and Journal Articles in Astronomy: a Productive Co-existence
Preprint version

5 Citation Advantage of Open Access Articles

Updated 01 May 2008 19:59 UTC

  • Replies

    Post a reply
    • To answer Noah I’m going to provide anecdote rather than evidence I’m afraid as I haven’t time right now to do a systematic study. Also I’ve only got access to Google Scholar for citation numbers.

      Anyway, compare two reports of the identification of the auxin receptor from 2005:

      Dharmasiri N., Dharmasiri S. & Estelle M. “The F-box protein TIR1 is an auxin receptor.” Nature 435, 441-445.

      Kepinski S. & Leyser O. “The Arabidopsis F-box protein TIR1 is an auxin receptor.” Nature 435, 446-451.

      Same issue of the same journal, back-to-back, practically the same title AND submitted on the same day. So why has the first got 267 citations and the second only 236?

    • From Scopus, the count is 213 vs. 299.

    • Thanks for the example Chris. I guess I was not expecting a systematic study, but had hoped that perhaps you had already seen data out there from others who had completed such a study. FYI, Scopus results give 213 and 199 for those two citations respectively. And ISI gives 201 and 194, respectively.

      I think that one has to be extremely careful with which numbers are used. Google’s counts typically include citations from any paper listed on any server anywhere, including multiple versions of the same paper referring back to the original. Therefore, there is a strong chance of counting the same citation twice at different locations. Scopus does state that it includes pre-print servers in their counts. I couldn’t find the exact resources for ISI, but I believe that they may only use post-print published materials from journals (online or otherwise) in their counts.

      Therefore, taking all of that information into account, the relationship between TOC location and citations breaks down quite a bit for these two papers (201 vs. 194). Even the Scopus data is a lot closer together. Is this really a systematic bias towards citing the first paper? I don’t know, but I find it hard to believe.

      The funny part is that when one does a PubMed search for “TIR1 auxin receptor”, you actually get the second paper listed first, since PubMed displays manuscript PMIDs in descending order. Since I tried to argue above that people would usually search PubMed to flesh out their bibliography with exactly the search I used, your positional premise combined with my bibliography-building premise would suggest that the second paper would actually be cited more since authors would come across that paper first in PubMed. Well, the citation numbers don’t bear this out, so the way the argument is structured is important.

      Perhaps the (TOC) first paper used different techniques that were more widely adopted by the field, was more detailed in their analysis of the receptor, or provided insights not mentioned by the second paper. All of those reasons could cause an increase in citation numbers as authors referred back to a particular diagnostic or experimental protocol in that particular paper. And by a quirk of editorial fate, it just happened to be published with earlier page numbers.

      A recent example off the top of my head from my journal:

      “Non–cell autonomous effect of glia on motor neurons in an embryonic stem cell–based ALS model.”
      Francesco Paolo Di Giorgio, Monica A Carrasco, Michelle C Siao, Tom Maniatis & Kevin Eggan
      Nature Neuroscience 10, 608 – 614 (2007)
      34 Citations

      “Astrocytes expressing ALS-linked mutated SOD1 release factors selectively toxic to motor neurons.”
      Makiko Nagai, Diane B Re, Tetsuya Nagata, Alcmène Chalazonitis, Thomas M Jessell, Hynek Wichterle & Serge Przedborski
      Nature Neuroscience 10, 615 – 622 (2007)
      41 Citations

      (For the record, Google Scholar gave 37 and 44 citations, respectively, to these papers…)

      Opposite result, but obviously, not enough time has passed to allow these manuscripts to have thoroughly sunk into the literature. In this case, I would argue that the Przedborski paper may continue to increase in citations over the Eggan paper simply because the former relies on a primary culture system that is more likely to be adopted and used by other researchers, as opposed to the embryonic stem cell-derived motor neurons used in the latter.

      Anyway, these are all interesting topics to debate, but I am still curious as to both yours and Hilary’s thoughts on what I had suggested in an earlier post here. Namely, that although OA articles may on average receive more citations, with increased author diligence to archive their papers openly when they are able to, does a 6 month head start by the OA paper really make that big of a difference in the eventual overall citations? Has anyone seen research out there where the “closed-access” papers were parsed into bins representing those that were immediately and openly self-archived anywhere (everywhere) versus those where the author failed to complete this task? It would be interesting to see if self-archiving allows the non-OA papers to catch up in citation counts…

      Do I get the award for the longest comment ever on a NN forum posting?

    • Whoops!—mistyped the Scopus count above. It’s 213 vs. 199.

    • Chris writes: if two papers report the same thing at the same time in the same place they should by rights get cited together and so get the same number of citations. One shouldn’t get cited more simply due to relatively arbitrary decision of a the journal’s editors, and yet it invariably does.

      The subsequent discussion has focused on citation counts – and I am sure that Scopus, ISI and GS are accurate only to a certain degree (not the same in all cases, and not in a predictably systematic way in all cases).

      But apart from that aspect, two papers published simultaneously will often be of different quality. I can think of two obvious examples from my own field. In the first of these, the first paper was far skimpier than the second. After a while, nobody cited it, everyone cited the second, solid study. In the second example, one journal published two papers and another journal published an inferior paper at the same time (all by the same group, as it happens). Again, nobody cited the inferior paper.

      I think it would be very hard to perform a systematic study of this fascinating question because of qualitative factors such as the relative merits of the publications concerned, quite apart from errors within the citation databases (or number of downloads, or whatever metric is being used).

    • Great post and discussion. I would say that the central issues revolve around precedence and attribution (which in terms is influenced by citability). The issue is who did something ‘first’. And this is why most journals include submission date as well as publication date on papers. The submission date is the clear third party certified date on when this work was done.

      Now my argument would be that to get the earliest possible recognition of what you’ve done you should release the raw data and your conclusions immediately to get yourself precedence. The problem with that is, as Hilary questioned, will anyone find it, and if so how do they cite it. We’re at a slightly awkward point in terms of the technology at the moment but in a few years there won’t be any excuse for not searching and finding these things. But we need a way of citing them.

      To put it another way, I’m hardly going to leave my brilliant new scientific idea in this comment because a) I can’t cite it – it has no independent existence and b) I don’t know that this post will be here in two years time. We need to link up all the elements of the scientific process, proposals, data, papers, and discussion, and make it all citeable and linked up. Then the issue of findability goes away because everything is linked, the issue of citeability goes away, because each element has its own existence, and if there are third party or reliable datestamps there can be no argument over who did a specific thing first.

      And with my ‘open science’ hat on. If in the example Noah gave these people had known that they were both doing the same thing, how much time and effort could have been saved? Maybe no-one had to eat dirt? Would the papers have been better had the two teams worked together?

    • Then the issue of findability goes away because everything is linked

      I’m not actually sure that the issue of “findability” goes away… I think that the question of findability has to do with what tools we’re using to find things. Just because something is linked to doesn’t mean that it won’t still be buried. Journals in a library are easy to find if you’re using the card catalog, but Google won’t tell you which shelf to look on. Highly linked documents online are easy to find using Google because of their page rank algorithm. But if, in 10 years, everyone is using some other tool, then other articles might be more “findable”.

      Most of the time, authors don’t need to worry about this—it’s why they publish in journals, and why journals have websites, issue press releases, and send copies to libraries. What I was hoping to suggest is that with the constant focus on being first to publish, other factors related to credit are perhaps being overlooked.

    • If I might add a few thoughts to the excellent comment that Cameron made. If you use someone else’s work that requires citation and you don’t cite it, you have committed scientific misconduct. If the journal you would like to publish in won’t allow you to cite what you need to cite to avoid scientific misconduct, then you can’t publish your work in that journal.

      If you come across a brilliant idea in a blog comment (such as one of my comments;) and want to expand on it such that it requires citation, and a journal such as PreterNature (to use a fictitious name) won’t allow such a citation, then you cannot publish a derivative work expanding on the brilliant idea in PreterNature no matter how brilliant that derivative work is.

      The reasoning behind why a journal would set up a system for scientists to communicate (such as in a blog/comment system), and then not allow citations to that communication system should sufficiently important and brilliant ideas be communicated is not something I understand. Is it their purpose to stifle scientific discussions? To limit discussions to mundane ideas? To enforce the editors’ monopoly power on what constitutes “science”?

      Perhaps Maxine or Henry Gee could comment on that? I suspect it is simply like the many other petty inane stupid nonsensical arcane rules of all most some science editors, it seemed like a good idea at the time. Or perhaps the editors feel their stylistic whims should dictate where and how science is done as well as communicated?

      (note, I had wanted to use strike through in some of the adjectives that I used to make my intent appear more humorous (instead of harsh) but couldn’t find an explanation of how to do so. The formatting explanatory link didn’t explain how to do so, and I am having great difficulty with my (dial-up) ISP connection (now at 28.8 kbs) and can’t find it elsewhere.)

    • Regarding differences in citations of (almost) simultaneously-published articles reporting (almost) identical observations

      As pointed out before, a difference could reflect other aspects of the work such as techniques that were developed or reagents that were generated during its course.

      For the examples given in the posts here (the works on the auxin receptor and on ALS/astrocytes), ‘findability’ or access to full-text cannot be a reason for the differential citation as both papers of the pairs are published in the same journal.

      Could it be that there actually is no significant difference when citations in original research articles are tallied? I have tried to do so using Google Scholar.

      …. Auxin receptor work
      …….. Total citations: 267 vs 232
      …….. Citations in original research: 122 vs 117

      …. ALS/astrocyte work
      …….. Total citations: 42 vs 35
      …….. Citations in original research: 14 vs 15

    • Hilary, I probably was a bit unclear. What I meant to suggest was that once the tools work reasonably well then the excuse ‘I didn’t see that example of prior work’ just doesn’t cut it any more so that strong attribution should be much easier to enforce.

      But I am absolutely with you on the idea that there are much more important things than ‘being first’. Particularly when things get rushed out in half baked form.

      David makes an interesting point though. I’d not really thought that one through but he is perfectly right to suggest that if you can’t cite the appropriate source in any specific medium then you shouldn’t use that medium. I don’t know how you balance that against the (perceived?) need for citations in journal articles to be stable over extended periods of time.

    Post a reply

Search forums Advanced search

web feed

Submit this topic to

Advertisement