Who wants to be eating dirt? Musings on search, self-archiving, citations and "findability"
Hilary Spencer
Wednesday, 30 April 2008 22:54 UTC
A lively (and long) discussion follows Jennifer Rohn’s post, In which I get into a little muddle about archiving. I’d like to respond to something said about 2/3rds of the way down the comment thread, but I’m starting a new post as I’m not sure how many people will take the time to read all 109 comments (go Jennifer!)
In comment 50 or 60-something (ok-I didn’t count), Henry Gee poses the following hypothetical example:
Group X discovers something which they write down with a time stamp. Group Y makes the same discovery – by the time Group X gets wind of this and has mobilized its army of intellectual property droids, Group Y has published it and has gotten the credit. In the eyes of the world, Group Y has made the discovery and Group X is eating dirt.
Let’s modify this a bit: Say Group X posts a preprint of their manuscript on a preprint server like ArXiv, time-stamping it, but more importantly, making it easily available via Google. Even though Group Y publishes first, say they publish in a closed-access journal to which few universities have subscriptions (at the worst, say the journal doesn’t even make their articles available online—it’s print only). Now whenever someone is looking for articles on Z in Google, they tend to find Group X’s paper on ArXiv, but not the journal article. Group X’s article on ArXiv is cited a couple of times, while Group Y’s languishes behind a subscription wall, or, at the worst, in the university library’s stacks… So is “getting credit” getting the first publication, or getting one’s work consistently cited and used as a reference point?
Now, most journals have been moving towards making their content available online, and many universities have extensive journal subscriptions, so this might be an extreme example. Most people will cite a peer-reviewed article when given the choice of either citing a preprint or a peer-reviewed article. But people will almost always cite articles that they can find and read, over those that they can’t. (Isn’t it a breach of ethics to cite documents one hasn’t read?) Most publishers realize this, and are working to increase the “findability” of these articles, but sometimes they don’t do the best that they can. Sometimes sites like also Google rank papers on ArXiv higher than those on journal sites (though the question of the role of page rank in literature reviews might best be left for another day.)
One might also argue that no one uses Google to search for research, though trends suggest that this is changing, especially with the current crop of undergraduates. 1, 2
If the published version is difficult to find, but one is able to read the preprint and finds it useful, then I suspect that one is more likely to ferret out a copy of the published version (even in the university stacks) for citation purposes. One is more likely to spend the time and effort trying to get a copy of an article that one already knows will be useful over an article that may or may not be. (This might explain why papers with posted preprints tend to get more citations than those without available preprints 3, 4 and why open access articles tend to have higher citation rates 5).
There are many stories of disputes over claims of inventions (the modern computer, photography, the radio, the telephone, the steam engine…) A recent article from the NY Times’s Week in Review discusses Thomas Edison’s invention of the phonograph, noting that 17 years prior to Edison’s patent, a Parisian inventor had already created a device to make visual recordings of songs. Who was he? Who knows? The article goes on to note that Edison was perhaps not the first to invent the lightbulb, and was only credited as doing so when the Supreme Court ruled that the prior inventor’s patent was too broad. Did you know this? I didn’t.
The Times article suggests that credit for inventions is often correlated with who is able to make theirs accessible to the public, and not necessarily with who was first, even in the filing of patent documents. The author also notes the importance of timing: “Great ideas, while perhaps not novel, are delivered to us…just as we’re hungry for them.” Perhaps being the first to publication isn’t always the key to receiving credit, just as being the first to patent doesn’t mean that you won’t be eating dirt later.
1 Student Searching Behavior and the Web: Use of Academic Resources and Google
2 Information Illiterate or Lazy: How College Students Use the Web for Research *Disclosure: I didn’t read this article because I don’t have access, so I’m citing the abstract.
3 The Citation Impact of Digital Preprint Archives for Solar Physics Papers
Preprint version
4 E-prints and Journal Articles in Astronomy: a Productive Co-existence
Preprint version
Updated 01 May 2008 19:59 UTC
-
Replies
Jump to resultsResults
-
There are many examples of the “Group X” and “Group Y” scenario in science; enough that it is probably reasonable to go back by hand and, based on the ones that we can remember, consider what the citation fall-out was from a publication competition (I ignored patents here simply for the sake of discussing this issue with regards to the benefit of pre-prints).
One recent neuroscience example involving Nature was the publication of the Halorhodopsin light-activated chloride pump that could be used as a means to eliminate neuronal firing in those cells expressing the pump, in response to yellow light. One group from MIT published their description of the pump and its properties in PLoS ONE 15 days ahead of the characterization of Halorhodopsin published in Nature. To my knowledge, in this case, both papers are often cited together, however, I believe that the Nature authors do believe that they are in fact eating dirt since they still were technically not the first. Now had they used a pre-print server, would they feel differently? I don’t know. But in their case, posting a pre-print would have been unwise and extraordinarily risky because that would have meant revealing the genetic sequence of the pump, allowing others to easily manufacture the protein themselves for a quick characterization. So they had little choice but to just go forward and attempt to get their story out into the public in a published form, if only to (try and) protect their results.
Switching gears a bit, it sounds very reasonable to state that the more highly-cited paper is the one with the greater impact, despite another version coming out first. And it also sounds reasonable to state that the more highly-cited paper may become so because it was more easily available. But if authors are diligent with their manuscripts, stewarding their public exposure (like self-archiving after 6 months in the case of NPG journal policy), this comparison described above with regards to who came first and number of citations should be moot. With the biomedical citation half-life approaching 2-3 years, losing 6 months of open access should only cause a minor dent in the eventual citation counts, as long as the authors stuck behind the firewall immediately release their study the day after the 6 month embargo has lapsed. I know that studies reveal that open-access papers are more highly cited, but perhaps a portion of this difference is due to author negligence when it comes to actively making sure their article lands in an appropriate repository at the right time.
-
This highlights the importance of the coverage of preprint servers by indexing services, metadata harvesters, and academic search engines.
For Nature Precedings, non-inclusion in PubMed is a big issue in case of the the bio/medical sciences. I believe less than a percent of bio/medical researchers use something besides PubMed for literature discovery.
Perhaps someday PubMed will cover reputed preprint servers.
-
Great post Hilary – thanks for continuing on here from that part of the comments from Jenny’s post.
I’m not sufficiently qualified to leave a detailed comment (such as that of Noah).
What I would say is those who have and continue to deposit in ArXiv should be well placed to comment on these issues.
-
it sounds very reasonable to state that the more highly-cited paper is the one with the greater impact
I can think of so many examples where this is not the case. Some of the worst cases are where two papers from different groups describing essentially the same discovery are published back-to-back in the same journal. The one appearing first in the journal always gets the higher number of citations.
-
Noah—thanks for providing a concrete example, although it isn’t quite indicative of the situation I was alluding to. Both PLoS and NPG have tried to make their respective articles highly accessible (and thus citable). All PLoS ONE papers are open access, and NPG is allowing free access (at least at the moment) to the Nature article. But how would this situation have changed if the PLoS ONE article was not easily accessible/“findable”? If Nature had not issued press releases and provided a special web focus to highlight the Nature paper? I think a lot of publishers try to increase the “findability” and awareness of their papers, and PLoS and NPG are both examples of publishers that are doing this. But sometimes this doesn’t happen or their efforts go awry.
Chris—For the back-to-back articles in the same journal, are you suggesting that the second article has greater impact even though the first article is cited more?
-
Hi Hilary. My initial scenario was basically a response to the more philosophical question you raised:
So is “getting credit” getting the first publication, or getting one’s work consistently cited and used as a reference point?
I am actually more interested in your response to my off-the-cuff hypothesis in the last paragraph, namely that with increased author diligence to make their published papers available the minute they can be released to the public, the differences in citations between OA articles and “firewall-blocked” articles could (should?) shrink. I am aware that several publishing companies have even greater restrictions on the dissemination of their produced manuscripts than NPG, but perhaps we can simplify the issue and use the 6 month NPG embargo for this discussion.
-
Hilary, what I meant was that if two papers report the same thing at the same time in the same place they should by rights get cited together and so get the same number of citations. One shouldn’t get cited more simply due to relatively arbitrary decision of a the journal’s editors, and yet it invariably does.
-
Chris—so by “the one appearing first”, you mean the one appearing first in the TOC? (I initially thought you meant the one appearing first in time). That’s really interesting.
-
I was also just reminded of a recent study on positioning on ArXiv:
The Importance of Being First: Position Dependent Citation Rates on arXiv:astro-ph_We find that e-prints appearing at or near the top of the astro-ph mailings receive significantly more citations than those further down the list. _
-
Hi Chris,
I guess I find it hard to believe that there is a categorical pattern of papers being cited more because they are physically situated first in the publication order (TOC) of a journal. To only cite one paper when two clearly present the same advance from the same journal issue is borderline unethical. In addition, a good portion of most bibliographies are built from PubMed searches where TOCs and publication order become irrelevant.The ArXiv paper doesn’t really look into this back-to-back issue, but it is almost equally as surprising. Basically, there was no good explanation for this effect since the authors could not disentangle “self-promotion” bias from the “visibility” bias (although these are somewhat related). With manuscripts representing different sub-disciplines semi-randomized within the list (since the pre-prints are ordered according to submission time), perhaps this effect arises from a few very competitive sub-disciplines that consistently grapple for the top position of the list every day, with the intent to increase visibility. If these same sub-disciplines were also higher producers of citations, then the mystery would be solved: a competitive subset of the community chronically fights for visibility (self-promotion), leading to a disproportionate clustering of their papers (from highly cited fields) near the top of the list. When comparing these studies to papers lower on the list (which could disproportionately arise from low-citation rate sub-disciplines that do not bias their submission time to make the top of the next day’s list), there will be an obvious effect.
In short, normalize for subject matter and see if the effect disappears.
Results
-