Is taxonomy the old way of doing science?
Roderic Page
Tuesday, 17 February 2009 08:11 UTC
In a provocative article entitled The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Chris Anderson (of “the long tail” fame) wrote about J. Craig Venter and shotgun sequencing:
If the words “discover a new species” call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn’t know what they look like, how they live, or much of anything else about their morphology. He doesn’t even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.
This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It’s just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.
In a blog post about this article (The end of science, and the end of taxonomy) I commented on what I thought was the yawning chasm between the multicellular, David Attenborough-esque view of life that dominates efforts such as EOL, and the metagenomic world of Craig Venter. Do initiatives such as EOL represent the “old way of doing science”?
-
Replies
-
In my view, the “old way” is the better way. Venter’s data are all very nice and very, very useful to scientists, but, in themselves, they’re not science. Data are not science. Science is what you do with data.
These methods only work on the microbial scale – you can hardly extract DNA from a jungle and sequence the metagenome. The “old school” science still very much has a place. The EOL is of the old school and is an excellent initiative.
All these gatherings of data and filling up GenBank with sequences of things that might be a bit similar to (for example) cytochrome c oxidase but actually might be something else because we’ve not got the actual organism to work on is very handy and definitely has a use…but it’s not what I’d call science. Where’s the hypothesis? Where are the results? All I see are raw data and a method.
-
Playing Devil’s advocate, I’d argue that we could sequence the jungle metagenome, and the way deforestation is going, maybe we should be doing this ASAP. Imagine following behind the bulldozers and sequencing what they leave in their wake, if only to have a record of things we may never actually see again. I’m not being totally serious nor, howver, am I being totally flippant.
The argument Chris Anderson was making was partly, I think, one of scale. He’s arguing that hypothesis-driven science doesn’t scale up to handle the massive data streams we are generating, so science at this scale becomes a matter of sophisticated pattern analysis.
-
Pattern analysis is fine and can be hypothesis driven e.g. “Organism X will be dominant in Environment A but Organism Y will be dominant in Environment B” can be tested by a metagenomics effort, however, one has to be careful with interpreting the data as they may suggest there is a lot of DNA from Organism X but this of course doesn’t mean that they’re alive, let alone active. A dead hippo still sequences as a hippo in terms of DNA.
I think trying to sequence a jungle would be impossible to do purely because of the scale and something resembling the Heisenberg uncertainty principle would kick in – stuff running away from the bulldozer doesn’t get sequenced, therefore in sampling, you’re perturbing the system dramatically and not actually “seeing” the system itself. That, of course, also happens when you throw a bucket into seawater but possibly not on the same level.
I have to say I’m not a fan of some of the “stamp collecting” approaches. Having a list of organisms or a list of active organisms in an environment is nice and it’s useful but I’d much rather know what those organisms were doing and were capable of doing and this be actually studied rather than guessed at based on sequence data.
-
There are a number of issues here, but on the question of whether the Venter approach to taxonomy is science, I strongly disagree with Rich Boden. Venter is still a collector in the traditional sense. It’s just that instead of discovering new species, he is discovering new genes – something like 20 million from his recent ocean voyage around the equator. Each ml of seawater Venter sampled contained over 1 million bacteria and 10 million viruses. Traditional taxonomic methods have little to contribute to the study of these groups, especially since most of them cannot be grown or cultured in the lab. This is not to decry traditional taxonomic methods. It is just that different methods are appropriate to different organisms and yield different kinds of information.
I was fortunate enough to be at part of a discussion on this topic at Google, as part of Nature and O’Reilly publishing groups SciFoo Camp 2008. In Chris Anderson’s session he posed the question “What can Google do for Science?” and used Venter’s approach to make the case for the “Googlization” of science. My response to Chris was that Google (and search tools in general) don’t intrinsically change our science. Rather, they simply lower the transaction costs of doing science, such that we can explore scientific data more quickly to discover and test new hypotheses. Data rich sciences lend themselves to this approach, and in principle taxonomy is made for this. Unfortunately the way we currently publish taxonomy – i.e. in static “papers” not databases, that exclude most of the underlying data, mean that we have no way to capitalize on do this approach. Thus, the solution for taxonomies woes is to embrace reductionism in the way that other data rich sciences have; to focus on publishing data rather than synthesized treatise about data; and move away from our obsession with publishing “papers” since these should be synthesized on demand from the underlying databases, much in the same way that we do a Google search. One further consequence of this is that our discipline (or more specifically our databases) need approach the way we store information about “species” concepts differently. Species / taxa are hypotheses about collections of data, and should be treated as such, but our databases tend to treat them as almost immutable objects around which we navigate all other information. This would be one of my criticisms about EOL, whose sole point of reference and navigation are taxa. Instead all data should be equally navigable, so that we can aggregate data around other objects such as authors, specimens, published papers, DNA sequences, etc, to name but a few.
-
Don’t get me wrong, what Venter has done is very useful, but it’s “just” raw data at the moment. On its own, it’s meaningless. It’s not until you start to mine those data that you start to make actual results.
The traditional collectors (e.g. the guys who went into the jungle with the killing jars and collected various butterflies) didn’t just collect. They described. They measured. Some of them observed the organisms in situ or in the laboratory and noted their behaviour, their diet, their growth patterns, their bone structure, their colouration. From a metagenome, we can’t yet predict these things. We can make some quite good guesses in some cases from genes that are present but we can’t infer that they’re being used or what they’re being used for. Metatranscriptomics is of much more use, I feel.
I’d like to see an equal investment in trying to cultivate these “uncultivable” (I don’t like “unculturable” as it sounds like the organisms refuse to listen to opera or something) Bacteria and Archaea as there is to “give up” and “just” do a metagenome. A bird in the hand is worth two in the bush, after all. Now that 454 sequencing etc have come down in price, metagenomics is going to become “the norm” and everyone will be doing it – and I worry that it’s detracting from the real need to find out what these genes we see in databases actually do. The number of “putative”, “hypothetical” and “theoretical” proteins you see in the GenBank™ database now is amazing – wouldn’t it be great to find out what they are before we add any more to the database?
Unfortunately as physiology and biochemistry are seen as increasingly unsexy compared to molecular ecology, a lot of genes are going to continue to sit there without a proper annotation and with us still pondering their role.
Hmph.
-
I completely take your point, but the issue here is one of scale (see my comments under the automation thread). How many of those microbes can we study in a traditional way when there are so many? So many in fact that we can’t reliably estimate their diversity, let alone describe the “natural history” of more than a handful. My professional background is in entomology, but I am an amateur aquarist and birder. Megagenomics does little to fire my interest in these groups. On the contrary, it is the descriptive natural history that makes me excited about them. But genomic techniques scale to some our tasks in taxonomy, in a way that traditional techniques do not. When we have so little cash to work with in taxonomy, we would be crazy not to make use of these methods in an attempt to answer some of taxonomies big questions. Contrary to some claims this need not happen at the expense of traditional natural history studies. Rather, I’d argue that they would make the work of traditional taxonomists more valuable because the molecular data will yield a host of questions only more traditional methods of studying natural history can answer. It will also yield resurgence in collecting.
Rod closed his original point by asking, “Do initiatives such as EOL represent the old way of doing science?” As EOL stands right now, the answer is unquestionably yes, but it need not be this way. Indeed if EOL is to realize its potential as a tool for scientists – both professional and citizen scientists alike, EOL will have to integrate these molecular stories, and find a way to accommodate these data, even if for many putative taxa, the only story we initially have is a molecular one. Part of EOL’s challenge is that it is trying to be all things to all people. At the moment I suspect this is not possible, and in the longer term this endangers the broader mission of EOL.
-
Data crunching tools and systems doesn’t make old science obsolete, but rather makes new science more accurate and perhaps most important of all increases the speed of discovery. A few years ago when my brother was dying of ALS- trust me we all felt a bit more data crunching couldn’t hurt. I find a good deal of fear and protectionism in the debate- reminiscent of the resistance to technology found in KM.
I am often stunned at the ease otherwise well educated scientists will suggest (particularly if they think their funding is in jeopardy) that we throw out Darwinism as it relates to anthropology; as if tools don’t matter in the evolution of our species- particularly given our influence on all others.
Of course it’s only one tool- so is the microscope and ruler, or for that matter the language filling a book. Who here doesn’t use a computer? So then we should ignore the potential benefits? No more than we should be fully aware of the potential harm. I recently wrote a white paper that is related within the realm of the semantic web for those with an interest, as well as an intro to our MAESTRO series that takes a look at the strengths and weaknesses of network computing more generally, and specifically in crisis prevention. http://www.kyield.com/publications.html – MM
-