One of the hot topics for computer scientists nowadays is cloud computing. Some good examples of cloud computing are most web 2.0 applications in which the computation and the data reside in a number of servers somewhere in the net, but presenting an interface not different to those of regular desktop applications, all that combined with social features that allow the system to establish relations between the different items of data.
![]()
Apparently, some people are seeing some potential in cloud computing not just as an aid to science but as a completely new approach to do it. An article in Wired magazine argues precisely that. With the provocative title of The end of theory, the article concludes that, with plenty of data and clever algorithms (like those developed by Google), it is possible to obtain patterns that could be used to predict outcomes…and all that without the need of scientific models.
I would start arguing now about how models are just more than predictive machines but people have already stepped up to the challenge. John Timmer at Ars Technica (not a science website either) make some very valid points (in my opinion). Does any one else think that traditional science is a thing of the past and that cloud computing will drive us modelers to the employment office?
I’m with you (wait, I’m a statistical modeller, so I might be biased). Cloud computing might drive us out of existence, if we don’t want to actually understand anything.
Thanks Bob. In any case I still would be interested to see if just information and data mining (of the google kind) is enough to produce models capable of accurate predictions of natural phenomena. That, if it happens, would be a major step but still unlikely to be the end of science as we know if if those models are not understandable. If the algorithm comes with some huge aggregation of several observational items and is difficult to infer from it the mechanisms by which it works then there is still work to do. My idea of Science is that it is not only prediction but also understanding.
I read that Wired article, David, and although I am an admirer (and long-ago ex-colleague) of the author, Chris Anderson, I think he confounds issues in it.
For example, he seems to be saying that we can’t forumulate hypotheses, and hence get anywhere trying to understand, biology because it has got “too big”. But this is to confuse “masses of data” with “complexity”. It doesn’t follow that the reason we don’t understand more completely how biology works is that there are too many data being collected (though this is certainly a challenge).
If Henry Gee were to comment on this Wired article, for example, he might say that there is an awful lot you can do with a fragment of a fossil. (But as you know there is loads of biological work going on that is not “big”.)
By the way, there is quite a criticial comment thread to the Wired piece when I read it the other night, including a link to a very good “rebuttal” article at ArsTechnica.
PS, Sorry, I missed the last paragraph of your blog post when I wrote my comment. Glad we agree on the ArsTechnica piece!
And no, I don’t think modellers are headed for the unemployment office any time soon. ;-)
See the Nature Network group Scientific Researchers and Web 2.0: Social Not Working?, and please join it if you are interested in this general topic.
I almost think that the Wired article by Chris Anderson is intentionally provocative. Maybe it tells us more of what the press (and general public) thinks about today’s science. And the Ars Technica piece is a nice reply.
I agree with Martin, it’s just a teaser. But I think that we need to rethink the way we are doing science. We must open the data and learn from linux how to make real things with internet, not just fancy pages where post cat pictures.
I think nobody have show this yet here, I found it really intersting, It is called CARMEN and here a tube
And what’s wrong with that, eh? :-)
Seriously, I think we only have to look at how projects like are used GenBank to understand Sebastian’s point. I would love to have more data available for me to play with, but it’s going to take a culture shift to make it happen.
Journals will/can help, by making data deposition a condition of publication. But the formats, annotation, curation etc are sure a problem.
I think we must go beyond that Maxine, a lot of experiments are founded by grants from government. The data produced due to that grant is as real as highways. The data must be accessible to the people that is paying for it with their taxes.
I agree with Bob about the cats, I love them :). Seriously, we need a cultural shift in the way we look science. It is not possible to continue to pay scientist who are not serious about their work. I’m talking about ITER
And the format. I don’t think that’s a hard problem. XML is the universal language for the cloud, and you only need to implement traducers from the proprietary format of the experiment to XML.
I imagine a world where the experiments are made with your iPhone :)
In this context it is a nice move by GlaxoSmithKline (GSK), to make a large dataset of cancer genomics data available to the public.
@Maxine: we dont need the journals. let’s make a wikidata, where you can post anonymously. We just need some Phd students that hack their labs, lol.
We need to start a wiki for the standards and the basic measurements that every data must contain. For example, in molecular simulation, a “photo” of the system every n seconds, and a fine measurement of the energy, center of mass and stuff like that every n/100 seconds.
Then, anyone can compute the fluctuations in the energy or whatever.
Is feasible something like that for biology?
Sebastian, I agree that they way scientist work should (and will) change. That the internet is allowing for a fundamental shift in the speed and quantity at which science is communicated seems very real to many of us.
But one thing is science communication, important as it is, and a very different one is how science is performed. Models allow us to integrate data from experiments into a framework that should describe the phenomena that we try to understand as well as predict things for which we still have (or can’t have) data. That the theoretical models should be understandable is important. At the end of the day what a scientist aims to do is to understand whatever it is that she or he is studying. Without that you don’t have science, just engineering (and don’t take me wrong, i am not criticising engineering here). The article in Wired I was writing about seemed to be advocating a way of doing science in which understandability would be optional and secondary to its power to make predictions.
Maxine, thanks for your comment and for pointing out that NN group on web2.0!
David, yes, we need to understand. Otherwise we will kill ourselves like Jacques Vache who said “the art is a stupidity.”
About the article in wired, personally I think is a really bad dossier. The articles are from like 2 years ago (map-reduce? ok, 6 months). But address a crucial point: we need to do something different with the data.
I should add that genomic data, especially genomic sequence data, as well as protein structure data is pretty much all open and has always been. I am the world biggest open data advocate, but it’s not like we don’t have access to all this sequence data. Now how we can access that data more easily and programmatically is a different story.
And for FWIW, people are looking at the data differently, building more advanced and very complex models. Are we being very smart about it, and have we change our approaches fundamentally? No, not yet, but if this was deliberately provocative, I am not sure who Chris was trying to provoke. Scientists are already provoked, at least many of them are.
There have always been “top down” data analysis derived, and “bottom up”, mechanism-based approaches to understanding science. Top down analyses can be very useful in predicting the behaviour of systems and of real practical benefit, although probably best when interpolating within the space defined by the data, rather than extrapolating to new situations. Mechanism-based modelling has more potential for extrapolating, because we understand the system properly. It seems to me that “Cloud Computing” is neutral as to which approach is best, since it supports both. Sure, it gives access to compute and data resources for data mining, but it also could support bottom up modelling. Take the BioModels repository for example, not quite “in the cloud” yet—but it could be.
Hi David,
Thanks for your comment. I agree that cloud computing can support bottom up, mechanism-based science but the trick is that, at least at the current state, an automated exclusively algorithmic approach to produce models is unlikely to produce a model that could result in a meaningful mechanistic description of any natural phenomena. Is not the internet part of cloud computing that is not suitable, is the algorithmic one that, otherwise, serves google so well finding correlations between keywords and customers for instance.
Sebastian—when I wrote about journals making data available, I meant just that. Require deposition of data into a public database, where one exists. That’s the Nature journals’ policy and probably other publishers’ too. Also, all the “Supplemental information” Nature journals publish is free to access. Finally, we encourage authors to submit additional details of protocols and methods to the (free to access) Nature Protocols (also provides online annotation and discussion facility).
My real point is, so long as there are public databases that curate, annotate etc, journals can require that authors deposit into them, and then those data are available to all. If a journal does not do that, the data could be lost for all kinds of reasons after a while, eg when an author moves labs. Supporting and maintaining data repositories seems to me to be a very good use of public funds.
Maxine, I cannot agree with you. Because I think that like Encyclopædia Britannica belongs to XIX century, journals belong to XX century. You say that “If a journal does not do that, the data could be lost…” but we, as people, can ask gently to scientist to do it and the result its the same, data is stored.
I can understand the importance you put in journals, since you work in one, but I think that wikipedia have shown that cooperation between peers is far most powerful than institutions. Moreover when the mission of this institution is “First, to serve scientists…” and not ”...to place before the general public the grand results of Scientific Work…”. Journals can do a lot of things, but people can do more.
We can ask the journals to free the data, but I think is like asking your dictator for a little bit of freedom. And I want all the freedom. I see that is possible to build knowledge without an authority and that the peer reviewing can be made on a cooperative basis. Dialogic as Bakhtin said.
Journals, as I see, are fundamentally wrong because I cannot read the article that my teachers puts in Nature from my house without a vnp connection to the university’s server, and to have access to that server I have to pay something like 2.5 chilean minimum wage monthly in fees, or pay US$32 for the article.
Make openScience means, at least for me, to open the knowledge to everyone, not just to the elite of people that have the money to pay for it. And I see that journals don’t want that.
Sebastian. there is much more to a journal article and a journal issue than the data contained within the papers. And there is nothing in our policies to stop scientists share data freely between themselves outside the journal.
There are also much cheaper ways to buy and read Nature than purchasing a single article, as you can see from our website.
I think this part of the discussion has got a bit away from the topic of the post, which was about cloud computing.
A lot depends on definitions of cloud computing—to go right back to the initial question. The original Wired article suggested that machine learning, sitting on top of unlimited data storage and unlimited compute resource will be the only way to do science. Obviously this is wind-up, but it also is based on a different view of what cloud computing is. To me the cloud is just a stack of resources (hardware and software), that needs no maintenance and can be paid for “on demand”. If you think of the cloud stack as only the lower level storage and processing, services such as Amazon provides means we already have “a cloud”. But for me it is the domain-specific services (the cloud’s applications) that are the most exciting from a scientists point of view, and these can support hypothesis-driven science just as effectively as they can machine learning methods (my colleague, Paul Watson put it a lot better than me in his recent Google Talk presentation). Implementation of science-specific services in the cloud is the next stage. The most exciting thing about the cloud (in all its definitions) is the way could enable more people to do quality science at much lower cost. It has the potential to enable a new and much bigger “long tail” of independent scientists. I think it will create far more opportunities for scientists than it destroys.
If I told you I had a massively parallel device which takes in huge amounts of raw data and finds patterns via computation to make predictions, would you be able to tell if I was speaking of a computer cloud or a human brain?
[I ended up writing an entire blog post as a reply, so if you are interested the rest is here]
Rafe,
Read your blog post and found it interesting. As I said in your blog, there might be limits to human understanding. There is probably a threshold of complexity beyond which a single mind will be unable to understand a physical/biological phenomena but I hope we will not yield to the temptation to produce ununderstandable models when a better one is possible. Finding correlations with models that we cannot understand can, undoubtedly, be useful, but it is in the best case, only a lesser kind of science