I saw this on Slashdot and wanted to make sure I posted it here…it’s a blogger who teaches data mining at Stanford (oh, and what do YOU do?) weighing in with the results of his work on whether or not better algorithms trump more data. Interesting analysis especially in re: Google.
Harumph. With apologies to George Clinton, Free Your Data ...And Your Searchers Will Follow.
(edited post to add in link to slashdot thread)
Thanks John for this posting. I think that there is many a biomed researcher who could learn from this article. Somewhere along the way researchers forgot that the sample size matters as much or more than the number of data points gathered from a single sample.
Mmm… I approve of this post. This is a great example of how “mashups” (I really dislike that term for some reason) can be used to sort of bootstrap the power of a dataset. In the case of the Stanford teams, the incorporation of data from an external source enabled them to improve their algorithm. In the case of Open Access science, the ability to better combine data from a variety of studies and fields will in turn lead to more discoveries.
Interesting post. I might be saying the obvious, but this example also emphasizes the value of having data in compatible formats. I don’t know how easy it was to merge the Netflix data with the IMDB data, but for datasets that use standard formats / identifiers, this seems like an easy win.