• john wilbanks' blog

    Agitating for innovation through open licensing and good technology.

    • More Data = WIN

      Tuesday, 01 Apr 2008 - 19:39 GMT

      I saw this on Slashdot and wanted to make sure I posted it here…it’s a blogger who teaches data mining at Stanford (oh, and what do YOU do?) weighing in with the results of his work on whether or not better algorithms trump more data. Interesting analysis especially in re: Google.

      Money quote: if you have limited resources, add more data rather than fine-tuning the weights on your fancy machine-learning algorithm.

      Harumph. With apologies to George Clinton, Free Your Data ...And Your Searchers Will Follow.

      (edited post to add in link to slashdot thread)

      Last updated: Tuesday, 01 Apr 2008 - 19:39 GMT

      • Comments

        • Date:
          Tuesday, 01 Apr 2008 - 20:11 GMT
          Craig Rowell said:

          Thanks John for this posting. I think that there is many a biomed researcher who could learn from this article. Somewhere along the way researchers forgot that the sample size matters as much or more than the number of data points gathered from a single sample.

        • Date:
          Tuesday, 01 Apr 2008 - 21:16 GMT
          Plausible Accuracy said:

          Mmm… I approve of this post. This is a great example of how “mashups” (I really dislike that term for some reason) can be used to sort of bootstrap the power of a dataset. In the case of the Stanford teams, the incorporation of data from an external source enabled them to improve their algorithm. In the case of Open Access science, the ability to better combine data from a variety of studies and fields will in turn lead to more discoveries.

        • Date:
          Saturday, 26 Apr 2008 - 20:26 GMT
          Hilary Spencer said:

          Interesting post. I might be saying the obvious, but this example also emphasizes the value of having data in compatible formats. I don’t know how easy it was to merge the Netflix data with the IMDB data, but for datasets that use standard formats / identifiers, this seems like an easy win.


Search blogs

web feed Want a blog?

Submit this post to

Advertisement