Data-intensive science

A really interesting post, especially given it’s source, Prof Kell is the Chief Exec of BBSRC. I agree that data mining or processing has become something in it’s own right. Makes me think about ‘everything is miscellaneous’ and how we need to process on the out. But will scientists have the foresight to do this processing in an open and distributed online way?


via Professor Douglas Kell's blog by DKell on 3/17/09

We arguably recognise three main approaches to generating new knowledge: experimental and theoretical research are classically the first two, while more recently computer simulations of natural phenomena (and of engineering artefacts) have contributed a third. Now Bell, Hey and Szalay have proposed a fourth – data-intensive science.

Like any other major shift in scientific thinking – as Kuhn’s re-coined term ‘paradigm’ is intended to signify – data-intensive science both represents and is driven by a change in the scientific landscape. It is not just a re-statement of the significance of data-driven rather than hypothesis-dependent science. In this case, it is the ability of modern instrumentation to generate data at rates 100-1000-fold that of the devices they are replacing. In biology, an obvious set of examples is represented by the so-called next-generation sequencing methods for nucleic acids. Mardis and Shendure & Ji give recent reviews of these.

As the need for cost-effective computation on non-specialist hardware developed, Beowulf clusters of commodity PCs became a de facto standard in University laboratories and elsewhere. However, these were designed more for parallel computation than for accessing and analysing huge datasets, and rarely included database software. As data volumes grow to petabytes and beyond, it becomes infeasible (for reasons of bandwidth) to transport such large amounts of data frequently over a Grid or the interweb (or even the Cloud), and localised processing is to be preferred. A team of collaborators including Szalay have consequently realised a computer architecture, the Graywulf (named after Jim Gray), suitable for data-intensive science. It won the award for best storage solution at Supercomputing08. Bell and colleagues gives an example using a Graywulf of the execution of a query over a large astronomical database, the Sloan Digital Sky Survey, that took 12 minutes (compared with 13 days on a non-parallel database). We have exploited other parallel architectures such as the Condor pool, but as with the Beowulf they are unsuitable for very large datasets.

As mentioned in my first blog, Sydney Brenner has remarked (Nature, June 5, 1980) that “Progress in science depends on new techniques, new discoveries, and new ideas, probably in that order.” Bell and colleagues conclude in a similar vein: “In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization and cloud computing technologies.” Workflows (e.g. using Taverna) for data integration have proved useful for many purposes, including for systems biology and for bio-statistical analyses of microarray data. Big databases with fast access and information visualisation look like being the next important areas for biology.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s