A really interesting post, especially given it’s source, Prof Kell is the Chief Exec of BBSRC. I agree that data mining or processing has become something in it’s own right. Makes me think about ‘everything is miscellaneous’ and how we need to process on the out. But will scientists have the foresight to do this processing in an open and distributed online way?
We arguably recognise three main approaches to generating new knowledge: experimental and theoretical research are classically the first two, while more recently computer simulations of natural phenomena (and of engineering artefacts) have contributed a third. Now Bell, Hey and Szalay have proposed a fourth – data-intensive science.
Like any other major shift in scientific thinking – as Kuhn’s re-coined term ‘paradigm’ is intended to signify – data-intensive science both represents and is driven by a change in the scientific landscape. It is not just a re-statement of the significance of data-driven rather than hypothesis-dependent science. In this case, it is the ability of modern instrumentation to generate data at rates 100-1000-fold that of the devices they are replacing. In biology, an obvious set of examples is represented by the so-called next-generation sequencing methods for nucleic acids. Mardis and Shendure & Ji give recent reviews of these.
As the need for cost-effective computation on non-specialist hardware developed, Beowulf clusters of commodity PCs became a de facto standard in University laboratories and elsewhere. However, these were designed more for parallel computation than for accessing and analysing huge datasets, and rarely included database software. As data volumes grow to petabytes and beyond, it becomes infeasible (for reasons of bandwidth) to transport such large amounts of data frequently over a Grid or the interweb (or even the Cloud), and localised processing is to be preferred. A team of collaborators including Szalay have consequently realised a computer architecture, the Graywulf (named after Jim Gray), suitable for data-intensive science. It won the award for best storage solution at Supercomputing08. Bell and colleagues gives an example using a Graywulf of the execution of a query over a large astronomical database, the Sloan Digital Sky Survey, that took 12 minutes (compared with 13 days on a non-parallel database). We have exploited other parallel architectures such as the Condor pool, but as with the Beowulf they are unsuitable for very large datasets.
As mentioned in my first blog, Sydney Brenner has remarked (Nature, June 5, 1980) that “Progress in science depends on new techniques, new discoveries, and new ideas, probably in that order.” Bell and colleagues conclude in a similar vein: “In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization and cloud computing technologies.” Workflows (e.g. using Taverna) for data integration have proved useful for many purposes, including for systems biology and for bio-statistical analyses of microarray data. Big databases with fast access and information visualisation look like being the next important areas for biology.
- Bell, G., Hey, T. & Szalay, A. (2009). Beyond the data deluge. Science 323, 1297-8
- Kell, D. B. & Oliver, S. G. (2004). Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. Bioessays 26, 99-105
- Kuhn, T. S. (1996). The structure of scientific revolutions. Chicago University Press, Chicago
- Li, P., Oinn, T., Soiland, S. & Kell, D. B. (2008). Automated manipulation of systems biology models using libSBML within Taverna workflows. Bioinformatics 24, 287-289. Full text
- Li, P., Castrillo, J. I., Velarde, G., Wassink, I., Soiland-Reyes, S., Owen, S., Withers, D., Oinn, T., Pocock, M. R., Goble, C. A., Oliver, S. G. & Kell, D. B. (2008). Performing statistical analyses on quantitative data in Taverna workflows: an example using R and maxdBrowse to identify differentially expressed genes from microarray data. BMC Bioinformatics 9, 334. Full text
- Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387-402
- Shendure, J. & Ji, H. (2008). Next-generation DNA sequencing. Nat Biotechnol 26, 1135-45
- Simmhan, Y., Barga, R., Ingen, C. v., Nieto-Santisteban, M., Dobos, L., Li, N., Shipway, M., Szalay, A. S., Werner, S. & Heasley, J. (2009). GrayWulf: scalable software architecture for data intensive computing. 42nd Hawaii International Conference on System Sciences, 1-10
- Sterling, T. L., Salmon, J., Becker, D. J. & Savarese, D. F. (1999). How to Build a Beowulf: A Guide to Implementation and Application of PC Clusters. The MIT Press, Cambridge, MA
- Ware, C. (2000). Information visualization. Morgan Kaufmann, San Francisco
- Wedge, D. & Kell, D. B. (2008). Rapid prediction of optimum population size in genetic programming using a novel genotype – fitness correlation. GECCO 2008 (M. Keizer et al., eds), 1315-1322