close
With the discovery of DNA, the completion of genome sequencing of a number of organisms, and the advent of powerful high-throughput measurement technologies such as microarrays, it is now commonly said that biology has gone through a revolution. But I also have heard it said that biology is only about to go through a scientific revolution, much as physics did in the 17th century. In messianic hopes, people foretell the coming of the Newton of biology, but it is up to us, the scientific community, to set the stage for that to happen.

Both views are valid, each in their own sense. The discovery of DNA and the more recent development of powerful new technologies have certainly revolutionized our understanding of the inner workings of life and allowed us to probe deep into the machinery of living organisms, much as the Copernican system and Galileo’s telescope helped revolutionize astronomy. It was Sir Isaac Newton, however, who placed science on a solid footing by formalizing existing knowledge in terms of mathematical models and universal laws. In some sense, this was the real scientific revolution because it permitted prediction of physical phenomena in a general setting, as opposed to simply describing individual observations. The difference is profound. Whereas a mathematical equation can adequately describe a given set of observations, it may be missing the needed universality for making predictions. Kepler’s equations pertained to planets in our solar system. Newton’s laws could be used to predict what would happen to two arbitrary bodies anywhere in the universe. The universality of a scientific theory coupled with mathematical modeling allows us to make testable predictions. This ability will have a profound effect on the field of biology.

The hallmarks of a great scientific theory are universality and simplicity. Newton’s law of gravity is a case in point. The fact that the force of attraction between any two bodies is proportional to the product of their masses and inversely proportional to the square of the distance between them is both universal and simple. These issues are especially important today in the rapidly evolving field of genomics, where formal mathematical and computational methods are becoming indispensable. So what should be our guiding principles, our beacons of scientific inquiry? One such fundamental principle underpinning all scientific investigation is Ockham’s razor, also called the “law of parsimony.”

Consider the following, seemingly straightforward problem. We are presented with a set of data, represented as pairs of numbers (x,y). In each pair, the first number (x) is an independent variable and the second number (y) is a dependent variable. The problem is to choose whether to fit a line (of the form y = a + bx ) or a parabolic function (of the form y = a + bx + cx2). The knee-jerk response might be as follows: Let’s fit the parabolic function, since the linear function is clearly a special case of it, just by letting c = 0; thus, the parabola will always provide a better fit to our data set. After all, if it so happens that our data points are arranged on a line, the estimation of parameters (a, b, and c) will simply reveal that c is indeed equal to zero and the parabolic function will reduce to a linear one. Thus, it would seem, three “adjustable” parameters are better than two. Of course, such reasoning could be taken ad absurdum if we had freedom to choose as many parameters as we like. Thus, there must be a tradeoff. Although three parameters surely provide a better fit to the data, the model becomes more complex and so, we sacrifice simplicity. But why is that bad?

To give a general answer, by making a model overly complex, we forfeit predictive accuracy. A complex model may be able to describe the observed data very well, but will it accurately predict future instances? For example, if the data contain random fluctuations or noise, an excessively complex model will “overfit” the data along with the noise and will obviously provide a poor fit to future (unseen) data. The chief goal of model selection is to find the right balance between simplicity and goodness-of-fit.

Consider gene expression–based cancer classification. The basic idea is simple: Take a number of tumor samples of a known type, measure expressions of thousands of genes for each one, and on the basis of these observations, construct a classifier (model) that will predict the tumor type when presented with an unknown sample. A fundamental question is “What type of classifier should we choose?” This is a crucial step in model selection (in machine learning, the model is called the “hypothesis space”). The next step—actually selecting a particular classifier from the model class (i.e., selecting a particular hypothesis)—is fairly well understood, as it involves the estimation of parameters.

As discussed, it would be unwise to devise an overly complex classifier, consisting of hundreds or thousands of parameters, especially in light of rather small sample sizes (number of tumors) available, which is typically below 100. Such a classifier may have extremely small or even no error on the seen data but may exhibit very high error on unseen data. Hence, its predictive accuracy would be very poor.

So, suitable criteria or methods are needed that would help us strike the right balance between simplicity and goodness-of-fit, such that predictive accuracy can be maximized. Fortunately, recent statistical literature is replete with various approaches, such as the Bayesian information criterion, Akaike’s information criterion, minimal description length principle, and cross-validation methods.

In the field of toxicogenomics, issues related to prediction and model selection are of vital importance. For example, toxicogenomic biomarkers should reliably predict toxic effects to help us develop safer drugs and chemicals and understand molecular mechanisms of pathogenesis. Models of genetic networks and gene expression–based classifiers are expected to predict consistently a cell’s response to a stressful challenge and to classify unknown compounds. A keen awareness of Ockham’s razor will help guide us on our quest to understand the nature of living systems and their behavior under various environmental conditions.

arrow
arrow
    全站熱搜

    tear2001 發表在 痞客邦 留言(0) 人氣()