Bioinformatics, Computational and Systems Biology - Poster Session D
Poster M2 - Solving the “Blind men and the elephant problem”: Additive statistical learning of complex high dimensional models from partial faceted datasets
Introduction: Biological systems are complex networks with thousands of interacting molecular components. Biological function and disfunction are often emergent properties of these complex networks. It can be challenging to quantify the contributions of all variables to the biological function simultaneously, making it difficult to obtain a full understanding of the system. More often, a subset of variables is measured and quantified, obtaining a projection (or facet) of the relationship between the biological output and the underlying variable. It is desirable to reconstruct the full relationship between the biological output and all the underlying variables from many sets of faceted data. In this paper, we first describe the general idea of faceted learning based on multiple data subsets of the same problem. We then illustrate the method using machine learning models based on polynomial regression and neural network, respectively. Two concrete examples are discussed: A spring network system under random force and a small biological network including the cellular senescence marker P53. Full system is successfully reconstructed from faceted data in both data sets. We further discuss the additive property of the model, where the model accuracy increases with increasing number of simultaneously measured variables (dimension of subsets). Our model provides a novel approach utilizing conditional distribution to integrate different pieces of information to reconstruct complex high dimensional system (e.g. cell mobility control, gene regulatory network).
Materials and
Methods: In this paper, we use machine learning models based on conditional probability distribution to reconstruct full system from faceted partial data sets. We use conditional expectation and conditional variance to approximate the probability distribution function and minimize the loss function between predicted distribution and true distribution. The underlying distribution of input variables is approximated by Gaussian models or Gaussian mixture models. Both polynomial regression and neural network models are examined. For polynomial regression model, the function is expanded to the second order. For both models, simulated annealing method is used for parameter optimization.
Results, Conclusions, and Discussions: In this paper, we develop an additive machine learning model to reconstruct full system from faceted partial datasets. The general idea is reconstructing the full probability distribution function from partial datasets. We illustrate the method using machine learning models based on polynomial regression and neural network, respectively. Two concrete examples are discussed: A spring network system under random force and a small biological network including the cellular senescence marker P53. Full system is successfully reconstructed from faceted data in both data sets. Interestingly, we find that the intrinsic P53 regulation mechanism is the same for cells in different conditions and the only difference between different conditions is the distribution. Our method well separates the intrinsic governing law and the natural distribution of the input variables. The polynomial regression model also allows us to explore the synergistic/antagonistic effects in a biological network. We further discuss the additive property of the model, where the model accuracy increases with increasing number of simultaneously measured variables (dimension of subsets). This work provides a novel approach for prediction of biological function such as cell motility and gene regulatory network structure from fragmented data pieces.