Assistant Professor Yale University New Haven, Connecticut, United States
Introduction: A key step towards rational microbiome engineering is the in silico sampling of realistic microbial communities that correspond to desired host phenotypes, and vice versa. This remains challenging due to a lack of generative models that simultaneously model compositions of host-associated microbiomes and host phenotypes. To that end, we present a machine learning model based on the consumer/resource (C/R) framework. In the model, variation in microbial ecosystem composition arises due to differences in the availability of effective resources (latent variables) while species’ resource preferences remain conserved. Variation in the same latent variables is used to model phenotypic variation across hosts. In silico microbiomes generated by our model accurately reproduce universal and dataset-specific statistics of bacterial communities. The model allows us to address two salient questions in microbiome design: (1) which host phenotypes maximally constrain the composition of the host-associated microbiome? and (2) what are plausible microbiome compositions corresponding to user-specified host phenotypes? Thus, our model aids the design and analysis of microbial communities associated with host phenotypes of interest.
Materials and
Methods: We studied microbiomes associated with three host species, the rumen of Holstein cows, the chicken cecum, and the human gut. All microbiomes were characterized at the level of operational taxonomic units (OTUs). For the bovine hosts, ∼ 50 host phenotypes were measured, including traits related to rumen chemistry (e.g. volatile fatty acids) and animal physiology (e.g. milk production, feed conversion efficiency).
We trained our latent variable model on these data. Details can be found here: https://www.biorxiv.org/content/10.1101/2023.04.28.538625v3.full
Results, Conclusions, and Discussions: Rationalizing the observed variation in complex and high dimensional biological systems using mechanistic modeling is seldom possible. At the same time, we can now collect large amounts of data. This has allowed building data driven generative models that describe possible variations in biological systems. Unfortunately, except for a few examples, studies on host-associated microbiomes usually operate with low sample sizes. Moreover, the context of the microbiome sample (abiotic environment or the phenotypic states of the host organism) is of paramount importance in deciding ecosystem composition. This context dependence likely contributes to a low reproducibility of microbiome studies.
In this work, we presented a generative model that can model the simultaneous variation in host-associated microbiomes and host’s phenotypes. Notably, our model can assign probabilities to specific communities and phenotypes. This allows us to identify realistic communities that are hypothesized to correspond to desired host states.
Importantly, our model does not provide causal connections. We have shown that certain host traits such as pH and nutrient intake are associated with more specific microbial communities. This does not necessarily mean that those parameters can drive the microbiome towards a particular state, although these are useful hypotheses. Instead, the ability to generate an arbitrary number of realistic samples facilitates the study of complex communities by enabling data stratification beyond what is experimentally possible.