A recently published article, “How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities”, outlines a vision for the creation of an AI Virtual Cell (AIVC)1. Such a model would tremendously benefit biological research, from drug development to biological engineering, and provide insights into basic biological questions. The paper envisions an AIVC model of a complete human cell where all relationships, across all scales, are learned from data via machine learning or other inference methods (figure 1). It is a bold vision and naturally sparks a discussion about how cellular models could and should be built and used, and how we can use what is available today to help bridge the gap to tomorrow’s vision.
In this article, we’ll review this vision for an AIVC and contrast it with what we’ve built and are building to model and simulate cells across scales.
Vision and Practical Constraints of an AI Virtual Cell
The article paints an ambitious picture, comprising an all-encompassing list of features and capabilities that an AIVC should have. As the authors note, to enable such a model will require enormous quantities of data across a highly diverse range of data types and cellular contexts.
The rate at which data is currently generated is invoked as evidence that this need can be satisfied, however while we can safely assume that at some point in the future there will exist enough data to sufficiently train such a model, we have no idea when this might be, what technological challenges we will face to get there, or how we might overcome them.
Even with the technologies we have available today the cost of data generation can be prohibitive to scale. For example it has almost been a decade since perturb-seq was developed, yet at ~$100 per gene the cost of a single genome-wide perturb-seq experiment is estimated to currently be about $2M, per cell line2. Assuming this is performed for 1,000 cell lines, a similar scale as the DepMap dataset, then this dataset alone would cost $2B. Given this cost one practical question is how data-driven approaches like the AIVC might work in concert with the myriad other approaches developed over the last decades for modeling cellular systems.
Our Take on ‘All Models are Wrong, But Some Are Useful’
At Deep Origin we’ve built cellular modeling and simulation across scales from atomistic simulations of proteins, through to cellular level models, and like the authors we share a vision for how such models can transform our understanding of biology and ability to treat and cure disease. However, we offer a different take on how to most effectively proceed to realize these goals as soon as possible. As a commercial entity this is partly out of necessity - grand long-term visions must be balanced with nearer-term utility, but ultimately it is shaped by the well-worn modeling adage that: “all models are wrong, but some are useful”.
Practically what is important for a model is that it is useful, and usefulness is highly context dependent. What kind of model best serves the needs of a bench scientist trying to interpret their data before a presentation the next day? What types of biological components should their model contain and what granularity should they be modeled at? For their questions of interest and data at hand, is a data-driven or a mechanistic model more suitable, or something else entirely? Is it always better for a model to be larger and contain more detail, or might a simpler model of fewer components be more predictive? Rather than aiming for a ‘one-size-fits-all’ universal model developed with a single modeling formalism such as the AIVC, if we can instead provide users with an ability to define and create their own models, from a range of modeling approaches, in effect opening up and democratizing the modeling process and allowing domain experts a means of experimenting with different kinds of models, the potential for and adoption of cellular simulations as a valuable tool in the life sciences will likely be realized far sooner.
We believe that for any given biological system of interest and set of questions we have about it, we should consider, and if appropriate combine, the full gamut of modeling approaches available to create the most useful model possible (see table for a comparison of these approaches).
These modeling approaches lie on a continuum, at one end are purely data driven, machine learning approaches, like the AIVC, while at the other end are purely mechanistic approaches, such as the recently updated SPARCED model3. Between these two extremes lie a rich variety of approaches with differing trade-offs, for example LEMBAS, a Neural-network based approach where the network structure is defined by the known pathway structure but the nature of the relationships are learned4 (outside of solely biological applications there are concerted efforts to combine data-driven and physically constrained models, such as Universal differential equations5, Chemical reaction neural-networks6, and Physics-informed neural networks (PINNs)7), or Genome-scale-Metabolic Models (GEMMs) - constraint-based ‘non-dynamical’ models which capture mechanism and can be easily scaled to capture all metabolism for a given cell type8,9. Another approach uses the output of mechanistic models to train machine-learning based ‘surrogate’ models and aims to faithfully capture the dynamics of the mechanistic model but can be simulated far more efficiently10. Combining the available modeling approaches, each perhaps chosen to best represent a separate biological sub-system, into a single ‘hybrid’ model is referred to as ‘Compositional Systems Biology’ - it was the basis for the original whole-cell mycoplasma model11 and the on-going efforts with the E. coli model12 and is now being formalized13 and made accessible through the Vivarium framework14.
The State of Mechanistic Models
Mechanistic models, that is models which aim to be a replica of the real system, have made substantial progress in the last 10 years, both in terms of the range of models available to users3,11,12,15,16 but also the frameworks for building and defining them17–20. Their primary advantage is explainability, the ability to map what happened in a simulation back to the real world.
When used as a predictive tool, for example predicting novel drug targets, this can provide valuable insight into the mechanism of action, which can inform various important target characteristics such as propensity for toxicity or the likelihood and identity of subsequent resistance mutations. Alternatively, as a tool for interpreting and understanding experimental data, it is debatable whether it is possible to get to a true mechanistic explanation without a mechanistic model - a data driven model will always require a final leap of interpretation into a mechanistic context. A second advantage may be less susceptibility to over-fitting and increased generalizability outside of training data, as the defined mechanistic model structure puts constraints on how far the model can deviate from reality in order to satisfy training data. Nonetheless data driven models have their own, often complementary advantages. Chief among them is scalability; for example GEARs and scGPT, a graph-neural network and transformer based model respectively,22, can in principle and data-permitting make genome-wide predictions of transcriptome perturbations - while whole-cell mechanistic models have been developed for the simplest organisms11,23 and efforts are ongoing for E. coli12, we are a long way from achieving such a scale of model of human cells.
Data driven models can also typically incorporate a wider range of data more easily than mechanistic models can - in many contexts data is scarce and expensive to generate so an ability to make use of what is available can be a significant practical advantage. While there has been caution whether these complex machine-learning architectures actually lead to a predictive benefit over simpler models (and some evidence to suggest they currently do not24) they will undoubtedly improve and provide a much-needed route to model scalability. Whatever modeling strategy is used it is clear that to achieve large and predictive models, data collection and curation efforts will need to be substantially expanded. Fortunately, much of this data can be used to improve the construction and parameterization across a range of model types.
Deep Origin’s Approach
At Deep Origin our approach aims to combine different modeling formalisms using compositional frameworks like Vivarium, to achieve, where possible, both scalability and explainability (figure 2). From a users’ perspective, someone wanting to use a model should be able to define a model’s format, scale, and granularity to find the most useful, tractable model for their needs.
Our goal at Deep Origin is to transform cellular simulations from a fringe activity into an accessible and valuable tool for the life sciences. Data driven models such as the vision outlined for the AIVC are part of a broad patchwork of models needed to achieve this and make biology a more predictive science.
Citations
1. Bunne, C. et al. How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities. Preprint at http://arxiv.org/abs/2409.11654 (2024).
2. https://douglasyao.github.io/blogs/2023/10/30/Summary-of-Scalable-genetic-screening-for-regulatory-circuits-using-compressed-Perturb-seq-Yao-et-al-2023-Nature-Biotechnology.html.
3. Erdem, C. et al. A scalable, open-source implementation of a large-scale mechanistic model for single cell proliferation and death signaling. Nat. Commun. 13, 3555 (2022).
4. Nilsson, A., Peters, J. M., Meimetis, N., Bryson, B. & Lauffenburger, D. A. Artificial neural networks enable genome-scale simulations of intracellular signaling. Nat. Commun. 13, 3069 (2022).
5. Rackauckas, C. et al. Universal Differential Equations for Scientific Machine Learning. Preprint at http://arxiv.org/abs/2001.04385 (2021).
6. Ji, W. & Deng, S. Autonomous Discovery of Unknown Reaction Pathways from Data by Chemical Reaction Neural Network.
7. Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. Preprint at http://arxiv.org/abs/1711.10561 (2017).
8. Orth, J. D., Thiele, I. & Palsson, B. Ø. What is flux balance analysis? Nat. Biotechnol. 28, 245–248 (2010).
9. Agren, R. et al. Identification of anticancer drugs for hepatocellular carcinoma through personalized genome-scale metabolic modeling. Mol. Syst. Biol. 10, 721–721 (2014).
10. Pestourie, R., Mroueh, Y., Rackauckas, C., Das, P. & Johnson, S. G. Physics-enhanced deep surrogates for partial differential equations. Nat. Mach. Intell. 5, 1458–1465 (2023).
11. Karr, J. R. et al. A Whole-Cell Computational Model Predicts Phenotype from Genotype. Cell 150, 389–401 (2012).
12. Ahn-Horst, T. A., Mille, L. S., Sun, G., Morrison, J. H. & Covert, M. W. An expanded whole-cell model of E. coli links cellular physiology with mechanisms of growth rate control. Npj Syst. Biol. Appl. 8, 30 (2022).
13. Agmon, E. Prelude to a Compositional Systems Biology. Preprint at http://arxiv.org/abs/2408.00942 (2024).
14. Agmon, E. et al. Vivarium: an interface and engine for integrative multiscale modeling in computational biology. Bioinformatics 38, 1972–1979 (2022).
15. Malik-Sheriff, R. S. et al. BioModels—15 years of sharing computational models in life science. Nucleic Acids Res. gkz1055 (2019) doi:10.1093/nar/gkz1055.
16. Fröhlich, F. et al. Efficient Parameter Estimation Enables the Prediction of Drug Response Using a Mechanistic Pan-Cancer Pathway Model. Cell Syst. 7, 567-579.e6 (2018).
17. Hucka, M. et al. The systems biology markup language (SBML): a medium forrepresentation and exchange of biochemical network models. Bioinformatics 19, 524–531 (2003).
18. Harris, L. A. et al. BioNetGen 2.2: advances in rule-based modeling. Bioinformatics 32, 3366–3368 (2016).
19. P. Boutillier, J. Feret, J. Krivine, and W. Fontana. The Kappa Language and Tools, kappalanguage.org.
20. Lloyd, C. M., Halstead, M. D. B. & Nielsen, P. F. CellML: its future, present and past. Prog. Biophys. Mol. Biol. 85, 433–450 (2004).
21. Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat. Biotechnol. (2023) doi:10.1038/s41587-023-01905-6.
22. Cui, H. et al. scGPT: Towards Building a Foundation Model for Single-Cell Multi-Omics Using Generative AI. http://biorxiv.org/lookup/doi/10.1101/2023.04.30.538439 (2023) doi:10.1101/2023.04.30.538439.
23. Thornburg, Z. R. et al. Fundamental behaviors emerge from simulations of a living minimal cell. Cell 185, 345-360.e28 (2022).
24. Ahlmann-Eltze, C., Huber, W. & Anders, S. Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods. Preprint at https://doi.org/10.1101/2024.09.16.613342 (2024).