An AI can predict how cells react when a gene is altered

A team of researchers from the University of Cambridge has developed an artificial intelligence system capable of predicting how cells react when a gene is modified. The model, called PT-RAG, outperforms previous approaches and could accelerate biomedical research, drug discovery, and the study of complex diseases.

The challenge of predicting how cells respond to genetic changes

Understanding how cells react when a gene is altered is one of the most complex problems in molecular biology. When scientists deactivate or modify a gene within a cell, the reaction does not remain limited to that single element. Instead, the modification triggers a cascade of changes that can affect thousands of additional genes, altering multiple cellular processes at the same time.

This phenomenon makes predicting the outcome of a genetic perturbation extremely difficult. In laboratories, researchers perform specific experiments to observe these changes, but each one requires time, resources, and carefully controlled conditions. Additionally, behavior can vary between different cell types, which forces scientists to repeat experiments across multiple contexts.

In recent years, deep learning models have attempted to address this problem through computational simulations. The idea is to train algorithms using large datasets of gene expression so they can learn to predict how a cell will react to a genetic modification.

However, previous models had an important limitation: their generalization capability was low. They worked well in scenarios similar to their training data but failed when predicting responses in different cells or under new perturbations. In many cases, the issue was the lack of contextual information during the prediction process.

Solving this limitation is crucial for advancing fields such as genomics, biomedicine, and pharmaceutical research. This is where the new model developed by Cambridge researchers comes into play.

How PT-RAG works and why it represents an advance

The PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation) model introduces an innovative strategy inspired by techniques used in natural language processing. Instead of generating predictions solely from what it learned during training, the system retrieves relevant examples before producing an answer.

The model operates in two main stages.

The first stage is the retrieval phase. In this step, the system searches within a database of previous experiments for genetic perturbations that are most similar to the case it is trying to predict. To achieve this, it uses embeddings generated by GenePT, a language model specifically designed to represent genes and their biological functions.

The second stage introduces the key element of the model: an adaptive refinement step. Rather than directly using the retrieved examples—as conventional RAG systems do—PT-RAG employs a mechanism based on Gumbel-Softmax that selects the most informative examples in a differentiable way depending on the cellular state and the perturbation being analyzed.

This process allows the system to adapt its predictions to the specific biological context, something previous models struggled to do with sufficient accuracy.

Researchers also discovered an important detail: applying a conventional RAG system without this cell-type–aware refinement can actually worsen predictions. This finding highlights that artificial intelligence techniques must be carefully adapted when transferred to complex scientific domains.

Results and potential applications in medicine

The performance of PT-RAG was evaluated using the Replogle-Nadig dataset, one of the most comprehensive datasets available on single-gene perturbations in individual cells.

The results show that the model surpasses systems considered state of the art in this field. Improvements were especially notable in the W1 and W2 distributional similarity metrics, which measure how well a model reproduces the full distribution of gene expression.

This is important because many previous tools only predict the average change in gene expression. In contrast, PT-RAG is capable of capturing the natural variability that exists between individual cells, a critical factor for understanding real biological processes.

The potential applications of this technology are broad. In drug discovery, the model could help predict how different cell types respond to modifications of target genes, accelerating the identification of promising therapeutic candidates.

It also opens new possibilities for personalized medicine, as it could model how different genetic profiles respond to specific interventions. This could lead to treatments better tailored to each patient.

Additionally, the system provides a powerful tool for fundamental biology, allowing researchers to computationally explore genetic perturbations that would be too costly or difficult to perform in laboratory experiments.

The PT-RAG model demonstrates how artificial intelligence can transform genetic research. By predicting more accurately how cells respond to genetic changes, this technology could accelerate drug development, improve personalized medicine, and open new paths for understanding the fundamental mechanisms of biology.

Reference:

arXiv / Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation. Link