Inspired by ChatGPT, Google DeepMind predicts 71 million genetic mutations! AI deciphers the genetic code of human genes in Science

Original source: Xinzhiyuan

Image source: Generated by Unbounded AI‌

After the protein prediction model AlphaFold set off a tsunami-level wave in the AI world, the Alpha family ushered in a new upstart.

Today, Google DeepMind released a new AI model-AlphaMissense, which can predict 71 million "missense mutations."

Specifically, of the 89% "missense mutations" successfully predicted by AlphaMissense, 57% were pathogenic and 32% were benign.

Paper address:

Only 0.1% of mutations can be confirmed by human experts.

In order for researchers to better understand its possible impact, Google has also made public the entire catalog of tens of millions of "missense mutations."

Discovering the underlying cause has long been one of the greatest challenges in human genetics.

Missense mutations are genetic mutations that can affect the function of "human proteins" and can lead to diseases such as cystic fibrosis, sickle cell anemia, and cancer.

The birth of AlphaMissense demonstrates the huge potential of AI in the medical field, especially in genetics.

It is of great significance for understanding the relationship between genetic variation and disease and developing targeted drug treatments.

Following AlphaFold, AlphaMissense may become an AI that can change the world and is expected to overcome the problems of human genetics!

**What is a "missense mutation"? **

Missense mutation is a genetic mutation used in the fields of biomedicine and molecular biology to describe protein-coding genes:

The substitution of a single letter in DNA results in a different amino acid in a protein.

If you think of DNA as a language, then the substitution of a single letter can change a word and completely change the meaning of a sentence.

In this case, changes to the DNA lead to changes in amino acids that affect the function of the protein.

The average person carries more than 9,000 missense mutations.

Generally speaking, most of these missense mutations are benign and have little impact on the human body. But the remaining few are pathogenic and can severely disrupt protein function.

Missense mutations can be used for the diagnosis of rare genetic diseases, because a few or even a single missense mutation may directly cause the disease.

In addition, they are important for studying complex diseases, such as type II diabetes, which may be caused by many different types of genetic variants.

Therefore, classifying missense mutations is an important step in understanding which protein changes may contribute to disease.

Of the more than 4 million missense mutations that have appeared in humans, only 2% have been labeled by experts as pathogenic or benign.

This represents only about 0.1% of all possible 71 million missense mutations.

The remaining mutations were classified as "mutations of unknown significance" due to a lack of experimental or clinical data on relevant effects.

But with AlphaMissense, we got the clearest image yet of the mutation's effects:

AlphaMissense can classify 89% of mutations with a threshold accuracy of 90% in a database of known disease mutations.

Built based on AlphaFold, inspired by the ChatGPT large model

So, how exactly is AlphaMissense built?

Since their release, AlphaFold and AlphaFold 2 have predicted the structure of almost all proteins known to science from their amino acid sequences, more than 200 million+ proteins.

In this regard, Google researchers adapted the model based on AlphaFold (hereinafter referred to as AF), so that they can predict the pathogenicity of missense mutations that change a single amino acid in a protein.

Simply put, the entire working principle of AlphaMissense is: taking an amino acid sequence as input and predicting the pathogenicity of all possible single amino acid changes at a given position in the sequence.

In order to train the AlphaMissense model, it needs to be carried out in two stages:

The first stage

Train a neural network the same as AF. This neural network is inspired by large models like ChatGPT.

By predicting the identity of amino acids masked at random positions in multiple sequence alignments (MSA), it enables single-chain structure prediction, as well as protein language modeling.

The researchers made some minor architectural modifications to AF and increased the loss weights for protein language modeling, while still achieving comparable structure prediction performance to AF.

After pre-training, the masked language modeling head can already be used for mutation effect prediction by calculating the log-likelihood ratio between the reference amino acid and alternative amino acid probabilities, as in MSA Transformer and Evolutionary Scaling Modeling (EMS) Do that.

These neural networks have proven good at predicting protein structures and designing new proteins, and are especially useful for variant prediction because they already know which sequences are credible and which are not.

second stage

At this stage, the researchers fine-tuned the model on human proteins, set mutation sequences for the second line of MSA, and added variant pathogenicity classification targets.

Then, follow the method of PrimateAI to label mutations in human and primate populations.

Common mutations are considered benign, and never-before-seen mutations are considered pathogenic.

Once the model began to overfit on the validation set (2,526 Clin variants, with equal numbers of benign and pathogenic variants per gene), the researchers stopped training.

However, AlphaMissense does not predict changes in protein structure following mutations or other effects on protein stability.

Instead, it uses AlphaFold's "intuition" about structure to identify possible disease-causing mutations in proteins.

Specifically, relevant protein sequence databases and structural context information of the mutation are used to generate a continuous score between 0 and 1 to approximate the pathogenic probability of the mutation.

This continuous score allows users to select a threshold to classify mutations as pathogenic or benign, depending on their accuracy requirements.

How AlphaMissense classifies human missense mutations

In experimental evaluation, AlphaMissense achieved state-of-the-art predictions across a wide range of genetic and experimental benchmarks, all without requiring explicit training on such data.

AlphaMissense outperforms other computational methods when classifying variants from Clin. Clin is a public data archive on the relationship between human variation and disease.

AlphaMissense was also the most accurate way to predict lab results, suggesting it was consistent with different ways of measuring pathogenicity.

AlphaMissense outperforms other computational methods in predicting missense variant effects

AI changes genetics

A year ago, Google DeepMind released 200 million protein structures predicted using AlphaFold.

This initiative has helped millions of scientists around the world accelerate research and paved the way for new discoveries.

Now, AlphaMissense, based on AlphaFold, has further deepened the world's understanding of proteins by tracing the origin of DNA.

Again, a key step in translating this research is collaboration with the scientific community.

Google DeenpMind has been working with Genomics England to explore how AlphaMissense's predictions can help study the genetics of rare diseases.

Genome England cross-referenced AlphaMissense's findings with previously compiled data on the pathogenicity of known human mutations.

The evaluation results are consistent with AlphaMissense’s predictions, which provides AlphaMissense with a real-world benchmark.

Google DeepMind has published a lookup table of missense mutations and shared expanded predictions of all possible 216 million single-amino acid sequence substitutions in more than 19,000 human proteins.

The published data also includes an average predicted value for each gene, which is similar to a measure of a gene's evolutionary constraints, indicating how important that gene is to an organism's survival.

Examples predicted by AlphaMissense superimposed on structures predicted by AlphaFold

(Red = predicted to be pathogenic, blue = predicted to be benign, gray = uncertain)

Left: Beta-hemoglobin subunit (HBB protein). Variations in this protein can cause sickle cell anemia.

Right: Cystic fibrosis transmembrane conductance regulator protein (CFTR protein). Variations in this protein can lead to cystic fibrosis.

Moreover, Google DeepMind has also cooperated with EMBL-EBI. Through the Ensembl mutation effect predictor, researchers will more easily apply the prediction results of AlphaMissense.

It is believed that in the near future, AlphaMissense will help solve core problems in genomics and the entire biological sciences.

References:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)