Proteins are the body’s real-time messengers, revealing what is happening beneath the surface – often before symptoms appear. But decoding them has long been a bottleneck. Now, researchers have built a method driven by artificial intelligence (AI) that can read proteins more rapidly, more accurately and without needing a reference. This is a leap forward that could transform medicine, microbiology – and even the study of ancient life.
When we get sick, we usually focus on the symptoms. But beneath the surface, tiny proteins in our body already signal what is wrong – often before we feel it.
These proteins do nearly everything: fighting infections, healing wounds and keeping our systems running. Doctors and scientists rely on them to understand what is really happening, especially when DNA does not tell the full story.
But reading these proteins has long been a major challenge – especially for complex diseases such as infections or cancer. Researchers from the Technical University of Denmark and the AI company InstaDeep therefore set out to solve this. Using AI, they developed a new method that can read proteins with far greater precision – even when there is no DNA to guide the way.
“With our models, we found about 50% more of the important protein pieces in a sample compared with the previous best method,” says senior author Timothy Patrick Jenkins, Associate Professor, Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby.
“This is not just a technical upgrade – it means we can actually see what is happening in the body, even when the usual tools fail. We are not just matching to a database and hoping for the best. We are reading what is really there, even if no one has seen it before. That can be the difference between guessing and knowing – especially in diagnosing a disease, tracking an infection or finding a new treatment target,” Timothy Patrick Jenkins adds.
Why reading proteins matters
We often hear about DNA as the code of life – but DNA is just a starting-point. It is like a recipe and not the final dish. To understand what is really happening inside a body, scientists need to examine the proteins that do most of the actual work in our cells. DNA and RNA can be sequenced quickly and accurately, but proteins are much harder to read.
“Researchers have solved DNA sequencing and RNA sequencing, but we are still pretty terrible at protein sequencing,” says Timothy Patrick Jenkins. “So technically speaking, this was a really interesting challenge.”
But this is more than just a technical puzzle. Proteins show what is going on in real time. DNA tells you what could happen, but proteins show what is happening – including the effects of disease, infection or treatment. And they can act in surprising ways.
“Even if you have DNA, it is just a blueprint. It is not the final product. And a lot can happen to the final product – the protein – along the way.”
The process gets even slower
Sometimes, the DNA might look completely normal, but the protein it produces has been altered in subtle but important ways. These changes can be especially critical in cancer or viral infections, in which the protein acts differently from what the DNA suggests.
“You are just measuring a proxy – like an indirect clue – instead of measuring what is really in the system.”
That is why scientists are so interested in proteomics – the large-scale study of proteins. But until now, the main way to read proteins was to compare them with huge digital databases. These methods work well when you already know what is in the sample – but not when for the unknown, such as a complex infection or a mixed sample from the gut or a wound.
“It works when you know what you are expecting, but it is awful when you do not. With microbiomes, for example – bacteria, viruses, everything – it takes weeks.”
And as you add more options to the database to try to catch unknowns, it gets even slower and harder to use, and the accuracy also drops dramatically. Faced with all these limitations, Timothy Patrick Jenkins and his colleague Konstantinos Kalogeropoulos saw an opportunity. What if you could skip the database entirely and simply let a machine-learning model read the proteins directly, from scratch?
“There is a big space for de novo sequencing,” Timothy Patrick Jenkins explains. “That means figuring out the protein just by looking at the raw data – without needing a reference. Nobody had really built a great model for that, and many data are available. It was kind of ripe for the picking.”
How the AI model works
To tackle the challenge, they teamed up with the British AI company InstaDeep – bringing in the computing power and machine-learning expertise needed to finally make protein sequencing faster, smarter and more accessible.
“To read proteins, we typically use a method called database searching. It is similar to Googling a sentence and hoping to find an exact match online. It works if you already know what you are looking for.”
But for unknowns, such as a mysterious infection or an ancient sample, the method falls apart.
So instead of searching for matches in a database, the researchers built something completely new – AI that can read the protein fragments directly from the raw data without needing prior knowledge. This de novo sequencing has often been too inaccurate and slow for real-world use.
To solve this, Timothy Patrick Jenkins and his team turned to the same kind of AI that powers ChatGPT – a deep learning model called a transformer.
“We trained it on the largest protein data set ever assembled at the time.”
In this way, it could learn how to recognise patterns in the spectral data from mass spectrometry and instantly translate them into protein sequences.
“We transformed the database search – which is like Google search – into a ChatGPT solution,” says Timothy Patrick Jenkins. “We have trained a big model to just look at a spectrum – it does not have to have seen it ever before, and then it just knows, based on the peaks, what the sequences were, with very high accuracy.”
A powerful one-two punch
But they did not stop there. Around the same time, a new kind of AI model – known as a diffusion model – was making waves in image generation and protein design. Timothy Patrick Jenkins and his collaborators saw the potential to apply this to protein sequencing too.
“We thought we could use diffusion models together with our ChatGPT transformer, so we can translate the spectrum into a sequence, but also refine it – like a researcher would do – step by step,” he explains.
Think of writing a rough draft and then going back with a red pen to edit and improve the text. That is what the diffusion model does – it takes an initial guess and improves it bit by bit.
“With the diffusion model, we can go back and say, did we actually get everything as good as we can? Or are we a bit uncertain about a certain amino acid in the sequence? If so, we can see whether we can optimise,” Timothy Patrick Jenkins notes. “It refines and improves the predicted sequence in a way that mimics how a human would work through the data, just infinitely faster.”
In the end, the team combined the two models – calling them InstaNovo and InstaNovo+ – to deliver a powerful one-two punch. The transformer makes the initial highly accurate prediction, and the diffusion model goes over it to fine-tune the results.
Finding what others missed
Together, these tools are not just faster – they are more accurate, more flexible and far better at detecting unknown or rare proteins. And they are already being applied in fields from cancer research to ancient biology.
“Once the models were built, the big question was: would they work in the real world?”
To find out, the team put InstaNovo and InstaNovo+ to the test. They compared their performance against the best existing methods using benchmark data sets – including some of the most commonly used samples in the proteomics community. The difference was clear.
“We detected 60,000 protein fragments in one of the test samples – the same data set in which the previous state-of-the-art model, Casanovo, found 40,000,” says Timothy Patrick Jenkins. “That is a 50% increase. And this was not just luck. It was because InstaNovo+ could go back and refine the prediction – like creating a second draft of the answer.”
In numerical terms, one of the models uncovered 3,495 new protein pieces (peptides) that existing tools had missed. The more advanced InstaNovo+ model found more than 10,000 additional matches – and predicted nearly 13,000 new ones. For researchers hunting for elusive disease markers, this opens a whole new playing field.
“In combination, the tests show that these models can really help to understand what is happening – even in samples in which we have no idea what is in there,” adds Timothy Patrick Jenkins. “That is a big deal for biomedicine.”
Fighting very tricky wounds
The use of the new tool is not just limited to cancer or wounds. The team also demonstrated success in microbiomes, ancient proteins, antibody discovery and even snake venom – showing just how wide the potential applications could be. Although the numbers prove that the models work, the real-world applications show just how powerful this breakthrough can be.
“We did not just want to beat the previous best method – we wanted to make something useful across fields, from hospitals to museums. We are not just predicting – we are identifying what is really going on,” says Timothy Patrick Jenkins.
One powerful example came from the clinic: a group of people with chronic venous ulcers – wounds that do not heal properly and often resist treatment. These infections are notoriously hard to diagnose because they contain little DNA and a mix of bacteria. Using their new model, the researchers could read the proteins directly and identify the culprits.
“These wounds are very tricky – very little DNA, lots of unknowns,” Timothy Patrick Jenkins explains. “With our method, we could directly read the proteins and identify two specific bacteria, including a multidrug-resistant one. And we confirmed this in the laboratory. This kind of information could completely change how these patients are treated.”
Building AlphaFold 2.0
Another striking application is in cancer treatment. Modern immunotherapy depends on identifying specific protein fragments on the surface of cancer cells – fragments that tell the immune system to attack here. But many of these targets are invisible to older tools. Using InstaNovo+, the team discovered thousands of potential new attack points.
“In cancer, the key is finding the right target,” Timothy Patrick Jenkins says. “With our models, we found more than 50,000 peptides that standard methods missed – that is seven times more potential targets. This means that we can now help to design vaccines or therapies that are tailored to that patient’s cancer and not just a generic version.”
According to the researchers, this is just the beginning. The team is already working on the next generation of the model – aiming to raise accuracy from 70% to as high as 95%.
“AlphaFold made headlines for figuring out how proteins fold into 3D shapes – something researchers thought might take another 50 years to crack. Now we want to bring that same kind of breakthrough to the way we read and identify proteins.”
The new tool will help scientists not just guess what proteins might be there but know for sure.
“This is kind of the AlphaFold 1 of protein sequencing,” says Timothy Patrick Jenkins. “We pushed the accuracy from 40% to 70%. Now we’re trying to build AlphaFold 2 – and we have already gotten European Union funding to do this.”
Insights into inaccessible landscapes
Crucially, the researchers want to make the tool widely accessible – not just to AI experts or elite laboratories. They have built a soon-to-be-released user-friendly interface so that scientists can simply upload their files and get results, with no coding required.
“We have built a graphical user interface. You just drop your mass spectrometry file, and it spits out the results. You do not have to do anything else,” Timothy Patrick Jenkins says.
And the scope is not limited to medicine or history. From environmental monitoring to industrial biotechnology, the ability to identify unknown proteins – rapidly and accurately – could transform how we understand and manage biological systems.
“With these tools, we can improve our understanding of the biological world as a whole – not only in terms of healthcare but also in industry and academia,” says Timothy Patrick Jenkins. “Within every field using proteomics – be it plant science, veterinary science, industrial biotechnology, environmental monitoring or archaeology – we can obtain insight into protein landscapes that have been inaccessible until now,” concludes Timothy Patrick Jenkins.