Can artificial intelligence spot dangerous bacteria before they make us sick?

Tech Science 31. may 2026 12 min Data Scientist and Bioinformatician Alfred Ferrer Florensa Written by Morten Busch

An artificial intelligence (AI) tool can now scan an entire bacterial genome and ask a difficult question: do its combined protein patterns reveal the capacity to cause disease in humans? PathogenFinder2 performs particularly well on bacterial species that do not resemble anything in existing databases, making it a potential tool for early surveillance of infectious threats – including those with pandemic potential.

Interested in Tech Science? We can keep you updated for free.

Most bacteria around us never make us sick. Some even help us. But hidden among the vast numbers in nature, animals and wastewater are strains that can harm humans – often only recognised once they reach the clinic. Infectious diseases remain a leading cause of death worldwide, and as climate change and globalisation expose us to new microbial environments, the need to identify threats earlier is growing.

Using PathogenFinder2, researchers from the Technical University of Denmark (DTU) and international collaborators are trying to answer the question: can an unknown bacterium be recognised as potentially dangerous before it has made anyone sick?

The tool determines whether a bacterium has pathogenic capacity – the genetic potential to harm humans – based on its genome.

“PathogenFinder 1 was developed in 2013, and since then both the quantity of bacterial genome data and the machine-learning methods have changed enormously. So the question became: can we do better now?” explains Alfred Ferrer Florensa, Data Scientist and Bioinformatician from the Technical University of Denmark, Kongens Lyngby.

Instead of looking for known warning signals or close relatives among established pathogens, PathogenFinder2 uses protein language models to translate each bacterial protein into a mathematical representation. This allows the model to compare functional patterns across the proteome – even when the underlying sequences differ. A key strength is that it works even on bacterial species that do not resemble anything known.

“Any model based only on bacterial genomes cannot predict whether a bacterium will succeed in infecting a specific person. That also depends on the host. So we changed the question: is this bacterium capable of being pathogenic to humans?” explains Alfred Ferrer Florensa – because surveillance cannot wait until every host factor is known.

“If you sample seawater, wastewater or a canal and find bacteria with pathogenic capacity, you may be able to act before they have infected anyone,” he says.

To test the new model, the researchers trained it on tens of thousands of bacterial genomes from global databases and applied it to real-world samples, including wastewater.“All the data we use comes from public repositories. If you only train on your own data, you risk biasing the model towards a specific region or dataset.”

To probe that ability, the researchers tested the model on novel species, recently added genomes and metagenome-assembled genomes from wastewater – situations where many bacteria are poorly known or completely unknown, and where the need to identify potential threats is greatest.

Why “dangerous bacteria” is harder to define than it sounds

Determining whether a bacterium is dangerous remains one of the most fundamental – and frustrating – challenges in microbiology. The question goes back to early attempts to define disease causation, from Koch’s postulates to modern frameworks that recognise disease as an interaction between microbe and host.

Traditionally, researchers have tried to identify disease-causing bacteria by asking whether they share genes or characteristics with known pathogens. But that approach quickly reaches its limits.

“The problem is that nature does not organise bacteria according to our categories. Two bacteria can be very closely related, and yet one can make us sick while the other cannot.”

In some cases, only a handful of mutations – or acquiring mobile genetic elements – can turn a harmless bacterium into one capable of causing disease. In addition, the concept of a “pathogen” has become more fluid: some bacteria are harmless in one context and harmful in another, depending on the host and environment.

“It is not a black-and-white world. Many bacteria only cause harm under certain conditions.”[AF1]

Further, the quantity of bacterial data has exploded.

“When PathogenFinder 1 was developed, a little more than 1,000 genomes were available. For PathogenFinder 2, we could collect around 20,000 – and in a few years that number will be much higher.”

Thousands of new bacteria are being sequenced from environments such as soil, oceans and wastewater – often without clearly understanding whether they pose a risk to humans.

“We are dealing with an enormous quantity of data, but we lack good ways to interpret it.”

Traditional methods also come with practical limitations. Animal experiments can be slow and difficult to translate to humans, and database-based approaches depend on the bacterium resembling something already recorded – precisely what fails when something truly new appears.

Disease potential is written across the whole genome

It is increasingly clear that a bacterium’s ability to cause disease does not depend on a single gene but on how many proteins and functions are combined across the genome.

“The paradox is that many of the genetic ingredients associated with disease are also found in harmless bacteria. It is not about individual genes but how they are combined. Pathogenic capacity is a pattern across the entire genome – a combination of functions that together allow the bacterium to survive, spread and cause damage.”

Pathogenic capacity is therefore not a single feature but a whole-genome phenotype – emerging from the combined behaviour of many genes. PathogenFinder2 is designed to capture this complexity by recognising patterns that recur across very different bacteria.

“We are trying to move away from the idea of single ‘dangerous genes’ and towards a more holistic understanding of what enables a bacterium to harm a human.”

Teaching AI to read proteins as patterns

To move beyond existing approaches, the researchers designed PathogenFinder2 to treat a bacterium not as a list of known risk factors but as a complex system encoded across its genome. The core idea is to let the model learn patterns directly from protein sequences – without relying on predefined databases or alignments.

“Instead of asking whether a bacterium has this or that known virulence gene, we let the model look at the whole proteome and learn what patterns are associated with pathogenic capacity.”

Many existing methods depend on matching sequences to known protein families. That works well when the right reference exists but breaks down when biology is unfamiliar. PathogenFinder2 avoids that step.

“We start with the bacterial genome and translate it into its full set of predicted proteins,” Alfred Ferrer Florensa explains.

Proteins are the working parts encoded by the genome, so they give the model a practical way to read what the bacterium may be able to do. Because whole genomes are too large to analyse directly, each protein is converted into a numerical representation the model can use to detect patterns across the genome.

How the model reads a bacterium

These protein representations are combined into structured input that reflects the entire genome. Unlike many previous methods, the model keeps track of the full set of proteins and their relationships, enabling it to detect patterns that span the whole genome.

“Pathogenic capacity is distributed across the genome. If you reduce it to a few features, you lose important information.”

PathogenFinder2 then processes this information in layers. One part scans for local patterns, such as groups of genes that often appear together, including operons and mobile genetic elements. Another assigns more weight to proteins that appear especially informative for the prediction.

“PathogenFinder 2 not only makes a prediction but also tells you which proteins it focused on when making that prediction.”

Unlike many black-box models, PathogenFinder2 leaves a trace: a prediction and a ranked list of the proteins that contributed most to it. Those proteins can then be mapped to known databases to explore possible biological functions.

“For someone working with a new strain, this can be a way to explore which proteins might matter for pathogenicity.”

Another key feature is that the model is taxonomy-agnostic: it does not depend on matching a bacterium to a known relative.

“If you already know what you are dealing with, you often already have a clear answer. The challenge is when you do not.”

Training the model to recognise potential and not labels

To ensure that the model generalises to unseen bacteria, closely related genomes were separated between training and testing.

“We made sure that the test set contained species that were not present in the training set. That is why we can claim that PathogenFinder 2 can make predictions on novel species,” says Alfred Ferrer Florensa.

Rather than defining strict “pathogens” and “non-pathogens”, the researchers labelled bacteria based on whether they had ever been observed to cause infection among humans – regardless of severity.

“When we calculate pathogenic capacity, we do not need to decide whether this is a rare infection. If it happened once, then there is potential.”

Defining the opposite category was more difficult.

“You cannot just say, ‘we have never seen this bacterium cause disease.’ So we used bacteria that cannot live in the human body or that are repeatedly in contact with humans without causing disease.”

The result is a model trained to recognise pathogenic capacity as a broad property – including bacteria that only cause disease under specific conditions. It also means that some uncertainty in the labels is unavoidable, because public databases do not always capture the full context of an infection. To reduce the impact of this, the system combines multiple neural networks into an ensemble, lowering dependence on any single model.

The real test: bacteria the model had never seen

When the researchers tested PathogenFinder2 on completely new bacterial species – organisms absent from the training data – the model did what it had been built to do: it outperformed existing methods.

“The accuracy we show is based on species that had not been seen during training.”

This is a critical benchmark, because in real-world surveillance the most important cases are those where no close reference exists.

On these unseen species, the model was both more accurate overall and better at the difficult balance: catching bacteria with pathogenic capacity without raising too many false alarms. Some existing methods showed high sensitivity but flagged too many harmless bacteria as threats, whereas others were overly conservative.

“We see that some models are very good at not missing pathogens, but they classify almost everything as dangerous. That is not useful in practice – you need a balance.”

PathogenFinder2 stood out by maintaining both high sensitivity and specificity while keeping false positives relatively low. It also showed strong calibration, meaning that its probability scores more reliably reflected actual risk.

“It is not just about making a prediction but about how much you can trust that prediction.”

The model performs particularly well when there is no close match in existing databases – precisely when traditional methods struggle, and where new threats are most likely to emerge. Species-based approaches, for example, showed a sharp drop in performance in these scenarios.

“If your method depends on finding something similar in a database, it will fail exactly when you need it most – when a new type of bacterium appears.”

What the model saw beneath the prediction

Beyond performance metrics, the model also provided biological insights. By analysing which proteins received the highest attention scores, the researchers found that PathogenFinder2 consistently highlighted proteins associated with known virulence mechanisms – such as toxins, adhesion factors, secretion systems and biofilm formation.

“We saw that the model focused on known factors, such as toxins and other proteins related to pathogenicity. But it also highlighted many proteins that were hypothetical or not characterised before.”

The model also pointed to less obvious features, including metabolic pathways, iron acquisition systems and proteins involved in genetic mobility – factors that may indirectly support infection.

“What is interesting is that it does not only highlight the classical virulence factors. It also picks up supporting systems that help bacteria to survive and adapt in the host.”

From laboratory benchmark to sewage samples

In some cases, the model highlighted proteins that are not yet well characterised.

“For someone working with a new strain, it can be a way to explore which proteins might matter for pathogenicity,” says Alfred Ferrer Florensa. “That is where it becomes really exciting. We can start to generate hypotheses about proteins that have not previously been linked to pathogenicity.”

To test its practical utility, the researchers applied PathogenFinder2 to 2,739 bacterial genomes reconstructed from global sewage samples – many of which are poorly characterised or completely unknown.

Most were predicted to lack pathogenic capacity: 1,839 genomes. But 370 were flagged as potentially capable of infecting humans under certain conditions, and another 530 fell into an intermediate zone in which the model could not confidently assign a label. The flagged genomes formed clusters that could represent previously unrecognised risk groups.

“In environmental samples, you often have a mix of known and unknown bacteria, and you need a way to set priorities for what to look at.”

Not every bacterium fits yes or no

This “uncertain” category reflects a practical reality: not all predictions can – or should – be forced into a binary decision. Instead, the model highlights ambiguous cases that warrant closer investigation.

Some of the predicted potentially pathogenic bacteria belonged to groups not previously recognised as human pathogens, indicating the model’s ability to move beyond existing knowledge.

“We are not saying that these bacteria will necessarily cause disease. But we can flag them as worth investigating further.”

In addition, the researchers emphasise that the model predicts capacity, not actual disease outcomes.

“The outcome of an infection depends on both the bacterium and the host. We are only modelling one side of that interaction.”

More time to respond to microbial threats

Being able to assess completely unknown bacteria changes the timing of infectious disease surveillance. Instead of reacting to outbreaks after they occur, tools such as PathogenFinder2 could shift the focus toward earlier detection and prioritisation – before infections begin to spread.

“The aim is to give us more tools for epidemic events. I do not know if we can stop them, but we can be more equipped and more ready when they happen,” says Alfred Ferrer Florensa.

“If you can identify potentially dangerous bacteria before they cause infections, you gain a completely different kind of response time.”

The model is therefore less a diagnostic tool than an early warning system.

One immediate application is large-scale surveillance of environmental samples – such as wastewater, soil or animal reservoirs – in which thousands of bacterial genomes can be analysed in parallel. Here, the challenge is not detection but deciding which bacteria matter.

“We are moving into a situation in which sequencing is no longer the bottleneck. The bottleneck is interpretation – understanding which organisms are relevant for human health.”

Deciding which bacteria deserve attention

By ranking bacteria according to their predicted pathogenic capacity, the model can help to direct attention towards organisms that warrant closer investigation.

“It is about setting priorities. You cannot study everything in detail, so you need a way to decide where to focus your resources.”

This means that the tool is most relevant when researchers do not already know what they are dealing with.

“It is not a tool that will necessarily be used every day, because often you already know what you are dealing with.”

The model also changes how bacteria can be compared. By placing genomes in the same mathematical landscape, the researchers can begin to map bacteria not only by species but by how they interact with hosts.

In this landscape, distantly related bacteria can appear close together because they share similar strategies for surviving in the human body – revealing patterns that go beyond traditional taxonomy.

“We started looking at pathogens more globally and not just one species at a time.”

Mapping bacteria by what they do – and what they cannot tell us

In the longer term, this approach could feed into early stages of drug and vaccine development. If certain proteins or pathways are repeatedly highlighted across bacteria, they may represent conserved mechanisms that can be targeted more broadly.

“If we start to see recurring patterns across many pathogens, that could point to common vulnerabilities that we can exploit.”

Alfred Ferrer Florensa stresses, however, that the model is an exploratory tool and not a definitive answer. It can highlight proteins that are informative for prediction but does not prove that they directly cause disease.

“We have to be careful not to overinterpret the signals. The model tells us what is informative for prediction and not necessarily what is causally responsible.”

Further, the model only captures one side of the biology. Whether an infection occurs – and how severe it becomes – depends greatly on the host.

“A bacterium can have the capacity to cause disease, but whether it actually does depends on the host.”

For surveillance, however, this limitation also defines the practical question the model can answer.

“If you want to do surveillance, you cannot compare all the genomes from all the humans in a population.”

Learning to read bacteria before they show their hand

This distinction is especially important for opportunistic pathogens and bacteria that are part of the normal microbiome, in which the boundary between harmless and harmful depends on the context.

“The same bacterium can be completely harmless in one situation and problematic in another. That is why we use the term ‘pathogenic capacity’.”

Taken together, the work points to a broader shift in how biological data can be read. By applying protein language models to whole genomes, researchers are beginning to extract patterns that were previously difficult to access.

“I would say this is among the first works in which biological language models are used on whole bacterial genomes.”

The approach could extend beyond bacteria to other organisms and traits encoded across many genes.

“This opens the door to more complex models that can answer more questions than just whether a bacterium has pathogenic capacity.”

The next question: where does the risk matter?

The next question is not only whether a bacterium can cause disease but where – and under what conditions – that risk matters most.

“The site of infection could be a good next step, and maybe expanding to other species.”

Ultimately, the goal is not to replace experimental biology but to guide it – by narrowing the vast search space of possible targets.

“The goal is to generate better hypotheses. Instead of searching blindly, we can start from a more informed position.”

As genomic data continues to grow, the challenge is no longer to collect more information but to understand what it is already telling us.

“The data are already there. The question is whether we can learn to read them in time.”

Explore topics

Exciting topics

English
© All rights reserved, Sciencenews 2020