Artificial intelligence (AI) tools in diabetes care risk leaving some people behind if the tools are trained on unrepresentative data. A new study shows that when diabetes AI tools are trained mostly on one racial group, they may not work as well for others and can thereby be dangerous, and the study suggests a simple fix that could help AI work more fairly for everyone. This method could help AI treat everyone equally – even when there are few data points to train it on.
Continuous glucose monitors are wearable devices that help people with diabetes monitor their blood glucose levels and when paired with insulin pumps they help decide how much insulin they need to administer to prevent dangerous spikes or dips in blood glucose. Some of these are closed-loop systems that adjusts insulin doses without needing constant input.
More and more devices use AI to predict where a person’s blood glucose is headed – about to rise or fall – based on recent trends and patterns from other people. But the machine learning models behind continuous glucose monitors, along with the training data, are considered proprietary, so companies keep them secret and do not explain how they work or share the data with the public, according to Adam Hulman, associate professor and diabetes researcher at Aarhus University and the Steno Diabetes Center Aarhus.
White people dominate diabetes research
Most diabetes research has focused on white people, which means that many AI models are trained on data that does not reflect everyone with diabetes.
“Most research is done on white, Western populations,” Hulman says. And since previous studies have found racial differences in how people process and store glucose, models trained on data from white people may not be as accurate for people of other races.
“Prediction models do not automatically work the same for everyone,” Hulman says. This is why we need to test how well they work across groups.
Hulman and Helene Bei Thomsen, a PhD student at Aarhus University who researches data on continuous glucose monitors, aimed to determine how the racial composition of training data influenced the performance of a machine learning model as part of the growing conversation on algorithmic fairness, the idea that AI should work equally well for everyone regardless of their background.
The researchers asked whether a model trained exclusively on data from white people would do a worse job in predicting the blood glucose of Black people – and whether the imbalance can be corrected even when data are limited.
“Many people say that technology and AI can contribute to reducing health inequalities,” Hulman says. “But if the models are not developed with care, the algorithms can actually contribute to increasing inequalities.”
Old language models used in new ways
Since researchers do not have access to the data used to train the AI in commercially available continuous glucose monitors, Thomsen turned to a bank of data that had been gathered to study racial differences in diabetes by the T1D Exchange, a non-profit organisation in the United States.
For the T1D Exchange study, researchers at diabetes centres across the United States had collected continuous glucose monitor readings from just over 200 people with type 1 diabetes, about half of whom were white and half of whom were Black. The devices logged blood glucose levels every 15 minutes for 14 weeks. Although such a sample size is fairly small for epidemiological studies, it is relatively large within research on continuous glucose monitors. “Getting access to large, open-access datasets of continuous glucose monitor data is challenging,” especially those that include information on race and ethnicity, Hulman says.
Next, Thomsen and her team made a suite of algorithms called long short-term memory models to predict blood glucose levels 60 minutes in advance.
Originally developed to process language, long short-term memory models have been outpaced in that arena by generative AI such as Chat GPT. (The long short-term models distinguishes them from even earlier models that had such limited short-term memory that they “forgot” the beginning of a sentence by the time they reached the end.) Nevertheless, long short-term memory models are “also really good with time-series data,” such as a sequence of blood glucose readings, Thomsen says.
To determine how the racial composition of the training data affected the accuracy of the model’s predictions, Thomsen and her team made models trained on 11 different proportions ranging from 0% Black and 100% white to 100% Black and 0% white. Then they assessed each model’s accuracy for white and Black people with diabetes.
More white data made the model less fair
For real people, poor prediction could mean taking too much – or too little – insulin. This is why Thomsen and Hulman quickly realised that their model was not ready to be used in care. Because it was trained on very limited data, the system had an average error of about 2 mmol/L. For comparison, normal blood sugar after fasting is around 5 mmol/L—so an error of 2 is quite a lot. “I do not think that would be safe,” Thomsen says.
But thankfully, the overall error rate made no difference for their purposes – the researchers were interested in whether the model was worse when the training data had a racial skew.
“We thought that as we increased the proportion of Black people in the training data, the model’s blood glucose predictions for Black people would improve,” Hulman says.
They were surprised that they did not find evidence for a statistically significant difference in performance, the authors say, emphasising that larger datasets might tell a different story.
“The overall difference between models trained only on white data versus only on Black data was small,” Hulman says, but we cannot assume that this holds in larger or more complex datasets.
However, as they compared the error rates for each model, the researchers found a small but reliable difference in how the model performed depending on the people on which it was trained. As the proportion of white people in the training data increased, the “performance differences favouring white individuals became more pronounced,” the authors write.
Teaching AI with borrowed knowledge
To determine whether this divergence could be corrected, the researchers added an extra step to the models’ training – transfer learning.
Thomsen explains transfer learning through her hobbies. “I do a lot of knitting, so I have a lot of knowledge about yarn, about gauge and how to read knitting patterns,” Thomsen says. “If I want to learn how to crochet, I do not have to start from scratch, because knitting and crochet share a lot of the same logic.”
This is like giving the model a head start. If it has already seen a wide mix of blood glucose data, it can learn faster when focusing on one specific group.
“It already knows the basics, so I do not need as much data,” she says.
The researchers were pleased to find that, after transfer learning, the divergence in performance disappeared, Hulman says.
A new way to test for fairness
According to Hulman, the main takeaway from this study is not necessarily its findings about continuous glucose monitor predictions – it is the process itself.
The real breakthrough is not just in diabetes but in how to determine whether health AI works equally well for everyone, even when limited data are available, Hulman says.
To that end, they have published the code behind their algorithms for other researchers to access. “It can be reused, not just for continuous glucose monitoring, but you can imagine using it in many other fields, from grading eye scans to analysing medical images or monitoring heart rhythms,” he explains.
“We want to build bridges between data science and clinical research,” Hulman says. This kind of work gives clinical teams tools to test fairness – and gives data scientists a real medical problem to work on,” Hulman concludes.
