Artificial intelligence (AI) can help blind and visually impaired people make sense of the world around them – while also exposing where it falls short. The AI systems solve many tasks correctly, but they lack a built-in stop button when the evidence is too thin. In other words, they cannot reliably distinguish between knowledge and qualified guesswork – and often respond without signalling uncertainty, new research shows.
A person with a visual impairment stands holding two almost identical medicine boxes and reaches for their phone. AI has become a regular aid for reading labels. The answer comes quickly and sounds certain, because the model is trained to produce the most likely response based on patterns in data – not to judge whether it actually has enough information. It therefore appears as a reliable guide – and in practice acts as a decision-maker.
The person follows the answer and picks the box the AI points to. Later, she double-checks – and realises the instruction was wrong.
Researchers examined the risk of AI misleading people with visual impairment in a new study. They presented their findings at the Annual Meeting of the Association for Computational Linguistics in Austria in 2025, with the work comprising part of research into multimodal language models – systems that combine image and text understanding.
The study was carried out by a team from the University of Copenhagen and Heriot-Watt University in Edinburgh, with Anders Søgaard among the lead authors. He is a Professor of Natural Language Processing at the University of Copenhagen and notes that the results confirm a suspicion among researchers: the models perform best on simple tasks and struggle when precision and context are required.
“The paradox is that the risk of error is greatest in complex situations – when you need help the most and when it can also be most difficult to sense that you are moving onto uncertain ground,” says Anders Søgaard, adding:
“From a broader perspective, it is about what happens when we begin to hand over everyday functions and assistive tools to AI. If systems sound certain while they are in fact guessing, they blur the line between knowledge and guesswork – and risk doing users a disservice by misleading rather than guiding them.”
Users set the course
The seed for the study was sown when Anders Søgaard and colleagues set out to examine whether the tests they and others typically rely on actually say anything about AI as an assistive tool in practice. They went through the tests one by one and kept encountering the same pattern. The images were usually taken by sighted people for entirely ordinary purposes, shared online and later prepared for research by others. The subject was clear, the questions were short and often in English and the answers were obvious.
So they could readily measure certain things – typically whether the model could recognise objects in well-composed images.
But they rarely measured what an assistive tool requires in practice: the ability to act under uncertainty, such as interpreting blurred photos, partial text and incomplete visual cues from a mobile phone camera in a real-life situation.
“We could not just take a standard test off the shelf and expect it to capture what a visual aid actually needs to be able to do. We had to tailor the evaluation so that it reflected the situations in which people use the technology,” notes Anders Søgaard.
That is why the researchers began with a questionnaire, he explains. They first developed it in collaboration with blind and visually impaired people with varying degrees of vision loss, refining the questions over two rounds to ensure that they worked in practice and were sufficiently precise. They then invited a larger group to respond, with 106 total participants.
The researchers subsequently filtered out responses from people without visual impairment and systematically reviewed the free-text answers so they could be grouped into themes rather than individual accounts.
From experience to testing
The responses formed the basis for realistic test scenarios – and were thus directly linked to situations in which errors have practical consequences.
The researchers used them to identify the tasks on which the models should be assessed and in which errors have the greatest impact – for example, reading packaging, images in messages, Braille, assistive devices and short video clips.
“We spent more time on the questionnaire than you might think, because the format is everything. If the questions do not make sense to participants, you will not draw out the experiences on which the test needs to be built,” explains Anders Søgaard.
The responses indicated the challenges that shape everyday life, in which information is often presented visually and is therefore not immediately accessible.
The next step was to build the evaluation package. The researchers developed five subtests, each covering an area described by users, and had 13 free and publicly available AI models attempt the same types of questions a user might ask.
Søgaard notes that they tested the models in languages other than English to determine whether the assistance still holds when users are not asking questions in the models’ native language. They also deliberately included tasks in which the image or video did not contain enough information for a reliable answer. This enabled something crucial to be measured: whether the model can refrain from answering when the evidence is insufficient – an ability researchers refer to as abstention.
“We designed the method to test both whether the models can answer and whether they can refrain from answering when the material is insufficient. In practice, this is at least as important for an assistive tool,” says Anders Søgaard.
Errors lie in the difficult tasks
The results paint a clear picture, says Anders Søgaard: the models perform best when the question is simple and the image offers clear cues.
The answer becomes far less certain when the task demands precision, context and linguistic understanding: when the model has to combine several types of information rather than simply recognise an object. This increases the risk of hallucinations – in which the model does not just make a mistake but fills gaps in its knowledge with a plausible yet incorrect answer without signalling it as such.
“We see again and again that the most difficult questions receive the most confident answers, even when the material is insufficient,” says Anders Søgaard.
Cultural image descriptions illustrate why things can go wrong. Several models can provide a general description but overlook names, symbols and text, which often carry the main point. Multilingual questions reveal the same pattern.
Models that perform reliably in English tend to be less consistent in other languages, partly because they are typically trained on far larger volumes of English-language data. This means that their experience base is narrower when users ask questions in other languages, and some models revert to English in their response, even when the question was asked in a different language.
Where things begin to slip
Braille exposes another weak point. Most models struggle to read the dots directly from a photo, because Braille requires fine-grained spatial resolution and precise interpretation of light and shadow – for which standard models are not optimised.
Assistive devices reveal a related limitation: models are more likely to recognise everyday objects than the equipment many blind people rely on in daily life. And in video, the uncertainty becomes even harder to detect, since a clip can easily be missing a crucial detail.
In practice, this creates an imbalance in the assistance. A model may be useful in many minor situations and still fall short at the moments when the answer needs to stand on its own. That is why the ability to express doubt becomes central – not as a weakness but as a condition for trusting the answer.
Without it, the user receives the same confident tone whether the model is drawing on clear information or making an educated guess – and the difference disappears precisely when it matters most.
The next step requires doubt
The way forward lies not only in increasing the share of correct answers but also in rethinking what we consider a good answer – and in developing systems that can assess their own certainty and respond accordingly.
The study also suggests that some capabilities can be improved through targeted training, including in Braille. But the most important improvement concerns how the system acts when it does not know enough.
“The key next step is for systems to become better at signalling uncertainty in time, so that users can act on this rather than receiving an answer that sounds more certain than it is,” says Anders Søgaard.
The next wave of studies therefore needs to move closer to real-world use. Real-time navigation and longer processes under time pressure are not included here – and these situations can place the technology under greater strain than a controlled test. It is also here that we can truly determine whether the system helps the user to take a new photo, ask a better question or choose a different solution before an error turns into a decision.
Ultimately, the story comes down to trust. Blind and visually impaired people already use AI because the technology can provide rapid access to information. Nevertheless, the study shows how easily that sense of security can become misleading if the system does not clearly distinguish between answers based on sufficient evidence and those based on filling in the gaps.
“If we begin to use AI as an everyday aid, it must also act like one that speaks up when it lacks a basis. Otherwise, we risk making people less independent by giving them something they cannot trust,” concludes Anders Søgaard.
The problem is not just that AI can make mistakes – but that it sounds as though it does not.
