Lab scientists are studying noise in data to know when they can and cannot trust AI to answer complex science questions.
March 31, 2025
Download a print-friendly version of this article.
“The AI community tends to say that noisy data is fine because the model is smarter than the noise,” says Diane Oyen, an AI expert at Los Alamos. “But a lot of scientists are skeptical; they tend to adhere to the philosophy that if you put garbage in, you’ll get garbage out. I think they’re both right.”
Large language models (LLMs), the machine-learning algorithms that comprehend and generate text and images for AI platforms, have shown themselves incredibly adept at handling tons of noisy data. Open AI’s ChatGPT was trained on the internet—the whole thing. The ability to find patterns in enormous datasets make LLMs appealing to scientists, but what worries many is that these models can be imprecise. Although loose margins of error are tolerable in consumer-facing AI applications like recipes and advertising strategies, in mission-critical Lab applications like weapons performance or energy forecasts, errors are unacceptable. This high bar means some scientists avoid AI entirely and could be missing out on the advantages it offers.
Los Alamos is keen to deploy AI to solve science and security problems that inform high-stakes decisions and has made it a signature institutional commitment, setting itself the task of accelerating the development of AI-enabled capabilities. A key part of that is ensuring that an AI model is constructed, trained, and applied in a way that makes it reliable for its intended use.
Oyen and a handful of colleagues are working to understand when and why an AI model might produce inaccurate results. Specifically, they are asking: Which noise in the data can safely be ignored and which noise is actually misleading? The team has written a suite of computer codes, statistical tools to help validate pre-trained AI models—publicly available, off-the-shelf LLMs—and assure their performance for high-stakes scientific and national security applications. The codes provide a kind of statistical security camera: They don’t stop the problem, if there is one, but they make it observable, so that it can be tended to.
The project focuses on uncertainty and uncertainty quantification. Uncertainty is the mathematical wiggle room in the estimates and averages that go into modeling any complex system. Uncertainty quantification is the process of characterizing and measuring the likelihoods of certain outcomes given that some inputs are inexact. Major companies like OpenAI and Google have developed the most successful LLMs, but so far have focused their uncertainty quantification mainly on output. But input data matters, and one thing that Oyen’s team is doing is uncertainty quantification for input data.
A lot of noise isn’t actually random, it’s correlated with some feature of the data, so you can’t, or shouldn’t, just ignore it.
Mislabeling is an example of noise in the input data that can derail a model’s performance. The data that a model is trained on must be labeled, typically by the human who is building the model. Labeling—this is a photo of a dog, this is not a photo of a dog, etc.—teaches the model what its job is, so that it can then do the same job on a novel dataset. But mislabeling can happen—maybe the image is blurry, or maybe there are multiple dogs pictured—and that can skew the model’s training and subsequent interpretation.
“Most people oversimplify it and treat all noise as random because it’s hard to figure out how noise gets into data, and random noise is easy to ignore,” Oyen says. “But a lot of noise isn’t actually random, it’s correlated with some feature of the data, so you shouldn’t ignore it.” For example, maybe photos of huskies are slightly more likely to be mislabeled as drawings because huskies are fuzzy and often colored in greyscale. The label is wrong, but the wrongness isn’t random.
To understand how bad labels can skew AI models, Oyen and her team deliberately mislabeled data and watched how far the algorithm went astray. Turns out, mislabeling matters more when it occurs in the extreme ends of the data distribution than when it occurs in the middle of the bell curve. If something in the distribution tails is mislabeled, the label can get flipped, creating a new classification boundary that shouldn’t be there. The model learns the wrong thing—huskies are drawings—then flips it—drawings are huskies—and starts identifying all drawings as huskies. It only takes a few bad labels in the tails to create these spurious signatures and confuse the algorithm.
“We showed that if you have a huge data set with a lot of mistakes, there is no way to autocorrect those,” says Oyen. “It’s better to focus on the important parts of the data set, have human brains scan through any data points that look unusual, and make sure the labels are right.”
In addition to looking at where uncertainty arises within the data, Oyen’s team did experiments to suss out where uncertainty lurks in the models themselves. Most machine-learning algorithms build models dynamically as they learn, creating steps to interpret data based on the patterns they detect. As a result, these models are black boxes—their inner workings aren’t usually apparent to the user. To make matters worse, current AI models are often built with simulated data, meaning that the elements a model builds into its latent spaces between input and output could be inappropriate for certain applications outside of its training distribution.
To interrogate AI models’ latent spaces, the team took publicly available data sets and tweaked them in a variety of ways. They trimmed the data down to varying degrees of sparsity or adjusted the uncertainty to low or high levels, then fed them to the models to see if they could tell what was going on under the hood. From these experiments, the team could define the minimum sample sizes and number of input features necessary for a model to get the right result under low- and high-uncertainty conditions.
Uncertainty quantification isn’t uncertainty correction; it doesn’t fix the problem. Instead, it provides a metric for gauging whether a model is doing the right thing or not. Quantifying these types of uncertainty will help bring powerful data science methods to programs that do not yet trust AI for critical missions.
“The next step? For me, that would be ArtIMis,” Oyen says, referring to the Artificial Intelligence for Mission project, a large, multidisciplinary project that will develop AI capabilities for a broad set of Laboratory missions. “We have big questions about which data we train on and how to clean it up,” says Oyen. “Unlike some places, we can’t just hope the data is good enough. We have to know that it is.”
People Also Ask:
- Can AI be trusted? Mostly. The trustworthiness of AI is paramount to its utility as a tool for modern society. It’s the responsibility of AI developers to make their tools as reliable as possible, and it’s the responsibility of AI users to validate the results they get.
- What is uncertainty quantification? In mathematical modeling, uncertainty quantification is the process of determining the likelihood of a particular outcome, given that some model inputs are inexact.