University of Wisconsin–Madison

With AI, field ecologists can finally keep up with environmental change

Across the globe, freshwater scientists have amassed “universes” of local ecological data. A new tool designed by a UW computer scientist uses AI to process it faster than humanly possible, enabling researchers to predict and solve water quality issues. 

By Rachel Robey 

Over the last several decades, a technology-driven “data deluge” has hit nearly all fields of research. Scientists today generate data faster than it can be organized, shared, and understood, creating new opportunities for computing and AI to help make sense of an increasingly complex world. Zoom out and you begin to see the problem: Every researcher, across every subfield, stranded on their own island of information. It’s hard enough to share findings with peers, let alone the public or its policymakers.

The freshwater scientists at UW–Madison’s Center for Limnology understand this better than most. Not only is our university the birthplace of freshwater science in North America, but the Center sits right on Lake Mendota, “the best studied lake in the world.” Trout Lake Station, the Center’s year-round field station in Vilas County, has been conducting research in the Northwoods since 1925.   

In other words: They’ve collected a lot of data over the last century.  

Beyond that, global field ecologists have yet another obstacle to contend with: Environmental change is happening faster than the research. “About a decade ago, [Limnology] recognized a need [for help with the data],” said Professor Paul Hanson, a researcher in water quality. “Fortunately, the university already happened to have a known expert.” 

A “known expert” in making messy data usable 

AnHai Doan, a professor of computer science at the College of Computing & Artificial Intelligence (CAI), helps researchers across disciplines make large, complex datasets more discoverable, usable, and meaningful. His work sits at the intersection of data management, AI, and scientific discovery, helping experts spend less time organizing information and more time generating insights. The result is high-quality data that can be used for various research-accelerating applications, from data science to building AI models.  

“My goal is to make messy data usable at scale,” he explains. “Modern research requires a lot of data, but often it’s ‘trapped’ in multiple datasets that are dirty, meaning they contain errors, duplicates, or inconsistent labeling. There’s no way to do effective research unless one can find, clean, and combine it.” 

The work reflects the growing role of computing in modern research. As datasets become larger and scientific questions become more complex, researchers increasingly rely on collaborations that combine deep domain expertise with advanced computing tools. Together, they can uncover patterns and connections that would be difficult for either discipline to achieve alone.  

quotation mark
Modern research requires a lot of data, but often it’s ‘trapped’ in multiple datasets … There’s no way to do effective research unless one can find, clean, and combine it.
AnHai Doan, Professor of Computer Sciences

Doan and Hanson began working together about a decade ago on projects related to the Environmental Data Initiative (EDI), an open-source environmental data repository funded by the National Science Foundation (NSF) and co-hosted with the University of New Mexico. Many of the contributors are individual researchers collecting data in their proverbial backyards, not unlike the students and faculty regularly seen skimming across Lake Mendota monitoring algal blooms. At the beginning, Doan was just supporting with entity matching, an important step in data cleaning that finds matches across datasets even when the labels are “fuzzy.”   

“Field ecologists are really creative,” said Hanson. Many use their own enigmatic shorthand, plus there are synonyms to account for — consider how “fauna,” “wildlife,” and “animals” can all mean the same thing. How do you organize all that information across data tables when everyone is speaking a different language? 

“Imagine all the work it takes to be able to translate that for just one column,” Hanson continued. “If you have thousands, which we do, it’s beyond human capacity.” 

In 2025, Doan began SmartCat, a collection of “AI-driven data catalog management systems” inspired by the decade-long collaboration with limnology. The new tool allows EDI to explore and organize its 87,000 (and growing) datasets more easily. It also generates descriptive text and metadata.

“AI is speeding up the rate at which you can pick up patterns or tease apart creative nomenclatures,” says Hanson. “It’s a gamechanger.” 

AI for accelerated open science 

In the 2010s, the NSF funded the $10M LAGOS project to harvest water quality data from individual researchers and federal and state agencies around the U.S. All of the LAGOS data — about one TB — is open source and available on the EDI website, where it drives around thousands of weekly downloads. 

“A lot of the data is its own universe,” says Hanson. “Tens of thousands of ecosystems are represented. We’re a big community that’s highly distributed and heterogenous, with resources to contribute but no standardized way of bringing them together for a higher purpose.” 

This is where AI excels, explained Doan. Modern AI systems like SmartCat can spot patterns across big data, helping researchers discover desired datasets and missed connections, organize their knowledge, and focus on the questions that matter most.

“Without SmartCat, trying to find useful datasets is like trying to search for information without Google. It can take months,” said Doan. “SmartCat allows researchers to find desired datasets in minutes.”   

Such tools can also make scientific knowledge more accessible. By improving metadata, searchability, and discovery, the platform helps researchers, agencies, policymakers, and communities find and use information that might otherwise remain buried within thousands of individual datasets. 

“Traditionally it’s a long and slow process to figure out how we as local ecologists fit into the broader tapestry of environmental science,” said Hanson, “[With AI], we’re doing things we never could have done before.” In the 2010s, the NSF funded the $10M LAGOS project to harvest water quality data from individual researchers and federal and state agencies around the U.S. All of the LAGOS data — about one TB — is open source and available on the EDI website, where it drives around thousands of weekly downloads. 

Already, Doan is looking ahead to the next evolution of the SmartCat project. “There needs to be something like a Google search but for data sets, which could allow people all over the world to quickly find the data they need,” he said.

Applied to global freshwater science, SmartCat’s ability to quickly share, discover, and interpret new data could help turn the tide in protecting lakes, rivers, streams, and wetlands. Applied beyond, it could support experts using AI to advance medical research or democratize food networks.  

Plenty of today’s most pressing answers are currently trapped in datasets, said Doan. By bringing together human expertise and powerful computing tools, researchers are expanding what’s possible — not only for science, but for the communities that depend on it. All of the UW campus — all of the state, all of society — stands to benefit.   

Related: