Donohue et al. (2025) Structured dataset of reported cloud seeding activities in the United States (2000–2025) using an LLM
Identification
- Journal: Scientific Data
- Year: 2025
- Date: 2025-12-11
- Authors: Jared Joseph Donohue, Kara D. Lamb
- DOI: 10.1038/s41597-025-06273-1
Research Groups
- Data Science Institute, Columbia University, New York, NY, USA
- Department of Earth and Environmental Engineering, Columbia University, New York, NY, USA
Short Summary
This study presents a structured dataset of reported cloud seeding activities in the United States from 2000 to 2025, extracted from 832 historical NOAA reports using a multi-stage PDF-to-text pipeline combined with an LLM, achieving an estimated 98.38% accuracy. The dataset addresses a critical data gap and demonstrates a scalable framework for unlocking historical environmental data using large language models.
Objective
- To create a comprehensive, structured dataset of reported cloud seeding activities in the United States from 2000 to 2025 to address the lack of accessible data for analysis.
- To demonstrate the potential of large language models (LLMs) for extracting structured information from historical, inconsistently formatted environmental documents.
- To provide a scalable framework for unlocking historical data from scanned documents across various scientific domains.
Study Configuration
- Spatial Scale: United States
- Temporal Scale: 2000–2025
Methodology and Data
- Models used: OpenAI's o3 large language model (LLM) for information extraction. Preprocessing involved pymupdf for native text extraction, pytesseract (Tesseract OCR engine) and llm-whisperer for optical character recognition (OCR) on scanned documents.
- Data sources: 832 historical Form 17-4 reports on weather modification activities from the National Oceanic and Atmospheric Administration (NOAA) Weather Modification Project Reports Archive, originally stored as scanned PDF files.
Main Results
- A structured dataset containing 832 unique cloud seeding projects in the United States from 2000 to 2025 was successfully created and made publicly available on Zenodo.
- The dataset includes 12 key fields per project: filename, project name, year, season, state, operator affiliation, seeding agent, apparatus, purpose, target area, control area, start date, and end date.
- The estimated overall accuracy of the extracted data is 98.38%, based on a manual review of 200 randomly sampled records.
- Cloud seeding activity was primarily concentrated in western states (California, Colorado, Utah) and Texas, with the main stated purpose being to increase snowpack, followed by increasing precipitation.
- Silver iodide was the most common seeding agent, predominantly deployed using ground-based apparatus.
- The number of weather modification events peaked in the early to mid-2000s, declined through the 2010s, and rebounded in 2024 and 2025.
- Deliberate prompt design, particularly using chain-of-thought reasoning, and optimal model selection (OpenAI's o3) significantly improved LLM-based data extraction accuracy.
Contributions
- This work provides the first comprehensive, structured dataset of reported cloud seeding activities in the U.S. for the 2000–2025 period, filling a significant gap in existing literature and enabling quantitative analysis of weather modification practices.
- It introduces a robust and scalable methodology for extracting structured data from unstructured, inconsistently formatted historical environmental documents using LLMs, which can be applied to other government-mandated reporting systems.
- The dataset serves as a valuable resource for researchers to study long-term patterns, analyze the evolution of seeding agents and deployment methods, and assess geographic and seasonal trends in cloud seeding operations.
- It demonstrates the practical utility of LLMs in scientific data synthesis, particularly for unlocking previously inaccessible historical records across various scientific domains.
Funding
- Columbia University’s Data Science Institute
- Zegar Family Foundation (for Kara D. Lamb)
Citation
@article{Donohue2025Structured,
author = {Donohue, Jared Joseph and Lamb, Kara D.},
title = {Structured dataset of reported cloud seeding activities in the United States (2000–2025) using an LLM},
journal = {Scientific Data},
year = {2025},
doi = {10.1038/s41597-025-06273-1},
url = {https://doi.org/10.1038/s41597-025-06273-1}
}
Original Source: https://doi.org/10.1038/s41597-025-06273-1