Gauch et al. (2025) How to deal w_ missing input data

Identification

Journal: Hydrology and earth system sciences
Year: 2025
Date: 2025-11-13
Authors: Martin Gauch, Frederik Kratzert, Daniel Klotz, Grey Nearing, Déborah Cohen, Oren Gilon
DOI: 10.5194/hess-29-6221-2025

Research Groups

Google Research, Zurich, Switzerland
Google Research, Vienna, Austria
IT:U Interdisciplinary Transformation University, Linz, Austria
Google Research, Tel Aviv, Israel

Short Summary

This paper addresses the critical challenge of missing input data in operational deep learning hydrologic models by introducing and comparing three strategies: input replacing, masked mean, and attention. The study concludes that the masked mean approach generally performs best across various missing data scenarios, offering a robust solution for real-world applications.

Objective

To introduce and compare different deep learning strategies for generating streamflow predictions when meteorological input data are missing, a common challenge in operational hydrologic modeling.

Study Configuration

Spatial Scale: 531 basins from the CAMELS dataset across the contiguous USA. One experiment specifically focused on 51 basins within the Ohio, Cumberland, and Tennessee River basins.
Temporal Scale: Daily input time steps. Models were trained with a 365-day lookback period. Training data spanned 1 October 1999 to 30 September 2008, validation from 1 October 1980 to 30 September 1989, and testing from 1 October 1989 to 30 September 1999.

Methodology and Data

Models used: Long Short-Term Memory (LSTM) networks. Three distinct mechanisms were developed and compared to handle missing input data:
- Input replacing: Missing values are set to zero, and binary flags are added to indicate outages.
- Masked mean: Each forcing product is embedded separately, and the embeddings of available products are averaged.
- Attention: A generalization of the masked mean, dynamically weighting the embeddings of available forcings based on static attributes, positional encoding, and availability flags.
Data sources:
- CAMELS dataset (Catchment Attributes and Meteorology for Large-sample Studies).
- Three sets of daily meteorological forcings: Daymet, Maurer, and NLDAS.
- 15 forcing variables (precipitation, solar radiation, minimum/maximum temperature, and vapor pressure for each forcing product).
- 26 static catchment attributes.
- Streamflow as the target variable.

Main Results

All three proposed methods successfully enabled deep learning hydrologic models to produce streamflow predictions even with significant amounts of missing input data.
Model accuracy (Nash–Sutcliffe efficiency, NSE; Kling–Gupta efficiency, KGE) consistently decreased with an increasing probability of missing data across all methods.
In scenarios with random time step dropout, the masked mean approach generally performed best in terms of KGE, showing statistically significant improvements over input replacing in most cases. The attention mechanism often underperformed at lower missing data probabilities.
For scenarios involving entire forcing sequences missing, the masked mean and attention mechanisms showed similar performance, with input replacing typically performing the worst. The differences in accuracy between methods were often small.
When incorporating regional forcing products (NLDAS available only in a subset of basins), all three methods improved predictions compared to a globally trained model using only two forcings. The masked mean approach performed comparably to a baseline model trained exclusively on the regional basins with all three forcings.
The attention mechanism, despite its theoretical expressiveness, largely converged to a solution similar to the masked mean and did not consistently yield superior performance in the tested configurations.

Contributions

Introduces and rigorously evaluates three novel deep learning strategies for robustly handling missing meteorological input data in operational hydrologic models.
Provides empirical evidence for the effectiveness of these strategies across diverse missing data scenarios, including random outages, complete product unavailability, and regional data limitations.
Identifies the masked mean approach as a practical and effective solution, offering a good balance between performance and architectural simplicity for real-world applications.
Advances the applicability of deep learning in hydrology by addressing a critical practical challenge that hinders the transition of research models to operational systems.
Demonstrates the potential for training global models that can seamlessly integrate local, high-quality forcing data, thereby optimizing the use of diverse data sources.

Funding

Explicit funding projects, programs, or reference codes are not listed in the paper. However, the work includes an invited contribution by Martin Gauch, recipient of the EGU Hydrological Sciences Virtual Outstanding Student and PhD candidate Presentation Award 2021. The authors are affiliated with Google Research and IT:U Interdisciplinary Transformation University.

Citation

@article{Gauch2025How,
  author = {Gauch, Martin and Kratzert, Frederik and Klotz, Daniel and Nearing, Grey and Cohen, Déborah and Gilon, Oren},
  title = {How to deal w___ missing input data},
  journal = {Hydrology and earth system sciences},
  year = {2025},
  doi = {10.5194/hess-29-6221-2025},
  url = {https://doi.org/10.5194/hess-29-6221-2025}
}

Original Source: https://doi.org/10.5194/hess-29-6221-2025