Michel et al. (2025) Temporal attention multi-resolution fusion of satellite image time-series, applied to Landsat-8/9 and Sentinel-2: all bands, any time, at best spatial resolution

Identification

Journal: Remote Sensing of Environment
Year: 2025
Date: 2025-12-11
Authors: Julien Michel, Jordi Inglada
DOI: 10.1016/j.rse.2025.115159

Research Groups

Univ Toulouse, France
CNES (Centre National d'Études Spatiales), France
CNRS (Centre National de la Recherche Scientifique), France
INRAE (Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement), France
IRD (Institut de Recherche pour le Développement), France
CESBIO (Centre d'Études Spatiales de la BIOsphère), France

Short Summary

This paper proposes a general formulation for fusing Satellite Image Time Series (SITS) from multiple sensors with varying spatial resolutions and acquisition times. It introduces TAMRF-SITS, a novel deep learning architecture and training strategy, which predicts all spectral bands from all input sensors at the best spatial resolution and any requested acquisition time, outperforming or matching existing ad-hoc methods across various tasks while relaxing unrealistic assumptions.

Objective

To predict all bands from all input sensors at the best observed spatial resolution and for any acquisition time, given observed SITS from two or more sensors over the same geographical area, without making unrealistic assumptions found in existing literature.

Study Configuration

Spatial Scale: Areas of Interest (AOIs) covering 9.9 km x 9.9 km (990 x 990 pixels for Sentinel-2, 330 x 330 pixels for Landsat-8/9). Input Sentinel-2 bands are at 10 meters (m) or 20 m (up-sampled to 10 m), Landsat-8/9 bands are at 30 m or 100 m (up-sampled to 30 m). The model outputs all bands at 10 m spatial resolution.
Temporal Scale: SITS spanning one full year (2022 for eu22 split) or random one-year periods between 2022-01-01 and 2024-06-01 (wwmy split). The model can predict at any observed or non-observed acquisition time.

Methodology and Data

Models used:
- TAMRF-SITS (Temporal Attention Multi-Resolution Fusion of Satellite Image Time-Series): A novel Deep Learning architecture.
- Architecture: Combines Residual Convolutional Neural Networks (CNNs) for spatial encoding and a Transformer for temporal encoding. Follows an encoder-decoder scheme.
- Training Strategy: Masked Auto-Encoder (MAE) with a diversified masking strategy.
- Loss Functions:
  - Huber loss (L_smooth1) for reconstruction.
  - Novel mask-contrastive term to ignore clouds and non-informative areas.
  - Novel Linear-Regression Learned Perceptual Image Similarity (LPIPS) term to favor high spatial frequency details.
Data sources:
- LS2S2 (Landsat-8/9 to Sentinel-2) dataset, comprising joint Sentinel-2 and Landsat-8/9 SITS.
- Two splits: eu22 (64 training, 41 testing AOIs in Europe, Equatorial Africa, French Guyana) and wwmy (138 training, 69 testing AOIs worldwide).
- Top-of-Canopy surface reflectances from Level 2 products.
- Validity masks derived from product metadata (clouds, shadows, out-of-swath).
- Data gathered via the OpenEO API.

Main Results

A single pre-trained TAMRF model consistently performs on par with or better than existing ad-hoc methods across four distinct tasks.
Gap-filling: TAMRF provides better Root Mean Square Error (RMSE) on masked dates (e.g., 0.015 for Landsat B1, 0.015 for Sentinel-2 B2) and significantly better Image Quality (IQ) (lower BRISQUE scores, e.g., 49.7 for Landsat B1, 35.2 for Sentinel-2 B2) compared to naive interpolation and U-TILISE. It effectively improves high spatial frequency content (positive Frequency Restoration, FR).
Band-sharpening (Sentinel-2 20 m bands to 10 m): TAMRF achieves better IQ (lower BRISQUE, e.g., 14.18 for B6) and sharpening (higher FR, e.g., 5.5 for B6) compared to DSen2, while seamlessly interpolating cloudy pixels.
Spatio-Temporal Fusion: TAMRF systematically yields better RMSE (e.g., 0.014 for B4) and lowest BRISQUE scores (e.g., 21.4 for B4) compared to STAIR, Sen2Like, Deep-Harmonization, and DSTFN. It effectively uses Landsat-8/9 low-resolution information and handles cloudy target dates.
Thermal Sharpening (Landsat-8/9 LST): With residual compensation, TAMRF achieves a lower RMSE (0.381 Kelvin) than DMS (0.477 Kelvin), while maintaining better IQ and FR.
The model relaxes unrealistic assumptions common in literature, such as requiring similar spectral bands, same-day acquisitions, or scale-invariance.
TAMRF has 2.3 million parameters. Processing time and memory consumption are mostly linear with the number of input dates (e.g., 20 Gigabytes for 128 input images).

Contributions

Proposes a general mathematical formulation for SITS fusion that unifies temporal interpolation, spatial resolution enhancement, and spatio-temporal fusion, overcoming limitations of existing methods.
Introduces TAMRF, a novel Deep Learning architecture combining Residual CNNs for spatial encoding and a Transformer for temporal encoding, capable of solving the generic problem formulation.
Develops an original self-supervised training framework using a Masked Auto-Encoder strategy with new Linear-Regression Learned Perceptual Image Similarity (LPIPS) and mask-contrastive loss terms.
Demonstrates unmatched versatility with a single pre-trained model that can process any number of Sentinel-2 and Landsat-8/9 dates, predicting all bands from both sensors at 10 m spatial resolution for any target date.
Evaluates TAMRF on a new worldwide, multi-year LS2S2 dataset, showing superior or comparable performance to task-specific state-of-the-art methods.
Highlights the potential for redefining Level 3 satellite products into multi-sensor, spatial-resolution enhanced, temporally accurate, and artifact-free outputs.

Funding

EvoLand project (Evolution of the Copernicus Land Service portfolio), grant agreement No 101082130, funded by the European Union’s Horizon Europe research and innovation program.
HPC resources from GENCI-IDRIS (Grant 2023-AD010114835).
HPC resources from CNES Computing Center.

Citation

@article{Michel2025Temporal,
  author = {Michel, Julien and Inglada, Jordi},
  title = {Temporal attention multi-resolution fusion of satellite image time-series, applied to Landsat-8/9 and Sentinel-2: all bands, any time, at best spatial resolution},
  journal = {Remote Sensing of Environment},
  year = {2025},
  doi = {10.1016/j.rse.2025.115159},
  url = {https://doi.org/10.1016/j.rse.2025.115159}
}

Original Source: https://doi.org/10.1016/j.rse.2025.115159