Yu et al. (2025) Cloud and Snow Segmentation via Transformer-Guided Multi-Stream Feature Integration

Identification

Journal: Remote Sensing
Year: 2025
Date: 2025-09-29
Authors: Ka Chun Yu, Kai Chen, Liguo Weng, Min Xia, Shengyan Liu
DOI: 10.3390/rs17193329

Research Groups

Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, B-DAT, Nanjing University of Information Science and Technology, Nanjing 210044, China
Department of Computer Science, University of Reading, Whiteknights, Reading RG6 6DH, UK

Short Summary

This paper introduces a novel Transformer-guided dual-branch deep learning architecture for accurate cloud and snow semantic segmentation in remote sensing images, effectively integrating global contextual features with local spatial details to overcome spectral similarities and achieve state-of-the-art performance on challenging datasets.

Objective

To design an innovative Transformer-guided architecture with complementary feature-extraction capabilities to accurately discriminate and segment spectrally similar clouds and snow in satellite observations, overcoming challenges like noise robustness, small target detection, and coarse boundary delineation.

Study Configuration

Spatial Scale:
- CSWV Dataset: High-resolution WorldView2 images, processed as 256 × 256 pixel patches.
- SPARCS Dataset: Landsat-8 images, 1000 × 1000 pixels, cropped into 256 × 256 pixel patches.
- Utilizes specific spectral bands (Blue, Green, Red, Near-Infrared) from Landsat-8.
Temporal Scale:
- CSWV Dataset: Images collected between June 2014 and July 2016.
- SPARCS Dataset: Generalization tests performed on multi-spectral satellite imagery (Landsat-8), specific collection period not detailed in the paper for this dataset.

Methodology and Data

Models used:
- Proposed model: Transformer-guided multi-stream feature integration network.
- Encoder: Dual-path structure integrating a Transformer Encoder Module (TEM) and a ResNet18-based convolutional branch.
- Feature-Enhancement Module (FEM) for bidirectional interaction.
- Deep Feature-Extraction Module (DFEM) at the deepest convolutional layer.
- Decoder: Transformer Fusion Module (TFM) and Strip Pooling Auxiliary Module (SPAM).
- Comparison models: FCN, PAN, PSPNet, DeepLabV3Plus, BiSeNetV2, DFANet, ESPNetV2, MFANet, SGBNet, ENet, PADANet, DDRNet, SP_CSANet, DenseASPP, PVT, MSPFANet, MFENet, DABNet, Restormer, CvT, CcNet, LCDNet, MCANet, CDUNet, ACFNet, HRNet, UNet, SegNet, DBNet, CSDNet, CloudNet.
Data sources:
- CSWV (Cloud and Snow) Dataset [43]: 27 high-resolution WorldView2 images, 3200 samples (256 × 256 pixels), augmented to 10,240 training and 2560 validation images.
- SPARCS Dataset [17]: 80 Landsat-8 images (1000 × 1000 pixels), cropped to 2000 samples (256 × 256 pixels), augmented to 6400 training and 1600 validation images.

Main Results

Achieved state-of-the-art performance on both CSWV and SPARCS datasets for cloud and snow segmentation.
CSWV Dataset:
- Cloud detection: Recall (R) of 91.64% and F1 score of 92.19%.
- Snow detection: Recall (R) of 93.59% and F1 score of 94.25%.
- Overall performance: Pixel Accuracy (PA) of 94.81%, Frequency-Weighted Intersection over Union (FWIoU) of 90.19%, and Mean Intersection over Union (MIoU) of 89.23%.
- Ablation studies showed significant MIoU improvements from individual modules: FEM (0.5%), TEM (1.19%), DFEM (0.12%), SPAM (0.16%), and TFM (1.43%).
SPARCS Dataset:
- Achieved the highest F1 scores for snow/ice (94.07%), water (91.22%), and land (95.72%).
- Overall performance: PA of 93.02%, FWIoU of 87.33%, and MIoU of 81.49%.
Demonstrated superior detection accuracy for thin clouds and small scattered snow patches, and improved segmentation of cloud shadows projected onto snow layers.

Contributions

Proposed a novel Transformer-driven multi-branch architecture for end-to-end cloud and snow semantic segmentation in visible and multispectral high-resolution remote sensing images.
Integrated a Transformer Encoder Module (TEM) and a ResNet18-based convolutional branch in the encoder to effectively combine global semantic information with local spatial details.
Introduced a Feature-Enhancement Module (FEM) to facilitate mutual guidance and adaptive feature fusion between the Transformer and convolutional branches, improving the detection of subtle and scattered cloud-snow structures.
Embedded a Deep Feature-Extraction Module (DFEM) at the deepest convolutional layer to refine channel-level information and enhance the clarity of object boundaries.
Designed a Transformer Fusion Module (TFM) and a Strip Pooling Auxiliary Module (SPAM) in the decoding stage to boost robustness against noise, enhance attention to snow detection, and improve segmentation of irregular cloud-snow junctions.
Achieved state-of-the-art performance on the CSWV and SPARCS datasets, demonstrating strong effectiveness and applicability in complex real-world cloud and snow detection scenarios.

Funding

National Natural Science Foundation of PR China (42075130)

Citation

@article{Yu2025Cloud,
  author = {Yu, Ka Chun and Chen, Kai and Weng, Liguo and Xia, Min and Liu, Shengyan},
  title = {Cloud and Snow Segmentation via Transformer-Guided Multi-Stream Feature Integration},
  journal = {Remote Sensing},
  year = {2025},
  doi = {10.3390/rs17193329},
  url = {https://doi.org/10.3390/rs17193329}
}

Original Source: https://doi.org/10.3390/rs17193329