Multilingual Vision-Language Pre-training for the Remote Sensing Domain

João Daniel Silva, João Magalhães, Devis Tuia, Bruno Martins

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Downloads (Pure)

Abstract

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective. Model training relied on assembling pre-existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. We show that translated data is indeed helpful, e.g. improving performance also on English. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks, including cross-modal and multilingual image-text retrieval, or zero-shot image classification.

Original languageEnglish
Title of host publicationSIGSPATIAL '24:
Subtitle of host publicationProceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems
EditorsMario A. Nascimento, Li Xiong, Andreas Zufle, Yao-Yi Chiang, Ahmed Eldawy, Peer Kroger
PublisherACM - Association for Computing Machinery
Pages220-232
Number of pages13
ISBN (Electronic)9798400711077
DOIs
Publication statusPublished - 22 Nov 2024
Event32nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL 2024 - Atlanta, United States
Duration: 29 Oct 20241 Nov 2024

Publication series

Name32nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL 2024

Conference

Conference32nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL 2024
Country/TerritoryUnited States
CityAtlanta
Period29/10/241/11/24

Keywords

  • Contrastive Language-Image Pre-training
  • Cross-Modal Retrieval
  • Remote Sensing
  • Self-Supervised Pre-training
  • Vision and Language

Cite this