GeoMMBench & GeoMMAgent: A Multimodal Benchmark and Multi-Agent Framework for GeoScience and Remote Sensing

* Equal contribution     Corresponding author

1 RIKEN AIP     2 Wuhan University     3 Linköping University     4 University of Tokyo

5 Nanjing University of Information Science and Technology

GeoMMBench overview examples

GeoMMBench provides expert-level, image-based multiple-choice questions across remote sensing, photogrammetry, GIS, and GNSS, with diverse sensor modalities and tasks. GeoMMAgent follows a plan–execute–evaluate pipeline with toolkits for general preprocessing, knowledge retrieval, perception (YOLO11 / DeepLabV3+), and multimodal reasoning.

Abstract

We present GeoMMBench, a comprehensive multimodal question-answering benchmark for geoscience and remote sensing, featuring 1,053 expert-level, image-based multiple-choice questions spanning four disciplines (Remote Sensing, Photogrammetry, GIS, GNSS), six sensor modalities (Optical, SAR, Hyperspectral, LiDAR, DEM, Thermal), and diverse tasks such as scene classification, object and change detection, spectral analysis, and spatial reasoning.

We further introduce GeoMMAgent, a multi-agent framework that follows a plan–execute–evaluate paradigm: a coordinator decomposes tasks; specialized agents handle general preprocessing, web and multimodal retrieval (including GME-based filtering), perception with YOLO11 and DeepLabV3+, and reasoning for option alignment. GeoMMBench evaluates 36+ vision-language models in zero-shot settings; GeoMMAgent achieves strong performance. The dataset is released on Hugging Face; code and configurations are publicly available.

News

  • 2026/04/09: GeoMMBench & GeoMMAgent is selected as a CVPR 2026 Highlight!
  • 2026/03/23: GeoMMBench & GeoMMAgent accepted by CVPR 2026.
  • 2026/03/23: GeoMMBench released on Hugging Face (1,053 questions).
  • 2026/03/23: GeoMMAgent code (coordinator, exec_agents, toolkit) open-sourced.

Method / Framework Overview

GeoMMAgent framework

Coordinator orchestrates agents; toolkits cover general utilities, knowledge retrieval, perception, and reasoning.

Toolkits (this repository)

ToolkitCapability
General Format conversion, filtering, scaling, optional neural super-resolution, etc.
Knowledge Web search (multi-engine fallback); optional image search; GME text–image similarity for candidate ranking
Perception YOLO11 classification & detection; DeepLabV3+ (Xception) segmentation
Reasoning Multimodal LLM reasoning and in-agent option alignment

Agents

AgentRole
CoordinatorPlanning, decomposition, orchestration
Perception (Cls / Det / Seg)Classification, detection, DeepLab segmentation
SearchRetrieval and evidence images for VLMs
Reasoning / MatchingMulti-step inference and MCQ alignment
Self-EvaluationOptional quality check

Benchmark Results

GeoMMBench evaluates 36+ vision-language models under zero-shot conditions. GeoMMAgent achieves strong performance. See the paper for full results.

Benchmark results

GeoMMAgent Architecture

coordinator/
  ├── coordinator.py        ← Dispatch → sequential execution, multi-image context
  └── prompts.py

exec_agents/
  ├── general/              ← Preprocess agents
  ├── perception/           ← ClsAgent, DetAgent, SegAgent (DeepLab)
  ├── knowledge/            ← SearchAgent (+ evidence images)
  ├── reasoning/            ← ReasoningAgent, MatchingAgent
  └── evaluation/           ← SelfEvaluationAgent (optional)

configs/
  └── GeoMMBench.yaml       ← Coordinator & agent settings

toolkit/
  ├── general.py
  ├── classification_toolkit.py   ← YOLO11 classification
  ├── detection_toolkit.py        ← YOLO11 OBB detection
  ├── segmentation_toolkit.py   ← DeepLabV3+ (Xception)
  ├── deeplabv3plus_xception/
  ├── gme_filter.py
  ├── knowledge.py
  ├── reasoning.py
  ├── super_resolution.py
  └── data_loader.py            ← GeoMMBench data loader

Dataset

from datasets import load_dataset
ds = load_dataset("AR-X/GeoMMBench")

Each sample includes image, question, option texts AD, and answer (see the dataset card for splits).

BibTeX

@inproceedings{xiao2026geomm,
  title={GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing},
  author={Xiao, Aoran and Cheng, Shihao and Xu, Yonghao and Ren, Yexian and Chen, Hongruixuan and Yokoya, Naoto},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

GeoMMBench dataset: CC BY 4.0 · Code: Apache 2.0