GeoMMBench & GeoMMAgent

Abstract

We present GeoMMBench, a comprehensive multimodal question-answering benchmark for geoscience and remote sensing, featuring 1,053 expert-level, image-based multiple-choice questions spanning four disciplines (Remote Sensing, Photogrammetry, GIS, GNSS), six sensor modalities (Optical, SAR, Hyperspectral, LiDAR, DEM, Thermal), and diverse tasks such as scene classification, object and change detection, spectral analysis, and spatial reasoning.

We further introduce GeoMMAgent, a multi-agent framework that follows a plan–execute–evaluate paradigm: a coordinator decomposes tasks; specialized agents handle general preprocessing, web and multimodal retrieval (including GME-based filtering), perception with YOLO11 and DeepLabV3+, and reasoning for option alignment. GeoMMBench evaluates 36+ vision-language models in zero-shot settings; GeoMMAgent achieves strong performance. The dataset is released on Hugging Face; code and configurations are publicly available.

News

2026/04/09: GeoMMBench & GeoMMAgent is selected as a CVPR 2026 Highlight!
2026/03/23: GeoMMBench & GeoMMAgent accepted by CVPR 2026.
2026/03/23: GeoMMBench released on Hugging Face (1,053 questions).
2026/03/23: GeoMMAgent code (coordinator, exec_agents, toolkit) open-sourced.

Method / Framework Overview

Coordinator orchestrates agents; toolkits cover general utilities, knowledge retrieval, perception, and reasoning.

Toolkits (this repository)

Toolkit	Capability
General	Format conversion, filtering, scaling, optional neural super-resolution, etc.
Knowledge	Web search (multi-engine fallback); optional image search; GME text–image similarity for candidate ranking
Perception	YOLO11 classification & detection; DeepLabV3+ (Xception) segmentation
Reasoning	Multimodal LLM reasoning and in-agent option alignment

Agents

Agent	Role
Coordinator	Planning, decomposition, orchestration
Perception (Cls / Det / Seg)	Classification, detection, DeepLab segmentation
Search	Retrieval and evidence images for VLMs
Reasoning / Matching	Multi-step inference and MCQ alignment
Self-Evaluation	Optional quality check

Benchmark Results

GeoMMBench evaluates 36+ vision-language models under zero-shot conditions. GeoMMAgent achieves strong performance. See the paper for full results.

GeoMMAgent Architecture

coordinator/
  ├── coordinator.py        ← Dispatch → sequential execution, multi-image context
  └── prompts.py

exec_agents/
  ├── general/              ← Preprocess agents
  ├── perception/           ← ClsAgent, DetAgent, SegAgent (DeepLab)
  ├── knowledge/            ← SearchAgent (+ evidence images)
  ├── reasoning/            ← ReasoningAgent, MatchingAgent
  └── evaluation/           ← SelfEvaluationAgent (optional)

configs/
  └── GeoMMBench.yaml       ← Coordinator & agent settings

toolkit/
  ├── general.py
  ├── classification_toolkit.py   ← YOLO11 classification
  ├── detection_toolkit.py        ← YOLO11 OBB detection
  ├── segmentation_toolkit.py   ← DeepLabV3+ (Xception)
  ├── deeplabv3plus_xception/
  ├── gme_filter.py
  ├── knowledge.py
  ├── reasoning.py
  ├── super_resolution.py
  └── data_loader.py            ← GeoMMBench data loader

Dataset

from datasets import load_dataset
ds = load_dataset("AR-X/GeoMMBench")

Each sample includes image, question, option texts A–D, and answer (see the dataset card for splits).

BibTeX

@inproceedings{xiao2026geomm,
  title={GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing},
  author={Xiao, Aoran and Cheng, Shihao and Xu, Yonghao and Ren, Yexian and Chen, Hongruixuan and Yokoya, Naoto},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

GeoMMBench dataset: CC BY 4.0 · Code: Apache 2.0

GeoMMBench & GeoMMAgent: A Multimodal Benchmark and Multi-Agent Framework for GeoScience and Remote Sensing