MangaVQA and MangaLMM
A Benchmark and Specialized Model for Multimodal Manga Understanding

The University of Tokyo
*: Equal Contribution
teaser

Overview of MangaVQA and MangaLMM. We present MangaVQA, a newly proposed benchmark for multimodal context understanding, consisting of 526 manually constructed question–answer pairs. We also develop MangaLMM, a manga-specialized model jointly trained to handle both MangaOCR and MangaVQA tasks.

Abstract

Why should LMMs understand Manga?

Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways, featuring intricate panel layouts, expressive visual elements, and text embedded directly within images. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. Such models could assist manga creators by functioning like a skilled editor or assistant, capable of reading and understanding manga in the way humans do. This calls for evaluating models' abilities to process visual-textual content and follow the context in a coherent and human-like manner.

1. We introduce MangaVQA and MangaOCR

We present MangaVQA, a benchmark of 526 manually constructed question–answer pairs designed to evaluate an LMM's ability to accurately answer targeted, factual questions grounded in both visual and textual context. The questions are categorized along four key axes: required information (panel vs. page level), answer type (exact extraction vs. descriptive answering), 5W1H question types, and author familiarity.

We also present MangaOCR, focusing on in-page text detection and recognition. MangaOCR consolidates annotations from the Manga109 dataset and the manga onomatopoeia dataset, containing approximately 209K narrative text instances including dialogue and onomatopoeia. Together, these benchmarks enable comprehensive evaluation of multimodal manga understanding.

2. We develop MangaLMM

MangaLMM is a manga-specialized version of Qwen2.5-VL, finetuned to jointly address both VQA and OCR tasks. We construct training data by using the MangaOCR annotations directly for OCR training, and by generating 39,837 synthetic VQA samples using GPT-4o with OCR annotations as guidance. This joint training enables MangaLMM to perform human-like manga understanding by combining text recognition with contextual comprehension.

3. We show the superiority of MangaLMM over State-of-the-Art LMMs

We perform extensive analysis and evaluate MangaLMM against proprietary models such as GPT-4o, Gemini 2.5, and Claude Sonnet 4.5, as well as various open-source LMMs. Our results reveal that even state-of-the-art proprietary models achieve near-zero OCR scores on manga, highlighting the limitations of general-purpose LMMs in stylized visual domains. In contrast, MangaLMM achieves 71.5% OCR Hmean and a competitive 6.68 VQA score, demonstrating promising performance on both tasks.

Manga109 Dataset and MangaOCR Benchmark

Manga109: A Widely Used Dataset for Manga Research

We selected Manga109 for its open-access license, diverse manga titles, and rich annotations. Manga109 is a dataset composed of 109 volumes of Japanese comics (manga), capturing many distinctive features:

  • Predominantly black-and-white artwork
  • Two-page spreads with right-to-left reading order
  • Vertical text layout
  • Frequent use of stylized onomatopoeia (e.g., Boom, Bang)
  • Culturally specific dialogue with honorifics and idiomatic expressions
Two-page spread example from Manga109

Illustration of a two-page spread from the Manga109 dataset.

MangaOCR: A Consolidated Dataset for Manga Text Recognition

Text in manga carries essential narrative information, appearing as speech balloons and stylized onomatopoeia integrated into the artwork. MangaOCR addresses this challenge by targeting two key categories of embedded text:

Dialogue

~148K instances

Onomatopoeia

~61K instances

We construct the MangaOCR dataset by consolidating existing annotations from the Manga109 dataset and the manga onomatopoeia dataset. It contains approximately 209K narrative text instances, spanning a wide variety of visual styles and layouts.

The MangaOCR task is performed on two-page spreads and consists of two sub-tasks:

  • Text Detection: Localizes textual regions in the image
  • Text Recognition: Reads the localized text

Author-Aware Dataset Split

We adopt a dataset split protocol based on author information to evaluate different types of generalization:

Intra-series

5 test volumes from same series as training set (different volumes)

Intra-author

5 test volumes from authors with other works in training set

Unseen Author

3 test volumes from authors not in training set

Dataset Statistics

Count Type Total Train Valid Test
Comic volumes 109 89 7 13
Images 10,602 8,763 673 1,166
MangaOCR
Dialogue 148K 120K 9K 18K
Onomatopoeia 61K 50K 4K 7K
Total 209K 170K 13K 26K
MangaVQA
QA pairs 40,363 39,837 526

MangaVQA Benchmark

Question Types and Categories

To evaluate model performance under realistic conditions, we manually constructed 526 question–answer (QA) pairs based on images from Manga109. Five annotators carefully developed a high-quality evaluation set, incorporating thorough human inspection and verification.

The question types are designed based on four key axes:

  • Required Information: Whether solving the question requires information from a single panel or multiple panels at the page level
  • Answer Type: Exact Extraction (word-level answers) vs. Descriptive Answering (sentence-level or explanatory answers)
  • 5W1H: Who, What, When, Where, Why, How
  • Author Type: Seen Title, Seen Author (Different Title), Unseen Author
MangaVQA Distributions

Answer Type Examples: Exact Extraction vs. Descriptive Answering

(1) Exact Extraction (240 questions): Questions that require extracting answer words from the image. These questions necessitate accurately retrieving the answer word from the manga page. This category assesses the LMM's basic comprehension ability to identify and extract the correct answer part from the manga panels.

(2) Descriptive Answering (286 questions): Questions that require contextual or explanatory responses. These questions go beyond simple answer word extraction and require comprehending the context within the manga. This category allows us to evaluate whether the LMM can not only recognize the dialogue but also understand its underlying meaning in the context of the narrative.

Answer Type Examples

Training Data Construction

OCR Training Set (TOCR)

For the OCR task, we use the MangaOCR training set. For each image, we format the sequence of text annotations as:

{"bbox_2d": [xtop_left, ytop_left, xbottom_right, ybottom_right], "text_content": "text"}

Synthetic VQA Training Set (TVQA)

For the VQA task, we generate synthetic training data using GPT-4o (gpt-4o-2024-11-20). Following the synthetic data construction approach used in LLaVA, we generate five questions per image using both the image and its OCR annotation from TOCR.

As a result, we created a total of 39,837 synthetic VQA samples from 8,379 images.

Experimental Results

The following table shows the comparison of LMMs on MangaOCR and MangaVQA.

Overall Results

Method MangaOCR
Hmean (%)
MangaVQA
LLM (/10.0)
Proprietary Models
GPT-4o 0.0 6.00
Gemini 2.5 Flash 0.0 7.26
Claude Sonnet 4.5 0.0 5.84
Open-source Models
Phi-4-Multimodal-5.6B 0.0 3.39
Pangea-7B 0.0 3.23
LLaVA-OV-1.5-8B 0.0 3.46
Sarashina2-Vision-8B 0.0 4.45
Gemma-3-12B 0.0 4.13
Heron-NVILA-Lite-15B 0.0 3.76
Qwen2.5-VL-7B 0.9 5.65
Ours
MangaLMM (Ours) 71.5 6.68


F1. MangaLMM achieves strong performance on both tasks.

MangaLMM can handle both tasks effectively: it achieves over 70% OCR score and shows competitive VQA performance. While it falls short of the proprietary model Gemini, it outperforms the other proprietary models GPT-4o and Claude Sonnet 4.5. MangaLMM achieves the best performance among the open-source models by a clear margin.

F2. All LMMs except MangaLMM show near-zero OCR scores.

As shown in the results table, all LMMs except MangaLMM show near-zero scores on the MangaOCR benchmark. Most of their predictions consist of meaningless repetitions or short repeated tokens. The extremely low OCR score before finetuning is likely due to two main factors: (1) these models are not familiar with manga data, and (2) their weak detection capabilities may limit OCR performance.

F3. Proprietary models manage to answer VQA questions even with near-zero OCR scores.

Despite the near-zero OCR score, where not only position information is missing but even the correct text content is not generated, these models still manage to answer certain VQA questions that require interpreting text within the image. This suggests that they are able to extract relevant information needed for answering VQA questions, even without performing OCR correctly.

Category-wise Analysis

Performance Breakdown by Category

We observe consistent performance gains across all tags, indicating that our training contributes to stable improvement in VQA capability without favoring specific categories. Interestingly, the model also generalizes well to questions from unseen authors.

Category-wise Analysis

Qualitative Analysis

MangaVQA Examples

We compare outputs of the original Qwen model and our trained MangaLMM. In the left and middle examples, performance improves significantly after training—the original model provides general or irrelevant answers, whereas the trained model leverages text-bubble content to produce more specific and correct ones.

In the right example, the trained model still struggles to produce an accurate answer due to character-level recognition errors, highlighting remaining challenges in manga understanding.

Qualitative Analysis

BibTeX

@inproceedings{baek2025mangavqa,
  author    = {Baek, Jeonghun and Egashira, Kazuki and Onohara, Shota and Miyai, Atsuyuki and Imajuku, Yuki and Ikuta, Hikaru and Aizawa, Kiyoharu},
  title     = {MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding},
  booktitle = {Findings of the Association for Computational Linguistics: EACL 2026},
  year      = {2026},
}