本网页摘抄自https://github.com/zhaochen0110/Awesome_Think_With_Images,仅用于个人学习

📜 Table of Contents


The Three-Stage Evolution of Thinking with Images

This section provides a conceptual map to navigate the paper list. The following papers are organized according to the primary mechanism they employ, aligning with the three-stage framework from our survey.


Stage 1: Tool-Driven Visual Exploration

In this stage, the model acts as a planner, orchestrating a predefined suite of external visual tools. Intelligence is demonstrated by selecting the right tool for the right sub-task.

Prompt-Based Approaches

Leveraging in-context learning to guide tool use without parameter updates.

➤ SFT-Based Approaches

Fine-tuning models on data demonstrating how to invoke tools and integrate their outputs.

➤ RL-Based Approaches

Using rewards to train agents to discover optimal tool-use strategies.


💻 Stage 2: Programmatic Visual Manipulation

Here, models evolve into “visual programmers,” generating executable code (e.g., Python) to create custom visual analyses. This unlocks compositional flexibility and interpretability.

➤ Prompt-Based Approaches

Guiding models to generate code as a transparent, intermediate reasoning step.

➤ SFT-Based Approaches

Distilling programmatic logic into models or using code to bootstrap high-quality training data.

➤ RL-Based Approaches

Optimizing code generation policies using feedback from execution results.


Stage 3: Intrinsic Visual Imagination

The most advanced stage, where models achieve full cognitive autonomy. They generate new images or visual representations internally as integral steps in a closed-loop thought process.

➤ SFT-Based Approaches

Training on interleaved text-image data to teach models the grammar of multimodal thought.

➤ RL-Based Approaches

Empowering models to discover generative reasoning strategies through trial, error, and reward.


Evaluation & Benchmarks

Essential resources for measuring progress. These benchmarks are specifically designed to test the multi-step, constructive, and simulative reasoning capabilities required for “Thinking with Images”.

Benchmarks for Thinking with Images

About

Resources and paper list for “Thinking with Images for LVLMs”. This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.arxiv.org/pdf/2506.23918

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注