多模态交互论文 – Max AI Group

本网页摘抄自https://github.com/zhaochen0110/Awesome_Think_With_Images，仅用于个人学习

📜 Table of Contents

The Three-Stage Evolution of Thinking with Images

This section provides a conceptual map to navigate the paper list. The following papers are organized according to the primary mechanism they employ, aligning with the three-stage framework from our survey.

Stage 1: Tool-Driven Visual Exploration

In this stage, the model acts as a planner, orchestrating a predefined suite of external visual tools. Intelligence is demonstrated by selecting the right tool for the right sub-task.

Prompt-Based Approaches

Leveraging in-context learning to guide tool use without parameter updates.

➤ SFT-Based Approaches

Fine-tuning models on data demonstrating how to invoke tools and integrate their outputs.

➤ RL-Based Approaches

Using rewards to train agents to discover optimal tool-use strategies.

💻 Stage 2: Programmatic Visual Manipulation

Here, models evolve into “visual programmers,” generating executable code (e.g., Python) to create custom visual analyses. This unlocks compositional flexibility and interpretability.

➤ Prompt-Based Approaches

Guiding models to generate code as a transparent, intermediate reasoning step.

➤ SFT-Based Approaches

Distilling programmatic logic into models or using code to bootstrap high-quality training data.

➤ RL-Based Approaches

Optimizing code generation policies using feedback from execution results.

Stage 3: Intrinsic Visual Imagination

The most advanced stage, where models achieve full cognitive autonomy. They generate new images or visual representations internally as integral steps in a closed-loop thought process.

➤ SFT-Based Approaches

Training on interleaved text-image data to teach models the grammar of multimodal thought.

➤ RL-Based Approaches

Empowering models to discover generative reasoning strategies through trial, error, and reward.

Evaluation & Benchmarks

Essential resources for measuring progress. These benchmarks are specifically designed to test the multi-step, constructive, and simulative reasoning capabilities required for “Thinking with Images”.

Benchmarks for Thinking with Images

About

Resources and paper list for “Thinking with Images for LVLMs”. This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.arxiv.org/pdf/2506.23918

📜 Table of Contents

The Three-Stage Evolution of Thinking with Images

Stage 1: Tool-Driven Visual Exploration

Prompt-Based Approaches

➤ SFT-Based Approaches

➤ RL-Based Approaches

💻 Stage 2: Programmatic Visual Manipulation

➤ Prompt-Based Approaches

➤ SFT-Based Approaches

➤ RL-Based Approaches

Stage 3: Intrinsic Visual Imagination

➤ SFT-Based Approaches

➤ RL-Based Approaches

Evaluation & Benchmarks

Benchmarks for Thinking with Images

About

发表评论 取消回复

发表评论取消回复