[原文]Thinking with Visual Primitives
Ruijie Lu1,2,∗ , Yiyang Ma1,∗ , Xiaokang Chen1,∗, ‡ , Lingxiao Luo1,3,∗ , Zhiyu Wu1,∗ , Zizheng Pan1,∗ ,
Xingchao Liu1,∗ , Yutong Lin1 , Hao Li1 , Wen Liu1 , Zhewen Hao1 , Xi Gao1 , Shaoheng Nie1 ,
Yixuan Wei1 , Zhenda Xie1 , Ting Chen3 , Gang Zeng2
1 DeepSeek-AI
2 Peking University
∗: Core contributors
3 Tsinghua University
‡: Project lead
Abstract
Despite the remarkable progress in Multimodal Large Language Models (MLLMs), the prevailing Chain-of-Thought (CoT) paradigms remain predominantly confined to the linguistic
space. While recent advancements have focus...
1. Introduction
已探索将边界框融入思维链过程,但它们主要将视觉定位视为一种事后验证机制,以增强感知密集型任务的性能。此外,它们对劳动密集型监督的依赖进一步限制了可扩展性。在此类任务中,视觉标记必须作为思维的内在媒介,而不仅仅是可验证的证据。在本研究中,我们提出一种范式转变:基于视觉基元的思维(Thinking with Visual Primitives)。我们不再将视觉定位视为次要任务或最终输出。使其直接交错嵌入模型的推理轨迹中。该机制的灵感来源于人类的认知过程。当在复杂迷宫中导航或清点密集物体时,人类会自然地使用指示性指针(如手指手势)来降低认知负荷并保持逻辑一致性。的协同机制,从而将抽象的语言思维有效锚定到具体的空间坐标上。长上下文的多模态交互而设计。将每 𝑚² 个视觉词元的键值(KV)缓存压缩为单一条目。模型架构与训练流程。开发,我们的模型在预训练阶段获得了基础的视觉基元生成能力。随后进入后训练阶段,采用专家级专业化与整合范式。建议在屏幕上查看以获得最佳效果。
[原文]While recent works [6, 27, 32] have explored integrating bounding boxes into the chain-ofthought process, they primarily treat grounding as a post-hoc verification mechanism to enhance
perception-heavy tasks. These approaches are often confined to high-resolution benchmarks
where the challenge is “seeing” rather than “reasoning”, and their reliance on labor-intensive
supervision further limits scalability. More importantly, they fail to address the Reference Gap
in complex structural reasoning—such as topological navigation—where visual markers must
function as the intrinsic medium of thought ...
1. Introduction
通过广泛的基准测试,我们证明了“基于视觉原语的思维(Thinking with Visual Primitives)”在推理准确率上实现了显著跃升。我们的模型展现出极具竞争力的性能,在一系列极具挑战性的空间推理与视觉问答任务中,表现媲美甚至超越了 GPT、Claude 和 Gemini 的最新版本(见图 1)。我们的研究结果表明,多模态智能的未来不仅在于“看见”更多的像素,更在于开发更精确、歧义更少的指代机制,以弥合语言与视觉世界之间的鸿沟。 通过广泛的基准测试,我们证明了“基于视觉原语的推理”(Thinking with Visual Primitives)在推理准确率上实现了显著跃升。我们的模型展现出极具竞争力的性能,在一系列极具挑战性的空间推理与视觉问答任务中,表现媲美甚至超越了 GPT、Claude 和 Gemini 的最新版本(见图 1)。研究结果表明,多模态智能的未来不仅在于“看见”更多的像素,更在于开发更精确、歧义更少的指代机制,以弥合语言与视觉世界之间的鸿沟。 通过广泛的基准测试,我们证明了“基于视觉原语的推理”(Thinking with Visual Primitives)在推理准确率上实现了显著跃升。我们的模型展现出极具竞争力的性能,在一系列极具挑战性的空间推理与视觉问答任务中,表现媲美甚至超越了 GPT、Claude 和 Gemini 的最新版本(见图 1)。研究结果表明,多模态智能的未来不仅在于“看见”更多的像素,更在于构建更精确、歧义更少的指代机制,从而弥合语言与视觉世界之间的鸿沟。 通过广泛的基准测试,我们证明了“基于视觉原语的推理”(Thinking with Visual Primitives)在推理准确率上实现了显著跃升。我们的模型展现出极具竞争力的性能,在一系列极具挑战性的空间推理与视觉问答任务中,表现媲美甚至超越了 GPT、Claude 和 Gemini 的最新版本(见图 1)。研究结果表明,多模态智能的未来不仅在于“看见”更多的像素,更在于开发更精确、歧义更少的指代机制,以弥合语言与视觉世界之间的鸿沟。 通过广泛的基准测试,我们证明了“基于视觉原语的推理”(Thinking with Visual Primitives)在推理准确率上实现了显著跃升。我们的模型展现出极具竞争力的性能,在一系列极具挑战性的空间推理与视觉问答任务中,表现媲美甚至超越了 GPT、Claude 和 Gemini 的最新版本(见图 1)。研究结果表明,多模态智能的未来不仅在于“看见”更多的像素,更在于构建更精确、歧义更少的指代机制,从而弥合语言与视觉世界
[原文]Through extensive benchmarking, we demonstrate that Thinking with Visual Primitives
delivers a significant leap in reasoning accuracy. Our model achieves competitive performance,
standing on par with or surpassing the latest iterations of GPT, Claude, and Gemini across
a wide spectrum of challenging spatial reasoning and visual QA tasks (seeing Fig. 1). Our
findings suggest that the future of multimodal intelligence lies not just in seeing more pixels,
but in developing more precise and less ambiguous referential mechanisms that bridges the gap
between language and the visual world.
[原文]2.1. Overview
This section first introduces the model architecture. Next, we elaborate on the training pipeline,
as illustrated in Fig. 2, and describe the corresponding data used across the pretraining and
post-training phases.
2.2. Architecture
Our model adopts a standard architecture similar to LLaVA [18, 19]. Specifically, input images are
processed by a Vision Transformer (ViT) to extract visual features, which are then concatenated
with language instructions to form an interleaved sequence of vision-language tokens. This
sequence is subsequently fed into the Large Language Model (LLM) to...
2. Method
抽象视觉指代,例如追踪运动轨迹或解决拓扑推理问题。提供了相对准确的边界框或点标注,但它们存在规模不足和多样性显著缺乏的问题。范式的泛化能力,必须构建具有丰富语义和高多样性的超大规模网络数据。筛选。更一致。标注的确定性:边界框紧密包围目标物体,使其标注具有相对较高的确定性。相反,点标注具有高度歧义性;物体边界内的任意坐标均可作为有效参考点,导致缺乏严格的标准答案(ground truth)。在涉及遮挡的极端场景中,原本指向背景物体的点可能落在前景遮挡物上,从而产生显著的歧义。任务泛化性:训练输出边界框的模型可以轻松地泛化至基于点的格式。由于边界框可由两个点(左上角和右下角坐标)定义,因此它天然包含了点的表示形式。信息丰富度:与点相比,边界框能够支持更广泛的下游任务。点仅提供空间定位信息,而边界框则封装了详细的几何信息(如宽度和高度)。框架内执行更复杂的推理。为平衡性能与计算成本,我们将 ViT 输出的视觉标记(visual tokens)数量限制在 81 至 384 之间。分辨率超出此范围的图像将在保持原始宽高比的前提下进行缩放。原始数据获取。我们通过跨多个网站进行大规模网络抓取,收集了大量与边界框定位(box grounding)相关的互联网数据。标签的任务数据。提供了相对准确的边界框或点标注,但仍面临规模不足与多样性匮乏的显著局限。整理大规模网络数据。更贴合后文标题。
[原文]abstract visual referencing, such as tracking motion trajectories or solving topological reasoning
problems.
2.3.2. Motivation for Large-Scale Data Curation
While existing public datasets, such as COCO [17] and Pixmo-Points [4], provide relatively
accurate box or point annotations, they suffer from insufficient scale and a notable lack of
diversity. To ensure the generalizability of our “Thinking with Visual Primitives” paradigm, it is
imperative to curate large-scale web data with rich semantics and high diversity. We prioritize
the extensive scaling of bounding box data for the following rea...
[原文]data contamination (i.e., data leakage) during model evaluation. Furthermore, we employ an
LLM-based agent to parse the README.md files of these repositories, automatically converting
the diverse dataset structures into our predefined, unified storage format. Following extensive
crawling and deduplication across these website, we ultimately curate 97, 984 box-groundingrelated data sources. Manual inspection of the sampled data reveals a highly diverse range
of object categories, spanning from common targets (e.g., people, faces) to domain-specific
entities (e.g., lesion regions in CT scans or ...
[原文]For each dataset, we sample three images and prompt the model to calculate a quality score
(ranging from 0 to 10) based on the aforementioned criteria. The model then outputs a definitive
“KEEP” or “DISCARD” decision, accompanied by a clear justification. This review stage retains
43, 141 out of the initial 97, 984 data sources, which are subsequently advanced to the next
filtering phase.
Step II: Visual-Geometric Quality Review. We further evaluate the geometric quality and
annotation completeness of the bounding boxes to ensure the model learns precise region-text
alignments. This process sp...
[原文]For general multimodal data, we predominantly utilize large-scale web-crawled data rather than
synthetic data generated via model distillation (e.g., synthetic image caption). The raw data undergoes careful curation, and we refrain from utilizing LLMs to rewrite the data content. Regarding the specialized data designed to equip the model with foundational capabilities to output
visual primitives, in addition to the aforementioned web crawling and filtering, we also incorporate several high-quality public datasets, such as [4, 15, 17, 25, 29, 33]. We establish a unified formatting standard for ...
2. Method
(边界框/点)或程序化生成,(ii)尽可能使用自动验证器(例如基于规则的检查器)以减少标签噪声。我们选择了能够从基于视觉基元的推理(通过边界框或点)中受益的代表性任务,并从四个关键维度设计了我们的冷启动数据:计数、空间推理与通用视觉问答、迷宫导航以及路径追踪。 多模态大语言模型在准确计数方面始终面临挑战,尤其是在密集场景中。与人类通常采用的系统性扫描与累加策略不同,基于语言的模型在对象数量较多时,往往难以建立精确的对象对应关系。我们通过采用边界框作为视觉基元来提供明确的参考锚点,从而解决这一根本性瓶颈。 任务分解。我们将计数任务分为两类:粗粒度计数和细粒度计数。前者侧重于对通用类别(例如“狗”)进行计数,而后者则需要根据特定属性或空间约束(例如“白色的狗”或“左边的狗”)来区分对象。 粗粒度计数。我们聚合了来自多个密集检测数据集的数据,包括[2, 9, 14, 22, 28, 29, 35]。为确保数据质量,我们实施了一个基于三个主要标准的过滤流程:避免对象密度过高、确保边界框足够大以便于清晰识别,以及保持对真实边界框标注的高召回率。对于过滤后的样本,我们提示多模态大语言模型根据图像和边界框标注生成推理过程和简洁的最终回答。推理过程的生成遵循结构化的三步流程:(1)意图分析,模型在此步骤中识别目标类别;(2)批量定位,模型利用视觉基元同时定位所有候选对象(我们发现批量定位在粗粒度任务中效率更高,因为它利用了模型固有的定位优势,同时避免了重复枚举);以及(3)统计求和,基于视觉基元进行。为消除冷启动训练期间的噪声,我们实施了严格的验证机制,以确保推理内容中的所有边界框视觉基元与元数据坐标严格对齐,遵循预定义语法,并与最终的数值计数相匹配。 The opening fragment: I will translate it exactly as provided, assuming it continues from a previous bullet point. “(边界框/点)或程序化生成,(ii)尽可能采用自动验证器(例如基于规则的检查器)以降低标签噪声。”
[原文]boxes/points) or programmatically generated, (ii) automatic verifiers (e.g., rule-based checkers)
whenever possible to reduce label noise. We selected representative tasks that benefit from
visual primitive-based reasoning (via boxes or points), and designed our cold-start data across
four key dimensions: counting, spatial reasoning & general visual QA, maze navigation, and
path tracing.
2.4.1. Counting
Multimodal Large Language Models consistently struggle with accurate counting, particularly
in dense scenes. Unlike humans, who typically employ a systematic scanning-and-accumulation
strategy,...
[原文]Fine-Grained Counting. Due to the scarcity of publicly available datasets specifically for
fine-grained counting, we developed a specialized data construction pipeline. (1) Question
Generation: Leveraging the images and scene-graph metadata from GQA [10], we prompt
an MLLM to curate informative fine-grained counting questions. Samples that fail to yield
meaningful questions are discarded. For each valid sample, we record the ground-truth object
IDs, the IDs of excluded negative candidates, and the underlying rationale for the QA pair
construction. (2) Thinking Content Synthesis: Using the imag...
2. Method
细粒度计数 基于视觉原语的推理
[原文]Fine-grained Counting
Thinking with Visual Primitives
[Trigger_Placeholder]
How many bears are on the ground?
Original Image
Image with Visual Primitives
1. **What am I looking for**
The question asks me to count the bears that are on the ground. I need to scan the entire scene, locate all
the bears, and check their positions to ensure I only count those standing on the earth or rocks, excluding
any that might be elevated or climbing.
2. **Evaluating each bear's position**
Starting my scan near the top of the image, I spot a bear[[452,23,804,411]]
clinging vertic...
[原文]the hypothesis that the capability to think with visual primitives developed here will naturally
generalize to broader VQA scenarios. Our data curation covers both natural and synthetic
environments.
Data Construction in Natural Scenes. Utilizing the images and scene graphs from GQA [10],
8
Spatial Reasoning
Thinking with Visual Primitives
[Trigger_Placeholder]
Is there a purple rubber object that has
the same size as the gray metallic object?
Original Image
Image with Visual Primitives
1. **Analyzing the request**
The user is asking to verify if there is a "purple rubber object" that shares t...
[原文]and yellow objects, but absolutely no purple ones. Therefore, no purple rubber object of the same size exists
in the scene.
Response
False, there is no purple rubber object that matches the size of the gray metallic object.
Figure 4 | Illustrative cold-start data for spatial reasoning. The model performs intent decomposition and utilizes visual primitives to anchor all pertinent entities, facilitating sophisticated
multi-hop logical inference. Best viewed on screen with zoom.
we prompt an MLLM to design questions centered on spatial relations and object interactions,
along with corresponding t...
[原文]if any. Display
\boxed{True} at the
end if there is a path,
else display
\boxed{False}.
Original Image
Image with Visual Primitives
Thinking with Visual Primitives
I'll use a trial-and-error strategy to explore this maze. First locate the starting point: [[309,512]], and the destination:
[[652,644]].
**Start Exploring**:
**Step1**: Reaching [[357,369]] after going upper-right, I face 2 directions to choose from (lower-right, left).lower-right
direction might also be correct, but I'll explore left direction first.
**Step2**: Moving left: ...
[原文]**Final Path**: After exploration, the correct route is:
[[310,513],[357,369],[408,510],[457,650],[554,648],[506,507],[603,507],[652,645]]
Successfully reaching the destination: [[652,644]] !
Response
The maze is solvable. The verified path is:
[[310,513],[357,369],[408,510],[457,650],[554,648],[506,507],[603,507],[652,645]]
[\boxed{True}]
Figure 5 | Example of cold-start data for the maze navigation task. The model first identifies
the start and end points, then explores possible paths in a DFS manner. Best viewed on screen
with zoom.
d...
[原文]Difficulty Control. The difficulty of maze navigation largely depends on how many visual
reasoning steps the model needs to chain together. We control this by changing the grid size.
As the grid becomes larger, the model has to parse more cells, track connectivity over longer
distances, and deal with more dead ends that require backtracking. Each of these adds to
the overall reasoning complexity. Concretely, easy mazes require the model to chain only a
handful of local connectivity checks, while nightmare-level mazes demand sustained, longrange composition of hundreds of such primitive operati...
[原文]primitive to decide which branch continues the target curve. To ensure this primitive is genuinely
tested, we carefully prevent any endpoint from overlapping with or being crossed by an
unrelated line, discarding and regenerating configurations that violate these constraints. We
further include a uniform-style mode in which every line shares the same color and stroke
11
Path Tracing
[Trigger_Placeholder]
Where does the crown icon connect to? Put the destination icon name in \boxed{}.
Thinking with Visual Primitives
I find the starting point you mentioned, it's located here:
[[176,758]...
[原文]all randomized to prevent superficial pattern matching.
Thinking Content Synthesis. We explicitly represent the path-tracing process as a sequence
of coordinates sampled along the target curve, which reflects how the model attends to and
follows the path across the image. The process starts by locating the queried start point, then
follows the curve through a series of intermediate waypoints, and finally identifies the endpoint
reached. Importantly, the density of these waypoints adapts to the local geometry of the curve.
Straightforward segments are represented with fewer points, while highly...
[原文]for each task from three perspectives: format constraints, quality constraints, and accuracy
constraints. The first two constraints are shared across different tasks, while the final accuracy
constraint requires specific designs tailored to the task type.
Format RM. This RM evaluates the output based on rules to generate a reward score ranging
from 0 to 1. Specifically, it verifies whether the representation format of the visual primitives
generated by the model is correct. For thinking with grounding, this RM additionally checks for
redundancy in the model’s output, such as generating duplica...
[原文]scale and decay rate, respectively. In practice, we set 𝛼 = 0.7 and 𝛽 = 3, which are empirically
chosen to provide stable and smooth learning signals.
Accuracy RM for Spatial Reasoning and General VQA. For these tasks, we design an LLMbased GRM. We feed the model’s thinking content, its final response, the user query, and the
ground-truth answer into the GRM to independently evaluate and score the thinking process
and the response. The final reward is calculated as the average of the two scores.
Accuracy RM for Maze Navigation. To encourage the model to explore the maze, we design a
rule-based...
[原文]wall violations) and that the path forms a continuous route from start to end. This item is
a binary score for solvable mazes. For unsolvable mazes, this item is kept as 1.
• Answer correctness. A binary score for whether the model’s solvability judgment matches
the ground truth.
This decomposition ensures that the reward signal is dense and informative: the model receives
credit for each correctly applied visual primitive rather than only for the final binary answer.
Accuracy RM for Path Tracing. To enforce the model to trace follow the lines, we propose
a rule-based RM to judge the generated...
[原文]cold-start model (FTwG or FTwP ) to perform rollouts over the data pool, generating 𝑁 rollouts for
each sample. Subsequently, based on the RM scores, we count the number of correct responses
among the 𝑁 rollouts for each sample and categorize the data pool into three difficulty levels:
• Easy-Level: All 𝑁 rollouts are correct.
• Normal-Level: The number of correct rollouts 𝑘 satisfies 1 ≤ 𝑘
[原文]Kullback-Leibler (KL) divergence loss, and 𝜋𝜃 denotes the student model. We adopt fullvocabulary logit distillation for our OPD implementation. In practice, we utilize two teacher
models, including ETwG and ETwP .
[原文]3.1. Implementation Details
Our model is trained and evaluated using HAI-LLM [7], a lightweight and efficient distributed
training framework built upon PyTorch. During the pre-training stage, we employ a sequence
length of 64K and FP8 precision; in the post-training stage, the sequence length is extended to
256K. To maximize the performance of the domain experts, we utilize FP8 precision during the
Specialized SFT and Specialized RL phases, and subsequently apply FP4 (MXFP4) quantization
in the Unified RFT and OPD phases.
3.2. Evaluation Setup
Our evaluation framework integrates widely adopted...
[原文]64.9/74.1
Topological
DS_Maze_Navigation (ACC)
49.4
50.6
48.9
49.8
49.6
66.9
Reasoning
DS_Path_Tracing (ACC)
41.4
46.5
30.6
33.9
24.5
56.7
• Fine-grained Counting: Existing fine-grained counting benchmarks, such as TallyQA [1],
often suffer from annotation errors and ambiguities, rendering them suboptimal for
rigorously evaluating a model’s fine-grained counting capabilities. To address this, we
introduce the DS_Finegrained_Counting evaluation set. Specifically, we prompt an
MLLM to generate counting queries constrained by specific attributes or spatial locations,
deliberately ensuring the pre...
3. Experiments
*Translation:* 多模态大语言模型。 *Translation:* 如图7至图9所示,我们的模型通过“结合定位的思考”(thinking with grounding)在粗粒度和细粒度计数任务上表现出强劲的性能,同时也展现出涌现的能力协同效应。 *Translation:* 例如,该模型能够整合世界知识以进行视觉问答,执行反事实推理,并根据用户的日常需求提供带有空间坐标的可操作建议。 *Translation:* 尽管我们关于视觉基元的后训练数据中未包含任何中文语料,但得益于从基座模型继承的多语言能力,该模型仍能够以中文进行思考和响应。 *Translation:* 如图10所示,我们的模型通过“结合指点的思考”(thinking with pointing)展现出拓扑推理能力,能够为迷宫生成逐步探索轨迹,并为路径追踪生成顺序跟踪轨迹。 *Translation:* 在领域内实例上,模型具备识别并跟随路径的能力,这得益于在专门的强化学习过程中通过缓解冷启动数据问题并设置奖励机制所实现的强化。 Better: “在领域内实例上,模型能够识别并沿路径行进,这一能力是通过在专门的强化学习过程中缓解冷启动数据问题并引入奖励机制得以强化的。” Or: “在领域内样本上,模型具备识别并跟随路径的能力,该能力通过在专门的强化学习过程中缓解冷启动数据问题并施加奖励机制而得到强化。” Let's adjust to be more precise academically: “在领域内实例上,模型具备识别并跟随路径的能力,这一能力通过在专门的强化学习过程中缓解冷启动数据问题并引入奖励机制而得到强化。” 多模态大语言模型。
[原文]multimodal large language models.
17
3.4. Qualitative Results
3.4.1. Boxes as Visual Primitives
As shown in Figs. 7 to 9, our model demonstrates strong performance on coarse-grained and
fine-grained counting tasks through thinking with grounding, while also exhibiting emergent
capability synergies. For instance, the model is able to integrate world knowledge for visual
question answering, perform counterfactual reasoning, and provide actionable suggestions with
spatial coordinates tailored to users’ everyday needs. Although our post-training data about
visual primitives does not include any Ch...
[原文]Despite these promising results, our current work has certain limitations. First, constrained
by input resolution, the model’s performance in fine-grained scenarios remains sub-optimal,
leading to occasionally imprecise outputs of visual primitives. This could potentially be addressed by integrating our framework with existing methods targeting the "Perception Gap" to
achieve complementary benefits. Second, the current “thinking with visual primitives” capability relies on explicit trigger words for activation. In the future, we aim to enable the model to
autonomously determine whether to invo...
5. Conclusion
为应对多模态大语言模型(MLLM)在复杂推理过程中固有的“指代鸿沟”(Reference Gap),我们提出了“基于视觉原语的推理”(Thinking with Visual Primitives),一种新型推理框架。超越单纯依赖提高感知分辨率的常规做法,我们将空间标记——如点与边界框——提升为“思维的最小单元”,并直接交织进模型的思维过程。该机制赋予模型“边推理边指向”的能力,将抽象语言概念精确锚定到物理图像坐标。此外,l...
[原文]To address the inherent “Reference Gap” in Multimodal Large Language Models (MLLMs)
during complex reasoning, we introduce “Thinking with Visual Primitives”, a novel reasoning
framework. Moving beyond the conventional reliance on simply increasing perceptual resolution, we elevate spatial markers—such as points and bounding boxes—to "minimal units
of thought" and interleave them directly into the model’s thinking process. This mechanism
endows the model with the ability to “point while it reasons”, precisely anchoring abstract
linguistic concepts onto physical image coordinates. Furthermore, l...
[原文]appear in the first row (second and fourth columns), the second row (first and third columns),
and the third row (second and fourth columns). They are mostly a light tan or brownish-white
color, with large dark eyes and large ears.
3. **Confirming the final count**
Looking at the identified dogs, I have counted 6 Chihuahuas in total.
Response
There are 6 Chihuahuas in the picture.
Image with Visual Primitives
Figure 7 | Showcases of thinking with grounding. Examples include fine-grained counting and
counter-commonsense visual question answering, Best viewed on screen with zoom.
19
[Trigger_Pla...
[原文]788,320,882,601],[824,470,919,889],[828,370,930,625],[856,529,989,891]]. This includes people in
the very back standing tall, those standing in the middle slightly hunched forward to be seen, and the
individuals seated on chairs in the very front.
3. **Final count confirmation**
Counting the distinct individuals I just located across the entire group, I find 29 people in total.
Response
There are 29 people in the image.
Image with Visual Primitives
Figure 9 | Showcases of thinking with grounding. Examples include humor comprehension in
images, escape room game guidance, and counting. B...
5. Conclusion
*Translation:* 通道特征更符合主路径的特征。/ 该通道的特征更类似于主路径。 *Refinement:* 通道特征更类似于主路径。 *Translation:* **步骤8**:到达[[532,720]]时,我面临2个分岔口(顺时针方向、向外方向)。向外方向也可通行,但根据迷宫布局,顺时针方向的路径看起来更宽。 *Refinement:* **步骤8**:到达[[532,720]]时,前方出现2个分岔口(顺时针方向、向外方向)。向外方向同样可通行,但结合迷宫布局判断,顺时针方向的路径显得更宽阔。 *Translation:* **步骤9**:在[[471,722]]处,我注意到有2个方向可供选择(顺时针方向、向内方向)。虽然向内方向也可通行,但在观察路径结构后,我判断顺时针方向更可能通向出口。 *Refinement:* **步骤9**:位于[[471,722]]时,发现前方有2个方向可选(顺时针方向、向内方向)。尽管向内方向同样可走,但经观察路径结构后,我判断顺时针方向更有可能通向出口。 *Translation:* **步骤10**:到达[[415,683]]时,我面临2个分岔口(顺时针方向、向外方向)。虽然向外方向也可通行,但在观察路径结构后,我判断顺时针方向更可能通向出口。 *Refinement:* **步骤10**:到达[[415,683]]时,前方出现2个分岔口(顺时针方向、向外方向)。尽管向外方向同样可通行,但经观察路径结构后,我判断顺时针方向更有可能通向出口。 *Translation:* **步骤11**:到达[[371,637]]时,我面临2个分岔口(向内方向、向外方向)。与向外方向相比,向内方向的通道特征更类似于主路径。 *Refinement:* **步骤11**:到达[[371,637]]时,前方出现2个分岔口(向内方向、向外方向)。与向外方向相比,向内方向的通道特征更符合主路径的特征。 *Translation:* **步骤12**:在[[378,567]]处,我注意到有3个方向可供选择(顺时针方向、向内方向、向外方向)。向内和向外方向也可通行,但根据迷宫布局,顺时针方向的路径看起来更宽。 *Refinement:* **步骤12**:位于[[378,567]]时,发现前方有3个方向可选(顺时针方向、向内方向、向外方向)。向内与向外方向同样可通行,但结合迷宫布局判断,顺时针方向的路径显得更宽阔。 *Translation:* **步骤13**:在[[396,467]]处,我注意到有2个方向可供选择(向外方向、向外方向)。排除死路后,向外方向是最佳选择。 *Note:* "(outward, outward)" seems like a typo in the original, but I'll translate it literally as "(向外方向、向外方向)" or adjust slightly for clarity if needed. I'll stick to literal: (向外方向、向外方向)。 *Refinement:* **步骤13**:位于[[396,467]]时,发现前方有2个方向可选(向外方向、向外方向)。排除死路后,向外方向为最佳选择。 *Translation:* **步骤14**:到达[[368,397]]时,我面临2个分岔口(顺时针方向、向外方向)。向外方向也可通行,但根据迷宫布局,顺时针方向的路径看起来更宽。 *Refinement:* **步骤14**:到达[[368,397]]时,前方出现2个分岔口(顺时针方向、向外方向)。向外方向同样可通行,但结合迷宫布局判断,顺时针方向的路径显得更宽阔。 *Translation:* **步骤15**:在[[330,355]]处,我注意到有2个方向可供选择(向外方向、向外方向)。排除死路后,顺时针方向是最佳选择。 *Note:* Again, "(outward, outward)" but conclusion says "clockwise". This is likely a typo in the source, but I will translate faithfully: (向外方向、向外方向)。排除死路
[原文]passage characteristics are more like a main path.
**Step8**: Reaching [[532,720]], I face 2 forks (clockwise, outward).The outward direction is also passable, but from the maze
layout, clockwise direction's path looks wider.
**Step9**: At [[471,722]], I notice 2 directions to choose from (clockwise, inward).Although inward direction is also walkable,
after observing path structure, I judge clockwise direction more likely leads to exit.
**Step10**: Reaching [[415,683]], I face 2 forks (clockwise, outward).Although outward direction is al...
5. Conclusion
C连接到哪个数字?请展示路径并输出 \boxed{终点数字}。 根据视觉追踪结果,终点为5。[\boxed{5}] 图10 | 基于指向/指代的思考示例。示例包括迷宫导航和路径追踪。建议在屏幕上放大查看效果最佳。 [1] M. Acharya, K. Kafle, and C. Kanan. Tallyqa:回答复杂计数问题. 收录于《人工智能协会(AAAI)会议论文集》,第33卷,第8076–8084页,2019年。 [4] M. Deitke, et al. Molmo与Pixmo:面向最先进视觉语言模型的开放权重与开放数据. 收录于《计算机视觉与模式识别会议(CVPR)论文集》,第91–104页,2025年。 [5] M. Du, et al. EmbSpatial-Bench:使用大型视觉语言模型对具身任务的空间理解进行基准测试. 收录于《计算语言学协会(ACL)第62届年会论文集(第2卷:短文)》,第346–355页,2024年。
[原文][Trigger_Placeholder]
What number does C connect to? Show the path and output
\boxed{endpoint number}.
Thinking with Visual Primitives
I find the starting point you mentioned, it's located here:
[[195,594]].
Following this beige line, the visual path I observe is:
[[195,594],[319,617],[338,626],[342,588],[375,547],[438,522],[508,512],[565,5
05],[604,490],[609,464],[595,427],[579,391],[580,364],[606,358],[661,372],[72
0,408],[765,442],[803,458]]
Following this path, it connects to: [[803,457]]
Response
Based on the visual trace, the endpo...
5. Conclusion
[9] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu. 基于空间正则化区域提议网络的无人机目标计数。见《IEEE 国际计算机视觉会议论文集》,第 4145–4153 页,2017 年。
[10] D. A. Hudson and C. D. Manning. GQA:面向真实世界视觉推理与组合式问答的新数据集。见《IEEE/CVF 计算机视觉与模式识别会议论文集》,第 6700–6709 页,2019 年。
[11] M. Jia, Z. Qi, S. Zhang, W. Zhang, X. Yu, J. He, H. Wang, and L. Yi. OmniSpatial:面向全面的视觉空间推理基准...