BabyVision-v2 · Benchmark + Platform v0.1.1

Measuring the impact of
3D vision on reasoning.

Executable ground truth — no LLM-as-judge. Tasks span scientific imagery (medical, geographic, biological) and interactive physics, where models discover the world by acting on it, not just viewing it.

executable GT 3D-grounded scientific + interactive

Motivation

Why this benchmark exists

Three commitments separate BabyVision-v2 from caption-similarity VQA leaderboards.

Executable GT

Every answer is produced by a deterministic program — geometry, mask argmax, physics solver. We compare model output to that program's output, not to a natural-language gold caption.

Protocol →

No LLM-as-judge

Pixel-level and numeric metrics defend the score. An LLM may evaluate trace fidelity (step ordering) but never the terminal answer.

Protocol →

3D-grounded

True 3D geometry — meshes, point clouds, volumes — is the source of truth. Caption matching against a single 2D view is rejected.

Protocol →

Doc —

Platform docs · benchmark protocol · MCP tool reference

Domain 3D Assets —

loading…

QA Tasks —

V2 / V3 / V4 / V6 · Interactive Physics · Spatial Vision

Tasks

Six task families across two demos

Each family stresses a distinct reasoning skill. Every task ships with a deterministic ground-truth recipe; metrics are computed by code, not by an LLM judge.

V2 — Cross-section / topology

2D-view → 3D inference · cannot be shortcut by code · geometric

Family details →

V3 — Segmentation / referring

pixel-level GT · SSIM / LPIPS · defends against LLM-as-judge

Family details →

V4 — Manipulation / counterfactual

rotate · slice · deform · geometry-grounded GT

Family details →

V6 — Multi-step reasoning

multi-tool trace · scored on answer + trace fidelity

Family details →

Interactive Physics

force control · MPM fracture · MPM granular · FEM cloth · P1/P2 — discover by acting on objects

Family details →

Spatial Vision

M3D · IXI brain · distance / angle / topology / volume

Family details →

Get started

Two entry points

Pick the path that matches what you're doing.

Co-build a new domain

For partners onboarding a new professional domain.

Start in docs →

Verify on the leaderboard

For model teams submitting baseline runs.

Open leaderboard →

Domain coverage

Eleven domains · two shipped today

The platform supports eleven professional domains. v0.1.1 ships data for two (medical + embodied via the Genesis_VP demo); the other nine arrive as partner teams onboard with the partner SDK.

Compare

How BabyVision-v2 differs

The same eight benchmarks, scored on the capabilities that matter for measuring vision's contribution to reasoning. Rows 1–4 are strengths BV2 inherits from prior work; rows 5–8 are gaps no single predecessor closes that BV2 closes together. ✓ supported ◐ partial — not supported

Capability	MMMU-Pro	BLINK	MathVerse	SciVerse	EmbodiedBench	Embodied3DBench	EPIC-Bench	BV1	BV2
Filters text-only shortcuts	✓	✓	✓	✓	—	◐	✓	✓	✓
Fine visual-perception primitives	—	✓	◐	◐	—	✓	◐	✓	✓
Real 3D geometry as input	—	—	—	—	✓	✓	◐	—	✓
Multi-step agent / tool-use trace evaluated	—	—	—	—	◐	—	◐	—	✓
Geometric GT from scene (distance / angle / volume)	—	—	—	—	—	◐	◐	—	✓
Pixel-level GT (SSIM / LPIPS)	—	—	—	—	—	—	—	—	✓
Tool-trace replayed against server	—	—	—	—	—	—	—	—	✓
Profile separation isolates vision (coding / vision / fusion)	—	—	—	—	—	—	—	—	✓

Per-benchmark description, source links, and the broader comparison table live in the full positioning doc →

Measuring the impact of 3D vision on reasoning.

Why this benchmark exists

Executable GT

No LLM-as-judge

3D-grounded

Six task families across two demos

V2 — Cross-section / topology

V3 — Segmentation / referring

V4 — Manipulation / counterfactual

V6 — Multi-step reasoning

Interactive Physics

Spatial Vision

Two entry points

Co-build a new domain

Verify on the leaderboard

Eleven domains · two shipped today

How BabyVision-v2 differs

Measuring the impact of
3D vision on reasoning.