BabyVision-v2
loading…
BabyVision-v2 · Benchmark + Platform v0.1.1

Measuring the impact of
3D vision on reasoning.

Executable ground truth — no LLM-as-judge. Tasks span scientific imagery (medical, geographic, biological) and interactive physics, where models discover the world by acting on it, not just viewing it.

executable GT 3D-grounded scientific + interactive

Why this benchmark exists

Three commitments separate BabyVision-v2 from caption-similarity VQA leaderboards.

Executable GT

Every answer is produced by a deterministic program — geometry, mask argmax, physics solver. We compare model output to that program's output, not to a natural-language gold caption.

Protocol →

No LLM-as-judge

Pixel-level and numeric metrics defend the score. An LLM may evaluate trace fidelity (step ordering) but never the terminal answer.

Protocol →

3D-grounded

True 3D geometry — meshes, point clouds, volumes — is the source of truth. Caption matching against a single 2D view is rejected.

Protocol →
Doc
Platform docs · benchmark protocol · MCP tool reference
Domain 3D Assets
loading…
QA Tasks
V2 / V3 / V4 / V6 · Interactive Physics · Spatial Vision

Six task families across two demos

Each family stresses a distinct reasoning skill. Every task ships with a deterministic ground-truth recipe; metrics are computed by code, not by an LLM judge.

V2 — Cross-section / topology

2D-view → 3D inference · cannot be shortcut by code · geometric

Family details →

V3 — Segmentation / referring

pixel-level GT · SSIM / LPIPS · defends against LLM-as-judge

Family details →

V4 — Manipulation / counterfactual

rotate · slice · deform · geometry-grounded GT

Family details →

V6 — Multi-step reasoning

multi-tool trace · scored on answer + trace fidelity

Family details →

Interactive Physics

force control · MPM fracture · MPM granular · FEM cloth · P1/P2 — discover by acting on objects

Family details →

Spatial Vision

M3D · IXI brain · distance / angle / topology / volume

Family details →

Two entry points

Pick the path that matches what you're doing.

Eleven domains · two shipped today

The platform supports eleven professional domains. v0.1.1 ships data for two (medical + embodied via the Genesis_VP demo); the other nine arrive as partner teams onboard with the partner SDK.

How BabyVision-v2 differs

The same eight benchmarks, scored on the capabilities that matter for measuring vision's contribution to reasoning. Rows 1–4 are strengths BV2 inherits from prior work; rows 5–8 are gaps no single predecessor closes that BV2 closes together. supported partial not supported

Capability MMMU-Pro BLINK MathVerse SciVerse EmbodiedBench Embodied3DBench EPIC-Bench BV1 BV2
Filters text-only shortcuts
Fine visual-perception primitives
Real 3D geometry as input
Multi-step agent / tool-use trace evaluated
Geometric GT from scene (distance / angle / volume)
Pixel-level GT (SSIM / LPIPS)
Tool-trace replayed against server
Profile separation isolates vision (coding / vision / fusion)

Per-benchmark description, source links, and the broader comparison table live in the full positioning doc →