Executable GT
Every answer is produced by a deterministic program — geometry, mask argmax, physics solver. We compare model output to that program's output, not to a natural-language gold caption.
Protocol →Executable ground truth — no LLM-as-judge. Tasks span scientific imagery (medical, geographic, biological) and interactive physics, where models discover the world by acting on it, not just viewing it.
Three commitments separate BabyVision-v2 from caption-similarity VQA leaderboards.
Every answer is produced by a deterministic program — geometry, mask argmax, physics solver. We compare model output to that program's output, not to a natural-language gold caption.
Protocol →Pixel-level and numeric metrics defend the score. An LLM may evaluate trace fidelity (step ordering) but never the terminal answer.
Protocol →True 3D geometry — meshes, point clouds, volumes — is the source of truth. Caption matching against a single 2D view is rejected.
Protocol →Each family stresses a distinct reasoning skill. Every task ships with a deterministic ground-truth recipe; metrics are computed by code, not by an LLM judge.
2D-view → 3D inference · cannot be shortcut by code · geometric
pixel-level GT · SSIM / LPIPS · defends against LLM-as-judge
rotate · slice · deform · geometry-grounded GT
multi-tool trace · scored on answer + trace fidelity
force control · MPM fracture · MPM granular · FEM cloth · P1/P2 — discover by acting on objects
M3D · IXI brain · distance / angle / topology / volume
Pick the path that matches what you're doing.
For partners onboarding a new professional domain.
Start in docs →For model teams submitting baseline runs.
Open leaderboard →The platform supports eleven professional domains. v0.1.1 ships data for two (medical + embodied via the Genesis_VP demo); the other nine arrive as partner teams onboard with the partner SDK.
The same eight benchmarks, scored on the capabilities that matter for measuring vision's contribution to reasoning. Rows 1–4 are strengths BV2 inherits from prior work; rows 5–8 are gaps no single predecessor closes that BV2 closes together. ✓ supported ◐ partial — not supported
| Capability | MMMU-Pro | BLINK | MathVerse | SciVerse | EmbodiedBench | Embodied3DBench | EPIC-Bench | BV1 | BV2 |
|---|---|---|---|---|---|---|---|---|---|
| Filters text-only shortcuts | ✓ | ✓ | ✓ | ✓ | — | ◐ | ✓ | ✓ | ✓ |
| Fine visual-perception primitives | — | ✓ | ◐ | ◐ | — | ✓ | ◐ | ✓ | ✓ |
| Real 3D geometry as input | — | — | — | — | ✓ | ✓ | ◐ | — | ✓ |
| Multi-step agent / tool-use trace evaluated | — | — | — | — | ◐ | — | ◐ | — | ✓ |
| Geometric GT from scene (distance / angle / volume) | — | — | — | — | — | ◐ | ◐ | — | ✓ |
| Pixel-level GT (SSIM / LPIPS) | — | — | — | — | — | — | — | — | ✓ |
| Tool-trace replayed against server | — | — | — | — | — | — | — | — | ✓ |
| Profile separation isolates vision (coding / vision / fusion) | — | — | — | — | — | — | — | — | ✓ |
Per-benchmark description, source links, and the broader comparison table live in the full positioning doc →