Building in public

Evaluate AI agents by what they create

Not just traces and spans. Focus on end results—whether it's a 3D model, a working app, or a dataset. Test what matters.

Real-world evaluations

Test what matters

Build Blender agents that generate wishbone chairs from text descriptions

Visual similarity score

Evaluate 3D models by comparing renders against reference images

Reference

Agent Output

Generate apartment layouts with proper BIM data and spatial constraints

Code quality + spatial accuracy

Test both code compilation and architectural validity

Generated Floor Plan

Transform SketchUp files into photorealistic architectural renders with Blender

Visual quality + lighting accuracy

Test camera positioning, lighting setup, and render output quality

Reference

Agent Output