blog · 6 posts

Writing from the inside.

Engineering deep-dives, research notes, and the occasional rant. Updated roughly every other week.

What our $4,300/month LLM bill taught us about caching

We routed everything through a semantic cache and turned a 47% cache-hit rate into a 71% one. Here's exactly what we changed.

Read →

Claude Opus 4.7 vs GPT-5: a side-by-side on our internal evals

Four task categories, 1,200 prompts, blind-rated. The headline result isn't what the benchmarks suggest.

Read →

Shipping the Vibe Pulse — a real-time AI leaderboard

From idea to launch in 11 days. The stack, the data pipeline, and the three things that almost killed it.

Read →

Astro + edge functions: our 60-page marketing site recipe

Why we left Next.js for Astro for the public site, and the four trade-offs we live with.

Read →

Why we publish our prompts

Transparency is a moat, not a leak. A short defense of open prompt libraries.

Read →

Evals that survive contact with reality

Most public benchmarks are useless in production. Here's the eval rig we actually run before every model swap.

Read →