Benchmark against Structured Output Endpoints

Neat Idea. And especially great that there are measurements!

However, the really interesting benchmark would have it run against the actual "Structured Output Format" endpoints of LLM providers, e.g. those fine-tuned to return valid JSON.

What do you think of this?