Evals V2 by scosman · Pull Request #1454 · Kiln-AI/Kiln

scosman · 2026-06-08T17:07:26Z

Kiln Evals V2 — Implementation

This project implements Kiln Evals V2: a typed, extensible eval framework that replaces V1's single-type LLM-as-judge model with many EvalConfigTypes producing scores per Eval, backed by purpose-built EvalInput datasets with per-case structured reference data. It is designed for "many small evals" — lowering the floor for creation so a small focused eval beats both "no eval at all" and an oversized eval that overfits.

V2 is additive over V1. V1 records on disk are never rewritten. V1 EvalConfigs (g_eval, llm_as_judge) continue to load and run unchanged under V2 via legacy adapter dispatch. New eval types, new dataset entities, and new scoring paths live alongside V1 — they never replace it on disk.

How this spec is organized (read this first)

This spec was produced from a long, deliberate design phase. The detail lives in components/ — one focused design doc per topic. The top-level files are thin maps into it:

project_overview.md (this file) — what we're building and how to navigate.
functional_spec.md — the feature/behavior surface (the eval-type catalog + data model + contracts), as pointers into components/.
architecture.md — the technical index: data model, runner dispatch, and the component map. Start here for the build.
implementation_plan.md — the phased build order (the checklist /spec implement walks).
components/NN_*.md — the real design detail. These are the meat; they already carry implementation-grade detail (schemas, signatures, test plans).
reference/ — supporting context (see note below). You usually don't need it.

On the `Ax.x` tags you'll see throughout

A full alignment phase preceded this design and is archived in reference/ALIGNMENT.md. Throughout the components/ docs you'll see tags like A2.1, B.13, E.18, C.runner.2 — these reference locked alignment decisions. You do not need to load reference/ALIGNMENT.md unless you genuinely want the history of how a decision was made. The design docs in components/ capture the output of alignment — they are less verbose, more complete, lower-churn, and are the best reference for building. Treat the tags as provenance breadcrumbs, not required reading.

reference/ also holds two docs that two components cite by name — backwards_compat_plan_grounded.md (code-grounded Phase-0 plan with eval.py line numbers) and batch_agent_eval_expansion.md (the tool_call_check/step_count_check expansion). Consult them only if a component points you there.

Scope at a glance

In V2.0: the data model (EvalInput, V2 EvalConfig, additive EvalRun/Eval fields), V1/V2 coexistence, 8 eval types (llm_judge, exact_match, pattern_match, set_check, contains, tool_call_check, step_count_check, code_eval), a Jinja2 template+extraction layer, the V2 runner, 6 RAG judge templates, the extensibility seams, and minimal-but-complete create/view UI for the new types.

Explicitly out of scope (owned by future, separate projects): the feedback-to-eval pipeline / triage, the goal-first onboarding redesign, spec-builder reliability, and the deferred eval types (composite, threshold, json_schema, event_ordering, embedding_similarity, dag_metric). See components/00_overview.md §3 for the full scope map and components/90_open_risks.md for residual risks.

Coordination & prerequisites

Multi-turn-synthetic is a parallel project. V2 commits only to the contract surface (components/26* is excluded from this folder; the MultiTurnSyntheticEvalInputData slot in the data model accommodates it). Don't build multi-turn scoring here.
Prereq already landed in main: the general Kiln input_transform (Jinja2) capability — libs/core/kiln_ai/utils/jinja_engine.py + InputTransform on KilnAgentRunConfigProperties. components/06 is its design, kept for reference only — do not rebuild.
Prereqs still to build (Phase 1): the SingleTurnR1ThinkingFormatter fix (components/05) and the V1 validate_template_properties bug fix (components/15 §4.3). See implementation_plan.md.

The full components/00_overview.md is the canonical front door — it expands every point above with links.

Add the V2 eval data model alongside V1 without breaking backwards compatibility. Key changes: - Add V2EvalConfigProperties typed model with discriminated union dispatch, mode="before" validator for V1 dict compatibility, and validate_properties V2 branch enforcing typed properties and no root-level model fields - Add EvalInput parented model, EvalInputFilter protocol/registry, and dataset_filters module with TagFilter implementation - Extend EvalRun with eval_input_id, reference_data, skipped_reason fields and relaxed score/output validators when skipped - Add Eval.eval_input_filter_id with mutual-exclusivity guard against eval_set_filter_id, plus Task.eval_inputs() accessor - Fix V1 validate_template_properties bug (was validating wrong field set) - Add forward_thinking_instructions opt-in to SingleTurnR1ThinkingFormatter - Extract legacy_model_fields helper and refactor eval_adapter_from_type dispatch in registry - Regenerate OpenAPI schema and fix all downstream TypeScript type errors for nullable model_name/model_provider/eval_set_filter_id fields Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… runner backbone Template/extraction layer consuming input_transform: EvalTaskInput assembly, extract() usage, required_var skip-with-reason pre-check, save-time template/expression Jinja2 compilation validation. V2 runner backbone: EvalInput flow with B2.1 TaskRun-to-EvalInput runtime translation, two-level adapter dispatch (config_type -> properties.type) with empty adapter map, multi-turn incompatible_input_shape skip handling, and per-job error isolation with skip records. Supporting infrastructure: BaseV2Eval ABC with adapter contract and typed property validation, scoring_utils extraction from GEval with legacy-model-field helpers, GEval characterization tests preserving backward compatibility, on-read n_used/n_excluded inline aggregation in eval score API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add exact_match, pattern_match, contains, set_check, tool_call_check, and step_count_check adapters, each registered into _V2_ADAPTER_MAP for two-level dispatch. Includes build_binary_scores helper in v2_eval_helpers and full test coverage for all adapters plus dispatch integration tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…/continuous scoring, and RAG judge templates - Add V2 LlmJudgeEval adapter with g_eval toggle, structured JSON output scoring, and reasoning-model thinking-formatter wiring (forward_thinking_instructions) - Extract shared scoring utilities (score_from_token_string, build_llm_as_judge_score) into scoring_utils.py; keep score_from_token_string strict for G-Eval logprob filtering while handling continuous 0-1 floats directly in build_llm_as_judge_score - Add six first-party RAG judge templates over the canonical reference-key contract: faithfulness, answer_relevance, context_relevance, context_precision, hallucination, answer_correctness - Wire forward_thinking_instructions through AdapterConfig -> build_chat_formatter -> SingleTurnR1ThinkingFormatter for reasoning-model judges - Update async V2 eval contract tests and dispatch registration - Comprehensive test coverage: 23 llm_judge tests, 33 RAG template tests, 4 forward-thinking tests, scoring_utils unit tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… gate Adds the code_eval V2 eval type, which runs user-authored Python scorer functions in a multiprocessing spawn-based worker process with wall-clock timeout and serialized spawn lock (daemon child for safety). Key components: - sandbox_worker.py: _execute_scorer (child target) + run_scorer (parent orchestrator) with timeout/kill/reap lifecycle - KilnEvalHelpers: pure-Python helper library injected as the `kiln` arg to user scorer functions (ref(), reference_data, task_input, trace) - CodeEvalAdapter (BaseV2Eval): trust-gate check returning code_eval_not_trusted skip when project is untrusted; score-shape validation (rejects bools, converts int→float, checks key sets) - Trust gate: ephemeral per-project in-memory store with grant/check/revoke helpers; grant and check API endpoints in eval_api.py (DENY_AGENT) - CodeEvalProperties.timeout_seconds field with save-time code validation (must define a score() function) - freeze_support() + set_start_method("spawn") wiring in desktop.py and dev_server.py entry points for PyInstaller compatibility - Registered as code_eval in _V2_ADAPTER_MAP (all 8 V2 types now covered) Note: OpenAPI schema was NOT regenerated for the new grant/check trust endpoints because schema generation is sandbox-blocked (mktemp). To be regenerated in Phase 6 when the frontend consumes them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-08T17:07:43Z

Walkthrough

This PR adds Evals V2 data models, adapter dispatch, runner paths, scoring utilities, code-eval trust and sandbox execution, backend API support, web UI creation and result rendering flows, broader compatibility guards, multiprocessing startup setup, and extensive tests and project specifications.

Changes

Evals V2 end-to-end rollout

Layer / File(s)	Summary
Runtime and data-model foundation `libs/core/kiln_ai/datamodel/`, `libs/core/kiln_ai/adapters/eval/base_eval.py`, `.../registry.py`, `.../eval_runner.py`, `.../eval_utils/`, `.../v2_eval_*`	Adds V2 eval types, properties, eval inputs, skipped-run fields, eval-input filtering, V2 adapter dispatch, runner collection and execution paths, shared scoring helpers, deterministic adapters, LLM judge, code-eval sandbox execution, and trust helpers.
Desktop API and startup integration `app/desktop/desktop.py`, `app/desktop/dev_server.py`, `app/desktop/studio_server/eval_api.py`	Adds multiprocessing spawn bootstrap for desktop entrypoints, updates score summaries with `n_used` and `n_excluded`, adds V2 test-eval handling, makes eval-config model fields conditionally optional, and exposes project trust endpoints.
Web UI V2 authoring and rendering `app/web_ui/src/lib/api/v2_eval_api.ts`, `app/web_ui/src/lib/utils/eval_types/registry.ts`, `app/web_ui/src/lib/components/eval_types/*`, `app/web_ui/src/routes/.../create_eval_config/+page.svelte`, `app/web_ui/src/routes/.../run_result/+page.svelte`	Adds V2 API helpers, eval-type registry metadata, per-type form and result components, a CodeMirror-based code editor, the V2 create-config flow with test and trust dialogs, and V2-aware result rendering and naming.
Compatibility and display guards `app/web_ui/src/routes/(app)/generate/...`, `app/web_ui/src/routes/(app)/specs/...`, `app/web_ui/src/routes/(app)/prompt_optimization/...`, `app/web_ui/src/lib/utils/formatters.ts`, `app/web_ui/src/lib/utils/form_element.svelte`	Adds null-safe model and provider formatting, guards flows that require `eval_set_filter_id`, updates V2 judge labels in existing pages, and adds numeric input min/max support.
Validation and regression coverage `app/desktop/studio_server/test_eval_api.py`, `libs/core/kiln_ai/adapters/eval/test_`, `libs/core/kiln_ai/datamodel/test_`, `app/web_ui/src/lib/*/.test.ts`, `app/web_ui/src/routes/.../page.test.ts`	Adds backend, runner, adapter, sandbox, datamodel, frontend component, API client, and page tests covering V2 execution, skipped aggregation, trust handling, formatter forwarding, and V1/V2 coexistence.
Specifications and plans `specs/projects/evals_v2/`, `specs/projects/deep-cr-cleanup/`	Adds Evals V2 architecture, component specs, functional and implementation indexes, phase plans, and cleanup planning documents aligned with the implemented runtime, UI, API, and test changes.

Sequence Diagram(s)

sequenceDiagram
  participant UI as Create Eval Config Page
  participant API as eval_api
  participant Adapter as V2 Eval Adapter
  participant Trust as CodeEvalTrust

  UI->>API: POST test_v2_eval(properties, eval_input)
  API->>Adapter: evaluate(eval_input)
  Adapter-->>API: scores or skipped_reason
  API-->>UI: TestV2EvalResponse

  UI->>API: GET code_eval_trust
  API->>Trust: check project trust
  Trust-->>API: trusted
  API-->>UI: CodeEvalTrustResponse

  UI->>API: POST grant_code_eval_trust
  API->>Trust: grant project trust
  Trust-->>API: trusted
  API-->>UI: CodeEvalTrustResponse

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Kiln-AI/Kiln#503: Both changes touch eval runner and run-config plumbing used by the new V2 adapter execution paths.
Kiln-AI/Kiln#1297: Both changes modify app/desktop/studio_server/eval_api.py score-summary and eval-results aggregation behavior.
Kiln-AI/Kiln#358: Both changes update the chat formatter path around SingleTurnR1ThinkingFormatter and get_chat_formatter.

Poem

🐇 I hopped through judges old and new,
With scorecards split by used and skew.
A sandbox hums, trust gates swing wide,
While V2 paths now stride with pride.
Spawned little burrows start just right—
And tests keep watch all through the night.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Evals V2' is concise and directly describes the main feature being added—a new V2 eval framework. It clearly indicates the primary change and would be meaningful to someone scanning PR history.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering what V2 does, how it relates to V1, how the spec is organized, scope at a glance, and coordination notes. It follows the spirit of the template by explaining the goals and rationale, though it does not explicitly check the CLA or provide all template checkbox items.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch scosman/evals_v2

⚔️ Resolve merge conflicts

Resolve merge conflict in branch scosman/evals_v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements the backend infrastructure for Kiln Evals V2, introducing the EvalInput data model, V2-specific fields for evals and runs, and six deterministic adapters alongside a sandboxed execution model for custom Python scorers. The review feedback identifies several critical issues: eval_helpers.py incorrectly parses standard OpenAI traces for tool calls and results; the score key validation in v2_eval_code_eval.py is overly strict and violates the specification; and eval_api.py contains a potential ZeroDivisionError and lacks native V2 eval support when eval_set_filter_id is missing. Additionally, minor docstring mismatches in base_v2_eval.py and v2_eval_step_count_check.py should be corrected.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-08T17:11:28Z

+    def get_tool_calls(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:
+        """Return all tool-call entries from a trace (empty list if trace is None)."""
+        if not trace:
+            return []
+        return [
+            entry
+            for entry in trace
+            if entry.get("role") == "tool_call" or entry.get("type") == "tool_call"
+        ]


In standard OpenAI chat completion traces (which Kiln uses), tool calls are not top-level trace entries with role: \"tool_call\". Instead, they are stored within the tool_calls field of assistant messages (as correctly handled in v2_eval_tool_call_check.py). Using the current implementation, get_tool_calls will return empty lists for standard OpenAI traces, causing user-authored scorers to fail or return incorrect results.

@staticmethod\n def get_tool_calls(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:\n \"\"\"Return all tool-call entries from a trace (empty list if trace is None).\"\"\"\n if not trace:\n return []\n calls = []\n for entry in trace:\n if entry.get(\"role\") == \"assistant\" and \"tool_calls\" in entry:\n for tc in entry[\"tool_calls\"]:\n func = tc.get(\"function\", {})\n calls.append({\n \"name\": func.get(\"name\", \"\"),\n \"arguments\": func.get(\"arguments\", {}),\n \"id\": tc.get(\"id\", \"\"),\n })\n elif entry.get(\"role\") == \"tool_call\" or entry.get(\"type\") == \"tool_call\":\n calls.append(entry)\n return calls

gemini-code-assist · 2026-06-08T17:11:28Z

+    def get_tool_results(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:
+        """Return all tool-result entries from a trace."""
+        if not trace:
+            return []
+        return [
+            entry
+            for entry in trace
+            if entry.get("role") == "tool_result" or entry.get("type") == "tool_result"
+        ]


In standard OpenAI chat completion traces, tool results have role: \"tool\", not role: \"tool_result\". We should update get_tool_results to support standard OpenAI traces.

@staticmethod\n def get_tool_results(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:\n \"\"\"Return all tool-result entries from a trace.\"\"\"\n if not trace:\n return []\n results = []\n for entry in trace:\n if entry.get(\"role\") == \"tool\" or entry.get(\"role\") == \"tool_result\" or entry.get(\"type\") == \"tool_result\":\n results.append(entry)\n return results

gemini-code-assist · 2026-06-08T17:11:28Z

+        expected_keys = {score.json_key() for score in eval_obj.output_scores}
+        actual_keys = set(raw.keys())
+        if actual_keys != expected_keys:
+            raise RuntimeError(
+                f"Score key mismatch: got {sorted(actual_keys)}, expected {sorted(expected_keys)}"
+            )
+
+        validated: EvalScores = {}
+        for key, value in raw.items():
+            if isinstance(value, bool):
+                raise RuntimeError(
+                    f"Score '{key}' returned a bool. Use a float (e.g. 1.0 for pass, 0.0 for fail)."
+                )
+            if isinstance(value, int):
+                value = float(value)
+            if not isinstance(value, float):
+                raise RuntimeError(
+                    f"Score '{key}' must be a float, got {type(value).__name__}"
+                )
+            validated[key] = value


According to the specification in components/27_type_code_eval.md, missing keys are allowed (and should be mapped to None or skipped), and extra keys should be silently ignored. However, this strict equality check (actual_keys != expected_keys) will raise a RuntimeError if there are any missing or extra keys. We should relax this validation to allow missing keys and ignore extra keys.

expected_keys = {score.json_key() for score in eval_obj.output_scores}\n validated: EvalScores = {}\n for score in eval_obj.output_scores:\n key = score.json_key()\n if key not in raw:\n continue\n value = raw[key]\n if isinstance(value, bool):\n raise RuntimeError(\n f\"Score '{key}' returned a bool. Use a float (e.g. 1.0 for pass, 0.0 for fail).\"\n )\n if isinstance(value, int):\n value = float(value)\n if not isinstance(value, float):\n raise RuntimeError(\n f\"Score '{key}' must be a float, got {type(value).__name__}\"\n )\n validated[key] = value

gemini-code-assist · 2026-06-08T17:11:28Z

+        n_processed = len(expected_dataset_ids) - incomplete_count
+        percent_complete = (n_processed) / len(expected_dataset_ids)


If expected_dataset_ids is empty, this division will raise a ZeroDivisionError. We should guard against this by checking if expected_dataset_ids is empty.

n_processed = len(expected_dataset_ids) - incomplete_count\n percent_complete = (n_processed) / len(expected_dataset_ids) if expected_dataset_ids else 0.0

gemini-code-assist · 2026-06-08T17:11:28Z

+
+
+class BaseV2Eval(ABC):
+    """Base class for V2 eval adapters -- separate from BaseEval because V2 adapters are synchronous, produce EvalScores dicts directly, and do not need the LLM-based score schema machinery."""


The docstring states that V2 adapters are synchronous, but evaluate is defined as an asynchronous method (async def evaluate). We should update the docstring to reflect this.

\"\"\"Base class for V2 eval adapters -- separate from BaseEval because V2 adapters produce EvalScores dicts directly and do not need the LLM-based score schema machinery.\"\"\"

gemini-code-assist · 2026-06-08T17:11:28Z

+
+    count_type determines what is counted:
+    - "tool_calls": total number of individual tool call invocations across all assistant messages
+    - "model_responses": number of assistant messages, including tool-call-only


The docstring states that model_responses excludes tool-call-only messages, but the implementation and tests include them. We should update the docstring to match the intended behavior.

Suggested change

- "model_responses": number of assistant messages, including tool-call-only

- \"model_responses\": number of assistant messages, including tool-call-only

gemini-code-assist · 2026-06-08T17:11:28Z

+        if eval.eval_set_filter_id is None:
+            raise HTTPException(
+                status_code=400,
+                detail="This eval does not have a V1 eval set filter.",
+            )


Instead of raising a 400 error when eval_set_filter_id is None, we can support native V2 evals by resolving the expected dataset IDs from eval_input_filter_id using eval_input_filter_from_id.

if eval.eval_set_filter_id is None:\n if eval.eval_input_filter_id is None:\n raise HTTPException(\n status_code=400,\n detail=\"This eval does not have a V1 or V2 dataset filter.\",\n )\n from kiln_ai.datamodel.dataset_filters import eval_input_filter_from_id\n filter_fn = eval_input_filter_from_id(eval.eval_input_filter_id)\n expected_dataset_ids = {\n ei.id for ei in task.eval_inputs(readonly=True) if filter_fn(ei)\n }\n else:\n expected_dataset_ids = dataset_ids_in_filter(\n task, eval.eval_set_filter_id, readonly=True\n )

gemini-code-assist · 2026-06-08T17:11:29Z

+            if eval.eval_set_filter_id is None:
+                continue
            expected_dataset_ids = dataset_ids_in_filter(
                task, eval.eval_set_filter_id, readonly=True
            )


Instead of skipping when eval_set_filter_id is None, we can support native V2 evals by resolving the expected dataset IDs from eval_input_filter_id using eval_input_filter_from_id.

if eval.eval_set_filter_id is None:\n if eval.eval_input_filter_id is None:\n continue\n from kiln_ai.datamodel.dataset_filters import eval_input_filter_from_id\n filter_fn = eval_input_filter_from_id(eval.eval_input_filter_id)\n expected_dataset_ids = {\n ei.id for ei in task.eval_inputs(readonly=True) if filter_fn(ei)\n }\n else:\n expected_dataset_ids = dataset_ids_in_filter(\n task, eval.eval_set_filter_id, readonly=True\n )

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

app/web_ui/src/lib/api_schema.d.ts (1)
2-4: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regenerate OpenAPI bindings; checked-in schema is stale and breaking API contract checks.

CI already confirms this generated file is out of sync with backend OpenAPI (missing trust endpoints and timeout_seconds), so frontend/backend typing contracts are currently inconsistent. Please update backend schema sources, then regenerate with app/web_ui/src/lib/generate_schema.sh and commit the result (instead of manual edits in this file).
As per coding guidelines, “Keep OpenAPI client schema up to date with app/web_ui/src/lib/generate_schema.sh.” Based on learnings, this file is generated and should be changed via backend schema + regeneration, not direct edits.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/api_schema.d.ts` around lines 2 - 4, The checked-in
generated OpenAPI TypeScript file is stale (missing trust endpoints and the
timeout_seconds field) and must not be edited directly; update the backend
OpenAPI schema sources to include the missing trust endpoints and
timeout_seconds, then run the regeneration script
app/web_ui/src/lib/generate_schema.sh to produce a fresh api_schema.d.ts, verify
the new types include the trust endpoints and timeout_seconds, and commit the
regenerated file so frontend/backend typing contracts stay in sync.
Sources: Coding guidelines, Learnings, Pipeline failures
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (1)
684-689: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Forwarding flag is skipped for tuned chat strategies.

On Line 684, the tuned-strategy formatter call includes thinking_instructions but does not pass forward_thinking_instructions, so AdapterConfig.forward_thinking_instructions=True is ignored for that path.
Proposed fix
         if tuned_chat_strategy and tuned_chat_strategy != ChatStrategy.single_turn:
             return get_chat_formatter(
                 strategy=tuned_chat_strategy,
                 system_message=system_message,
                 user_input=input,
                 thinking_instructions=cot_prompt,
+                forward_thinking_instructions=self.base_adapter_config.forward_thinking_instructions,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/model_adapters/base_adapter.py` around lines 684 -
689, The tuned-strategy path calling get_chat_formatter (with
tuned_chat_strategy, system_message, input, thinking_instructions=cot_prompt)
omits the forward_thinking_instructions flag so
AdapterConfig.forward_thinking_instructions is ignored; update that
get_chat_formatter invocation to forward the flag (e.g.
forward_thinking_instructions=AdapterConfig.forward_thinking_instructions or the
instance equivalent) alongside thinking_instructions so the configured
forwarding behavior is respected.

🧹 Nitpick comments (1)

libs/core/kiln_ai/adapters/eval/test_eval_runner.py (1)
1229-1237: ⚡ Quick win

Add a runner-level regression test for eval-input jobs.

Current V2 tests validate collection and direct run_job, but not the collect_tasks() → run() path for eval-input mode. A single async test covering that path would catch the missing stored_output contract break.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/test_eval_runner.py` around lines 1229 -
1237, The test suite is missing a regression test that exercises the full
collect_tasks() → run() path for eval-input jobs and would catch the
stored_output contract break; add an async test (similar to
test_collects_all_inputs) that calls mock_v2_runner.collect_tasks(), then
executes the runner's run() for those tasks (or awaits
mock_v2_runner.run(collected_tasks)), and asserts that each produced job has the
expected stored_output structure and types for EvalInput jobs (referencing
collect_tasks(), run(), EvalInput, and the test_collects_all_inputs pattern) so
the stored_output contract is validated end-to-end.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/eval_config_instruction.svelte:
- Around line 9-19: The helper functions get_eval_steps and get_task_description
must validate the untyped properties before casting: in get_eval_steps (taking
eval_config: EvalConfig) check eval_config?.properties exists, ensure
props["eval_steps"] is an array with Array.isArray and that every element typeof
=== "string" (or coerce safe values via String and filter out null/undefined)
and then return a string[]; in get_task_description ensure
props["task_description"] is a string (typeof === "string") and only return it,
otherwise return the default "No description provided." This prevents non-string
values from being rendered and keeps the code robust against malformed payloads.

In `@libs/core/kiln_ai/adapters/eval/eval_helpers.py`:
- Around line 65-67: In has_tool_call, guard against non-dict tool call
arguments before calling .get: when extracting args (the variable assigned from
call.get("arguments") or call.get("args")), check isinstance(args, dict) and if
not, set args = {} (or convert to a dict if appropriate) so the subsequent
all(args.get(k) == v for k, v in expected_args.items()) cannot raise; update the
code around the args assignment in has_tool_call to perform this check and then
use the safe args variable in the matching logic.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Around line 176-189: The created EvalJob instances are missing required
execution artifacts (stored_output/stored_trace) which later causes ValueError
in execution; update the job-creation loop in eval_runner where EvalJob(...) is
instantiated (inside the for eval_input in self.task.eval_inputs(...) block) to
include stored_output and stored_trace pulled from the EvalInput (e.g.,
eval_input.stored_output and eval_input.stored_trace) or, if artifacts aren't
present, ensure the task is executed/wired before building jobs so those fields
are populated; make the change to the EvalJob constructor usage so collected
jobs are runnable by eval_run execution.

In `@libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py`:
- Around line 70-80: The current key-offset detection in metric_offsets uses raw
substring count/find on metric_name which can match quoted metric text inside
values; instead use a JSON-key-aware regex search to locate keys only (e.g.
compile pattern rf'"{re.escape(m)}"\s*:') and use re.finditer on raw_output to
collect matches, ensure exactly one match per metric, and set offsets[m] =
match.start(); replace the existing count/find checks with this regex-based
logic to avoid false positives from quoted values.

In `@libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py`:
- Around line 33-40: Wrap the call to extract(expression, data) in a try/except
so any exception (e.g., ValueError or runtime errors from compiled(**data)) is
caught and translated into the same skip tuple used for Undefined; keep the
existing check for isinstance(result, Undefined) but add an except Exception as
e that returns (None, SkippedReason.extraction_failed, f"Expression
'{expression}' raised error: {e}") and ensure you import/keep references to
extract, Undefined, and SkippedReason from the current module.

In `@libs/core/kiln_ai/adapters/eval/sandbox_worker.py`:
- Around line 122-127: The current try/except using q.get_nowait() can race with
the child process and miss a just-enqueued result; in the block that references
q and process p (the scorer subprocess), replace the non-blocking get_nowait()
with a blocking wait (e.g., q.get with a short timeout or a loop that waits
while p.is_alive() and then drains the queue) so you only raise the RuntimeError
after you are sure the child has exited and the queue is empty; keep the
existing checks on p.exitcode (the RuntimeError messages) but ensure you attempt
a blocking read from q (or repeated reads until timeout/empty after p.exitcode
is set) to avoid the race.

In `@libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py`:
- Around line 76-77: The test contains literal "print('debug info')" substrings
inside code strings which trigger the debug-detector; replace those occurrences
with a non-print alternative (e.g., use a logger call like logger.debug(...)
inside the string or remove the debug line entirely) in the string literals that
include "print('debug info')" (the string immediately before "return {'x': 1.0}"
and the similar occurrence later) so the test no longer contains the raw
"print(" pattern while preserving test behavior.

In `@libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py`:
- Around line 343-346: The test test_g_eval_raises_when_provider_lacks_logprobs
currently calls
asyncio.get_event_loop().run_until_complete(adapter.evaluate(_inp())), which
bypasses the file's async style; change the test to be async (async def
test_g_eval_raises_when_provider_lacks_logprobs(...)) and replace the
run_until_complete call with an awaited call (await adapter.evaluate(_inp()))
inside the existing pytest.raises context; also remove the now-unnecessary
import asyncio. This keeps the test consistent with the file's async/await
pattern and pytest's asyncio_mode="auto".

In `@libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py`:
- Around line 117-123: The current validation block (the one assigning into
validated[key]) accepts float values like NaN or Infinity; update it to reject
non-finite floats before populating validated (and before returning EvalScores)
by using math.isfinite on the float-converted value and raising a RuntimeError
if not finite. Keep the existing int->float conversion and type check for value,
then add a finiteness check referencing the same symbols (value, key, validated)
and include a clear error message like "Score '{key}' must be a finite float,
got {value}". Ensure math is imported where this validation lives.

In `@libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py`:
- Around line 31-38: The code is treating an invalid regex as an extraction
skip; update the error handling in the block around re.search(props.pattern,
actual) so an invalid pattern raises a hard configuration/validation error
instead of returning SkippedReason.extraction_failed—replace the except re.error
branch that returns ( {}, SkippedReason.extraction_failed, ... ) with a raised
exception (e.g., ValueError or a dedicated ConfigValidationError) that includes
the pattern and original exception message so the misconfiguration surfaces
immediately; keep references to re.search(props.pattern, actual) and
SkippedReason.extraction_failed to locate and change the logic.

In `@libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py`:
- Around line 25-27: Replace the runtime-optimizable assert with an explicit
type check: after props = self.properties, verify with if not isinstance(props,
StepCountCheckProperties): raise ValueError("expected StepCountCheckProperties
for self.properties") so the guard in V2EvalStepCountCheck (using properties and
StepCountCheckProperties) remains effective in optimized builds.

In `@libs/core/kiln_ai/datamodel/eval.py`:
- Around line 381-384: EvalRun's output is allowed to be None even for
non-skipped runs; add a Pydantic validation that enforces output is non-None
unless the run is explicitly marked as skipped. In the EvalRun model (and the
other similar model around lines 471-477), add a validator or root_validator
that checks the run's status/skipped flag (e.g., compare to the skipped
enum/value used in this module) and raises a ValueError if status != SKIPPED and
output is None, leaving output optional only when status indicates a
skipped-before-execution run. Ensure the validator references the class name
EvalRun and the output field so it triggers on model construction and prevents
invalid records.

In `@specs/projects/evals_v2/components/70_builder_and_onboarding.md`:
- Line 280: The renderer guidance incorrectly references the old property name
g_eval_mode; update that reference to the canonical V2 field g_eval so the
llm_judge binding uses g_eval (ensure any text mentioning g_eval_mode in the
llm_judge renderer guidance is replaced and consistent with the documented
g_eval property).

---

Outside diff comments:
In `@app/web_ui/src/lib/api_schema.d.ts`:
- Around line 2-4: The checked-in generated OpenAPI TypeScript file is stale
(missing trust endpoints and the timeout_seconds field) and must not be edited
directly; update the backend OpenAPI schema sources to include the missing trust
endpoints and timeout_seconds, then run the regeneration script
app/web_ui/src/lib/generate_schema.sh to produce a fresh api_schema.d.ts, verify
the new types include the trust endpoints and timeout_seconds, and commit the
regenerated file so frontend/backend typing contracts stay in sync.

In `@libs/core/kiln_ai/adapters/model_adapters/base_adapter.py`:
- Around line 684-689: The tuned-strategy path calling get_chat_formatter (with
tuned_chat_strategy, system_message, input, thinking_instructions=cot_prompt)
omits the forward_thinking_instructions flag so
AdapterConfig.forward_thinking_instructions is ignored; update that
get_chat_formatter invocation to forward the flag (e.g.
forward_thinking_instructions=AdapterConfig.forward_thinking_instructions or the
instance equivalent) alongside thinking_instructions so the configured
forwarding behavior is respected.

---

Nitpick comments:
In `@libs/core/kiln_ai/adapters/eval/test_eval_runner.py`:
- Around line 1229-1237: The test suite is missing a regression test that
exercises the full collect_tasks() → run() path for eval-input jobs and would
catch the stored_output contract break; add an async test (similar to
test_collects_all_inputs) that calls mock_v2_runner.collect_tasks(), then
executes the runner's run() for those tasks (or awaits
mock_v2_runner.run(collected_tasks)), and asserts that each produced job has the
expected stored_output structure and types for EvalInput jobs (referencing
collect_tasks(), run(), EvalInput, and the test_collects_all_inputs pattern) so
the stored_output contract is validated end-to-end.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 50c2dc65-f3e2-4c33-8799-85e807724f13

📥 Commits

Reviewing files that changed from the base of the PR and between 58608b4 and f3777c8.

📒 Files selected for processing (102)

app/desktop/desktop.py
app/desktop/dev_server.py
app/desktop/studio_server/eval_api.py
app/desktop/studio_server/test_eval_api.py
app/web_ui/src/lib/api_schema.d.ts
app/web_ui/src/lib/utils/formatters.ts
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/data_gen_intro.svelte
app/web_ui/src/routes/(app)/prompt_optimization/[project_id]/[task_id]/create_prompt_optimization_job/+page.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/+page.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/[eval_config_id]/[run_config_id]/run_result/+page.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/+page.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/eval_config_instruction.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
libs/core/kiln_ai/adapters/chat/chat_formatter.py
libs/core/kiln_ai/adapters/chat/test_chat_formatter.py
libs/core/kiln_ai/adapters/eval/__init__.py
libs/core/kiln_ai/adapters/eval/base_eval.py
libs/core/kiln_ai/adapters/eval/base_v2_eval.py
libs/core/kiln_ai/adapters/eval/eval_helpers.py
libs/core/kiln_ai/adapters/eval/eval_runner.py
libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py
libs/core/kiln_ai/adapters/eval/eval_utils/test_scoring_utils.py
libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py
libs/core/kiln_ai/adapters/eval/g_eval.py
libs/core/kiln_ai/adapters/eval/rag_judge_templates.py
libs/core/kiln_ai/adapters/eval/registry.py
libs/core/kiln_ai/adapters/eval/sandbox_worker.py
libs/core/kiln_ai/adapters/eval/test_eval_helpers.py
libs/core/kiln_ai/adapters/eval/test_eval_runner.py
libs/core/kiln_ai/adapters/eval/test_forward_thinking_instructions.py
libs/core/kiln_ai/adapters/eval/test_g_eval.py
libs/core/kiln_ai/adapters/eval/test_g_eval_characterization.py
libs/core/kiln_ai/adapters/eval/test_rag_judge_templates.py
libs/core/kiln_ai/adapters/eval/test_registry.py
libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py
libs/core/kiln_ai/adapters/eval/test_v2_contains.py
libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/test_v2_exact_match.py
libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
libs/core/kiln_ai/adapters/eval/test_v2_step_count_check.py
libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py
libs/core/kiln_ai/adapters/model_adapters/test_base_adapter.py
libs/core/kiln_ai/datamodel/dataset_filters.py
libs/core/kiln_ai/datamodel/eval.py
libs/core/kiln_ai/datamodel/task.py
libs/core/kiln_ai/datamodel/test_dataset_filters.py
libs/core/kiln_ai/datamodel/test_eval_model.py
specs/projects/evals_v2/architecture.md
specs/projects/evals_v2/components/00_overview.md
specs/projects/evals_v2/components/05_prereq_thinking_formatter_fix.md
specs/projects/evals_v2/components/06_prereq_input_transform.md
specs/projects/evals_v2/components/10_data_model.md
specs/projects/evals_v2/components/15_v1_v2_coexistence.md
specs/projects/evals_v2/components/20_eval_config_types_overview.md
specs/projects/evals_v2/components/21_type_llm_judge.md
specs/projects/evals_v2/components/22_type_deterministic_basics.md
specs/projects/evals_v2/components/27_type_code_eval.md
specs/projects/evals_v2/components/29_rag_judge_templates.md
specs/projects/evals_v2/components/40_template_and_extraction.md
specs/projects/evals_v2/components/45_runner_architecture.md
specs/projects/evals_v2/components/50_reference_data.md
specs/projects/evals_v2/components/70_builder_and_onboarding.md
specs/projects/evals_v2/components/80_extensibility_contract.md
specs/projects/evals_v2/components/85_observability_and_audit.md
specs/projects/evals_v2/components/90_open_risks.md
specs/projects/evals_v2/functional_spec.md
specs/projects/evals_v2/implementation_plan.md
specs/projects/evals_v2/phase_plans/phase_1.md
specs/projects/evals_v2/phase_plans/phase_2.md
specs/projects/evals_v2/phase_plans/phase_3.md
specs/projects/evals_v2/phase_plans/phase_4.md
specs/projects/evals_v2/phase_plans/phase_5.md
specs/projects/evals_v2/project_overview.md
specs/projects/evals_v2/reference/ALIGNMENT.md
specs/projects/evals_v2/reference/backwards_compat_plan_grounded.md
specs/projects/evals_v2/reference/batch_agent_eval_expansion.md
specs/projects/input_transform_create_ui/architecture.md
specs/projects/input_transform_create_ui/functional_spec.md
specs/projects/input_transform_create_ui/phase_plans/phase_1.md
specs/projects/input_transform_create_ui/phase_plans/phase_2.md
specs/projects/input_transform_create_ui/project_overview.md
specs/projects/input_transform_ui/architecture.md
specs/projects/input_transform_ui/functional_spec.md
specs/projects/input_transform_ui/phase_plans/phase_1.md
specs/projects/input_transform_ui/phase_plans/phase_2.md
specs/projects/input_transform_ui/project_overview.md
specs/projects/sse_cancel_fix/architecture.md
specs/projects/sse_cancel_fix/functional_spec.md
specs/projects/sse_cancel_fix/implementation_plan.md
specs/projects/sse_cancel_fix/project_overview.md

coderabbitai · 2026-06-08T17:26:40Z

+    const props = eval_config.properties as Record<string, unknown>
+    if (!props["eval_steps"]) return undefined
+    if (!Array.isArray(props["eval_steps"])) return undefined
+    return props["eval_steps"] as string[]
+  }
+
+  function get_task_description(eval_config: EvalConfig): string {
+    if (!eval_config?.properties) return "No description provided."
+    const props = eval_config.properties as Record<string, unknown>
+    return (props["task_description"] as string) || "No description provided."
  }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate properties value types before rendering.

get_eval_steps and get_task_description still trust unknown payload types after casting. Non-string values can render incorrectly in UI.

Suggested fix

function get_eval_steps(eval_config: EvalConfig): string[] | undefined { if (!eval_config) return undefined if (!eval_config.properties) return undefined const props = eval_config.properties as Record<string, unknown> if (!props["eval_steps"]) return undefined if (!Array.isArray(props["eval_steps"])) return undefined - return props["eval_steps"] as string[] + if (!props["eval_steps"].every((s) => typeof s === "string")) return undefined + return props["eval_steps"] as string[] } function get_task_description(eval_config: EvalConfig): string { if (!eval_config?.properties) return "No description provided." const props = eval_config.properties as Record<string, unknown> - return (props["task_description"] as string) || "No description provided." + return typeof props["task_description"] === "string" && + props["task_description"].trim() !== "" + ? props["task_description"] + : "No description provided." }

Also applies to: 26-26

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/eval_config_instruction.svelte around lines 9 - 19, The helper functions get_eval_steps and get_task_description must validate the untyped properties before casting: in get_eval_steps (taking eval_config: EvalConfig) check eval_config?.properties exists, ensure props["eval_steps"] is an array with Array.isArray and that every element typeof === "string" (or coerce safe values via String and filter out null/undefined) and then return a string[]; in get_task_description ensure props["task_description"] is a string (typeof === "string") and only return it, otherwise return the default "No description provided." This prevents non-string values from being rendered and keeps the code robust against malformed payloads.

coderabbitai · 2026-06-08T17:26:40Z

+            args = call.get("arguments") or call.get("args") or {}
+            if all(args.get(k) == v for k, v in expected_args.items()):
+                return True


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard non-dict tool-call arguments before .get().

has_tool_call can raise when arguments/args is not a dict (common when args arrive as strings), which turns a matching check into a scorer failure.

Proposed fix

- args = call.get("arguments") or call.get("args") or {} - if all(args.get(k) == v for k, v in expected_args.items()): + args = call.get("arguments") or call.get("args") or {} + if not isinstance(args, dict): + continue + if all(args.get(k) == v for k, v in expected_args.items()): return True

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/eval_helpers.py` around lines 65 - 67, In has_tool_call, guard against non-dict tool call arguments before calling .get: when extracting args (the variable assigned from call.get("arguments") or call.get("args")), check isinstance(args, dict) and if not, set args = {} (or convert to a dict if appropriate) so the subsequent all(args.get(k) == v for k, v in expected_args.items()) cannot raise; update the code around the args assignment in has_tool_call to perform this check and then use the safe args variable in the matching logic.

coderabbitai · 2026-06-08T17:26:40Z

+    for m in metrics:
+        metric_name = f'"{m}"'
+        count = raw_output.count(metric_name)
+        if count != 1:
+            raise ValueError(
+                f"Metric {m} should appear exactly once in the output. Found {count} times"
+            )
+        offset = raw_output.find(metric_name)
+        if offset == -1:
+            raise ValueError(f"Metric {m} not found in raw output")
+        offsets[m] = offset


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Key offset detection is too broad and can fail on valid outputs.

metric_offsets uses raw substring counting (count/find) instead of matching JSON key positions. If a quoted metric name appears in string content, this path can raise a false “appear exactly once” error and block scoring.

Suggested fix

+import re @@ def metric_offsets(raw_output: str, metrics: List[str]) -> Dict[str, int]: @@ offsets: Dict[str, int] = {} for m in metrics: - metric_name = f'"{m}"' - count = raw_output.count(metric_name) - if count != 1: + key_pattern = re.compile(rf'"{re.escape(m)}"\s*:') + matches = list(key_pattern.finditer(raw_output)) + if len(matches) != 1: raise ValueError( - f"Metric {m} should appear exactly once in the output. Found {count} times" + f"Metric {m} should appear exactly once as a JSON key. Found {len(matches)} times" ) - offset = raw_output.find(metric_name) - if offset == -1: - raise ValueError(f"Metric {m} not found in raw output") - offsets[m] = offset + offsets[m] = matches[0].start() return offsets

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for m in metrics:

metric_name = f'"{m}"'

count = raw_output.count(metric_name)

if count != 1:

raise ValueError(

f"Metric {m} should appear exactly once in the output. Found {count} times"

)

offset = raw_output.find(metric_name)

if offset == -1:

raise ValueError(f"Metric {m} not found in raw output")

offsets[m] = offset

for m in metrics:

key_pattern = re.compile(rf'"{re.escape(m)}"\s*:')

matches = list(key_pattern.finditer(raw_output))

if len(matches) != 1:

raise ValueError(

f"Metric {m} should appear exactly once as a JSON key. Found {len(matches)} times"

)

offsets[m] = matches[0].start()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py` around lines 70 - 80, The current key-offset detection in metric_offsets uses raw substring count/find on metric_name which can match quoted metric text inside values; instead use a JSON-key-aware regex search to locate keys only (e.g. compile pattern rf'"{re.escape(m)}"\s*:') and use re.finditer on raw_output to collect matches, ensure exactly one match per metric, and set offsets[m] = match.start(); replace the existing count/find checks with this regex-based logic to avoid false positives from quoted values.

coderabbitai · 2026-06-08T17:26:40Z

+    result = extract(expression, data)
+    if isinstance(result, Undefined):
+        return (
+            None,
+            SkippedReason.extraction_failed,
+            f"Expression '{expression}' resolved to undefined",
+        )
+    return result, None, None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Locate jinja extract implementation:" fd -i 'jinja_engine.py' echo echo "Show extract implementation (read-only):" fd -i 'jinja_engine.py' --exec sed -n '1,240p' {} echo echo "Find eval helper call sites relying on extract behavior:" rg -n --type=py -C2 '\bextract\(' libs/core/kiln_ai/adapters/eval libs/core/kiln_ai/utils

Repository: Kiln-AI/Kiln

Length of output: 16213

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Inspect v2_eval_helpers.py around the referenced lines:" sed -n '1,140p' libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py echo echo "Locate SkippedReason definition:" rg -n "class SkippedReason|SkippedReason\s*=" libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py libs/core/kiln_ai/adapters/eval -S echo # If found via rg, print the most relevant match file(s) area matches=$(rg -n "class SkippedReason|SkippedReason\s*=" libs/core/kiln_ai/adapters/eval -S -o --no-filename | head -n 1 || true) echo "done"

Repository: Kiln-AI/Kiln

Length of output: 2817

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Locate SkippedReason in kiln_ai datamodel eval:" rg -n "class SkippedReason|SkippedReason\s*=" libs/core/kiln_ai/datamodel/eval* libs/core/kiln_ai/datamodel -S echo echo "Show likely SkippedReason definition file(s):" files=$(rg -l "SkippedReason" libs/core/kiln_ai/datamodel/eval* libs/core/kiln_ai/datamodel -S || true) for f in $files; do echo "---- $f ----" sed -n '1,220p' "$f" | sed -n '1,220p' >/dev/null done # Print exact matches with surrounding context rg -n "SkippedReason" libs/core/kiln_ai/datamodel/eval* libs/core/kiln_ai/datamodel -S -C2 | head -n 80

Repository: Kiln-AI/Kiln

Length of output: 3141

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "v2_eval_helpers.py (line-numbered excerpt):" nl -ba libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py | sed -n '1,120p' echo echo "SkippedReason enum (line-numbered excerpt):" nl -ba libs/core/kiln_ai/datamodel/eval.py | sed -n '220,310p' echo echo "jinja_engine.py extract() (line-numbered excerpt):" nl -ba libs/core/kiln_ai/utils/jinja_engine.py | sed -n '60,140p'

Repository: Kiln-AI/Kiln

Length of output: 161

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "v2_eval_helpers.py (line-numbered excerpt):" cat -n libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py | sed -n '1,140p' echo echo "SkippedReason enum (line-numbered excerpt):" cat -n libs/core/kiln_ai/datamodel/eval.py | sed -n '220,310p' echo echo "jinja_engine.py extract() (line-numbered excerpt):" cat -n libs/core/kiln_ai/utils/jinja_engine.py | sed -n '60,170p'

Repository: Kiln-AI/Kiln

Length of output: 8853

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Search extract exception expectations in test_jinja_engine.py:" rg -n "extract$" libs/core/kiln_ai/utils/test_jinja_engine.py -n echo rg -n "extract\([^$]*\).*pytest\.raises|pytest\.raises.*extract\(" libs/core/kiln_ai/utils/test_jinja_engine.py -S echo rg -n "SecurityError|UndefinedError|TemplateSyntaxError" libs/core/kiln_ai/utils/test_jinja_engine.py -S

Repository: Kiln-AI/Kiln

Length of output: 631

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Find all extract() call sites in libs/core (short context):" rg -n --type=py -C2 '\bextract\(' libs/core/kiln_ai | head -n 200 echo echo "Look for SecurityError/UndefinedError tests that involve extract():" rg -n --type=py '(SecurityError|UndefinedError).*extract\(|extract\(.*(SecurityError|UndefinedError)' libs/core/kiln_ai/utils/test_jinja_engine.py

Repository: Kiln-AI/Kiln

Length of output: 17252

Handle extract(...) exceptions and map them to SkippedReason.extraction_failed (v2_eval_helpers.py lines 33-39 and 73-78).

Eval helpers only skip when extract(...) returns Undefined. But kiln_ai.utils.jinja_engine.extract() also raises (e.g., malformed expressions become ValueError, and sandbox/runtime errors can raise during compiled(**data)), which can hard-fail eval instead of returning a skip tuple.

Suggested fix

def extract_value( expression: str | None, eval_input: EvalTaskInput, ) -> tuple[Any, SkippedReason | None, str | None]: @@ if expression is None: return eval_input.final_message, None, None data = eval_input.model_dump() - result = extract(expression, data) + try: + result = extract(expression, data) + except Exception as exc: + return ( + None, + SkippedReason.extraction_failed, + f"Expression '{expression}' raised: {exc}", + ) if isinstance(result, Undefined): @@ def check_required_vars( @@ data = eval_input.model_dump() for var_expr in required_vars: - result = extract(var_expr, data) + try: + result = extract(var_expr, data) + except Exception as exc: + return ( + SkippedReason.extraction_failed, + f"required_var '{var_expr}' raised: {exc}", + ) if isinstance(result, Undefined): return ( SkippedReason.extraction_failed, f"required_var '{var_expr}' resolved to undefined", )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py` around lines 33 - 40, Wrap the call to extract(expression, data) in a try/except so any exception (e.g., ValueError or runtime errors from compiled(**data)) is caught and translated into the same skip tuple used for Undefined; keep the existing check for isinstance(result, Undefined) but add an except Exception as e that returns (None, SkippedReason.extraction_failed, f"Expression '{expression}' raised error: {e}") and ensure you import/keep references to extract, Undefined, and SkippedReason from the current module.

coderabbitai · 2026-06-08T17:26:41Z

+            if isinstance(value, int):
+                value = float(value)
+            if not isinstance(value, float):
+                raise RuntimeError(
+                    f"Score '{key}' must be a float, got {type(value).__name__}"
+                )
+            validated[key] = value


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject non-finite score values before returning EvalScores.

Line 119 currently accepts NaN/Infinity as valid floats; those values can corrupt score aggregation and persisted metrics.

🔧 Proposed fix

import asyncio +import math from threading import Lock from typing import Any @@ if isinstance(value, int): value = float(value) if not isinstance(value, float): raise RuntimeError( f"Score '{key}' must be a float, got {type(value).__name__}" ) + if not math.isfinite(value): + raise RuntimeError( + f"Score '{key}' must be a finite float, got {value!r}" + ) validated[key] = value

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if isinstance(value, int):

value = float(value)

if not isinstance(value, float):

raise RuntimeError(

f"Score '{key}' must be a float, got {type(value).__name__}"

)

validated[key] = value

if isinstance(value, int):

value = float(value)

if not isinstance(value, float):

raise RuntimeError(

f"Score '{key}' must be a float, got {type(value).__name__}"

)

if not math.isfinite(value):

raise RuntimeError(

f"Score '{key}' must be a finite float, got {value!r}"

)

validated[key] = value

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py` around lines 117 - 123, The current validation block (the one assigning into validated[key]) accepts float values like NaN or Infinity; update it to reject non-finite floats before populating validated (and before returning EvalScores) by using math.isfinite on the float-converted value and raising a RuntimeError if not finite. Keep the existing int->float conversion and type check for value, then add a finiteness check referencing the same symbols (value, key, validated) and include a clear error message like "Score '{key}' must be a finite float, got {value}". Ensure math is imported where this validation lives.

coderabbitai · 2026-06-08T17:26:41Z

+        try:
+            match = re.search(props.pattern, actual)
+        except re.error as e:
+            return (
+                {},
+                SkippedReason.extraction_failed,
+                f"Invalid regex pattern '{props.pattern}': {e}",
+            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Invalid regex should fail fast as config error, not be reported as extraction skip.

Returning SkippedReason.extraction_failed here masks a static eval-config bug and can silently skip every run. This should raise a hard error (or a dedicated config-validation error) so misconfiguration is surfaced immediately.

Suggested fix

try: match = re.search(props.pattern, actual) except re.error as e: - return ( - {}, - SkippedReason.extraction_failed, - f"Invalid regex pattern '{props.pattern}': {e}", - ) + raise ValueError(f"Invalid regex pattern '{props.pattern}': {e}") from e

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py` around lines 31 - 38, The code is treating an invalid regex as an extraction skip; update the error handling in the block around re.search(props.pattern, actual) so an invalid pattern raises a hard configuration/validation error instead of returning SkippedReason.extraction_failed—replace the except re.error branch that returns ( {}, SkippedReason.extraction_failed, ... ) with a raised exception (e.g., ValueError or a dedicated ConfigValidationError) that includes the pattern and original exception message so the misconfiguration surfaces immediately; keep references to re.search(props.pattern, actual) and SkippedReason.extraction_failed to locate and change the logic.

coderabbitai · 2026-06-08T17:26:41Z

+        props = self.properties
+        assert isinstance(props, StepCountCheckProperties)
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use an explicit runtime check instead of assert for properties typing.

Line 26’s assert can be optimized out (-O), removing this guard in runtime builds. Prefer an explicit if not isinstance(...): raise ValueError(...) to keep behavior deterministic.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py` around lines 25 - 27, Replace the runtime-optimizable assert with an explicit type check: after props = self.properties, verify with if not isinstance(props, StepCountCheckProperties): raise ValueError("expected StepCountCheckProperties for self.properties") so the guard in V2EvalStepCountCheck (using properties and StepCountCheckProperties) remains effective in optimized builds.

coderabbitai · 2026-06-08T17:26:41Z

+
+| Type | Result renderer adds (under the score badge) | Read-only config-detail shows |
+|---|---|---|
+| `llm_judge` | The judge's **reasoning text** + per-criterion pass/fail breakdown (the structured judge output already persisted). | The criteria/rubric, `g_eval_mode` flag, judge model. (`system_prompt`/`prompt_template` collapsed under "Advanced".) |


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the canonical g_eval field name in the renderer guidance.

This line references g_eval_mode, but the V2 property is documented elsewhere in this PR as g_eval. Keeping the old name here risks implementing the wrong binding in UI details.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/projects/evals_v2/components/70_builder_and_onboarding.md` at line 280, The renderer guidance incorrectly references the old property name g_eval_mode; update that reference to the canonical V2 field g_eval so the llm_judge binding uses g_eval (ensure any text mentioning g_eval_mode in the llm_judge renderer guidance is replaced and consistent with the documented g_eval property).

Per-type registry with exhaustive switch over V2EvalType (assertNever guard) and pluggable create container with type picker. Six deterministic create forms (exact_match, pattern_match, contains, set_check, tool_call_check, step_count_check) plus llm_judge form extracted from the monolithic page. Code_eval (Beta) create UI: lazy-loaded CodeMirror 6 Python editor, starter-code examples, ephemeral trust modal (grant/check endpoints), sandboxed preview via the test-run endpoint. Eight per-type result renderers integrated into the three view surfaces (run_result page, eval detail page, eval_configs page). New `test_v2_eval` preview endpoint for try-it-before-save. OpenAPI schema regenerated (includes test_v2_eval + grant/check trust endpoints). Follow-ups: (a) Persist V2 run-level detail (llm_judge reasoning, code_eval stdout/stderr/traceback) so result renderers can show live per-run detail — deferred because the V2 evaluate() contract returns only (scores, skip, detail) with no run-detail slot. (b) run_config_comparison_table.svelte intentionally left unchanged (score-type-agnostic, per product decision). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

socket-security · 2026-06-08T20:54:50Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pypi/pillow@11.1.0
	pypi/pystray@0.19.5
	pypi/pyinstaller@6.15.0
	npm/codemirror@6.0.2
	npm/@codemirror/state@6.6.0
	pypi/pillow@12.2.0
	npm/@codemirror/lang-python@6.2.1
	npm/@codemirror/commands@6.10.3
	npm/@codemirror/language@6.12.3
	npm/@codemirror/view@6.43.0
	pypi/kiln-server@1.0.0

View full report

coderabbitai

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/web_ui/src/lib/components/code_editor.svelte`:
- Around line 15-90: The editor only sets its document once in onMount via
EditorState.create so subsequent updates to the exported prop value are ignored;
add a reactive watcher that, when value changes after mount and view exists,
compares value to view.state.doc.toString() and, if different, calls
view.dispatch(...) to replace the document (same change shape used in export
function setValue) to keep the editor in sync with external prop updates (refer
to onMount, EditorState.create, view, value and export function setValue).

In `@app/web_ui/src/lib/components/eval_types/code_eval_form.svelte`:
- Around line 33-44: The timeout value may be outside the documented 1–300s
range when submitted; clamp/normalize timeout_seconds to that range before
exposing it in getProperties (and the other similar exporter around 161–168).
Locate the timeout_seconds variable and the export function getProperties(),
coerce timeout_seconds to a number, clamp it to Math.max(1, Math.min(300,
Number(timeout_seconds) || 1)), assign the normalized value back to
properties.timeout_seconds, and return that normalized timeout_seconds in the
payload so invalid numeric states cannot leak into the submitted properties.

In `@app/web_ui/src/lib/components/eval_types/code_eval_form.test.ts`:
- Around line 62-68: The tests for CodeEvalForm currently check for the timeout
label and help text but never assert the actual timeout value; update the
"renders the timeout input with default value of 30" test (render(CodeEvalForm))
to locate the timeout input (e.g., via getByLabelText("Timeout (seconds)") or
querySelector for the timeout input) and assert its value/prop equals "30" or 30
(depending on input type); likewise, in the "updated timeout" test (the test
that simulates changing timeout_seconds to 60) add an assertion after the change
to verify the component state/prop or the input value reflects 60 (e.g.,
expect(timeoutInput.value).toBe("60") or
expect(wrapper.vm.timeout_seconds).toBe(60)), referencing the CodeEvalForm
timeout input and the timeout_seconds property to locate where to add the
assertions.

In `@app/web_ui/src/lib/components/eval_types/exact_match_form.svelte`:
- Around line 17-22: The type union for the local variable `source` in
ExactMatchForm (the `source` declaration) incorrectly includes
"value_expression" even though the initializer only ever sets "expected_value"
or "reference_key"; remove "value_expression" from the union and update the
declaration to be only "expected_value" | "reference_key" so the type matches
the actual assignment and consumers of `source` (e.g., any branches or bindings
that check `source`) no longer need to handle an unreachable variant.

In `@app/web_ui/src/lib/components/eval_types/llm_judge_form.svelte`:
- Around line 190-194: The suggested-model click handler incorrectly assigns
display labels to provider_name and model_name which breaks capability
resolution in update_unsupported_algos_and_default_algo; change the handler to
set provider_id and model_id (e.g., provider_id = model.provider_id and model_id
= model.model_id) and keep combined_model_name as
`${model.provider_id}/${model.model_id}` while optionally also setting
provider_name/model_name to their display labels if UI text needs
updating—ensure update_unsupported_algos_and_default_algo continues to receive
the ID pair (provider_id + model_id) for capability detection.

In `@app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte`:
- Around line 11-19: The current reactive assignment for props casts
eval_config.properties to the llm_judge shape whenever a "type" key exists;
tighten the guard to only cast when eval_config?.properties?.type ===
"llm_judge" (and optionally verify model_name/model_provider keys exist) so
non-llm_judge configs won't be treated as llm_judge; update the $: props = ...
expression to check the explicit type value before casting to the { type:
"llm_judge", model_name: string, model_provider: string, g_eval: boolean }
shape.

In `@app/web_ui/src/lib/components/eval_types/set_check_form.svelte`:
- Around line 23-27: The reactive assignment to properties.expected_set (the $:
block that sets properties.expected_set from expected_set_text) overwrites the
explicit nulling when the source is reference_key; change that reactive update
to be conditional so it only runs when the source is not "reference_key" (e.g.,
guard on properties.source or selected_source !== "reference_key"), or
alternatively set properties.expected_set = null inside that reactive block when
properties.source === "reference_key"; update the $: properties.expected_set =
... expression to respect the source and prevent restoring stale values (this
also fixes the same issue in the block around lines 53-59).

In `@app/web_ui/src/lib/components/eval_types/tool_call_check_form.svelte`:
- Around line 158-170: Buttons rendering argument controls (the remove and add
actions) are missing explicit type attributes so they default to submit inside
the parent form causing unintended form submission; update the two button
elements that call remove_arg(item_index, arg_index) and add_arg(item_index) to
include type="button" so they act as plain buttons and do not submit the
enclosing form (look for the button with on:click={() => remove_arg(item_index,
arg_index)} and the button with on:click={() => add_arg(item_index)} and add
type="button" to each).

In `@app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte`:
- Around line 15-23: The reactive assignment for props currently assumes any
eval_config.properties with a type field is the tool_call_check shape; change
the guard to explicitly check eval_config?.properties?.type ===
"tool_call_check" before casting to the tool_call shape (the variable is props)
and ensure downstream uses (e.g. accessing props.expected_tools.length around
the code that uses props) handle props being null/undefined by conditional
checks (or optional chaining) so the component never dereferences undefined when
a different eval type is provided.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte:
- Around line 159-165: When switching eval types you need to reset the test-run
state so a previously run test doesn't allow saving a new untested type: in
select_v2_type(type: V2EvalType) and go_back_to_type_picker() clear the test
flag (reset test_has_run to false and any related test result state) immediately
after changing selected_v2_type; ensure the same reset happens both when setting
selected_v2_type = type and when setting it back to null so the save gate that
checks test_has_run cannot be bypassed.
- Around line 291-297: The grant_trust_and_retry function is currently assigning
failures to test_error instead of the main submit error channel; change the
catch handler inside grant_trust_and_retry (which calls
grantCodeEvalTrust(project_id)) to set create_evaluator_error =
createKilnError(e) rather than test_error and then return false so the UI
surfaces trust-grant failures on the main submit error path.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/page.test.ts:
- Around line 293-369: The test "retries test after granting trust via dialog"
never exercises the retry path: it only asserts mockTestV2Eval was called once
and never invokes the dialog's async action or mockGrantCodeEvalTrust, so the
grant-and-retry behavior isn't validated. Fix by simulating the dialog action
after locating the dialog stub (querySelectorAll('[data-testid="dialog-stub"]')
and matching data-title "Allow Code Execution"), then call the grant action
sequence: invoke mockGrantCodeEvalTrust (or simulate the dialog's
asyncAction/grant_trust_and_retry) so that mockGrantCodeEvalTrust is awaited and
mockTestV2Eval is invoked a second time; finally update assertions to expect
mockGrantCodeEvalTrust toHaveBeenCalled() and mockTestV2Eval
toHaveBeenCalledTimes(2) to confirm retry occurred.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 727249cd-184f-435a-a580-8335684c48e8

📥 Commits

Reviewing files that changed from the base of the PR and between f3777c8 and 99f0c43.

⛔ Files ignored due to path filters (1)

app/web_ui/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (54)

app/desktop/studio_server/eval_api.py
app/desktop/studio_server/test_eval_api.py
app/web_ui/package.json
app/web_ui/src/lib/api/v2_eval_api.test.ts
app/web_ui/src/lib/api/v2_eval_api.ts
app/web_ui/src/lib/api_schema.d.ts
app/web_ui/src/lib/components/code_editor.svelte
app/web_ui/src/lib/components/eval_types/__tests__/code_editor_stub.svelte
app/web_ui/src/lib/components/eval_types/__tests__/dialog_stub.svelte
app/web_ui/src/lib/components/eval_types/code_eval_form.svelte
app/web_ui/src/lib/components/eval_types/code_eval_form.test.ts
app/web_ui/src/lib/components/eval_types/code_eval_result.svelte
app/web_ui/src/lib/components/eval_types/code_eval_result.test.ts
app/web_ui/src/lib/components/eval_types/contains_form.svelte
app/web_ui/src/lib/components/eval_types/contains_result.svelte
app/web_ui/src/lib/components/eval_types/contains_result.test.ts
app/web_ui/src/lib/components/eval_types/eval_result_scores.svelte
app/web_ui/src/lib/components/eval_types/eval_result_scores.test.ts
app/web_ui/src/lib/components/eval_types/exact_match_form.svelte
app/web_ui/src/lib/components/eval_types/exact_match_result.svelte
app/web_ui/src/lib/components/eval_types/exact_match_result.test.ts
app/web_ui/src/lib/components/eval_types/llm_judge_form.svelte
app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte
app/web_ui/src/lib/components/eval_types/llm_judge_result.test.ts
app/web_ui/src/lib/components/eval_types/pattern_match_form.svelte
app/web_ui/src/lib/components/eval_types/pattern_match_result.svelte
app/web_ui/src/lib/components/eval_types/pattern_match_result.test.ts
app/web_ui/src/lib/components/eval_types/set_check_form.svelte
app/web_ui/src/lib/components/eval_types/set_check_result.svelte
app/web_ui/src/lib/components/eval_types/set_check_result.test.ts
app/web_ui/src/lib/components/eval_types/step_count_check_form.svelte
app/web_ui/src/lib/components/eval_types/step_count_check_result.svelte
app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
app/web_ui/src/lib/components/eval_types/tool_call_check_form.svelte
app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte
app/web_ui/src/lib/components/eval_types/tool_call_check_result.test.ts
app/web_ui/src/lib/utils/eval_types/eval_steps_utils.ts
app/web_ui/src/lib/utils/eval_types/registry.test.ts
app/web_ui/src/lib/utils/eval_types/registry.ts
app/web_ui/src/lib/utils/exhaustive.ts
app/web_ui/src/lib/utils/formatters.ts
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/+page.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/[eval_config_id]/[run_config_id]/run_result/+page.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/app_page_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/collapse_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/dialog_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/form_container_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/llm_judge_form_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/v2_form_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/page.test.ts
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/+page.svelte
specs/projects/evals_v2/implementation_plan.md
specs/projects/evals_v2/phase_plans/phase_6.md

✅ Files skipped from review due to trivial changes (9)

app/web_ui/src/lib/components/eval_types/tests/dialog_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/collapse_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/form_container_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/llm_judge_form_stub.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/v2_form_stub.svelte
app/web_ui/package.json
app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/app_page_stub.svelte
app/web_ui/src/lib/api_schema.d.ts

🚧 Files skipped from review as they are similar to previous changes (2)

app/desktop/studio_server/test_eval_api.py
app/desktop/studio_server/eval_api.py

coderabbitai · 2026-06-08T21:03:12Z

+  onMount(async () => {
+    const [
+      { EditorView, keymap, placeholder: placeholderExt, lineNumbers },
+      { EditorState },
+      { python },
+      { defaultKeymap, history, historyKeymap },
+      { syntaxHighlighting, defaultHighlightStyle },
+    ] = await Promise.all([
+      import("@codemirror/view"),
+      import("@codemirror/state"),
+      import("@codemirror/lang-python"),
+      import("@codemirror/commands"),
+      import("@codemirror/language"),
+    ])
+
+    const extensions = [
+      lineNumbers(),
+      history(),
+      syntaxHighlighting(defaultHighlightStyle),
+      python(),
+      keymap.of([...defaultKeymap, ...historyKeymap]),
+      EditorView.updateListener.of((update) => {
+        if (update.docChanged) {
+          value = update.state.doc.toString()
+          dispatch("change", value)
+        }
+      }),
+      EditorView.theme({
+        "&": { minHeight: min_height },
+        ".cm-scroller": { overflow: "auto" },
+        ".cm-content": { fontFamily: "monospace", fontSize: "14px" },
+        ".cm-gutters": {
+          backgroundColor: "transparent",
+          borderRight: "1px solid oklch(var(--bc) / 0.2)",
+        },
+      }),
+    ]
+
+    if (placeholder) {
+      extensions.push(placeholderExt(placeholder))
+    }
+
+    if (readonly) {
+      extensions.push(EditorState.readOnly.of(true))
+    }
+
+    view = new EditorView({
+      state: EditorState.create({
+        doc: value,
+        extensions,
+      }),
+      parent: container,
+    })
+
+    loading = false
+  })
+
+  onDestroy(() => {
+    view?.destroy()
+  })
+
+  export function setValue(newValue: string) {
+    if (view) {
+      const currentValue = view.state.doc.toString()
+      if (currentValue !== newValue) {
+        view.dispatch({
+          changes: {
+            from: 0,
+            to: view.state.doc.length,
+            insert: newValue,
+          },
+        })
+      }
+    }
+    value = newValue
+  }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

value is treated as init-only; external prop updates are ignored after mount.

If the parent updates value later, the editor content stays stale because the doc is only set during initial EditorState.create (Line 62-64).

Suggested fix

onDestroy(() => { view?.destroy() }) + $: if (view) { + const currentValue = view.state.doc.toString() + if (currentValue !== value) { + view.dispatch({ + changes: { + from: 0, + to: view.state.doc.length, + insert: value, + }, + }) + } + } + export function setValue(newValue: string) {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/components/code_editor.svelte` around lines 15 - 90, The editor only sets its document once in onMount via EditorState.create so subsequent updates to the exported prop value are ignored; add a reactive watcher that, when value changes after mount and view exists, compares value to view.state.doc.toString() and, if different, calls view.dispatch(...) to replace the document (same change shape used in export function setValue) to keep the editor in sync with external prop updates (refer to onMount, EditorState.create, view, value and export function setValue).

coderabbitai · 2026-06-08T21:03:12Z

+  let timeout_seconds: number = properties.timeout_seconds ?? 30
+
+  $: properties.timeout_seconds = timeout_seconds
+
+  export function getProperties(): components["schemas"]["CodeEvalProperties"] & {
+    timeout_seconds?: number
+  } {
+    return {
+      type: "code_eval",
+      code: properties.code,
+      timeout_seconds,
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize timeout to the documented 1–300 range before submit.

The UI text says 1–300 seconds, but getProperties() forwards raw timeout_seconds. Invalid numeric states can leak into the payload.

Proposed fix

export function getProperties(): components["schemas"]["CodeEvalProperties"] & { timeout_seconds?: number } { + const normalized_timeout = Number.isFinite(timeout_seconds) + ? Math.min(300, Math.max(1, Math.trunc(timeout_seconds))) + : 30 return { type: "code_eval", code: properties.code, - timeout_seconds, + timeout_seconds: normalized_timeout, } }

Also applies to: 161-168

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/components/eval_types/code_eval_form.svelte` around lines 33 - 44, The timeout value may be outside the documented 1–300s range when submitted; clamp/normalize timeout_seconds to that range before exposing it in getProperties (and the other similar exporter around 161–168). Locate the timeout_seconds variable and the export function getProperties(), coerce timeout_seconds to a number, clamp it to Math.max(1, Math.min(300, Number(timeout_seconds) || 1)), assign the normalized value back to properties.timeout_seconds, and return that normalized timeout_seconds in the payload so invalid numeric states cannot leak into the submitted properties.

coderabbitai · 2026-06-08T21:03:12Z

+  it("renders the timeout input with default value of 30", () => {
+    const { container } = render(CodeEvalForm)
+    expect(container.textContent).toContain("Timeout (seconds)")
+    expect(container.textContent).toContain(
+      "Maximum time allowed for the score function to execute",
+    )
+  })


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Timeout tests don’t verify timeout values, so regressions can slip through.

Line 62’s test title says default is 30 but never asserts input/property value. Line 78’s “updated timeout” test also never asserts timeout_seconds changed to 60.

Suggested fix

it("renders the timeout input with default value of 30", () => { - const { container } = render(CodeEvalForm) + const { container, component } = render(CodeEvalForm) expect(container.textContent).toContain("Timeout (seconds)") expect(container.textContent).toContain( "Maximum time allowed for the score function to execute", ) + expect(component.getProperties().timeout_seconds).toBe(30) }) @@ it("produces CodeEvalProperties with updated timeout", async () => { const { component, container } = render(CodeEvalForm) @@ - if (input) { - await fireEvent.input(input, { target: { value: "60" } }) - } + expect(input).not.toBeNull() + await fireEvent.input(input, { target: { value: "60" } }) const props = component.getProperties() expect(props.type).toBe("code_eval") + expect(props.timeout_seconds).toBe(60) })

Also applies to: 78-90

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/components/eval_types/code_eval_form.test.ts` around lines 62 - 68, The tests for CodeEvalForm currently check for the timeout label and help text but never assert the actual timeout value; update the "renders the timeout input with default value of 30" test (render(CodeEvalForm)) to locate the timeout input (e.g., via getByLabelText("Timeout (seconds)") or querySelector for the timeout input) and assert its value/prop equals "30" or 30 (depending on input type); likewise, in the "updated timeout" test (the test that simulates changing timeout_seconds to 60) add an assertion after the change to verify the component state/prop or the input value reflects 60 (e.g., expect(timeoutInput.value).toBe("60") or expect(wrapper.vm.timeout_seconds).toBe(60)), referencing the CodeEvalForm timeout input and the timeout_seconds property to locate where to add the assertions.

coderabbitai · 2026-06-08T21:03:13Z

+            on:click={() => {
+              model_name = model.model_name
+              provider_name = model.provider_name
+              combined_model_name = `${model.provider_id}/${model.model_id}`
+            }}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Suggested-model click handler stores display names instead of IDs.

update_unsupported_algos_and_default_algo (Line 147-149) resolves model capabilities via provider_id + model id, but the suggested-model click path sets provider_name/model_name to display labels. This breaks capability detection and can expose unsupported algo choices/defaults.

Proposed fix

on:click={() => { - model_name = model.model_name - provider_name = model.provider_name + model_name = model.model_id + provider_name = model.provider_id combined_model_name = `${model.provider_id}/${model.model_id}` }}

Also applies to: 141-149

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/components/eval_types/llm_judge_form.svelte` around lines 190 - 194, The suggested-model click handler incorrectly assigns display labels to provider_name and model_name which breaks capability resolution in update_unsupported_algos_and_default_algo; change the handler to set provider_id and model_id (e.g., provider_id = model.provider_id and model_id = model.model_id) and keep combined_model_name as `${model.provider_id}/${model.model_id}` while optionally also setting provider_name/model_name to their display labels if UI text needs updating—ensure update_unsupported_algos_and_default_algo continues to receive the ID pair (provider_id + model_id) for capability detection.

coderabbitai · 2026-06-08T21:03:13Z

+          <button
+            class="btn btn-ghost btn-sm btn-square"
+            on:click={() => remove_arg(item_index, arg_index)}
+            title="Remove argument"
+          >
+            ✕
+          </button>
+        </div>
+      {/each}
+      <button
+        class="btn btn-ghost btn-sm self-start"
+        on:click={() => add_arg(item_index)}
+      >


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set explicit type="button" on non-submit controls.

Line 158 and Line 167 buttons default to submit behavior in form contexts, which can prematurely submit the parent config form while editing arg rows.

Proposed fix

<button + type="button" class="btn btn-ghost btn-sm btn-square" on:click={() => remove_arg(item_index, arg_index)} title="Remove argument" > @@ <button + type="button" class="btn btn-ghost btn-sm self-start" on:click={() => add_arg(item_index)} >

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<button

class="btn btn-ghost btn-sm btn-square"

on:click={() => remove_arg(item_index, arg_index)}

title="Remove argument"

>

✕

</button>

</div>

{/each}

<button

class="btn btn-ghost btn-sm self-start"

on:click={() => add_arg(item_index)}

>

<button

type="button"

class="btn btn-ghost btn-sm btn-square"

on:click={() => remove_arg(item_index, arg_index)}

title="Remove argument"

>

✕

</button>

</div>

{/each}

<button

type="button"

class="btn btn-ghost btn-sm self-start"

on:click={() => add_arg(item_index)}

>

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/components/eval_types/tool_call_check_form.svelte` around lines 158 - 170, Buttons rendering argument controls (the remove and add actions) are missing explicit type attributes so they default to submit inside the parent form causing unintended form submission; update the two button elements that call remove_arg(item_index, arg_index) and add_arg(item_index) to include type="button" so they act as plain buttons and do not submit the enclosing form (look for the button with on:click={() => remove_arg(item_index, arg_index)} and the button with on:click={() => add_arg(item_index)} and add type="button" to each).

coderabbitai · 2026-06-08T21:03:13Z

+  function select_v2_type(type: V2EvalType) {
+    selected_v2_type = type
+  }

-  function select_evaluator(algo: EvalConfigType) {
-    selected_algo = algo
+  function go_back_to_type_picker() {
+    selected_v2_type = null
  }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset test-run state when switching eval types.

test_has_run is never cleared when changing selection (Line 159 and Line 163), so the save gate at Line 325 can be bypassed by testing one type and then saving a different untested type.

Suggested fix

+ function reset_test_state() { + test_error = null + test_result = null + test_has_run = false + } + function select_v2_type(type: V2EvalType) { selected_v2_type = type + reset_test_state() } function go_back_to_type_picker() { selected_v2_type = null + reset_test_state() }

Also applies to: 325-327

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte around lines 159 - 165, When switching eval types you need to reset the test-run state so a previously run test doesn't allow saving a new untested type: in select_v2_type(type: V2EvalType) and go_back_to_type_picker() clear the test flag (reset test_has_run to false and any related test result state) immediately after changing selected_v2_type; ensure the same reset happens both when setting selected_v2_type = type and when setting it back to null so the save gate that checks test_has_run cannot be bypassed.

coderabbitai · 2026-06-08T21:03:13Z

+  async function grant_trust_and_retry(): Promise<boolean> {
+    try {
+      await grantCodeEvalTrust(project_id)
+    } catch (e) {
+      test_error = createKilnError(e)
+      return false
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Route trust-grant errors to the correct UI error channel.

When trust grant fails during a save flow, the error is written to test_error instead of create_evaluator_error, so users can miss the failure in the main submit path.

Suggested fix

async function grant_trust_and_retry(): Promise<boolean> { + const action = pending_trust_action try { await grantCodeEvalTrust(project_id) } catch (e) { - test_error = createKilnError(e) + const kilnError = createKilnError(e) + if (action === "save") { + create_evaluator_error = kilnError + } else { + test_error = kilnError + } return false } - const action = pending_trust_action pending_trust_action = null if (action === "test") { await run_test()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte around lines 291 - 297, The grant_trust_and_retry function is currently assigning failures to test_error instead of the main submit error channel; change the catch handler inside grant_trust_and_retry (which calls grantCodeEvalTrust(project_id)) to set create_evaluator_error = createKilnError(e) rather than test_error and then return false so the UI surfaces trust-grant failures on the main submit error path.

coderabbitai · 2026-06-08T21:03:13Z

+    it("retries test after granting trust via dialog", async () => {
+      const { container } = await renderPage()
+
+      await selectEvalType(container, "Code Eval")
+      await tick()
+
+      const textarea = container.querySelector(
+        "#test_final_message",
+      ) as HTMLTextAreaElement
+      await fireEvent.input(textarea, { target: { value: "test output" } })
+      await tick()
+
+      // First call: not trusted. Second call (after grant): success.
+      mockTestV2Eval
+        .mockResolvedValueOnce({
+          scores: {},
+          skipped_reason: "code_eval_not_trusted",
+          skipped_detail: null,
+        })
+        .mockResolvedValueOnce({
+          scores: { quality: 0.9 },
+          skipped_reason: null,
+          skipped_detail: null,
+        })
+
+      mockGrantCodeEvalTrust.mockResolvedValueOnce({ trusted: true })
+
+      // Click Try It – triggers the not-trusted path
+      const tryBtn = container.querySelector(
+        "button.btn-primary.btn-sm",
+      ) as HTMLButtonElement
+      await fireEvent.click(tryBtn)
+      await tick()
+      await new Promise((r) => setTimeout(r, 0))
+      await tick()
+
+      // Trust dialog was shown
+      expect(showCalls).toContain("Allow Code Execution")
+
+      // Now find the trust dialog's action button in the DOM and examine it.
+      // Since our Dialog stub records calls but doesn't render action buttons,
+      // we invoke grant_trust_and_retry directly by finding the dialog's
+      // asyncAction and calling it. We can do this by looking at the
+      // data-action-buttons attribute or by accessing the component directly.
+      //
+      // The dialog stub serialises action_buttons. Find the trust dialog.
+      const dialogs = container.querySelectorAll('[data-testid="dialog-stub"]')
+      let trustDialogEl: Element | null = null
+      for (const d of dialogs) {
+        if (d.getAttribute("data-title") === "Allow Code Execution") {
+          trustDialogEl = d
+        }
+      }
+      expect(trustDialogEl).not.toBeNull()
+      const actionBtns = JSON.parse(
+        trustDialogEl!.getAttribute("data-action-buttons") || "[]",
+      )
+      const grantBtn = actionBtns.find(
+        (b: Record<string, unknown>) => b.isWarning,
+      )
+      expect(grantBtn).toBeTruthy()
+
+      // The asyncAction in the real component is grant_trust_and_retry.
+      // Since the stub doesn't wire up action_buttons, we simulate
+      // what the real Dialog does: call the asyncAction.
+      // We imported the page component; the function is internal.
+      // Instead, let's invoke it via the mock chain: grantCodeEvalTrust
+      // should be called, then testV2Eval retried.
+
+      // We can get the page component instance and call the method,
+      // but it's private. The cleaner approach: directly test the
+      // observable side effects by verifying mock calls.
+
+      // Since we can't click a real dialog button (stub doesn't render them),
+      // verify the state: testV2Eval was called once, trust dialog shown.
+      expect(mockTestV2Eval).toHaveBeenCalledTimes(1)
+    })


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The retry-path test never executes the retry path.

On Line 293, the case says it verifies “retries test after granting trust,” but it never triggers the dialog action and ends by asserting mockTestV2Eval was called only once (Line 368). This will pass even if grant-and-retry is broken.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/page.test.ts around lines 293 - 369, The test "retries test after granting trust via dialog" never exercises the retry path: it only asserts mockTestV2Eval was called once and never invokes the dialog's async action or mockGrantCodeEvalTrust, so the grant-and-retry behavior isn't validated. Fix by simulating the dialog action after locating the dialog stub (querySelectorAll('[data-testid="dialog-stub"]') and matching data-title "Allow Code Execution"), then call the grant action sequence: invoke mockGrantCodeEvalTrust (or simulate the dialog's asyncAction/grant_trust_and_retry) so that mockGrantCodeEvalTrust is awaited and mockTestV2Eval is invoked a second time; finally update assertions to expect mockGrantCodeEvalTrust toHaveBeenCalled() and mockTestV2Eval toHaveBeenCalledTimes(2) to confirm retry occurred.

never mode and on_unexpected_tools="fail" used name-only matching and ignored expected_args (spec components/22 §3.5, J.37). Both now use the full _call_matches_spec matcher. Adds tests for never + expected_args and on_unexpected + expected_args, the previously-uncovered cases that let the bug ship. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds the coexistence tests architecture.md §5 elevates to first-class: - V1 EvalRun loads with new optional fields defaulting to None (+ round-trip) - V1 EvalConfig with absent config_type dispatches via the legacy path - V1 g_eval whose properties dict carries a "type" key is not misrouted into the V2 discriminated union (in-memory + round-trip) - V1 g_eval / llm_as_judge create + report unchanged through the API No production code changed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aseEval V2 EvalConfigs silently judged the dataset item's stale output in task_run_eval mode (ignoring the selected run config) because V2 adapters were built on a judge-only BaseV2Eval fork with no fresh-output path. - Unfork BaseV2Eval -> single BaseEval via thin BaseV2EvalBridge (C.11c) - V2 task_run_eval generates fresh output (run_task) then scores it, with skip handling matching eval_config_eval - Extract run_task from run_task_and_eval (pure refactor; legacy unchanged) - Design B: build EvalTaskInput per source via EvalTaskInput.from_task_run; remove dead EvalJob.stored_output/stored_trace - EvalInput-source execution deferred to Task B (clean NotImplementedError) - Tests: fresh-gen distinctness + skip-during-task_run_eval coverage Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…l_config_eval EvalInput-source V2 evals were non-functional: the runner rejected task_run_eval + eval_input_filter_id at construction and the EvalInput branch raised NotImplementedError. - Constructor accepts EvalInput sources in task_run_eval (C.runner.3) - collect_tasks_for_eval_input crosses inputs x configs x run_configs (3-level dedup) - EvalInput + task_run_eval: generate fresh output from user_message via run_task, judge with reference_data from EvalInput.reference; persist EvalRun with eval_input_id - EvalInput + eval_config_eval: clean skip (incompatible_input_shape) — deferred per B2.1/K.3 (golden subsets stay TaskRuns in V2.0) - run_task widened to TaskRun | EvalInput; EvalTaskInput.from_eval_input added - Tests: fresh-gen + reference_data, skip paths, collection dedup, run_task Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libs/core/kiln_ai/adapters/eval/eval_runner.py (1)

245-247: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist source dataset IDs for V2 task_run_eval runs.

At Line 494, dataset_id is set to run_output.id (fresh run), but dedupe/coverage logic keys by source dataset IDs (Lines 245-247 and Line 260). This breaks rerun deduplication and percent-complete aggregation for V2 task-run evals.

Suggested fix

-            dataset_id = run_output.id
+            dataset_id = job.item.id

Also applies to: 260-260, 492-495

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` around lines 245 - 247, The
dedupe/coverage logic is storing the fresh run id (run_output.id) instead of the
original source dataset id, which breaks rerun deduplication for V2 task-run
evals; update the places that add to and check already_run (references:
already_run, eval_config.id, run.task_run_config_id, run.dataset_id,
run_output.id, task_run_eval) so they persist the source dataset id extracted
from the incoming run (use run.dataset_id or the source id field from the V2
task_run_eval payload) rather than run_output.id when recording and comparing
entries—ensure both the addition (where
already_run[eval_config.id][run.task_run_config_id].add(...)) and any
checks/aggregation that reference already_run use that source id consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Around line 245-247: The dedupe/coverage logic is storing the fresh run id
(run_output.id) instead of the original source dataset id, which breaks rerun
deduplication for V2 task-run evals; update the places that add to and check
already_run (references: already_run, eval_config.id, run.task_run_config_id,
run.dataset_id, run_output.id, task_run_eval) so they persist the source dataset
id extracted from the incoming run (use run.dataset_id or the source id field
from the V2 task_run_eval payload) rather than run_output.id when recording and
comparing entries—ensure both the addition (where
already_run[eval_config.id][run.task_run_config_id].add(...)) and any
checks/aggregation that reference already_run use that source id consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 92090d71-5642-42cb-a644-882a7bb677d2

📥 Commits

Reviewing files that changed from the base of the PR and between 99f0c43 and 55ec5e8.

📒 Files selected for processing (25)

app/desktop/studio_server/test_eval_api.py
libs/core/kiln_ai/adapters/eval/__init__.py
libs/core/kiln_ai/adapters/eval/base_eval.py
libs/core/kiln_ai/adapters/eval/eval_runner.py
libs/core/kiln_ai/adapters/eval/registry.py
libs/core/kiln_ai/adapters/eval/test_eval_runner.py
libs/core/kiln_ai/adapters/eval/test_registry.py
libs/core/kiln_ai/adapters/eval/test_v2_contains.py
libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/test_v2_exact_match.py
libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
libs/core/kiln_ai/datamodel/eval.py
libs/core/kiln_ai/datamodel/test_eval_model.py

💤 Files with no reviewable changes (1)

libs/core/kiln_ai/adapters/eval/init.py

🚧 Files skipped from review as they are similar to previous changes (15)

libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
libs/core/kiln_ai/adapters/eval/registry.py
libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/test_eval_runner.py
libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/test_registry.py
libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
libs/core/kiln_ai/datamodel/eval.py

… V2.0 Restore V1's battle-tested discrete scoring for V2 launch. RAG judge templates (faithfulness, answer_relevance, context_relevance, context_precision, hallucination, answer_correctness) were net-new in V2 and introduced direct-float model output that deviated from V1's discrete convention, leaked leniency into the shared scorer, and lacked first-class continuous-score support (UI, data model, docs, g-eval compatibility). Changes: - Delete rag_judge_templates.py and its tests (576 lines removed) - Revert V2 llm_judge to discrete output (allow_float_scores=False) - Restore pristine strict scorer in scoring_utils.py (remove float leniency: isinstance int/float early-return and float() fallback) - Remove _filter_output_to_score_keys (RAG-specific rich JSON filtering) - Add tests: discrete score schema assertion, strict scorer rejection of non-mapping values (float strings, raw floats) - Annotate 13 spec files as deferred/punted with pointers to /specs/projects/rag_templates/ - Create bring-back overview at specs/projects/rag_templates/project_overview.md documenting what was removed, why, design references, and complexities to solve before restoration Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Verify the RAG judge template + continuous scoring removal (commits 5efc626, 74154c0) landed correctly on the evals_v2 branch. All 5 checklist items pass: V1 scoring identical to main, V2 llm_judge discrete, no RAG refs in source, specs annotated as punted, and rag_templates bring-back plan records the removal SHA. Key finding: _filter_output_to_score_keys was removed with the RAG templates — Phase 2 item 1b is moot and should be skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d, dead-except removal Phase 2 of deep-cr-cleanup: - 1a: extract_value, check_required_vars, and check_reference_key now treat None as a skip signal alongside Jinja Undefined, preventing garbage from reaching the judge when reference data is missing or null. - 1b: DROPPED — _filter_output_to_score_keys was already removed with the RAG templates (commit 5efc626). - 1c: V1 EvalRun validate_output_fields now guards against output=None for non-skipped runs, catching the bug before it becomes a downstream AttributeError. - 5.4: Removed dead except SyntaxError: pass in CodeEvalProperties.validate_code — compile() already catches syntax errors upstream, so ast.parse() never raises SyntaxError for code that survived compile(). 22 new tests covering all changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…from_type; dedupe dispatch tests Rename the legacy V1 dispatcher to clearly signal it is the legacy path. Update V2-branch error message to point callers at v2_eval_adapter_from_config. Remove two exact-duplicate dispatch tests in test_registry.py (covered by parametrized and V2-perspective tests that remain). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…utput_scores, clearer error 5.8: Replace fragile double-get_args _V2_PROPERTY_TYPES derivation with an explicit V2_PROPERTY_TYPES tuple defined alongside the V2EvalConfigProperties union in eval.py, imported by base_eval.py and registry.py. 5.9: Cache output_scores in BaseV2EvalBridge.__init__ so build_binary_scores and CodeEvalAdapter._validate_scores avoid per-item parent_eval() disk I/O. Signature changed from (eval_config, passed) to (output_scores, passed); all 6 deterministic adapters + code_eval + llm_judge updated. 5.6: Clarify the dataset_id/eval_input_id mutual-exclusivity error message to explain V1 (TaskRun source) vs V2 (EvalInput source) distinction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ce affordances (5.3) In V2.0 no UI path populates reference_data on EvalInputs, so any eval-type form offering a "Reference Data Key" source option would produce configs that always silently skip at runtime. Add TODO(pre-ship 5.3) comments at the source selector in exact_match_form, set_check_form, and contains_form. CI's no-TODO-on-main rule makes these a hard pre-ship gate: either wire reference_data population or remove the affordance before merging to main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…luation_data_type default (5.1) docs to code Update two evals_v2 spec docs to match actual implementation: - components/27 §2.2: document strict exact-key-match validation in _validate_scores (raises RuntimeError on missing/extra keys), replacing the old lenient spec rule. Add key-mismatch row to the error-cases table. Correct int-coercion behavior in the no-bool-convenience bullet. - components/15 §4.1: change evaluation_data_type default from None to EvalDataType.final_answer, matching eval.py:793. Document the V1 back-compat rationale (omitted field loads as true V1 behavior). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- 6.1: Extract shared StubV2Eval/SkippingStubV2Eval into new conftest.py; remove duplicate definitions from test_v2_dispatch_and_contract.py and test_eval_runner.py. - 6.3: Add make_v2_eval_config/make_eval_task_input factory helpers to conftest.py; adopt across all 6 deterministic matcher test files, preserving each file's original default final_message. - 6.5: Convert test_g_eval_raises_when_provider_lacks_logprobs from run_until_complete to @pytest.mark.asyncio / async def / await. - 5.5: Add TestRunnerSkipReasonsAreValidEnumMembers — AST-inspects eval_runner.py to assert every skipped_reason= value (enum .value access or raw string literal) is a valid SkippedReason member. - 6.12: Add provenance docstring to test_g_eval_data.py documenting what the serialized RunOutput fixture is, how it was generated, and how to regenerate it (no directory exists for a README). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add TestV1LegacyRunnerCoexistence with 3 regression guards: - V1 g_eval dispatches through legacy_eval_adapter_from_type (not V2) - EvalConfig from dict without config_type key defaults to g_eval and runs through the legacy path - V1 g_eval config with properties["type"]="exact_match" is not misrouted to the V2 runner Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s (2.1) Replace the open dict[str, bool] response of the two code_eval trust endpoints with a named CodeEvalTrustResponse{trusted: bool} model, so the OpenAPI/TS contract is a closed object. Regenerate the web_ui OpenAPI schema (api_schema.d.ts) and update the frontend client return types. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- De-leak two single-resource 400 messages to a type-agnostic string (get_eval_progress, get_eval_config_score_summary); list endpoints unchanged. - create_eval_config: return a clean 400 for invalid V2 properties (ValidationError handler ordered before ValueError) + test. - Move trust-endpoint lazy imports (project_from_id, v2_eval_code_eval) to module top-level (no import cycle); retarget test patches accordingly. - test_nothing_persisted: assert against the real data-model directory. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ter timeout test (6.11) Add a module-level asyncio.Lock in CodeEvalAdapter.evaluate() around the run_in_executor call so only one code_eval subprocess runs at a time, with waiting evals suspended as coroutines (not blocking executor threads for the full timeout). Keep the narrow threading.Lock around p.start() for the PyInstaller spawn race (#7410). Add a serialization test. Speed up the sandbox_worker timeout test (sleep(60)/2s → sleep(10)/1s, same margin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- code_eval timeout input: client-side min=1 max=300 (FormElement gains optional min/max props; existing usages unaffected). - Trust dialog: remove all execution-method/reassurance language; keep a stark "runs arbitrary Python code on your machine" warning. - five_star docstring: describe the raising behavior (was wrongly "clamped"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…l-type forms (4.2/4.4) Replace any/SvelteComponent<any> on the V2 eval-type form imperative API with a shared EvalTypeFormApi interface ({ getProperties(): V2EvalConfigProperties; validate?(): string | null }). Type the registry's createFormComponent, narrow the create_eval_config call sites, and drop the eslint-disable suppressions. One isolated, commented Svelte-4 bind:this cast remains (unavoidable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- formatEvalConfigName: collapse identical V2 compact/non-compact ternary. - exact_match_form: remove dead "value_expression" source union member. - Add extractV2Props<T> type-guard util; convert 8 V2 result components off repeated as-casts (now null-safe on type mismatch); + unit tests. - tool_call_check_result: use generated ToolCallSpec instead of a local mirror. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Extend the with/without-eval_config render test to the other 7 V2 result renderers (null-props path renders scores, hides config detail). - Parametrize the 8 v2_eval_api error-path tests with describe.each. - registry.test.ts: derive the expected count from the types array. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(7.1) The editor imports only the granular @codemirror/* packages (in code_editor.svelte); the codemirror umbrella was imported nowhere in src. Remove it and re-resolve the lockfile, dropping the umbrella plus its unused transitive deps (@codemirror/lint, @codemirror/search). Editor still builds. Completes the deep-cr-cleanup project (all 16 phases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Leave only the single "runs Python code on your machine" warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/web_ui/src/lib/utils/eval_types/registry.ts`:
- Around line 206-218: The extractV2Props function lacks proper validation of
the config type itself and only checks if the properties.type field matches the
expected type. A legacy config with a type property could incorrectly pass
validation. Add a guard condition to check that eval_config.config_type === "v2"
alongside the existing property validation checks before performing the cast and
returning eval_config.properties as V2PropsMap[T].

In `@app/web_ui/src/lib/utils/form_element.svelte`:
- Around line 51-52: The exported `min` and `max` properties are not being
validated within the component's validation logic. Update the `validator`
function to include bounds checking that verifies numeric input values fall
within the specified minimum and maximum range when these bounds are set.
Additionally, ensure the `inline_error` mechanism properly reports validation
failures when values exceed the defined bounds. This validation should apply
consistently across all numeric input scenarios handled by the validator
function, ensuring that out-of-range values are rejected before they pass
validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f138eecf-1f19-45c5-aabc-e230a9f4150d

📥 Commits

Reviewing files that changed from the base of the PR and between 55ec5e8 and 89a69f2.

⛔ Files ignored due to path filters (1)

app/web_ui/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (95)

app/desktop/studio_server/eval_api.py
app/desktop/studio_server/test_eval_api.py
app/web_ui/package.json
app/web_ui/src/lib/api/v2_eval_api.test.ts
app/web_ui/src/lib/api/v2_eval_api.ts
app/web_ui/src/lib/api_schema.d.ts
app/web_ui/src/lib/components/eval_types/code_eval_form.svelte
app/web_ui/src/lib/components/eval_types/code_eval_result.svelte
app/web_ui/src/lib/components/eval_types/code_eval_result.test.ts
app/web_ui/src/lib/components/eval_types/contains_form.svelte
app/web_ui/src/lib/components/eval_types/contains_result.svelte
app/web_ui/src/lib/components/eval_types/contains_result.test.ts
app/web_ui/src/lib/components/eval_types/exact_match_form.svelte
app/web_ui/src/lib/components/eval_types/exact_match_result.svelte
app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte
app/web_ui/src/lib/components/eval_types/llm_judge_result.test.ts
app/web_ui/src/lib/components/eval_types/pattern_match_result.svelte
app/web_ui/src/lib/components/eval_types/pattern_match_result.test.ts
app/web_ui/src/lib/components/eval_types/set_check_form.svelte
app/web_ui/src/lib/components/eval_types/set_check_result.svelte
app/web_ui/src/lib/components/eval_types/set_check_result.test.ts
app/web_ui/src/lib/components/eval_types/step_count_check_result.svelte
app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte
app/web_ui/src/lib/components/eval_types/tool_call_check_result.test.ts
app/web_ui/src/lib/utils/eval_types/registry.test.ts
app/web_ui/src/lib/utils/eval_types/registry.ts
app/web_ui/src/lib/utils/form_element.svelte
app/web_ui/src/lib/utils/formatters.ts
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
libs/core/kiln_ai/adapters/eval/__init__.py
libs/core/kiln_ai/adapters/eval/base_eval.py
libs/core/kiln_ai/adapters/eval/conftest.py
libs/core/kiln_ai/adapters/eval/eval_helpers.py
libs/core/kiln_ai/adapters/eval/eval_runner.py
libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py
libs/core/kiln_ai/adapters/eval/eval_utils/test_scoring_utils.py
libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py
libs/core/kiln_ai/adapters/eval/registry.py
libs/core/kiln_ai/adapters/eval/test_eval_runner.py
libs/core/kiln_ai/adapters/eval/test_g_eval_data.py
libs/core/kiln_ai/adapters/eval/test_phase4_data_model.py
libs/core/kiln_ai/adapters/eval/test_registry.py
libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py
libs/core/kiln_ai/adapters/eval/test_v2_contains.py
libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_helpers.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/test_v2_exact_match.py
libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
libs/core/kiln_ai/adapters/eval/test_v2_step_count_check.py
libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
libs/core/kiln_ai/datamodel/eval.py
libs/core/kiln_ai/datamodel/test_eval_model.py
specs/projects/deep-cr-cleanup/implementation_plan.md
specs/projects/deep-cr-cleanup/phase_plans/phase_1.md
specs/projects/deep-cr-cleanup/phase_plans/phase_10.md
specs/projects/deep-cr-cleanup/phase_plans/phase_11.md
specs/projects/deep-cr-cleanup/phase_plans/phase_12.md
specs/projects/deep-cr-cleanup/phase_plans/phase_13.md
specs/projects/deep-cr-cleanup/phase_plans/phase_14.md
specs/projects/deep-cr-cleanup/phase_plans/phase_15.md
specs/projects/deep-cr-cleanup/phase_plans/phase_16.md
specs/projects/deep-cr-cleanup/phase_plans/phase_2.md
specs/projects/deep-cr-cleanup/phase_plans/phase_3.md
specs/projects/deep-cr-cleanup/phase_plans/phase_4.md
specs/projects/deep-cr-cleanup/phase_plans/phase_5.md
specs/projects/deep-cr-cleanup/phase_plans/phase_6.md
specs/projects/deep-cr-cleanup/phase_plans/phase_7.md
specs/projects/deep-cr-cleanup/phase_plans/phase_8.md
specs/projects/deep-cr-cleanup/phase_plans/phase_9.md
specs/projects/deep-cr-cleanup/project_overview.md
specs/projects/evals_v2/architecture.md
specs/projects/evals_v2/components/00_overview.md
specs/projects/evals_v2/components/15_v1_v2_coexistence.md
specs/projects/evals_v2/components/21_type_llm_judge.md
specs/projects/evals_v2/components/27_type_code_eval.md
specs/projects/evals_v2/components/29_rag_judge_templates.md
specs/projects/evals_v2/components/50_reference_data.md
specs/projects/evals_v2/components/80_extensibility_contract.md
specs/projects/evals_v2/components/90_open_risks.md
specs/projects/evals_v2/functional_spec.md
specs/projects/evals_v2/implementation_plan.md
specs/projects/evals_v2/phase_plans/phase_4.md
specs/projects/rag_templates/project_overview.md

💤 Files with no reviewable changes (3)

app/web_ui/package.json
libs/core/kiln_ai/adapters/eval/init.py
libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py

✅ Files skipped from review due to trivial changes (17)

specs/projects/deep-cr-cleanup/phase_plans/phase_16.md
specs/projects/deep-cr-cleanup/implementation_plan.md
specs/projects/deep-cr-cleanup/phase_plans/phase_13.md
specs/projects/deep-cr-cleanup/phase_plans/phase_8.md
specs/projects/deep-cr-cleanup/phase_plans/phase_6.md
libs/core/kiln_ai/adapters/eval/test_g_eval_data.py
specs/projects/deep-cr-cleanup/phase_plans/phase_5.md
specs/projects/deep-cr-cleanup/phase_plans/phase_12.md
specs/projects/deep-cr-cleanup/phase_plans/phase_15.md
specs/projects/deep-cr-cleanup/phase_plans/phase_10.md
specs/projects/deep-cr-cleanup/phase_plans/phase_14.md
specs/projects/deep-cr-cleanup/phase_plans/phase_7.md
specs/projects/deep-cr-cleanup/phase_plans/phase_11.md
specs/projects/deep-cr-cleanup/phase_plans/phase_2.md
specs/projects/evals_v2/architecture.md
specs/projects/evals_v2/components/00_overview.md
app/web_ui/src/lib/api_schema.d.ts

🚧 Files skipped from review as they are similar to previous changes (46)

app/web_ui/src/lib/components/eval_types/code_eval_result.svelte
libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
app/web_ui/src/lib/components/eval_types/tool_call_check_result.test.ts
app/web_ui/src/lib/components/eval_types/set_check_result.test.ts
libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
app/web_ui/src/lib/components/eval_types/contains_result.svelte
app/web_ui/src/lib/components/eval_types/set_check_result.svelte
libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py
app/web_ui/src/lib/components/eval_types/contains_form.svelte
app/web_ui/src/lib/components/eval_types/contains_result.test.ts
app/web_ui/src/lib/components/eval_types/step_count_check_result.svelte
app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte
app/web_ui/src/lib/components/eval_types/llm_judge_result.test.ts
libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
libs/core/kiln_ai/adapters/eval/eval_helpers.py
app/web_ui/src/lib/components/eval_types/set_check_form.svelte
app/web_ui/src/lib/utils/formatters.ts
app/web_ui/src/lib/components/eval_types/pattern_match_result.svelte
app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte
app/web_ui/src/lib/components/eval_types/code_eval_form.svelte
libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
libs/core/kiln_ai/adapters/eval/test_v2_contains.py
libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py
app/web_ui/src/lib/api/v2_eval_api.test.ts
libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
app/web_ui/src/lib/components/eval_types/exact_match_result.svelte
app/web_ui/src/lib/components/eval_types/code_eval_result.test.ts
libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
libs/core/kiln_ai/adapters/eval/test_v2_step_count_check.py
libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
app/web_ui/src/lib/utils/eval_types/registry.test.ts
libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
app/web_ui/src/lib/components/eval_types/exact_match_form.svelte
app/web_ui/src/lib/components/eval_types/pattern_match_result.test.ts
libs/core/kiln_ai/adapters/eval/base_eval.py
libs/core/kiln_ai/adapters/eval/eval_utils/test_scoring_utils.py
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
app/desktop/studio_server/eval_api.py
app/desktop/studio_server/test_eval_api.py
libs/core/kiln_ai/datamodel/eval.py
libs/core/kiln_ai/adapters/eval/eval_runner.py

coderabbitai · 2026-06-18T15:48:46Z

+export function extractV2Props<T extends V2EvalType>(
+  eval_config: EvalConfig | null,
+  expectedType: T,
+): V2PropsMap[T] | null {
+  if (
+    !eval_config?.properties ||
+    !("type" in eval_config.properties) ||
+    eval_config.properties.type !== expectedType
+  ) {
+    return null
+  }
+  return eval_config.properties as V2PropsMap[T]
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard extractV2Props by config_type === "v2" before casting.

At Line 210, the helper only checks properties.type; it can accept a legacy config that happens to contain a type key and then cast it to V2 props incorrectly. Add a config_type check before returning.

Suggested fix

export function extractV2Props<T extends V2EvalType>( eval_config: EvalConfig | null, expectedType: T, ): V2PropsMap[T] | null { if ( + eval_config?.config_type !== "v2" || !eval_config?.properties || !("type" in eval_config.properties) || eval_config.properties.type !== expectedType ) { return null } return eval_config.properties as V2PropsMap[T] }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/utils/eval_types/registry.ts` around lines 206 - 218, The extractV2Props function lacks proper validation of the config type itself and only checks if the properties.type field matches the expected type. A legacy config with a type property could incorrectly pass validation. Add a guard condition to check that eval_config.config_type === "v2" alongside the existing property validation checks before performing the cast and returning eval_config.properties as V2PropsMap[T].

coderabbitai · 2026-06-18T15:48:46Z

+  export let min: number | null = null
+  export let max: number | null = null


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

min/max are forwarded but not enforced by component validation.

Lines 51-52 and 261-262 add bounds, but validator/inline_error never check them. This creates a mismatch where out-of-range numeric values can still pass component validation.

Suggested fix

export let validator: (value: unknown) => string | null = () => { if (inputType === "header_only") { return null } @@ if (max_length && typeof value === "string" && value.length > max_length) { @@ } + if (inputType === "input_number" && typeof value === "number") { + if (min !== null && value < min) { + return `"${label}" must be at least ${min}.` + } + if (max !== null && value > max) { + return `"${label}" must be at most ${max}.` + } + } return null } @@ } else if ( max_length && typeof value === "string" && value.length > max_length ) { inline_error = "" + value.length + "/" + max_length + } else if (inputType === "input_number" && typeof value === "number") { + if (min !== null && value < min) { + inline_error = `Min ${min}` + } else if (max !== null && value > max) { + inline_error = `Max ${max}` + } else { + inline_error = null + } } else { inline_error = null } }

Also applies to: 65-82, 86-101, 261-262

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/utils/form_element.svelte` around lines 51 - 52, The exported `min` and `max` properties are not being validated within the component's validation logic. Update the `validator` function to include bounds checking that verifies numeric input values fall within the specified minimum and maximum range when these bounds are set. Additionally, ensure the `inline_error` mechanism properly reports validation failures when values exceed the defined bounds. This validation should apply consistently across all numeric input scenarios handled by the validator function, ensuring that out-of-range values are rejected before they pass validation.

scosman and others added 6 commits June 6, 2026 13:55

Spec for evals v2

cd56926

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

scosman and others added 2 commits June 8, 2026 17:52

chiang-daniel mentioned this pull request Jun 9, 2026

feat: Multiturn eval builder #1457

Draft

6 tasks

scosman and others added 2 commits June 9, 2026 20:44

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

scosman and others added 12 commits June 17, 2026 16:59

docs(rag_templates): record removal commit id as restore starting point

74154c0

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scosman and others added 8 commits June 18, 2026 00:40

fix(evals): shorten code_eval trust-dialog warning copy

8e04841

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(evals): drop secondary paragraph from code_eval trust dialog

89a69f2

Leave only the single "runs Python code on your machine" warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

		n_processed = len(expected_dataset_ids) - incomplete_count
		percent_complete = (n_processed) / len(expected_dataset_ids)



		class BaseV2Eval(ABC):
		"""Base class for V2 eval adapters -- separate from BaseEval because V2 adapters are synchronous, produce EvalScores dicts directly, and do not need the LLM-based score schema machinery."""

	- "model_responses": number of assistant messages, including tool-call-only
	- \"model_responses\": number of assistant messages, including tool-call-only

		props = self.properties
		assert isinstance(props, StepCountCheckProperties)

		export let min: number \| null = null
		export let max: number \| null = null

Conversation

scosman commented Jun 8, 2026

Kiln Evals V2 — Implementation

How this spec is organized (read this first)

On the Ax.x tags you'll see throughout

Scope at a glance

Coordination & prerequisites

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

socket-security Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

On the `Ax.x` tags you'll see throughout

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

socket-security Bot commented Jun 8, 2026 •

edited

Loading