Skip to content

Evals V2#1454

Open
scosman wants to merge 31 commits into
mainfrom
scosman/evals_v2
Open

Evals V2#1454
scosman wants to merge 31 commits into
mainfrom
scosman/evals_v2

Conversation

@scosman

@scosman scosman commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Kiln Evals V2 — Implementation

This project implements Kiln Evals V2: a typed, extensible eval framework that replaces V1's single-type LLM-as-judge model with many EvalConfigTypes producing scores per Eval, backed by purpose-built EvalInput datasets with per-case structured reference data. It is designed for "many small evals" — lowering the floor for creation so a small focused eval beats both "no eval at all" and an oversized eval that overfits.

V2 is additive over V1. V1 records on disk are never rewritten. V1 EvalConfigs (g_eval, llm_as_judge) continue to load and run unchanged under V2 via legacy adapter dispatch. New eval types, new dataset entities, and new scoring paths live alongside V1 — they never replace it on disk.

How this spec is organized (read this first)

This spec was produced from a long, deliberate design phase. The detail lives in components/ — one focused design doc per topic. The top-level files are thin maps into it:

  • project_overview.md (this file) — what we're building and how to navigate.
  • functional_spec.md — the feature/behavior surface (the eval-type catalog + data model + contracts), as pointers into components/.
  • architecture.md — the technical index: data model, runner dispatch, and the component map. Start here for the build.
  • implementation_plan.md — the phased build order (the checklist /spec implement walks).
  • components/NN_*.md — the real design detail. These are the meat; they already carry implementation-grade detail (schemas, signatures, test plans).
  • reference/ — supporting context (see note below). You usually don't need it.

On the Ax.x tags you'll see throughout

A full alignment phase preceded this design and is archived in reference/ALIGNMENT.md. Throughout the components/ docs you'll see tags like A2.1, B.13, E.18, C.runner.2 — these reference locked alignment decisions. You do not need to load reference/ALIGNMENT.md unless you genuinely want the history of how a decision was made. The design docs in components/ capture the output of alignment — they are less verbose, more complete, lower-churn, and are the best reference for building. Treat the tags as provenance breadcrumbs, not required reading.

reference/ also holds two docs that two components cite by name — backwards_compat_plan_grounded.md (code-grounded Phase-0 plan with eval.py line numbers) and batch_agent_eval_expansion.md (the tool_call_check/step_count_check expansion). Consult them only if a component points you there.

Scope at a glance

In V2.0: the data model (EvalInput, V2 EvalConfig, additive EvalRun/Eval fields), V1/V2 coexistence, 8 eval types (llm_judge, exact_match, pattern_match, set_check, contains, tool_call_check, step_count_check, code_eval), a Jinja2 template+extraction layer, the V2 runner, 6 RAG judge templates, the extensibility seams, and minimal-but-complete create/view UI for the new types.

Explicitly out of scope (owned by future, separate projects): the feedback-to-eval pipeline / triage, the goal-first onboarding redesign, spec-builder reliability, and the deferred eval types (composite, threshold, json_schema, event_ordering, embedding_similarity, dag_metric). See components/00_overview.md §3 for the full scope map and components/90_open_risks.md for residual risks.

Coordination & prerequisites

  • Multi-turn-synthetic is a parallel project. V2 commits only to the contract surface (components/26* is excluded from this folder; the MultiTurnSyntheticEvalInputData slot in the data model accommodates it). Don't build multi-turn scoring here.
  • Prereq already landed in main: the general Kiln input_transform (Jinja2) capability — libs/core/kiln_ai/utils/jinja_engine.py + InputTransform on KilnAgentRunConfigProperties. components/06 is its design, kept for reference only — do not rebuild.
  • Prereqs still to build (Phase 1): the SingleTurnR1ThinkingFormatter fix (components/05) and the V1 validate_template_properties bug fix (components/15 §4.3). See implementation_plan.md.

The full components/00_overview.md is the canonical front door — it expands every point above with links.

scosman and others added 6 commits June 6, 2026 13:55
Add the V2 eval data model alongside V1 without breaking backwards
compatibility. Key changes:

- Add V2EvalConfigProperties typed model with discriminated union dispatch,
  mode="before" validator for V1 dict compatibility, and validate_properties
  V2 branch enforcing typed properties and no root-level model fields
- Add EvalInput parented model, EvalInputFilter protocol/registry, and
  dataset_filters module with TagFilter implementation
- Extend EvalRun with eval_input_id, reference_data, skipped_reason fields
  and relaxed score/output validators when skipped
- Add Eval.eval_input_filter_id with mutual-exclusivity guard against
  eval_set_filter_id, plus Task.eval_inputs() accessor
- Fix V1 validate_template_properties bug (was validating wrong field set)
- Add forward_thinking_instructions opt-in to SingleTurnR1ThinkingFormatter
- Extract legacy_model_fields helper and refactor eval_adapter_from_type
  dispatch in registry
- Regenerate OpenAPI schema and fix all downstream TypeScript type errors
  for nullable model_name/model_provider/eval_set_filter_id fields

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… runner backbone

Template/extraction layer consuming input_transform: EvalTaskInput assembly,
extract() usage, required_var skip-with-reason pre-check, save-time
template/expression Jinja2 compilation validation.

V2 runner backbone: EvalInput flow with B2.1 TaskRun-to-EvalInput runtime
translation, two-level adapter dispatch (config_type -> properties.type) with
empty adapter map, multi-turn incompatible_input_shape skip handling, and
per-job error isolation with skip records.

Supporting infrastructure: BaseV2Eval ABC with adapter contract and
typed property validation, scoring_utils extraction from GEval with
legacy-model-field helpers, GEval characterization tests preserving
backward compatibility, on-read n_used/n_excluded inline aggregation
in eval score API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add exact_match, pattern_match, contains, set_check, tool_call_check,
and step_count_check adapters, each registered into _V2_ADAPTER_MAP for
two-level dispatch. Includes build_binary_scores helper in v2_eval_helpers
and full test coverage for all adapters plus dispatch integration tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/continuous scoring, and RAG judge templates

- Add V2 LlmJudgeEval adapter with g_eval toggle, structured JSON output scoring,
  and reasoning-model thinking-formatter wiring (forward_thinking_instructions)
- Extract shared scoring utilities (score_from_token_string, build_llm_as_judge_score)
  into scoring_utils.py; keep score_from_token_string strict for G-Eval logprob
  filtering while handling continuous 0-1 floats directly in build_llm_as_judge_score
- Add six first-party RAG judge templates over the canonical reference-key contract:
  faithfulness, answer_relevance, context_relevance, context_precision,
  hallucination, answer_correctness
- Wire forward_thinking_instructions through AdapterConfig -> build_chat_formatter ->
  SingleTurnR1ThinkingFormatter for reasoning-model judges
- Update async V2 eval contract tests and dispatch registration
- Comprehensive test coverage: 23 llm_judge tests, 33 RAG template tests,
  4 forward-thinking tests, scoring_utils unit tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… gate

Adds the code_eval V2 eval type, which runs user-authored Python scorer
functions in a multiprocessing spawn-based worker process with wall-clock
timeout and serialized spawn lock (daemon child for safety).

Key components:
- sandbox_worker.py: _execute_scorer (child target) + run_scorer (parent
  orchestrator) with timeout/kill/reap lifecycle
- KilnEvalHelpers: pure-Python helper library injected as the `kiln` arg
  to user scorer functions (ref(), reference_data, task_input, trace)
- CodeEvalAdapter (BaseV2Eval): trust-gate check returning
  code_eval_not_trusted skip when project is untrusted; score-shape
  validation (rejects bools, converts int→float, checks key sets)
- Trust gate: ephemeral per-project in-memory store with grant/check/revoke
  helpers; grant and check API endpoints in eval_api.py (DENY_AGENT)
- CodeEvalProperties.timeout_seconds field with save-time code validation
  (must define a score() function)
- freeze_support() + set_start_method("spawn") wiring in desktop.py and
  dev_server.py entry points for PyInstaller compatibility
- Registered as code_eval in _V2_ADAPTER_MAP (all 8 V2 types now covered)

Note: OpenAPI schema was NOT regenerated for the new grant/check trust
endpoints because schema generation is sandbox-blocked (mktemp). To be
regenerated in Phase 6 when the frontend consumes them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR adds Evals V2 data models, adapter dispatch, runner paths, scoring utilities, code-eval trust and sandbox execution, backend API support, web UI creation and result rendering flows, broader compatibility guards, multiprocessing startup setup, and extensive tests and project specifications.

Changes

Evals V2 end-to-end rollout

Layer / File(s) Summary
Runtime and data-model foundation
libs/core/kiln_ai/datamodel/*, libs/core/kiln_ai/adapters/eval/base_eval.py, .../registry.py, .../eval_runner.py, .../eval_utils/*, .../v2_eval_*
Adds V2 eval types, properties, eval inputs, skipped-run fields, eval-input filtering, V2 adapter dispatch, runner collection and execution paths, shared scoring helpers, deterministic adapters, LLM judge, code-eval sandbox execution, and trust helpers.
Desktop API and startup integration
app/desktop/desktop.py, app/desktop/dev_server.py, app/desktop/studio_server/eval_api.py
Adds multiprocessing spawn bootstrap for desktop entrypoints, updates score summaries with n_used and n_excluded, adds V2 test-eval handling, makes eval-config model fields conditionally optional, and exposes project trust endpoints.
Web UI V2 authoring and rendering
app/web_ui/src/lib/api/v2_eval_api.ts, app/web_ui/src/lib/utils/eval_types/registry.ts, app/web_ui/src/lib/components/eval_types/*, app/web_ui/src/routes/.../create_eval_config/+page.svelte, app/web_ui/src/routes/.../run_result/+page.svelte
Adds V2 API helpers, eval-type registry metadata, per-type form and result components, a CodeMirror-based code editor, the V2 create-config flow with test and trust dialogs, and V2-aware result rendering and naming.
Compatibility and display guards
app/web_ui/src/routes/(app)/generate/..., app/web_ui/src/routes/(app)/specs/..., app/web_ui/src/routes/(app)/prompt_optimization/..., app/web_ui/src/lib/utils/formatters.ts, app/web_ui/src/lib/utils/form_element.svelte
Adds null-safe model and provider formatting, guards flows that require eval_set_filter_id, updates V2 judge labels in existing pages, and adds numeric input min/max support.
Validation and regression coverage
app/desktop/studio_server/test_eval_api.py, libs/core/kiln_ai/adapters/eval/test_*, libs/core/kiln_ai/datamodel/test_*, app/web_ui/src/lib/**/*.test.ts, app/web_ui/src/routes/.../page.test.ts
Adds backend, runner, adapter, sandbox, datamodel, frontend component, API client, and page tests covering V2 execution, skipped aggregation, trust handling, formatter forwarding, and V1/V2 coexistence.
Specifications and plans
specs/projects/evals_v2/*, specs/projects/deep-cr-cleanup/*
Adds Evals V2 architecture, component specs, functional and implementation indexes, phase plans, and cleanup planning documents aligned with the implemented runtime, UI, API, and test changes.

Sequence Diagram(s)

sequenceDiagram
  participant UI as Create Eval Config Page
  participant API as eval_api
  participant Adapter as V2 Eval Adapter
  participant Trust as CodeEvalTrust

  UI->>API: POST test_v2_eval(properties, eval_input)
  API->>Adapter: evaluate(eval_input)
  Adapter-->>API: scores or skipped_reason
  API-->>UI: TestV2EvalResponse

  UI->>API: GET code_eval_trust
  API->>Trust: check project trust
  Trust-->>API: trusted
  API-->>UI: CodeEvalTrustResponse

  UI->>API: POST grant_code_eval_trust
  API->>Trust: grant project trust
  Trust-->>API: trusted
  API-->>UI: CodeEvalTrustResponse
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • Kiln-AI/Kiln#503: Both changes touch eval runner and run-config plumbing used by the new V2 adapter execution paths.
  • Kiln-AI/Kiln#1297: Both changes modify app/desktop/studio_server/eval_api.py score-summary and eval-results aggregation behavior.
  • Kiln-AI/Kiln#358: Both changes update the chat formatter path around SingleTurnR1ThinkingFormatter and get_chat_formatter.

Poem

🐇 I hopped through judges old and new,
With scorecards split by used and skew.
A sandbox hums, trust gates swing wide,
While V2 paths now stride with pride.
Spawned little burrows start just right—
And tests keep watch all through the night.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Evals V2' is concise and directly describes the main feature being added—a new V2 eval framework. It clearly indicates the primary change and would be meaningful to someone scanning PR history.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering what V2 does, how it relates to V1, how the spec is organized, scope at a glance, and coordination notes. It follows the spirit of the template by explaining the goals and rationale, though it does not explicitly check the CLA or provide all template checkbox items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch scosman/evals_v2
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch scosman/evals_v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the backend infrastructure for Kiln Evals V2, introducing the EvalInput data model, V2-specific fields for evals and runs, and six deterministic adapters alongside a sandboxed execution model for custom Python scorers. The review feedback identifies several critical issues: eval_helpers.py incorrectly parses standard OpenAI traces for tool calls and results; the score key validation in v2_eval_code_eval.py is overly strict and violates the specification; and eval_api.py contains a potential ZeroDivisionError and lacks native V2 eval support when eval_set_filter_id is missing. Additionally, minor docstring mismatches in base_v2_eval.py and v2_eval_step_count_check.py should be corrected.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +16 to +24
def get_tool_calls(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:
"""Return all tool-call entries from a trace (empty list if trace is None)."""
if not trace:
return []
return [
entry
for entry in trace
if entry.get("role") == "tool_call" or entry.get("type") == "tool_call"
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In standard OpenAI chat completion traces (which Kiln uses), tool calls are not top-level trace entries with role: \"tool_call\". Instead, they are stored within the tool_calls field of assistant messages (as correctly handled in v2_eval_tool_call_check.py). Using the current implementation, get_tool_calls will return empty lists for standard OpenAI traces, causing user-authored scorers to fail or return incorrect results.

    @staticmethod\n    def get_tool_calls(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:\n        \"\"\"Return all tool-call entries from a trace (empty list if trace is None).\"\"\"\n        if not trace:\n            return []\n        calls = []\n        for entry in trace:\n            if entry.get(\"role\") == \"assistant\" and \"tool_calls\" in entry:\n                for tc in entry[\"tool_calls\"]:\n                    func = tc.get(\"function\", {})\n                    calls.append({\n                        \"name\": func.get(\"name\", \"\"),\n                        \"arguments\": func.get(\"arguments\", {}),\n                        \"id\": tc.get(\"id\", \"\"),\n                    })\n            elif entry.get(\"role\") == \"tool_call\" or entry.get(\"type\") == \"tool_call\":\n                calls.append(entry)\n        return calls

Comment on lines +36 to +44
def get_tool_results(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:
"""Return all tool-result entries from a trace."""
if not trace:
return []
return [
entry
for entry in trace
if entry.get("role") == "tool_result" or entry.get("type") == "tool_result"
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In standard OpenAI chat completion traces, tool results have role: \"tool\", not role: \"tool_result\". We should update get_tool_results to support standard OpenAI traces.

    @staticmethod\n    def get_tool_results(trace: list[dict[str, Any]] | None) -> list[dict[str, Any]]:\n        \"\"\"Return all tool-result entries from a trace.\"\"\"\n        if not trace:\n            return []\n        results = []\n        for entry in trace:\n            if entry.get(\"role\") == \"tool\" or entry.get(\"role\") == \"tool_result\" or entry.get(\"type\") == \"tool_result\":\n                results.append(entry)\n        return results

Comment on lines +104 to +123
expected_keys = {score.json_key() for score in eval_obj.output_scores}
actual_keys = set(raw.keys())
if actual_keys != expected_keys:
raise RuntimeError(
f"Score key mismatch: got {sorted(actual_keys)}, expected {sorted(expected_keys)}"
)

validated: EvalScores = {}
for key, value in raw.items():
if isinstance(value, bool):
raise RuntimeError(
f"Score '{key}' returned a bool. Use a float (e.g. 1.0 for pass, 0.0 for fail)."
)
if isinstance(value, int):
value = float(value)
if not isinstance(value, float):
raise RuntimeError(
f"Score '{key}' must be a float, got {type(value).__name__}"
)
validated[key] = value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the specification in components/27_type_code_eval.md, missing keys are allowed (and should be mapped to None or skipped), and extra keys should be silently ignored. However, this strict equality check (actual_keys != expected_keys) will raise a RuntimeError if there are any missing or extra keys. We should relax this validation to allow missing keys and ignore extra keys.

        expected_keys = {score.json_key() for score in eval_obj.output_scores}\n        validated: EvalScores = {}\n        for score in eval_obj.output_scores:\n            key = score.json_key()\n            if key not in raw:\n                continue\n            value = raw[key]\n            if isinstance(value, bool):\n                raise RuntimeError(\n                    f\"Score '{key}' returned a bool. Use a float (e.g. 1.0 for pass, 0.0 for fail).\"\n                )\n            if isinstance(value, int):\n                value = float(value)\n            if not isinstance(value, float):\n                raise RuntimeError(\n                    f\"Score '{key}' must be a float, got {type(value).__name__}\"\n                )\n            validated[key] = value

Comment on lines +585 to +586
n_processed = len(expected_dataset_ids) - incomplete_count
percent_complete = (n_processed) / len(expected_dataset_ids)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If expected_dataset_ids is empty, this division will raise a ZeroDivisionError. We should guard against this by checking if expected_dataset_ids is empty.

        n_processed = len(expected_dataset_ids) - incomplete_count\n        percent_complete = (n_processed) / len(expected_dataset_ids) if expected_dataset_ids else 0.0



class BaseV2Eval(ABC):
"""Base class for V2 eval adapters -- separate from BaseEval because V2 adapters are synchronous, produce EvalScores dicts directly, and do not need the LLM-based score schema machinery."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring states that V2 adapters are synchronous, but evaluate is defined as an asynchronous method (async def evaluate). We should update the docstring to reflect this.

    \"\"\"Base class for V2 eval adapters -- separate from BaseEval because V2 adapters produce EvalScores dicts directly and do not need the LLM-based score schema machinery.\"\"\"


count_type determines what is counted:
- "tool_calls": total number of individual tool call invocations across all assistant messages
- "model_responses": number of assistant messages, including tool-call-only

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring states that model_responses excludes tool-call-only messages, but the implementation and tests include them. We should update the docstring to match the intended behavior.

Suggested change
- "model_responses": number of assistant messages, including tool-call-only
- \"model_responses\": number of assistant messages, including tool-call-only

Comment on lines +1180 to +1184
if eval.eval_set_filter_id is None:
raise HTTPException(
status_code=400,
detail="This eval does not have a V1 eval set filter.",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of raising a 400 error when eval_set_filter_id is None, we can support native V2 evals by resolving the expected dataset IDs from eval_input_filter_id using eval_input_filter_from_id.

        if eval.eval_set_filter_id is None:\n            if eval.eval_input_filter_id is None:\n                raise HTTPException(\n                    status_code=400,\n                    detail=\"This eval does not have a V1 or V2 dataset filter.\",\n                )\n            from kiln_ai.datamodel.dataset_filters import eval_input_filter_from_id\n            filter_fn = eval_input_filter_from_id(eval.eval_input_filter_id)\n            expected_dataset_ids = {\n                ei.id for ei in task.eval_inputs(readonly=True) if filter_fn(ei)\n            }\n        else:\n            expected_dataset_ids = dataset_ids_in_filter(\n                task, eval.eval_set_filter_id, readonly=True\n            )

Comment on lines +1486 to 1490
if eval.eval_set_filter_id is None:
continue
expected_dataset_ids = dataset_ids_in_filter(
task, eval.eval_set_filter_id, readonly=True
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of skipping when eval_set_filter_id is None, we can support native V2 evals by resolving the expected dataset IDs from eval_input_filter_id using eval_input_filter_from_id.

            if eval.eval_set_filter_id is None:\n                if eval.eval_input_filter_id is None:\n                    continue\n                from kiln_ai.datamodel.dataset_filters import eval_input_filter_from_id\n                filter_fn = eval_input_filter_from_id(eval.eval_input_filter_id)\n                expected_dataset_ids = {\n                    ei.id for ei in task.eval_inputs(readonly=True) if filter_fn(ei)\n                }\n            else:\n                expected_dataset_ids = dataset_ids_in_filter(\n                    task, eval.eval_set_filter_id, readonly=True\n                )

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
app/web_ui/src/lib/api_schema.d.ts (1)

2-4: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regenerate OpenAPI bindings; checked-in schema is stale and breaking API contract checks.

CI already confirms this generated file is out of sync with backend OpenAPI (missing trust endpoints and timeout_seconds), so frontend/backend typing contracts are currently inconsistent. Please update backend schema sources, then regenerate with app/web_ui/src/lib/generate_schema.sh and commit the result (instead of manual edits in this file).
As per coding guidelines, “Keep OpenAPI client schema up to date with app/web_ui/src/lib/generate_schema.sh.” Based on learnings, this file is generated and should be changed via backend schema + regeneration, not direct edits.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/api_schema.d.ts` around lines 2 - 4, The checked-in
generated OpenAPI TypeScript file is stale (missing trust endpoints and the
timeout_seconds field) and must not be edited directly; update the backend
OpenAPI schema sources to include the missing trust endpoints and
timeout_seconds, then run the regeneration script
app/web_ui/src/lib/generate_schema.sh to produce a fresh api_schema.d.ts, verify
the new types include the trust endpoints and timeout_seconds, and commit the
regenerated file so frontend/backend typing contracts stay in sync.

Sources: Coding guidelines, Learnings, Pipeline failures

libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (1)

684-689: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Forwarding flag is skipped for tuned chat strategies.

On Line 684, the tuned-strategy formatter call includes thinking_instructions but does not pass forward_thinking_instructions, so AdapterConfig.forward_thinking_instructions=True is ignored for that path.

Proposed fix
         if tuned_chat_strategy and tuned_chat_strategy != ChatStrategy.single_turn:
             return get_chat_formatter(
                 strategy=tuned_chat_strategy,
                 system_message=system_message,
                 user_input=input,
                 thinking_instructions=cot_prompt,
+                forward_thinking_instructions=self.base_adapter_config.forward_thinking_instructions,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/model_adapters/base_adapter.py` around lines 684 -
689, The tuned-strategy path calling get_chat_formatter (with
tuned_chat_strategy, system_message, input, thinking_instructions=cot_prompt)
omits the forward_thinking_instructions flag so
AdapterConfig.forward_thinking_instructions is ignored; update that
get_chat_formatter invocation to forward the flag (e.g.
forward_thinking_instructions=AdapterConfig.forward_thinking_instructions or the
instance equivalent) alongside thinking_instructions so the configured
forwarding behavior is respected.
🧹 Nitpick comments (1)
libs/core/kiln_ai/adapters/eval/test_eval_runner.py (1)

1229-1237: ⚡ Quick win

Add a runner-level regression test for eval-input jobs.

Current V2 tests validate collection and direct run_job, but not the collect_tasks()run() path for eval-input mode. A single async test covering that path would catch the missing stored_output contract break.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/test_eval_runner.py` around lines 1229 -
1237, The test suite is missing a regression test that exercises the full
collect_tasks() → run() path for eval-input jobs and would catch the
stored_output contract break; add an async test (similar to
test_collects_all_inputs) that calls mock_v2_runner.collect_tasks(), then
executes the runner's run() for those tasks (or awaits
mock_v2_runner.run(collected_tasks)), and asserts that each produced job has the
expected stored_output structure and types for EvalInput jobs (referencing
collect_tasks(), run(), EvalInput, and the test_collects_all_inputs pattern) so
the stored_output contract is validated end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/eval_config_instruction.svelte:
- Around line 9-19: The helper functions get_eval_steps and get_task_description
must validate the untyped properties before casting: in get_eval_steps (taking
eval_config: EvalConfig) check eval_config?.properties exists, ensure
props["eval_steps"] is an array with Array.isArray and that every element typeof
=== "string" (or coerce safe values via String and filter out null/undefined)
and then return a string[]; in get_task_description ensure
props["task_description"] is a string (typeof === "string") and only return it,
otherwise return the default "No description provided." This prevents non-string
values from being rendered and keeps the code robust against malformed payloads.

In `@libs/core/kiln_ai/adapters/eval/eval_helpers.py`:
- Around line 65-67: In has_tool_call, guard against non-dict tool call
arguments before calling .get: when extracting args (the variable assigned from
call.get("arguments") or call.get("args")), check isinstance(args, dict) and if
not, set args = {} (or convert to a dict if appropriate) so the subsequent
all(args.get(k) == v for k, v in expected_args.items()) cannot raise; update the
code around the args assignment in has_tool_call to perform this check and then
use the safe args variable in the matching logic.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Around line 176-189: The created EvalJob instances are missing required
execution artifacts (stored_output/stored_trace) which later causes ValueError
in execution; update the job-creation loop in eval_runner where EvalJob(...) is
instantiated (inside the for eval_input in self.task.eval_inputs(...) block) to
include stored_output and stored_trace pulled from the EvalInput (e.g.,
eval_input.stored_output and eval_input.stored_trace) or, if artifacts aren't
present, ensure the task is executed/wired before building jobs so those fields
are populated; make the change to the EvalJob constructor usage so collected
jobs are runnable by eval_run execution.

In `@libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py`:
- Around line 70-80: The current key-offset detection in metric_offsets uses raw
substring count/find on metric_name which can match quoted metric text inside
values; instead use a JSON-key-aware regex search to locate keys only (e.g.
compile pattern rf'"{re.escape(m)}"\s*:') and use re.finditer on raw_output to
collect matches, ensure exactly one match per metric, and set offsets[m] =
match.start(); replace the existing count/find checks with this regex-based
logic to avoid false positives from quoted values.

In `@libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py`:
- Around line 33-40: Wrap the call to extract(expression, data) in a try/except
so any exception (e.g., ValueError or runtime errors from compiled(**data)) is
caught and translated into the same skip tuple used for Undefined; keep the
existing check for isinstance(result, Undefined) but add an except Exception as
e that returns (None, SkippedReason.extraction_failed, f"Expression
'{expression}' raised error: {e}") and ensure you import/keep references to
extract, Undefined, and SkippedReason from the current module.

In `@libs/core/kiln_ai/adapters/eval/sandbox_worker.py`:
- Around line 122-127: The current try/except using q.get_nowait() can race with
the child process and miss a just-enqueued result; in the block that references
q and process p (the scorer subprocess), replace the non-blocking get_nowait()
with a blocking wait (e.g., q.get with a short timeout or a loop that waits
while p.is_alive() and then drains the queue) so you only raise the RuntimeError
after you are sure the child has exited and the queue is empty; keep the
existing checks on p.exitcode (the RuntimeError messages) but ensure you attempt
a blocking read from q (or repeated reads until timeout/empty after p.exitcode
is set) to avoid the race.

In `@libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py`:
- Around line 76-77: The test contains literal "print('debug info')" substrings
inside code strings which trigger the debug-detector; replace those occurrences
with a non-print alternative (e.g., use a logger call like logger.debug(...)
inside the string or remove the debug line entirely) in the string literals that
include "print('debug info')" (the string immediately before "return {'x': 1.0}"
and the similar occurrence later) so the test no longer contains the raw
"print(" pattern while preserving test behavior.

In `@libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py`:
- Around line 343-346: The test test_g_eval_raises_when_provider_lacks_logprobs
currently calls
asyncio.get_event_loop().run_until_complete(adapter.evaluate(_inp())), which
bypasses the file's async style; change the test to be async (async def
test_g_eval_raises_when_provider_lacks_logprobs(...)) and replace the
run_until_complete call with an awaited call (await adapter.evaluate(_inp()))
inside the existing pytest.raises context; also remove the now-unnecessary
import asyncio. This keeps the test consistent with the file's async/await
pattern and pytest's asyncio_mode="auto".

In `@libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py`:
- Around line 117-123: The current validation block (the one assigning into
validated[key]) accepts float values like NaN or Infinity; update it to reject
non-finite floats before populating validated (and before returning EvalScores)
by using math.isfinite on the float-converted value and raising a RuntimeError
if not finite. Keep the existing int->float conversion and type check for value,
then add a finiteness check referencing the same symbols (value, key, validated)
and include a clear error message like "Score '{key}' must be a finite float,
got {value}". Ensure math is imported where this validation lives.

In `@libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py`:
- Around line 31-38: The code is treating an invalid regex as an extraction
skip; update the error handling in the block around re.search(props.pattern,
actual) so an invalid pattern raises a hard configuration/validation error
instead of returning SkippedReason.extraction_failed—replace the except re.error
branch that returns ( {}, SkippedReason.extraction_failed, ... ) with a raised
exception (e.g., ValueError or a dedicated ConfigValidationError) that includes
the pattern and original exception message so the misconfiguration surfaces
immediately; keep references to re.search(props.pattern, actual) and
SkippedReason.extraction_failed to locate and change the logic.

In `@libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py`:
- Around line 25-27: Replace the runtime-optimizable assert with an explicit
type check: after props = self.properties, verify with if not isinstance(props,
StepCountCheckProperties): raise ValueError("expected StepCountCheckProperties
for self.properties") so the guard in V2EvalStepCountCheck (using properties and
StepCountCheckProperties) remains effective in optimized builds.

In `@libs/core/kiln_ai/datamodel/eval.py`:
- Around line 381-384: EvalRun's output is allowed to be None even for
non-skipped runs; add a Pydantic validation that enforces output is non-None
unless the run is explicitly marked as skipped. In the EvalRun model (and the
other similar model around lines 471-477), add a validator or root_validator
that checks the run's status/skipped flag (e.g., compare to the skipped
enum/value used in this module) and raises a ValueError if status != SKIPPED and
output is None, leaving output optional only when status indicates a
skipped-before-execution run. Ensure the validator references the class name
EvalRun and the output field so it triggers on model construction and prevents
invalid records.

In `@specs/projects/evals_v2/components/70_builder_and_onboarding.md`:
- Line 280: The renderer guidance incorrectly references the old property name
g_eval_mode; update that reference to the canonical V2 field g_eval so the
llm_judge binding uses g_eval (ensure any text mentioning g_eval_mode in the
llm_judge renderer guidance is replaced and consistent with the documented
g_eval property).

---

Outside diff comments:
In `@app/web_ui/src/lib/api_schema.d.ts`:
- Around line 2-4: The checked-in generated OpenAPI TypeScript file is stale
(missing trust endpoints and the timeout_seconds field) and must not be edited
directly; update the backend OpenAPI schema sources to include the missing trust
endpoints and timeout_seconds, then run the regeneration script
app/web_ui/src/lib/generate_schema.sh to produce a fresh api_schema.d.ts, verify
the new types include the trust endpoints and timeout_seconds, and commit the
regenerated file so frontend/backend typing contracts stay in sync.

In `@libs/core/kiln_ai/adapters/model_adapters/base_adapter.py`:
- Around line 684-689: The tuned-strategy path calling get_chat_formatter (with
tuned_chat_strategy, system_message, input, thinking_instructions=cot_prompt)
omits the forward_thinking_instructions flag so
AdapterConfig.forward_thinking_instructions is ignored; update that
get_chat_formatter invocation to forward the flag (e.g.
forward_thinking_instructions=AdapterConfig.forward_thinking_instructions or the
instance equivalent) alongside thinking_instructions so the configured
forwarding behavior is respected.

---

Nitpick comments:
In `@libs/core/kiln_ai/adapters/eval/test_eval_runner.py`:
- Around line 1229-1237: The test suite is missing a regression test that
exercises the full collect_tasks() → run() path for eval-input jobs and would
catch the stored_output contract break; add an async test (similar to
test_collects_all_inputs) that calls mock_v2_runner.collect_tasks(), then
executes the runner's run() for those tasks (or awaits
mock_v2_runner.run(collected_tasks)), and asserts that each produced job has the
expected stored_output structure and types for EvalInput jobs (referencing
collect_tasks(), run(), EvalInput, and the test_collects_all_inputs pattern) so
the stored_output contract is validated end-to-end.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 50c2dc65-f3e2-4c33-8799-85e807724f13

📥 Commits

Reviewing files that changed from the base of the PR and between 58608b4 and f3777c8.

📒 Files selected for processing (102)
  • app/desktop/desktop.py
  • app/desktop/dev_server.py
  • app/desktop/studio_server/eval_api.py
  • app/desktop/studio_server/test_eval_api.py
  • app/web_ui/src/lib/api_schema.d.ts
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/data_gen_intro.svelte
  • app/web_ui/src/routes/(app)/prompt_optimization/[project_id]/[task_id]/create_prompt_optimization_job/+page.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/+page.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/[eval_config_id]/[run_config_id]/run_result/+page.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/+page.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/eval_config_instruction.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
  • libs/core/kiln_ai/adapters/chat/chat_formatter.py
  • libs/core/kiln_ai/adapters/chat/test_chat_formatter.py
  • libs/core/kiln_ai/adapters/eval/__init__.py
  • libs/core/kiln_ai/adapters/eval/base_eval.py
  • libs/core/kiln_ai/adapters/eval/base_v2_eval.py
  • libs/core/kiln_ai/adapters/eval/eval_helpers.py
  • libs/core/kiln_ai/adapters/eval/eval_runner.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/test_scoring_utils.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py
  • libs/core/kiln_ai/adapters/eval/g_eval.py
  • libs/core/kiln_ai/adapters/eval/rag_judge_templates.py
  • libs/core/kiln_ai/adapters/eval/registry.py
  • libs/core/kiln_ai/adapters/eval/sandbox_worker.py
  • libs/core/kiln_ai/adapters/eval/test_eval_helpers.py
  • libs/core/kiln_ai/adapters/eval/test_eval_runner.py
  • libs/core/kiln_ai/adapters/eval/test_forward_thinking_instructions.py
  • libs/core/kiln_ai/adapters/eval/test_g_eval.py
  • libs/core/kiln_ai/adapters/eval/test_g_eval_characterization.py
  • libs/core/kiln_ai/adapters/eval/test_rag_judge_templates.py
  • libs/core/kiln_ai/adapters/eval/test_registry.py
  • libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py
  • libs/core/kiln_ai/adapters/eval/test_v2_contains.py
  • libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/test_v2_exact_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
  • libs/core/kiln_ai/adapters/model_adapters/base_adapter.py
  • libs/core/kiln_ai/adapters/model_adapters/test_base_adapter.py
  • libs/core/kiln_ai/datamodel/dataset_filters.py
  • libs/core/kiln_ai/datamodel/eval.py
  • libs/core/kiln_ai/datamodel/task.py
  • libs/core/kiln_ai/datamodel/test_dataset_filters.py
  • libs/core/kiln_ai/datamodel/test_eval_model.py
  • specs/projects/evals_v2/architecture.md
  • specs/projects/evals_v2/components/00_overview.md
  • specs/projects/evals_v2/components/05_prereq_thinking_formatter_fix.md
  • specs/projects/evals_v2/components/06_prereq_input_transform.md
  • specs/projects/evals_v2/components/10_data_model.md
  • specs/projects/evals_v2/components/15_v1_v2_coexistence.md
  • specs/projects/evals_v2/components/20_eval_config_types_overview.md
  • specs/projects/evals_v2/components/21_type_llm_judge.md
  • specs/projects/evals_v2/components/22_type_deterministic_basics.md
  • specs/projects/evals_v2/components/27_type_code_eval.md
  • specs/projects/evals_v2/components/29_rag_judge_templates.md
  • specs/projects/evals_v2/components/40_template_and_extraction.md
  • specs/projects/evals_v2/components/45_runner_architecture.md
  • specs/projects/evals_v2/components/50_reference_data.md
  • specs/projects/evals_v2/components/70_builder_and_onboarding.md
  • specs/projects/evals_v2/components/80_extensibility_contract.md
  • specs/projects/evals_v2/components/85_observability_and_audit.md
  • specs/projects/evals_v2/components/90_open_risks.md
  • specs/projects/evals_v2/functional_spec.md
  • specs/projects/evals_v2/implementation_plan.md
  • specs/projects/evals_v2/phase_plans/phase_1.md
  • specs/projects/evals_v2/phase_plans/phase_2.md
  • specs/projects/evals_v2/phase_plans/phase_3.md
  • specs/projects/evals_v2/phase_plans/phase_4.md
  • specs/projects/evals_v2/phase_plans/phase_5.md
  • specs/projects/evals_v2/project_overview.md
  • specs/projects/evals_v2/reference/ALIGNMENT.md
  • specs/projects/evals_v2/reference/backwards_compat_plan_grounded.md
  • specs/projects/evals_v2/reference/batch_agent_eval_expansion.md
  • specs/projects/input_transform_create_ui/architecture.md
  • specs/projects/input_transform_create_ui/functional_spec.md
  • specs/projects/input_transform_create_ui/phase_plans/phase_1.md
  • specs/projects/input_transform_create_ui/phase_plans/phase_2.md
  • specs/projects/input_transform_create_ui/project_overview.md
  • specs/projects/input_transform_ui/architecture.md
  • specs/projects/input_transform_ui/functional_spec.md
  • specs/projects/input_transform_ui/phase_plans/phase_1.md
  • specs/projects/input_transform_ui/phase_plans/phase_2.md
  • specs/projects/input_transform_ui/project_overview.md
  • specs/projects/sse_cancel_fix/architecture.md
  • specs/projects/sse_cancel_fix/functional_spec.md
  • specs/projects/sse_cancel_fix/implementation_plan.md
  • specs/projects/sse_cancel_fix/project_overview.md

Comment on lines +9 to 19
const props = eval_config.properties as Record<string, unknown>
if (!props["eval_steps"]) return undefined
if (!Array.isArray(props["eval_steps"])) return undefined
return props["eval_steps"] as string[]
}

function get_task_description(eval_config: EvalConfig): string {
if (!eval_config?.properties) return "No description provided."
const props = eval_config.properties as Record<string, unknown>
return (props["task_description"] as string) || "No description provided."
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate properties value types before rendering.

get_eval_steps and get_task_description still trust unknown payload types after casting. Non-string values can render incorrectly in UI.

Suggested fix
 function get_eval_steps(eval_config: EvalConfig): string[] | undefined {
   if (!eval_config) return undefined
   if (!eval_config.properties) return undefined
   const props = eval_config.properties as Record<string, unknown>
   if (!props["eval_steps"]) return undefined
   if (!Array.isArray(props["eval_steps"])) return undefined
-  return props["eval_steps"] as string[]
+  if (!props["eval_steps"].every((s) => typeof s === "string")) return undefined
+  return props["eval_steps"] as string[]
 }
 
 function get_task_description(eval_config: EvalConfig): string {
   if (!eval_config?.properties) return "No description provided."
   const props = eval_config.properties as Record<string, unknown>
-  return (props["task_description"] as string) || "No description provided."
+  return typeof props["task_description"] === "string" &&
+    props["task_description"].trim() !== ""
+    ? props["task_description"]
+    : "No description provided."
 }

Also applies to: 26-26

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/eval_config_instruction.svelte
around lines 9 - 19, The helper functions get_eval_steps and
get_task_description must validate the untyped properties before casting: in
get_eval_steps (taking eval_config: EvalConfig) check eval_config?.properties
exists, ensure props["eval_steps"] is an array with Array.isArray and that every
element typeof === "string" (or coerce safe values via String and filter out
null/undefined) and then return a string[]; in get_task_description ensure
props["task_description"] is a string (typeof === "string") and only return it,
otherwise return the default "No description provided." This prevents non-string
values from being rendered and keeps the code robust against malformed payloads.

Comment on lines +65 to +67
args = call.get("arguments") or call.get("args") or {}
if all(args.get(k) == v for k, v in expected_args.items()):
return True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard non-dict tool-call arguments before .get().

has_tool_call can raise when arguments/args is not a dict (common when args arrive as strings), which turns a matching check into a scorer failure.

Proposed fix
-            args = call.get("arguments") or call.get("args") or {}
-            if all(args.get(k) == v for k, v in expected_args.items()):
+            args = call.get("arguments") or call.get("args") or {}
+            if not isinstance(args, dict):
+                continue
+            if all(args.get(k) == v for k, v in expected_args.items()):
                 return True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_helpers.py` around lines 65 - 67, In
has_tool_call, guard against non-dict tool call arguments before calling .get:
when extracting args (the variable assigned from call.get("arguments") or
call.get("args")), check isinstance(args, dict) and if not, set args = {} (or
convert to a dict if appropriate) so the subsequent all(args.get(k) == v for k,
v in expected_args.items()) cannot raise; update the code around the args
assignment in has_tool_call to perform this check and then use the safe args
variable in the matching logic.

Comment thread libs/core/kiln_ai/adapters/eval/eval_runner.py Outdated
Comment on lines +70 to +80
for m in metrics:
metric_name = f'"{m}"'
count = raw_output.count(metric_name)
if count != 1:
raise ValueError(
f"Metric {m} should appear exactly once in the output. Found {count} times"
)
offset = raw_output.find(metric_name)
if offset == -1:
raise ValueError(f"Metric {m} not found in raw output")
offsets[m] = offset

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Key offset detection is too broad and can fail on valid outputs.

metric_offsets uses raw substring counting (count/find) instead of matching JSON key positions. If a quoted metric name appears in string content, this path can raise a false “appear exactly once” error and block scoring.

Suggested fix
+import re
@@
 def metric_offsets(raw_output: str, metrics: List[str]) -> Dict[str, int]:
@@
     offsets: Dict[str, int] = {}
     for m in metrics:
-        metric_name = f'"{m}"'
-        count = raw_output.count(metric_name)
-        if count != 1:
+        key_pattern = re.compile(rf'"{re.escape(m)}"\s*:')
+        matches = list(key_pattern.finditer(raw_output))
+        if len(matches) != 1:
             raise ValueError(
-                f"Metric {m} should appear exactly once in the output. Found {count} times"
+                f"Metric {m} should appear exactly once as a JSON key. Found {len(matches)} times"
             )
-        offset = raw_output.find(metric_name)
-        if offset == -1:
-            raise ValueError(f"Metric {m} not found in raw output")
-        offsets[m] = offset
+        offsets[m] = matches[0].start()
     return offsets
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for m in metrics:
metric_name = f'"{m}"'
count = raw_output.count(metric_name)
if count != 1:
raise ValueError(
f"Metric {m} should appear exactly once in the output. Found {count} times"
)
offset = raw_output.find(metric_name)
if offset == -1:
raise ValueError(f"Metric {m} not found in raw output")
offsets[m] = offset
for m in metrics:
key_pattern = re.compile(rf'"{re.escape(m)}"\s*:')
matches = list(key_pattern.finditer(raw_output))
if len(matches) != 1:
raise ValueError(
f"Metric {m} should appear exactly once as a JSON key. Found {len(matches)} times"
)
offsets[m] = matches[0].start()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py` around lines 70
- 80, The current key-offset detection in metric_offsets uses raw substring
count/find on metric_name which can match quoted metric text inside values;
instead use a JSON-key-aware regex search to locate keys only (e.g. compile
pattern rf'"{re.escape(m)}"\s*:') and use re.finditer on raw_output to collect
matches, ensure exactly one match per metric, and set offsets[m] =
match.start(); replace the existing count/find checks with this regex-based
logic to avoid false positives from quoted values.

Comment on lines +33 to +40
result = extract(expression, data)
if isinstance(result, Undefined):
return (
None,
SkippedReason.extraction_failed,
f"Expression '{expression}' resolved to undefined",
)
return result, None, None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Locate jinja extract implementation:"
fd -i 'jinja_engine.py'

echo
echo "Show extract implementation (read-only):"
fd -i 'jinja_engine.py' --exec sed -n '1,240p' {}

echo
echo "Find eval helper call sites relying on extract behavior:"
rg -n --type=py -C2 '\bextract\(' libs/core/kiln_ai/adapters/eval libs/core/kiln_ai/utils

Repository: Kiln-AI/Kiln

Length of output: 16213


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Inspect v2_eval_helpers.py around the referenced lines:"
sed -n '1,140p' libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py

echo
echo "Locate SkippedReason definition:"
rg -n "class SkippedReason|SkippedReason\s*=" libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py libs/core/kiln_ai/adapters/eval -S

echo
# If found via rg, print the most relevant match file(s) area
matches=$(rg -n "class SkippedReason|SkippedReason\s*=" libs/core/kiln_ai/adapters/eval -S -o --no-filename | head -n 1 || true)
echo "done"

Repository: Kiln-AI/Kiln

Length of output: 2817


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Locate SkippedReason in kiln_ai datamodel eval:"

rg -n "class SkippedReason|SkippedReason\s*=" libs/core/kiln_ai/datamodel/eval* libs/core/kiln_ai/datamodel -S

echo
echo "Show likely SkippedReason definition file(s):"
files=$(rg -l "SkippedReason" libs/core/kiln_ai/datamodel/eval* libs/core/kiln_ai/datamodel -S || true)
for f in $files; do
  echo "---- $f ----"
  sed -n '1,220p' "$f" | sed -n '1,220p' >/dev/null
done
# Print exact matches with surrounding context
rg -n "SkippedReason" libs/core/kiln_ai/datamodel/eval* libs/core/kiln_ai/datamodel -S -C2 | head -n 80

Repository: Kiln-AI/Kiln

Length of output: 3141


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "v2_eval_helpers.py (line-numbered excerpt):"
nl -ba libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py | sed -n '1,120p'

echo
echo "SkippedReason enum (line-numbered excerpt):"
nl -ba libs/core/kiln_ai/datamodel/eval.py | sed -n '220,310p'

echo
echo "jinja_engine.py extract() (line-numbered excerpt):"
nl -ba libs/core/kiln_ai/utils/jinja_engine.py | sed -n '60,140p'

Repository: Kiln-AI/Kiln

Length of output: 161


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "v2_eval_helpers.py (line-numbered excerpt):"
cat -n libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py | sed -n '1,140p'

echo
echo "SkippedReason enum (line-numbered excerpt):"
cat -n libs/core/kiln_ai/datamodel/eval.py | sed -n '220,310p'

echo
echo "jinja_engine.py extract() (line-numbered excerpt):"
cat -n libs/core/kiln_ai/utils/jinja_engine.py | sed -n '60,170p'

Repository: Kiln-AI/Kiln

Length of output: 8853


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Search extract exception expectations in test_jinja_engine.py:"
rg -n "extract\(" libs/core/kiln_ai/utils/test_jinja_engine.py -n
echo
rg -n "extract\([^\)]*\).*pytest\.raises|pytest\.raises.*extract\(" libs/core/kiln_ai/utils/test_jinja_engine.py -S
echo
rg -n "SecurityError|UndefinedError|TemplateSyntaxError" libs/core/kiln_ai/utils/test_jinja_engine.py -S

Repository: Kiln-AI/Kiln

Length of output: 631


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Find all extract() call sites in libs/core (short context):"
rg -n --type=py -C2 '\bextract\(' libs/core/kiln_ai | head -n 200

echo
echo "Look for SecurityError/UndefinedError tests that involve extract():"
rg -n --type=py '(SecurityError|UndefinedError).*extract\(|extract\(.*(SecurityError|UndefinedError)' libs/core/kiln_ai/utils/test_jinja_engine.py

Repository: Kiln-AI/Kiln

Length of output: 17252


Handle extract(...) exceptions and map them to SkippedReason.extraction_failed (v2_eval_helpers.py lines 33-39 and 73-78).

Eval helpers only skip when extract(...) returns Undefined. But kiln_ai.utils.jinja_engine.extract() also raises (e.g., malformed expressions become ValueError, and sandbox/runtime errors can raise during compiled(**data)), which can hard-fail eval instead of returning a skip tuple.

Suggested fix
 def extract_value(
     expression: str | None,
     eval_input: EvalTaskInput,
 ) -> tuple[Any, SkippedReason | None, str | None]:
@@
     if expression is None:
         return eval_input.final_message, None, None
     data = eval_input.model_dump()
-    result = extract(expression, data)
+    try:
+        result = extract(expression, data)
+    except Exception as exc:
+        return (
+            None,
+            SkippedReason.extraction_failed,
+            f"Expression '{expression}' raised: {exc}",
+        )
     if isinstance(result, Undefined):
@@
 def check_required_vars(
@@
     data = eval_input.model_dump()
     for var_expr in required_vars:
-        result = extract(var_expr, data)
+        try:
+            result = extract(var_expr, data)
+        except Exception as exc:
+            return (
+                SkippedReason.extraction_failed,
+                f"required_var '{var_expr}' raised: {exc}",
+            )
         if isinstance(result, Undefined):
             return (
                 SkippedReason.extraction_failed,
                 f"required_var '{var_expr}' resolved to undefined",
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py` around lines
33 - 40, Wrap the call to extract(expression, data) in a try/except so any
exception (e.g., ValueError or runtime errors from compiled(**data)) is caught
and translated into the same skip tuple used for Undefined; keep the existing
check for isinstance(result, Undefined) but add an except Exception as e that
returns (None, SkippedReason.extraction_failed, f"Expression '{expression}'
raised error: {e}") and ensure you import/keep references to extract, Undefined,
and SkippedReason from the current module.

Comment on lines +117 to +123
if isinstance(value, int):
value = float(value)
if not isinstance(value, float):
raise RuntimeError(
f"Score '{key}' must be a float, got {type(value).__name__}"
)
validated[key] = value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject non-finite score values before returning EvalScores.

Line 119 currently accepts NaN/Infinity as valid floats; those values can corrupt score aggregation and persisted metrics.

🔧 Proposed fix
 import asyncio
+import math
 from threading import Lock
 from typing import Any
@@
             if isinstance(value, int):
                 value = float(value)
             if not isinstance(value, float):
                 raise RuntimeError(
                     f"Score '{key}' must be a float, got {type(value).__name__}"
                 )
+            if not math.isfinite(value):
+                raise RuntimeError(
+                    f"Score '{key}' must be a finite float, got {value!r}"
+                )
             validated[key] = value
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(value, int):
value = float(value)
if not isinstance(value, float):
raise RuntimeError(
f"Score '{key}' must be a float, got {type(value).__name__}"
)
validated[key] = value
if isinstance(value, int):
value = float(value)
if not isinstance(value, float):
raise RuntimeError(
f"Score '{key}' must be a float, got {type(value).__name__}"
)
if not math.isfinite(value):
raise RuntimeError(
f"Score '{key}' must be a finite float, got {value!r}"
)
validated[key] = value
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py` around lines 117 - 123,
The current validation block (the one assigning into validated[key]) accepts
float values like NaN or Infinity; update it to reject non-finite floats before
populating validated (and before returning EvalScores) by using math.isfinite on
the float-converted value and raising a RuntimeError if not finite. Keep the
existing int->float conversion and type check for value, then add a finiteness
check referencing the same symbols (value, key, validated) and include a clear
error message like "Score '{key}' must be a finite float, got {value}". Ensure
math is imported where this validation lives.

Comment on lines +31 to +38
try:
match = re.search(props.pattern, actual)
except re.error as e:
return (
{},
SkippedReason.extraction_failed,
f"Invalid regex pattern '{props.pattern}': {e}",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Invalid regex should fail fast as config error, not be reported as extraction skip.

Returning SkippedReason.extraction_failed here masks a static eval-config bug and can silently skip every run. This should raise a hard error (or a dedicated config-validation error) so misconfiguration is surfaced immediately.

Suggested fix
         try:
             match = re.search(props.pattern, actual)
         except re.error as e:
-            return (
-                {},
-                SkippedReason.extraction_failed,
-                f"Invalid regex pattern '{props.pattern}': {e}",
-            )
+            raise ValueError(f"Invalid regex pattern '{props.pattern}': {e}") from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py` around lines 31 -
38, The code is treating an invalid regex as an extraction skip; update the
error handling in the block around re.search(props.pattern, actual) so an
invalid pattern raises a hard configuration/validation error instead of
returning SkippedReason.extraction_failed—replace the except re.error branch
that returns ( {}, SkippedReason.extraction_failed, ... ) with a raised
exception (e.g., ValueError or a dedicated ConfigValidationError) that includes
the pattern and original exception message so the misconfiguration surfaces
immediately; keep references to re.search(props.pattern, actual) and
SkippedReason.extraction_failed to locate and change the logic.

Comment on lines +25 to +27
props = self.properties
assert isinstance(props, StepCountCheckProperties)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use an explicit runtime check instead of assert for properties typing.

Line 26’s assert can be optimized out (-O), removing this guard in runtime builds. Prefer an explicit if not isinstance(...): raise ValueError(...) to keep behavior deterministic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py` around lines 25
- 27, Replace the runtime-optimizable assert with an explicit type check: after
props = self.properties, verify with if not isinstance(props,
StepCountCheckProperties): raise ValueError("expected StepCountCheckProperties
for self.properties") so the guard in V2EvalStepCountCheck (using properties and
StepCountCheckProperties) remains effective in optimized builds.

Comment thread libs/core/kiln_ai/datamodel/eval.py

| Type | Result renderer adds (under the score badge) | Read-only config-detail shows |
|---|---|---|
| `llm_judge` | The judge's **reasoning text** + per-criterion pass/fail breakdown (the structured judge output already persisted). | The criteria/rubric, `g_eval_mode` flag, judge model. (`system_prompt`/`prompt_template` collapsed under "Advanced".) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the canonical g_eval field name in the renderer guidance.

This line references g_eval_mode, but the V2 property is documented elsewhere in this PR as g_eval. Keeping the old name here risks implementing the wrong binding in UI details.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/projects/evals_v2/components/70_builder_and_onboarding.md` at line 280,
The renderer guidance incorrectly references the old property name g_eval_mode;
update that reference to the canonical V2 field g_eval so the llm_judge binding
uses g_eval (ensure any text mentioning g_eval_mode in the llm_judge renderer
guidance is replaced and consistent with the documented g_eval property).

Per-type registry with exhaustive switch over V2EvalType (assertNever
guard) and pluggable create container with type picker. Six deterministic
create forms (exact_match, pattern_match, contains, set_check,
tool_call_check, step_count_check) plus llm_judge form extracted from
the monolithic page.

Code_eval (Beta) create UI: lazy-loaded CodeMirror 6 Python editor,
starter-code examples, ephemeral trust modal (grant/check endpoints),
sandboxed preview via the test-run endpoint.

Eight per-type result renderers integrated into the three view surfaces
(run_result page, eval detail page, eval_configs page). New
`test_v2_eval` preview endpoint for try-it-before-save. OpenAPI schema
regenerated (includes test_v2_eval + grant/check trust endpoints).

Follow-ups:
(a) Persist V2 run-level detail (llm_judge reasoning, code_eval
    stdout/stderr/traceback) so result renderers can show live per-run
    detail — deferred because the V2 evaluate() contract returns only
    (scores, skip, detail) with no run-detail slot.
(b) run_config_comparison_table.svelte intentionally left unchanged
    (score-type-agnostic, per product decision).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@socket-security

socket-security Bot commented Jun 8, 2026

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/web_ui/src/lib/components/code_editor.svelte`:
- Around line 15-90: The editor only sets its document once in onMount via
EditorState.create so subsequent updates to the exported prop value are ignored;
add a reactive watcher that, when value changes after mount and view exists,
compares value to view.state.doc.toString() and, if different, calls
view.dispatch(...) to replace the document (same change shape used in export
function setValue) to keep the editor in sync with external prop updates (refer
to onMount, EditorState.create, view, value and export function setValue).

In `@app/web_ui/src/lib/components/eval_types/code_eval_form.svelte`:
- Around line 33-44: The timeout value may be outside the documented 1–300s
range when submitted; clamp/normalize timeout_seconds to that range before
exposing it in getProperties (and the other similar exporter around 161–168).
Locate the timeout_seconds variable and the export function getProperties(),
coerce timeout_seconds to a number, clamp it to Math.max(1, Math.min(300,
Number(timeout_seconds) || 1)), assign the normalized value back to
properties.timeout_seconds, and return that normalized timeout_seconds in the
payload so invalid numeric states cannot leak into the submitted properties.

In `@app/web_ui/src/lib/components/eval_types/code_eval_form.test.ts`:
- Around line 62-68: The tests for CodeEvalForm currently check for the timeout
label and help text but never assert the actual timeout value; update the
"renders the timeout input with default value of 30" test (render(CodeEvalForm))
to locate the timeout input (e.g., via getByLabelText("Timeout (seconds)") or
querySelector for the timeout input) and assert its value/prop equals "30" or 30
(depending on input type); likewise, in the "updated timeout" test (the test
that simulates changing timeout_seconds to 60) add an assertion after the change
to verify the component state/prop or the input value reflects 60 (e.g.,
expect(timeoutInput.value).toBe("60") or
expect(wrapper.vm.timeout_seconds).toBe(60)), referencing the CodeEvalForm
timeout input and the timeout_seconds property to locate where to add the
assertions.

In `@app/web_ui/src/lib/components/eval_types/exact_match_form.svelte`:
- Around line 17-22: The type union for the local variable `source` in
ExactMatchForm (the `source` declaration) incorrectly includes
"value_expression" even though the initializer only ever sets "expected_value"
or "reference_key"; remove "value_expression" from the union and update the
declaration to be only "expected_value" | "reference_key" so the type matches
the actual assignment and consumers of `source` (e.g., any branches or bindings
that check `source`) no longer need to handle an unreachable variant.

In `@app/web_ui/src/lib/components/eval_types/llm_judge_form.svelte`:
- Around line 190-194: The suggested-model click handler incorrectly assigns
display labels to provider_name and model_name which breaks capability
resolution in update_unsupported_algos_and_default_algo; change the handler to
set provider_id and model_id (e.g., provider_id = model.provider_id and model_id
= model.model_id) and keep combined_model_name as
`${model.provider_id}/${model.model_id}` while optionally also setting
provider_name/model_name to their display labels if UI text needs
updating—ensure update_unsupported_algos_and_default_algo continues to receive
the ID pair (provider_id + model_id) for capability detection.

In `@app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte`:
- Around line 11-19: The current reactive assignment for props casts
eval_config.properties to the llm_judge shape whenever a "type" key exists;
tighten the guard to only cast when eval_config?.properties?.type ===
"llm_judge" (and optionally verify model_name/model_provider keys exist) so
non-llm_judge configs won't be treated as llm_judge; update the $: props = ...
expression to check the explicit type value before casting to the { type:
"llm_judge", model_name: string, model_provider: string, g_eval: boolean }
shape.

In `@app/web_ui/src/lib/components/eval_types/set_check_form.svelte`:
- Around line 23-27: The reactive assignment to properties.expected_set (the $:
block that sets properties.expected_set from expected_set_text) overwrites the
explicit nulling when the source is reference_key; change that reactive update
to be conditional so it only runs when the source is not "reference_key" (e.g.,
guard on properties.source or selected_source !== "reference_key"), or
alternatively set properties.expected_set = null inside that reactive block when
properties.source === "reference_key"; update the $: properties.expected_set =
... expression to respect the source and prevent restoring stale values (this
also fixes the same issue in the block around lines 53-59).

In `@app/web_ui/src/lib/components/eval_types/tool_call_check_form.svelte`:
- Around line 158-170: Buttons rendering argument controls (the remove and add
actions) are missing explicit type attributes so they default to submit inside
the parent form causing unintended form submission; update the two button
elements that call remove_arg(item_index, arg_index) and add_arg(item_index) to
include type="button" so they act as plain buttons and do not submit the
enclosing form (look for the button with on:click={() => remove_arg(item_index,
arg_index)} and the button with on:click={() => add_arg(item_index)} and add
type="button" to each).

In `@app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte`:
- Around line 15-23: The reactive assignment for props currently assumes any
eval_config.properties with a type field is the tool_call_check shape; change
the guard to explicitly check eval_config?.properties?.type ===
"tool_call_check" before casting to the tool_call shape (the variable is props)
and ensure downstream uses (e.g. accessing props.expected_tools.length around
the code that uses props) handle props being null/undefined by conditional
checks (or optional chaining) so the component never dereferences undefined when
a different eval type is provided.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte:
- Around line 159-165: When switching eval types you need to reset the test-run
state so a previously run test doesn't allow saving a new untested type: in
select_v2_type(type: V2EvalType) and go_back_to_type_picker() clear the test
flag (reset test_has_run to false and any related test result state) immediately
after changing selected_v2_type; ensure the same reset happens both when setting
selected_v2_type = type and when setting it back to null so the save gate that
checks test_has_run cannot be bypassed.
- Around line 291-297: The grant_trust_and_retry function is currently assigning
failures to test_error instead of the main submit error channel; change the
catch handler inside grant_trust_and_retry (which calls
grantCodeEvalTrust(project_id)) to set create_evaluator_error =
createKilnError(e) rather than test_error and then return false so the UI
surfaces trust-grant failures on the main submit error path.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/page.test.ts:
- Around line 293-369: The test "retries test after granting trust via dialog"
never exercises the retry path: it only asserts mockTestV2Eval was called once
and never invokes the dialog's async action or mockGrantCodeEvalTrust, so the
grant-and-retry behavior isn't validated. Fix by simulating the dialog action
after locating the dialog stub (querySelectorAll('[data-testid="dialog-stub"]')
and matching data-title "Allow Code Execution"), then call the grant action
sequence: invoke mockGrantCodeEvalTrust (or simulate the dialog's
asyncAction/grant_trust_and_retry) so that mockGrantCodeEvalTrust is awaited and
mockTestV2Eval is invoked a second time; finally update assertions to expect
mockGrantCodeEvalTrust toHaveBeenCalled() and mockTestV2Eval
toHaveBeenCalledTimes(2) to confirm retry occurred.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 727249cd-184f-435a-a580-8335684c48e8

📥 Commits

Reviewing files that changed from the base of the PR and between f3777c8 and 99f0c43.

⛔ Files ignored due to path filters (1)
  • app/web_ui/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (54)
  • app/desktop/studio_server/eval_api.py
  • app/desktop/studio_server/test_eval_api.py
  • app/web_ui/package.json
  • app/web_ui/src/lib/api/v2_eval_api.test.ts
  • app/web_ui/src/lib/api/v2_eval_api.ts
  • app/web_ui/src/lib/api_schema.d.ts
  • app/web_ui/src/lib/components/code_editor.svelte
  • app/web_ui/src/lib/components/eval_types/__tests__/code_editor_stub.svelte
  • app/web_ui/src/lib/components/eval_types/__tests__/dialog_stub.svelte
  • app/web_ui/src/lib/components/eval_types/code_eval_form.svelte
  • app/web_ui/src/lib/components/eval_types/code_eval_form.test.ts
  • app/web_ui/src/lib/components/eval_types/code_eval_result.svelte
  • app/web_ui/src/lib/components/eval_types/code_eval_result.test.ts
  • app/web_ui/src/lib/components/eval_types/contains_form.svelte
  • app/web_ui/src/lib/components/eval_types/contains_result.svelte
  • app/web_ui/src/lib/components/eval_types/contains_result.test.ts
  • app/web_ui/src/lib/components/eval_types/eval_result_scores.svelte
  • app/web_ui/src/lib/components/eval_types/eval_result_scores.test.ts
  • app/web_ui/src/lib/components/eval_types/exact_match_form.svelte
  • app/web_ui/src/lib/components/eval_types/exact_match_result.svelte
  • app/web_ui/src/lib/components/eval_types/exact_match_result.test.ts
  • app/web_ui/src/lib/components/eval_types/llm_judge_form.svelte
  • app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte
  • app/web_ui/src/lib/components/eval_types/llm_judge_result.test.ts
  • app/web_ui/src/lib/components/eval_types/pattern_match_form.svelte
  • app/web_ui/src/lib/components/eval_types/pattern_match_result.svelte
  • app/web_ui/src/lib/components/eval_types/pattern_match_result.test.ts
  • app/web_ui/src/lib/components/eval_types/set_check_form.svelte
  • app/web_ui/src/lib/components/eval_types/set_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/set_check_result.test.ts
  • app/web_ui/src/lib/components/eval_types/step_count_check_form.svelte
  • app/web_ui/src/lib/components/eval_types/step_count_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
  • app/web_ui/src/lib/components/eval_types/tool_call_check_form.svelte
  • app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/tool_call_check_result.test.ts
  • app/web_ui/src/lib/utils/eval_types/eval_steps_utils.ts
  • app/web_ui/src/lib/utils/eval_types/registry.test.ts
  • app/web_ui/src/lib/utils/eval_types/registry.ts
  • app/web_ui/src/lib/utils/exhaustive.ts
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/+page.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/[eval_config_id]/[run_config_id]/run_result/+page.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/app_page_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/collapse_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/dialog_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/form_container_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/llm_judge_form_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/__tests__/v2_form_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/page.test.ts
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/eval_configs/+page.svelte
  • specs/projects/evals_v2/implementation_plan.md
  • specs/projects/evals_v2/phase_plans/phase_6.md
✅ Files skipped from review due to trivial changes (9)
  • app/web_ui/src/lib/components/eval_types/tests/dialog_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/collapse_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/form_container_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/llm_judge_form_stub.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/v2_form_stub.svelte
  • app/web_ui/package.json
  • app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/tests/app_page_stub.svelte
  • app/web_ui/src/lib/api_schema.d.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • app/desktop/studio_server/test_eval_api.py
  • app/desktop/studio_server/eval_api.py

Comment on lines +15 to +90
onMount(async () => {
const [
{ EditorView, keymap, placeholder: placeholderExt, lineNumbers },
{ EditorState },
{ python },
{ defaultKeymap, history, historyKeymap },
{ syntaxHighlighting, defaultHighlightStyle },
] = await Promise.all([
import("@codemirror/view"),
import("@codemirror/state"),
import("@codemirror/lang-python"),
import("@codemirror/commands"),
import("@codemirror/language"),
])

const extensions = [
lineNumbers(),
history(),
syntaxHighlighting(defaultHighlightStyle),
python(),
keymap.of([...defaultKeymap, ...historyKeymap]),
EditorView.updateListener.of((update) => {
if (update.docChanged) {
value = update.state.doc.toString()
dispatch("change", value)
}
}),
EditorView.theme({
"&": { minHeight: min_height },
".cm-scroller": { overflow: "auto" },
".cm-content": { fontFamily: "monospace", fontSize: "14px" },
".cm-gutters": {
backgroundColor: "transparent",
borderRight: "1px solid oklch(var(--bc) / 0.2)",
},
}),
]

if (placeholder) {
extensions.push(placeholderExt(placeholder))
}

if (readonly) {
extensions.push(EditorState.readOnly.of(true))
}

view = new EditorView({
state: EditorState.create({
doc: value,
extensions,
}),
parent: container,
})

loading = false
})

onDestroy(() => {
view?.destroy()
})

export function setValue(newValue: string) {
if (view) {
const currentValue = view.state.doc.toString()
if (currentValue !== newValue) {
view.dispatch({
changes: {
from: 0,
to: view.state.doc.length,
insert: newValue,
},
})
}
}
value = newValue
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

value is treated as init-only; external prop updates are ignored after mount.

If the parent updates value later, the editor content stays stale because the doc is only set during initial EditorState.create (Line 62-64).

Suggested fix
   onDestroy(() => {
     view?.destroy()
   })
 
+  $: if (view) {
+    const currentValue = view.state.doc.toString()
+    if (currentValue !== value) {
+      view.dispatch({
+        changes: {
+          from: 0,
+          to: view.state.doc.length,
+          insert: value,
+        },
+      })
+    }
+  }
+
   export function setValue(newValue: string) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/components/code_editor.svelte` around lines 15 - 90, The
editor only sets its document once in onMount via EditorState.create so
subsequent updates to the exported prop value are ignored; add a reactive
watcher that, when value changes after mount and view exists, compares value to
view.state.doc.toString() and, if different, calls view.dispatch(...) to replace
the document (same change shape used in export function setValue) to keep the
editor in sync with external prop updates (refer to onMount, EditorState.create,
view, value and export function setValue).

Comment on lines +33 to +44
let timeout_seconds: number = properties.timeout_seconds ?? 30

$: properties.timeout_seconds = timeout_seconds

export function getProperties(): components["schemas"]["CodeEvalProperties"] & {
timeout_seconds?: number
} {
return {
type: "code_eval",
code: properties.code,
timeout_seconds,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize timeout to the documented 1–300 range before submit.

The UI text says 1–300 seconds, but getProperties() forwards raw timeout_seconds. Invalid numeric states can leak into the payload.

Proposed fix
   export function getProperties(): components["schemas"]["CodeEvalProperties"] & {
     timeout_seconds?: number
   } {
+    const normalized_timeout = Number.isFinite(timeout_seconds)
+      ? Math.min(300, Math.max(1, Math.trunc(timeout_seconds)))
+      : 30
     return {
       type: "code_eval",
       code: properties.code,
-      timeout_seconds,
+      timeout_seconds: normalized_timeout,
     }
   }

Also applies to: 161-168

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/components/eval_types/code_eval_form.svelte` around lines
33 - 44, The timeout value may be outside the documented 1–300s range when
submitted; clamp/normalize timeout_seconds to that range before exposing it in
getProperties (and the other similar exporter around 161–168). Locate the
timeout_seconds variable and the export function getProperties(), coerce
timeout_seconds to a number, clamp it to Math.max(1, Math.min(300,
Number(timeout_seconds) || 1)), assign the normalized value back to
properties.timeout_seconds, and return that normalized timeout_seconds in the
payload so invalid numeric states cannot leak into the submitted properties.

Comment on lines +62 to +68
it("renders the timeout input with default value of 30", () => {
const { container } = render(CodeEvalForm)
expect(container.textContent).toContain("Timeout (seconds)")
expect(container.textContent).toContain(
"Maximum time allowed for the score function to execute",
)
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Timeout tests don’t verify timeout values, so regressions can slip through.

Line 62’s test title says default is 30 but never asserts input/property value. Line 78’s “updated timeout” test also never asserts timeout_seconds changed to 60.

Suggested fix
   it("renders the timeout input with default value of 30", () => {
-    const { container } = render(CodeEvalForm)
+    const { container, component } = render(CodeEvalForm)
     expect(container.textContent).toContain("Timeout (seconds)")
     expect(container.textContent).toContain(
       "Maximum time allowed for the score function to execute",
     )
+    expect(component.getProperties().timeout_seconds).toBe(30)
   })
@@
   it("produces CodeEvalProperties with updated timeout", async () => {
     const { component, container } = render(CodeEvalForm)
@@
-    if (input) {
-      await fireEvent.input(input, { target: { value: "60" } })
-    }
+    expect(input).not.toBeNull()
+    await fireEvent.input(input, { target: { value: "60" } })
 
     const props = component.getProperties()
     expect(props.type).toBe("code_eval")
+    expect(props.timeout_seconds).toBe(60)
   })

Also applies to: 78-90

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/components/eval_types/code_eval_form.test.ts` around lines
62 - 68, The tests for CodeEvalForm currently check for the timeout label and
help text but never assert the actual timeout value; update the "renders the
timeout input with default value of 30" test (render(CodeEvalForm)) to locate
the timeout input (e.g., via getByLabelText("Timeout (seconds)") or
querySelector for the timeout input) and assert its value/prop equals "30" or 30
(depending on input type); likewise, in the "updated timeout" test (the test
that simulates changing timeout_seconds to 60) add an assertion after the change
to verify the component state/prop or the input value reflects 60 (e.g.,
expect(timeoutInput.value).toBe("60") or
expect(wrapper.vm.timeout_seconds).toBe(60)), referencing the CodeEvalForm
timeout input and the timeout_seconds property to locate where to add the
assertions.

Comment thread app/web_ui/src/lib/components/eval_types/exact_match_form.svelte Outdated
Comment on lines +190 to +194
on:click={() => {
model_name = model.model_name
provider_name = model.provider_name
combined_model_name = `${model.provider_id}/${model.model_id}`
}}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Suggested-model click handler stores display names instead of IDs.

update_unsupported_algos_and_default_algo (Line 147-149) resolves model capabilities via provider_id + model id, but the suggested-model click path sets provider_name/model_name to display labels. This breaks capability detection and can expose unsupported algo choices/defaults.

Proposed fix
             on:click={() => {
-              model_name = model.model_name
-              provider_name = model.provider_name
+              model_name = model.model_id
+              provider_name = model.provider_id
               combined_model_name = `${model.provider_id}/${model.model_id}`
             }}

Also applies to: 141-149

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/components/eval_types/llm_judge_form.svelte` around lines
190 - 194, The suggested-model click handler incorrectly assigns display labels
to provider_name and model_name which breaks capability resolution in
update_unsupported_algos_and_default_algo; change the handler to set provider_id
and model_id (e.g., provider_id = model.provider_id and model_id =
model.model_id) and keep combined_model_name as
`${model.provider_id}/${model.model_id}` while optionally also setting
provider_name/model_name to their display labels if UI text needs
updating—ensure update_unsupported_algos_and_default_algo continues to receive
the ID pair (provider_id + model_id) for capability detection.

Comment on lines +158 to +170
<button
class="btn btn-ghost btn-sm btn-square"
on:click={() => remove_arg(item_index, arg_index)}
title="Remove argument"
>
</button>
</div>
{/each}
<button
class="btn btn-ghost btn-sm self-start"
on:click={() => add_arg(item_index)}
>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set explicit type="button" on non-submit controls.

Line 158 and Line 167 buttons default to submit behavior in form contexts, which can prematurely submit the parent config form while editing arg rows.

Proposed fix
           <button
+            type="button"
             class="btn btn-ghost btn-sm btn-square"
             on:click={() => remove_arg(item_index, arg_index)}
             title="Remove argument"
           >
@@
       <button
+        type="button"
         class="btn btn-ghost btn-sm self-start"
         on:click={() => add_arg(item_index)}
       >
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<button
class="btn btn-ghost btn-sm btn-square"
on:click={() => remove_arg(item_index, arg_index)}
title="Remove argument"
>
</button>
</div>
{/each}
<button
class="btn btn-ghost btn-sm self-start"
on:click={() => add_arg(item_index)}
>
<button
type="button"
class="btn btn-ghost btn-sm btn-square"
on:click={() => remove_arg(item_index, arg_index)}
title="Remove argument"
>
</button>
</div>
{/each}
<button
type="button"
class="btn btn-ghost btn-sm self-start"
on:click={() => add_arg(item_index)}
>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/components/eval_types/tool_call_check_form.svelte` around
lines 158 - 170, Buttons rendering argument controls (the remove and add
actions) are missing explicit type attributes so they default to submit inside
the parent form causing unintended form submission; update the two button
elements that call remove_arg(item_index, arg_index) and add_arg(item_index) to
include type="button" so they act as plain buttons and do not submit the
enclosing form (look for the button with on:click={() => remove_arg(item_index,
arg_index)} and the button with on:click={() => add_arg(item_index)} and add
type="button" to each).

Comment thread app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte Outdated
Comment on lines +159 to 165
function select_v2_type(type: V2EvalType) {
selected_v2_type = type
}

function select_evaluator(algo: EvalConfigType) {
selected_algo = algo
function go_back_to_type_picker() {
selected_v2_type = null
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset test-run state when switching eval types.

test_has_run is never cleared when changing selection (Line 159 and Line 163), so the save gate at Line 325 can be bypassed by testing one type and then saving a different untested type.

Suggested fix
+  function reset_test_state() {
+    test_error = null
+    test_result = null
+    test_has_run = false
+  }
+
   function select_v2_type(type: V2EvalType) {
     selected_v2_type = type
+    reset_test_state()
   }
 
   function go_back_to_type_picker() {
     selected_v2_type = null
+    reset_test_state()
   }

Also applies to: 325-327

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
around lines 159 - 165, When switching eval types you need to reset the test-run
state so a previously run test doesn't allow saving a new untested type: in
select_v2_type(type: V2EvalType) and go_back_to_type_picker() clear the test
flag (reset test_has_run to false and any related test result state) immediately
after changing selected_v2_type; ensure the same reset happens both when setting
selected_v2_type = type and when setting it back to null so the save gate that
checks test_has_run cannot be bypassed.

Comment on lines +291 to +297
async function grant_trust_and_retry(): Promise<boolean> {
try {
await grantCodeEvalTrust(project_id)
} catch (e) {
test_error = createKilnError(e)
return false
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Route trust-grant errors to the correct UI error channel.

When trust grant fails during a save flow, the error is written to test_error instead of create_evaluator_error, so users can miss the failure in the main submit path.

Suggested fix
 async function grant_trust_and_retry(): Promise<boolean> {
+  const action = pending_trust_action
   try {
     await grantCodeEvalTrust(project_id)
   } catch (e) {
-    test_error = createKilnError(e)
+    const kilnError = createKilnError(e)
+    if (action === "save") {
+      create_evaluator_error = kilnError
+    } else {
+      test_error = kilnError
+    }
     return false
   }
-  const action = pending_trust_action
   pending_trust_action = null
   if (action === "test") {
     await run_test()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
around lines 291 - 297, The grant_trust_and_retry function is currently
assigning failures to test_error instead of the main submit error channel;
change the catch handler inside grant_trust_and_retry (which calls
grantCodeEvalTrust(project_id)) to set create_evaluator_error =
createKilnError(e) rather than test_error and then return false so the UI
surfaces trust-grant failures on the main submit error path.

Comment on lines +293 to +369
it("retries test after granting trust via dialog", async () => {
const { container } = await renderPage()

await selectEvalType(container, "Code Eval")
await tick()

const textarea = container.querySelector(
"#test_final_message",
) as HTMLTextAreaElement
await fireEvent.input(textarea, { target: { value: "test output" } })
await tick()

// First call: not trusted. Second call (after grant): success.
mockTestV2Eval
.mockResolvedValueOnce({
scores: {},
skipped_reason: "code_eval_not_trusted",
skipped_detail: null,
})
.mockResolvedValueOnce({
scores: { quality: 0.9 },
skipped_reason: null,
skipped_detail: null,
})

mockGrantCodeEvalTrust.mockResolvedValueOnce({ trusted: true })

// Click Try It – triggers the not-trusted path
const tryBtn = container.querySelector(
"button.btn-primary.btn-sm",
) as HTMLButtonElement
await fireEvent.click(tryBtn)
await tick()
await new Promise((r) => setTimeout(r, 0))
await tick()

// Trust dialog was shown
expect(showCalls).toContain("Allow Code Execution")

// Now find the trust dialog's action button in the DOM and examine it.
// Since our Dialog stub records calls but doesn't render action buttons,
// we invoke grant_trust_and_retry directly by finding the dialog's
// asyncAction and calling it. We can do this by looking at the
// data-action-buttons attribute or by accessing the component directly.
//
// The dialog stub serialises action_buttons. Find the trust dialog.
const dialogs = container.querySelectorAll('[data-testid="dialog-stub"]')
let trustDialogEl: Element | null = null
for (const d of dialogs) {
if (d.getAttribute("data-title") === "Allow Code Execution") {
trustDialogEl = d
}
}
expect(trustDialogEl).not.toBeNull()
const actionBtns = JSON.parse(
trustDialogEl!.getAttribute("data-action-buttons") || "[]",
)
const grantBtn = actionBtns.find(
(b: Record<string, unknown>) => b.isWarning,
)
expect(grantBtn).toBeTruthy()

// The asyncAction in the real component is grant_trust_and_retry.
// Since the stub doesn't wire up action_buttons, we simulate
// what the real Dialog does: call the asyncAction.
// We imported the page component; the function is internal.
// Instead, let's invoke it via the mock chain: grantCodeEvalTrust
// should be called, then testV2Eval retried.

// We can get the page component instance and call the method,
// but it's private. The cleaner approach: directly test the
// observable side effects by verifying mock calls.

// Since we can't click a real dialog button (stub doesn't render them),
// verify the state: testV2Eval was called once, trust dialog shown.
expect(mockTestV2Eval).toHaveBeenCalledTimes(1)
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The retry-path test never executes the retry path.

On Line 293, the case says it verifies “retries test after granting trust,” but it never triggers the dialog action and ends by asserting mockTestV2Eval was called only once (Line 368). This will pass even if grant-and-retry is broken.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/page.test.ts
around lines 293 - 369, The test "retries test after granting trust via dialog"
never exercises the retry path: it only asserts mockTestV2Eval was called once
and never invokes the dialog's async action or mockGrantCodeEvalTrust, so the
grant-and-retry behavior isn't validated. Fix by simulating the dialog action
after locating the dialog stub (querySelectorAll('[data-testid="dialog-stub"]')
and matching data-title "Allow Code Execution"), then call the grant action
sequence: invoke mockGrantCodeEvalTrust (or simulate the dialog's
asyncAction/grant_trust_and_retry) so that mockGrantCodeEvalTrust is awaited and
mockTestV2Eval is invoked a second time; finally update assertions to expect
mockGrantCodeEvalTrust toHaveBeenCalled() and mockTestV2Eval
toHaveBeenCalledTimes(2) to confirm retry occurred.

scosman and others added 2 commits June 8, 2026 17:52
never mode and on_unexpected_tools="fail" used name-only matching and
ignored expected_args (spec components/22 §3.5, J.37). Both now use the
full _call_matches_spec matcher. Adds tests for never + expected_args
and on_unexpected + expected_args, the previously-uncovered cases that
let the bug ship.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the coexistence tests architecture.md §5 elevates to first-class:
- V1 EvalRun loads with new optional fields defaulting to None (+ round-trip)
- V1 EvalConfig with absent config_type dispatches via the legacy path
- V1 g_eval whose properties dict carries a "type" key is not misrouted
  into the V2 discriminated union (in-memory + round-trip)
- V1 g_eval / llm_as_judge create + report unchanged through the API

No production code changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chiang-daniel chiang-daniel mentioned this pull request Jun 9, 2026
6 tasks
scosman and others added 2 commits June 9, 2026 20:44
…aseEval

V2 EvalConfigs silently judged the dataset item's stale output in
task_run_eval mode (ignoring the selected run config) because V2 adapters
were built on a judge-only BaseV2Eval fork with no fresh-output path.

- Unfork BaseV2Eval -> single BaseEval via thin BaseV2EvalBridge (C.11c)
- V2 task_run_eval generates fresh output (run_task) then scores it, with
  skip handling matching eval_config_eval
- Extract run_task from run_task_and_eval (pure refactor; legacy unchanged)
- Design B: build EvalTaskInput per source via EvalTaskInput.from_task_run;
  remove dead EvalJob.stored_output/stored_trace
- EvalInput-source execution deferred to Task B (clean NotImplementedError)
- Tests: fresh-gen distinctness + skip-during-task_run_eval coverage

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l_config_eval

EvalInput-source V2 evals were non-functional: the runner rejected
task_run_eval + eval_input_filter_id at construction and the EvalInput
branch raised NotImplementedError.

- Constructor accepts EvalInput sources in task_run_eval (C.runner.3)
- collect_tasks_for_eval_input crosses inputs x configs x run_configs (3-level dedup)
- EvalInput + task_run_eval: generate fresh output from user_message via run_task,
  judge with reference_data from EvalInput.reference; persist EvalRun with eval_input_id
- EvalInput + eval_config_eval: clean skip (incompatible_input_shape) — deferred per
  B2.1/K.3 (golden subsets stay TaskRuns in V2.0)
- run_task widened to TaskRun | EvalInput; EvalTaskInput.from_eval_input added
- Tests: fresh-gen + reference_data, skip paths, collection dedup, run_task

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libs/core/kiln_ai/adapters/eval/eval_runner.py (1)

245-247: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist source dataset IDs for V2 task_run_eval runs.

At Line 494, dataset_id is set to run_output.id (fresh run), but dedupe/coverage logic keys by source dataset IDs (Lines 245-247 and Line 260). This breaks rerun deduplication and percent-complete aggregation for V2 task-run evals.

Suggested fix
-            dataset_id = run_output.id
+            dataset_id = job.item.id

Also applies to: 260-260, 492-495

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` around lines 245 - 247, The
dedupe/coverage logic is storing the fresh run id (run_output.id) instead of the
original source dataset id, which breaks rerun deduplication for V2 task-run
evals; update the places that add to and check already_run (references:
already_run, eval_config.id, run.task_run_config_id, run.dataset_id,
run_output.id, task_run_eval) so they persist the source dataset id extracted
from the incoming run (use run.dataset_id or the source id field from the V2
task_run_eval payload) rather than run_output.id when recording and comparing
entries—ensure both the addition (where
already_run[eval_config.id][run.task_run_config_id].add(...)) and any
checks/aggregation that reference already_run use that source id consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Around line 245-247: The dedupe/coverage logic is storing the fresh run id
(run_output.id) instead of the original source dataset id, which breaks rerun
deduplication for V2 task-run evals; update the places that add to and check
already_run (references: already_run, eval_config.id, run.task_run_config_id,
run.dataset_id, run_output.id, task_run_eval) so they persist the source dataset
id extracted from the incoming run (use run.dataset_id or the source id field
from the V2 task_run_eval payload) rather than run_output.id when recording and
comparing entries—ensure both the addition (where
already_run[eval_config.id][run.task_run_config_id].add(...)) and any
checks/aggregation that reference already_run use that source id consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 92090d71-5642-42cb-a644-882a7bb677d2

📥 Commits

Reviewing files that changed from the base of the PR and between 99f0c43 and 55ec5e8.

📒 Files selected for processing (25)
  • app/desktop/studio_server/test_eval_api.py
  • libs/core/kiln_ai/adapters/eval/__init__.py
  • libs/core/kiln_ai/adapters/eval/base_eval.py
  • libs/core/kiln_ai/adapters/eval/eval_runner.py
  • libs/core/kiln_ai/adapters/eval/registry.py
  • libs/core/kiln_ai/adapters/eval/test_eval_runner.py
  • libs/core/kiln_ai/adapters/eval/test_registry.py
  • libs/core/kiln_ai/adapters/eval/test_v2_contains.py
  • libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/test_v2_exact_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
  • libs/core/kiln_ai/datamodel/eval.py
  • libs/core/kiln_ai/datamodel/test_eval_model.py
💤 Files with no reviewable changes (1)
  • libs/core/kiln_ai/adapters/eval/init.py
🚧 Files skipped from review as they are similar to previous changes (15)
  • libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
  • libs/core/kiln_ai/adapters/eval/registry.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/test_eval_runner.py
  • libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/test_registry.py
  • libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
  • libs/core/kiln_ai/datamodel/eval.py

scosman and others added 12 commits June 17, 2026 16:59
… V2.0

Restore V1's battle-tested discrete scoring for V2 launch. RAG judge
templates (faithfulness, answer_relevance, context_relevance,
context_precision, hallucination, answer_correctness) were net-new in V2
and introduced direct-float model output that deviated from V1's discrete
convention, leaked leniency into the shared scorer, and lacked first-class
continuous-score support (UI, data model, docs, g-eval compatibility).

Changes:
- Delete rag_judge_templates.py and its tests (576 lines removed)
- Revert V2 llm_judge to discrete output (allow_float_scores=False)
- Restore pristine strict scorer in scoring_utils.py (remove float
  leniency: isinstance int/float early-return and float() fallback)
- Remove _filter_output_to_score_keys (RAG-specific rich JSON filtering)
- Add tests: discrete score schema assertion, strict scorer rejection
  of non-mapping values (float strings, raw floats)
- Annotate 13 spec files as deferred/punted with pointers to
  /specs/projects/rag_templates/
- Create bring-back overview at specs/projects/rag_templates/project_overview.md
  documenting what was removed, why, design references, and complexities
  to solve before restoration

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verify the RAG judge template + continuous scoring removal (commits
5efc626, 74154c0) landed correctly on the evals_v2 branch. All 5
checklist items pass: V1 scoring identical to main, V2 llm_judge
discrete, no RAG refs in source, specs annotated as punted, and
rag_templates bring-back plan records the removal SHA.

Key finding: _filter_output_to_score_keys was removed with the RAG
templates — Phase 2 item 1b is moot and should be skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d, dead-except removal

Phase 2 of deep-cr-cleanup:

- 1a: extract_value, check_required_vars, and check_reference_key now treat
  None as a skip signal alongside Jinja Undefined, preventing garbage from
  reaching the judge when reference data is missing or null.
- 1b: DROPPED — _filter_output_to_score_keys was already removed with the
  RAG templates (commit 5efc626).
- 1c: V1 EvalRun validate_output_fields now guards against output=None for
  non-skipped runs, catching the bug before it becomes a downstream
  AttributeError.
- 5.4: Removed dead except SyntaxError: pass in CodeEvalProperties.validate_code
  — compile() already catches syntax errors upstream, so ast.parse() never
  raises SyntaxError for code that survived compile().

22 new tests covering all changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…from_type; dedupe dispatch tests

Rename the legacy V1 dispatcher to clearly signal it is the legacy path.
Update V2-branch error message to point callers at v2_eval_adapter_from_config.
Remove two exact-duplicate dispatch tests in test_registry.py (covered by
parametrized and V2-perspective tests that remain).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…utput_scores, clearer error

5.8: Replace fragile double-get_args _V2_PROPERTY_TYPES derivation with an
explicit V2_PROPERTY_TYPES tuple defined alongside the V2EvalConfigProperties
union in eval.py, imported by base_eval.py and registry.py.

5.9: Cache output_scores in BaseV2EvalBridge.__init__ so build_binary_scores
and CodeEvalAdapter._validate_scores avoid per-item parent_eval() disk I/O.
Signature changed from (eval_config, passed) to (output_scores, passed);
all 6 deterministic adapters + code_eval + llm_judge updated.

5.6: Clarify the dataset_id/eval_input_id mutual-exclusivity error message
to explain V1 (TaskRun source) vs V2 (EvalInput source) distinction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce affordances (5.3)

In V2.0 no UI path populates reference_data on EvalInputs, so any
eval-type form offering a "Reference Data Key" source option would
produce configs that always silently skip at runtime. Add TODO(pre-ship
5.3) comments at the source selector in exact_match_form, set_check_form,
and contains_form. CI's no-TODO-on-main rule makes these a hard
pre-ship gate: either wire reference_data population or remove the
affordance before merging to main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…luation_data_type default (5.1) docs to code

Update two evals_v2 spec docs to match actual implementation:

- components/27 §2.2: document strict exact-key-match validation in
  _validate_scores (raises RuntimeError on missing/extra keys), replacing
  the old lenient spec rule. Add key-mismatch row to the error-cases
  table. Correct int-coercion behavior in the no-bool-convenience bullet.

- components/15 §4.1: change evaluation_data_type default from None to
  EvalDataType.final_answer, matching eval.py:793. Document the V1
  back-compat rationale (omitted field loads as true V1 behavior).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- 6.1: Extract shared StubV2Eval/SkippingStubV2Eval into new conftest.py;
  remove duplicate definitions from test_v2_dispatch_and_contract.py and
  test_eval_runner.py.
- 6.3: Add make_v2_eval_config/make_eval_task_input factory helpers to
  conftest.py; adopt across all 6 deterministic matcher test files,
  preserving each file's original default final_message.
- 6.5: Convert test_g_eval_raises_when_provider_lacks_logprobs from
  run_until_complete to @pytest.mark.asyncio / async def / await.
- 5.5: Add TestRunnerSkipReasonsAreValidEnumMembers — AST-inspects
  eval_runner.py to assert every skipped_reason= value (enum .value
  access or raw string literal) is a valid SkippedReason member.
- 6.12: Add provenance docstring to test_g_eval_data.py documenting
  what the serialized RunOutput fixture is, how it was generated, and
  how to regenerate it (no directory exists for a README).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add TestV1LegacyRunnerCoexistence with 3 regression guards:
- V1 g_eval dispatches through legacy_eval_adapter_from_type (not V2)
- EvalConfig from dict without config_type key defaults to g_eval and
  runs through the legacy path
- V1 g_eval config with properties["type"]="exact_match" is not
  misrouted to the V2 runner

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s (2.1)

Replace the open dict[str, bool] response of the two code_eval trust
endpoints with a named CodeEvalTrustResponse{trusted: bool} model, so the
OpenAPI/TS contract is a closed object. Regenerate the web_ui OpenAPI
schema (api_schema.d.ts) and update the frontend client return types.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- De-leak two single-resource 400 messages to a type-agnostic string
  (get_eval_progress, get_eval_config_score_summary); list endpoints unchanged.
- create_eval_config: return a clean 400 for invalid V2 properties
  (ValidationError handler ordered before ValueError) + test.
- Move trust-endpoint lazy imports (project_from_id, v2_eval_code_eval) to
  module top-level (no import cycle); retarget test patches accordingly.
- test_nothing_persisted: assert against the real data-model directory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scosman and others added 8 commits June 18, 2026 00:40
…ter timeout test (6.11)

Add a module-level asyncio.Lock in CodeEvalAdapter.evaluate() around the
run_in_executor call so only one code_eval subprocess runs at a time, with
waiting evals suspended as coroutines (not blocking executor threads for the
full timeout). Keep the narrow threading.Lock around p.start() for the
PyInstaller spawn race (#7410). Add a serialization test. Speed up the
sandbox_worker timeout test (sleep(60)/2s → sleep(10)/1s, same margin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- code_eval timeout input: client-side min=1 max=300 (FormElement gains
  optional min/max props; existing usages unaffected).
- Trust dialog: remove all execution-method/reassurance language; keep a
  stark "runs arbitrary Python code on your machine" warning.
- five_star docstring: describe the raising behavior (was wrongly "clamped").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l-type forms (4.2/4.4)

Replace any/SvelteComponent<any> on the V2 eval-type form imperative API
with a shared EvalTypeFormApi interface ({ getProperties(): V2EvalConfigProperties;
validate?(): string | null }). Type the registry's createFormComponent, narrow
the create_eval_config call sites, and drop the eslint-disable suppressions.
One isolated, commented Svelte-4 bind:this cast remains (unavoidable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- formatEvalConfigName: collapse identical V2 compact/non-compact ternary.
- exact_match_form: remove dead "value_expression" source union member.
- Add extractV2Props<T> type-guard util; convert 8 V2 result components off
  repeated as-casts (now null-safe on type mismatch); + unit tests.
- tool_call_check_result: use generated ToolCallSpec instead of a local mirror.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Extend the with/without-eval_config render test to the other 7 V2 result
  renderers (null-props path renders scores, hides config detail).
- Parametrize the 8 v2_eval_api error-path tests with describe.each.
- registry.test.ts: derive the expected count from the types array.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(7.1)

The editor imports only the granular @codemirror/* packages (in
code_editor.svelte); the codemirror umbrella was imported nowhere in src.
Remove it and re-resolve the lockfile, dropping the umbrella plus its unused
transitive deps (@codemirror/lint, @codemirror/search). Editor still builds.

Completes the deep-cr-cleanup project (all 16 phases).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Leave only the single "runs Python code on your machine" warning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/web_ui/src/lib/utils/eval_types/registry.ts`:
- Around line 206-218: The extractV2Props function lacks proper validation of
the config type itself and only checks if the properties.type field matches the
expected type. A legacy config with a type property could incorrectly pass
validation. Add a guard condition to check that eval_config.config_type === "v2"
alongside the existing property validation checks before performing the cast and
returning eval_config.properties as V2PropsMap[T].

In `@app/web_ui/src/lib/utils/form_element.svelte`:
- Around line 51-52: The exported `min` and `max` properties are not being
validated within the component's validation logic. Update the `validator`
function to include bounds checking that verifies numeric input values fall
within the specified minimum and maximum range when these bounds are set.
Additionally, ensure the `inline_error` mechanism properly reports validation
failures when values exceed the defined bounds. This validation should apply
consistently across all numeric input scenarios handled by the validator
function, ensuring that out-of-range values are rejected before they pass
validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f138eecf-1f19-45c5-aabc-e230a9f4150d

📥 Commits

Reviewing files that changed from the base of the PR and between 55ec5e8 and 89a69f2.

⛔ Files ignored due to path filters (1)
  • app/web_ui/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (95)
  • app/desktop/studio_server/eval_api.py
  • app/desktop/studio_server/test_eval_api.py
  • app/web_ui/package.json
  • app/web_ui/src/lib/api/v2_eval_api.test.ts
  • app/web_ui/src/lib/api/v2_eval_api.ts
  • app/web_ui/src/lib/api_schema.d.ts
  • app/web_ui/src/lib/components/eval_types/code_eval_form.svelte
  • app/web_ui/src/lib/components/eval_types/code_eval_result.svelte
  • app/web_ui/src/lib/components/eval_types/code_eval_result.test.ts
  • app/web_ui/src/lib/components/eval_types/contains_form.svelte
  • app/web_ui/src/lib/components/eval_types/contains_result.svelte
  • app/web_ui/src/lib/components/eval_types/contains_result.test.ts
  • app/web_ui/src/lib/components/eval_types/exact_match_form.svelte
  • app/web_ui/src/lib/components/eval_types/exact_match_result.svelte
  • app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte
  • app/web_ui/src/lib/components/eval_types/llm_judge_result.test.ts
  • app/web_ui/src/lib/components/eval_types/pattern_match_result.svelte
  • app/web_ui/src/lib/components/eval_types/pattern_match_result.test.ts
  • app/web_ui/src/lib/components/eval_types/set_check_form.svelte
  • app/web_ui/src/lib/components/eval_types/set_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/set_check_result.test.ts
  • app/web_ui/src/lib/components/eval_types/step_count_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
  • app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/tool_call_check_result.test.ts
  • app/web_ui/src/lib/utils/eval_types/registry.test.ts
  • app/web_ui/src/lib/utils/eval_types/registry.ts
  • app/web_ui/src/lib/utils/form_element.svelte
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
  • libs/core/kiln_ai/adapters/eval/__init__.py
  • libs/core/kiln_ai/adapters/eval/base_eval.py
  • libs/core/kiln_ai/adapters/eval/conftest.py
  • libs/core/kiln_ai/adapters/eval/eval_helpers.py
  • libs/core/kiln_ai/adapters/eval/eval_runner.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/test_scoring_utils.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py
  • libs/core/kiln_ai/adapters/eval/registry.py
  • libs/core/kiln_ai/adapters/eval/test_eval_runner.py
  • libs/core/kiln_ai/adapters/eval/test_g_eval_data.py
  • libs/core/kiln_ai/adapters/eval/test_phase4_data_model.py
  • libs/core/kiln_ai/adapters/eval/test_registry.py
  • libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py
  • libs/core/kiln_ai/adapters/eval/test_v2_contains.py
  • libs/core/kiln_ai/adapters/eval/test_v2_dispatch_and_contract.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_helpers.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/test_v2_exact_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
  • libs/core/kiln_ai/datamodel/eval.py
  • libs/core/kiln_ai/datamodel/test_eval_model.py
  • specs/projects/deep-cr-cleanup/implementation_plan.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_1.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_10.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_11.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_12.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_13.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_14.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_15.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_16.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_2.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_3.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_4.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_5.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_6.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_7.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_8.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_9.md
  • specs/projects/deep-cr-cleanup/project_overview.md
  • specs/projects/evals_v2/architecture.md
  • specs/projects/evals_v2/components/00_overview.md
  • specs/projects/evals_v2/components/15_v1_v2_coexistence.md
  • specs/projects/evals_v2/components/21_type_llm_judge.md
  • specs/projects/evals_v2/components/27_type_code_eval.md
  • specs/projects/evals_v2/components/29_rag_judge_templates.md
  • specs/projects/evals_v2/components/50_reference_data.md
  • specs/projects/evals_v2/components/80_extensibility_contract.md
  • specs/projects/evals_v2/components/90_open_risks.md
  • specs/projects/evals_v2/functional_spec.md
  • specs/projects/evals_v2/implementation_plan.md
  • specs/projects/evals_v2/phase_plans/phase_4.md
  • specs/projects/rag_templates/project_overview.md
💤 Files with no reviewable changes (3)
  • app/web_ui/package.json
  • libs/core/kiln_ai/adapters/eval/init.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/scoring_utils.py
✅ Files skipped from review due to trivial changes (17)
  • specs/projects/deep-cr-cleanup/phase_plans/phase_16.md
  • specs/projects/deep-cr-cleanup/implementation_plan.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_13.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_8.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_6.md
  • libs/core/kiln_ai/adapters/eval/test_g_eval_data.py
  • specs/projects/deep-cr-cleanup/phase_plans/phase_5.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_12.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_15.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_10.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_14.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_7.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_11.md
  • specs/projects/deep-cr-cleanup/phase_plans/phase_2.md
  • specs/projects/evals_v2/architecture.md
  • specs/projects/evals_v2/components/00_overview.md
  • app/web_ui/src/lib/api_schema.d.ts
🚧 Files skipped from review as they are similar to previous changes (46)
  • app/web_ui/src/lib/components/eval_types/code_eval_result.svelte
  • libs/core/kiln_ai/adapters/eval/v2_eval_exact_match.py
  • app/web_ui/src/lib/components/eval_types/tool_call_check_result.test.ts
  • app/web_ui/src/lib/components/eval_types/set_check_result.test.ts
  • libs/core/kiln_ai/adapters/eval/v2_eval_contains.py
  • app/web_ui/src/lib/components/eval_types/contains_result.svelte
  • app/web_ui/src/lib/components/eval_types/set_check_result.svelte
  • libs/core/kiln_ai/adapters/eval/eval_utils/v2_eval_helpers.py
  • app/web_ui/src/lib/components/eval_types/contains_form.svelte
  • app/web_ui/src/lib/components/eval_types/contains_result.test.ts
  • app/web_ui/src/lib/components/eval_types/step_count_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/step_count_check_result.test.ts
  • app/web_ui/src/lib/components/eval_types/tool_call_check_result.svelte
  • app/web_ui/src/lib/components/eval_types/llm_judge_result.test.ts
  • libs/core/kiln_ai/adapters/eval/v2_eval_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/eval_helpers.py
  • app/web_ui/src/lib/components/eval_types/set_check_form.svelte
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/lib/components/eval_types/pattern_match_result.svelte
  • app/web_ui/src/lib/components/eval_types/llm_judge_result.svelte
  • app/web_ui/src/lib/components/eval_types/code_eval_form.svelte
  • libs/core/kiln_ai/adapters/eval/v2_eval_code_eval.py
  • libs/core/kiln_ai/adapters/eval/test_v2_contains.py
  • libs/core/kiln_ai/adapters/eval/test_sandbox_worker.py
  • app/web_ui/src/lib/api/v2_eval_api.test.ts
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_code_eval.py
  • app/web_ui/src/lib/components/eval_types/exact_match_result.svelte
  • app/web_ui/src/lib/components/eval_types/code_eval_result.test.ts
  • libs/core/kiln_ai/adapters/eval/test_v2_pattern_match.py
  • libs/core/kiln_ai/adapters/eval/test_v2_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_step_count_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_eval_llm_judge.py
  • libs/core/kiln_ai/adapters/eval/v2_eval_tool_call_check.py
  • app/web_ui/src/lib/utils/eval_types/registry.test.ts
  • libs/core/kiln_ai/adapters/eval/v2_eval_set_check.py
  • app/web_ui/src/lib/components/eval_types/exact_match_form.svelte
  • app/web_ui/src/lib/components/eval_types/pattern_match_result.test.ts
  • libs/core/kiln_ai/adapters/eval/base_eval.py
  • libs/core/kiln_ai/adapters/eval/eval_utils/test_scoring_utils.py
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/[spec_id]/[eval_id]/create_eval_config/+page.svelte
  • libs/core/kiln_ai/adapters/eval/test_v2_tool_call_check.py
  • libs/core/kiln_ai/adapters/eval/test_v2_set_check.py
  • app/desktop/studio_server/eval_api.py
  • app/desktop/studio_server/test_eval_api.py
  • libs/core/kiln_ai/datamodel/eval.py
  • libs/core/kiln_ai/adapters/eval/eval_runner.py

Comment on lines +206 to +218
export function extractV2Props<T extends V2EvalType>(
eval_config: EvalConfig | null,
expectedType: T,
): V2PropsMap[T] | null {
if (
!eval_config?.properties ||
!("type" in eval_config.properties) ||
eval_config.properties.type !== expectedType
) {
return null
}
return eval_config.properties as V2PropsMap[T]
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard extractV2Props by config_type === "v2" before casting.

At Line 210, the helper only checks properties.type; it can accept a legacy config that happens to contain a type key and then cast it to V2 props incorrectly. Add a config_type check before returning.

Suggested fix
 export function extractV2Props<T extends V2EvalType>(
   eval_config: EvalConfig | null,
   expectedType: T,
 ): V2PropsMap[T] | null {
   if (
+    eval_config?.config_type !== "v2" ||
     !eval_config?.properties ||
     !("type" in eval_config.properties) ||
     eval_config.properties.type !== expectedType
   ) {
     return null
   }
   return eval_config.properties as V2PropsMap[T]
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/utils/eval_types/registry.ts` around lines 206 - 218, The
extractV2Props function lacks proper validation of the config type itself and
only checks if the properties.type field matches the expected type. A legacy
config with a type property could incorrectly pass validation. Add a guard
condition to check that eval_config.config_type === "v2" alongside the existing
property validation checks before performing the cast and returning
eval_config.properties as V2PropsMap[T].

Comment on lines +51 to +52
export let min: number | null = null
export let max: number | null = null

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

min/max are forwarded but not enforced by component validation.

Lines 51-52 and 261-262 add bounds, but validator/inline_error never check them. This creates a mismatch where out-of-range numeric values can still pass component validation.

Suggested fix
   export let validator: (value: unknown) => string | null = () => {
     if (inputType === "header_only") {
       return null
     }
@@
     if (max_length && typeof value === "string" && value.length > max_length) {
@@
     }
+    if (inputType === "input_number" && typeof value === "number") {
+      if (min !== null && value < min) {
+        return `"${label}" must be at least ${min}.`
+      }
+      if (max !== null && value > max) {
+        return `"${label}" must be at most ${max}.`
+      }
+    }
     return null
   }
@@
     } else if (
       max_length &&
       typeof value === "string" &&
       value.length > max_length
     ) {
       inline_error = "" + value.length + "/" + max_length
+    } else if (inputType === "input_number" && typeof value === "number") {
+      if (min !== null && value < min) {
+        inline_error = `Min ${min}`
+      } else if (max !== null && value > max) {
+        inline_error = `Max ${max}`
+      } else {
+        inline_error = null
+      }
     } else {
       inline_error = null
     }
   }

Also applies to: 65-82, 86-101, 261-262

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/utils/form_element.svelte` around lines 51 - 52, The
exported `min` and `max` properties are not being validated within the
component's validation logic. Update the `validator` function to include bounds
checking that verifies numeric input values fall within the specified minimum
and maximum range when these bounds are set. Additionally, ensure the
`inline_error` mechanism properly reports validation failures when values exceed
the defined bounds. This validation should apply consistently across all numeric
input scenarios handled by the validator function, ensuring that out-of-range
values are rejected before they pass validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant