This repository provides an autonomous agent framework for generating and repairing Coq proofs using a multi-stage planning and execution pipeline. The system integrates a variety of components, from HTTP clients interfacing with Coq servers to language-model-driven planners and executors.
Ths code is not runnable as it has the dependencies which are under NDA.
├── agent.py # Core agent logic and state machine
├── coq_project_client.py # HTTP client for Coq project operations
├── mcp_client.py # JSON-RPC client for MCP (Coq) server
├── tools.py # Dynamic tool provider and tool wrappers
├── planning/ # Proof planning strategies
│ ├── mad.py # Multi-Agent Debate (MAD) planner
│ └── simple.py # Single-shot simple planner
└── README.md # (This file) Overview and documentation
Purpose:
Serves as the orchestrator for end-to-end proof generation and repair by modeling the entire workflow as a finite state machine (FSM) built on langgraph.StateGraph. It ties together session management, strategy planning, execution, error handling, and iterative refinement—all driven by language models and Coq-tool interactions.
Key Components & Workflow:
-
Initialization (
init):-
Parses incoming user message to extract
file_pathandtheorem_name. -
Starts a Coq session via
CoqProjectClient.start_session, storingsession_idandproof_version_hash. -
Instantiates
McpHttpClient, initializes MCP session, and dynamically retrieves tool definitions viaMcpCoqToolProvider. -
Builds and names tool wrappers (
McpCoqTool) for methods likecheck_proof,get_premises, etc. -
Creates LLM clients (
ChatGrazie) for executor, critic, plan_ranker, planner, summarizer, and failure_summarizer, each configured fromself.config. -
Generates an initial set of candidate strategies (
plans) by invoking eithersimple_plan_generation(single-shot) ormulti_agent_proof_debate(MAD) based onplanning_config.mode. -
Seeds the FSM state with:
messages: system prompt and user messageplans,plan_scores,plans_to_try,current_plan_index- Session identifiers, counters (
tool_calling_iterations,failed_proof_checks), and an emptysummary.
-
-
Plan Scoring (
score_plans):-
When >1 plan exists, iterates over each strategy string:
- Prompts the plan_ranker model to assign a JSON score (1–10).
- Parses output, defaults to 5 on parse failure.
-
Sorts plans descending by score and retains top
best_plan_samples_numberintoplans_to_try. -
Logs detailed
plan_scoresfor later analysis.
-
-
Plan Loop (
plan_loop):-
Retrieves initial proof goals via
check_prooftool to populatecurrent_goals. -
Iterates through
plans_to_tryup to the configured maximum:- Sets
current_plan_index, logs selected strategy. - Calls
execute_single_plan(plan, summary, state)to attempt the proof. - On success, marks
is_proof_finished = True, savesfinished_proof, and exits loop. - On failure, invokes
summarize_plan_failureto produce concise bullet points, resets tool counters, and continues.
- Sets
-
If all strategies fail, exits with
is_proof_finished = False.
-
-
Executor Subgraph (
build_executor_subgraph):-
Defines sub-FSM nodes:
call_llm: sends messages + summary to executor LLM and appends reply.call_tool: extracts firsttool_callfrom last AI message, invokes viaToolNode, and merges responses.critique: after N consecutivecheck_prooffailures, runs critic to diagnose deviations.fetch_similar: queriesget_premises+get_theoremto assemble similar theorems for inspiration.replan: refines the current strategy using replanner and prior criticism.summarize: condenses long histories with proof_progress_summarizer when message length exceedsMAX_RAW.early_stopping: halts when proof complete or max tool calls reached.
-
Uses
should_call_toolandshould_continuepredicates to route between nodes.
-
-
Tool Interaction & Error Handling:
-
Tools (e.g.,
check_proof) are wrapped inMcpCoqTool, auto-injectingcoqSessionIdandproofVersionHash. -
format_check_proof_response:- Parses JSON from
check_proof, updatescurrent_goals,failed_proof_checks, andproof_version_hash. - Distinguishes between syntax/runtime errors, incomplete-but-valid proofs, and completion.
- Parses JSON from
-
Tracks
failed_proof_checksto trigger critique after threshold breaches.
-
-
Critique & Replanning:
- Critic LLM (
critic) is invoked when repeated proof checks fail, producing improvement suggestions without calling tools. - Gathers insights on similar proven theorems and feeds them to replanner LLM, which outputs a refined strategy string.
- Integrates refined plan back into the subgraph for continued execution.
- Critic LLM (
-
Summarization:
- Monitors raw message length (
MAX_RAW), and when exceeded, extracts the tail (TAIL_SIZE) to preserve context. - proof_progress_summarizer collapses earlier messages into a 4–5 bullet summary, reducing memory while retaining key facts.
- Monitors raw message length (
Purpose: Synchronous HTTP client for interacting with a Coq project server.
Highlights:
- Methods to start sessions, fetch theorem statements, validate proofs (
check_proof), and retrieve premises or full proofs. - Uses
requeststo perform REST calls and returns JSON responses.
Purpose: Asynchronous JSON-RPC client for the MCP Coq server, leveraging SSE for streaming responses.
Highlights:
initialize(): Performs theinitializeJSON-RPC handshake and captures themcp-session-id.list_tools(): Streams available tactics/tools via Server-Sent Events.call_tool(): Invokes named MCP tools, updates the proof-version hash, and returns textual output.- Custom error handling with
MCPServerError.
Purpose: Adapts MCP JSON-RPC tools into LangChain-compatible BaseTool instances and provides them dynamically.
Key Classes:
McpCoqTool: Wraps a single MCP tool, injecting session and proof-version context.McpCoqToolProvider: Discovers available MCP tools at runtime and constructsMcpCoqToolobjects.
Purpose: Implements a Multi-Agent Debate (MAD) proof planner, where two AI debaters (A and B) iteratively propose and critique strategies, followed by a judge to select the best plan.
Flow:
- Initialization: Record theorem and available tools.
- Debate Rounds: Alternate between Debater A (proposes) and Debater B (critiques/improves) for n rounds.
- Judgment: A judge LLM decides the winner and consolidates a final proof strategy.
Purpose: Provides a straightforward, single-shot plan-generation strategy using a single Grazie chat model.
Flow:
- Lists available tools.
- Sends a prompt asking the model to produce a numbered step-by-step proof outline.