This repo is organized around a three-stage single-LoRA competition pipeline:
- harness-aligned
chat_thinkingprompts - local competition-proxy evaluation with
competition_correct - staged fine-tuning: stage1 format align -> stage2 distill -> stage3 repair
Run these in order before packaging a submission:
python scripts/probe_chat_template.py
python scripts/inspect_target_modules.py --config configs/train_stage1_format.yaml
bash scripts/train_stage1_format_align.sh
bash scripts/train_stage2_distill.sh
bash scripts/train_stage3_repair.sh
python scripts/validate_submission.py \
--config configs/train_stage3_repair.yaml \
--adapter-dir artifacts/adapter_stage3_repair \
--smoke-input data/processed/official_train_tagged.jsonl \
--labels data/processed/official_train_tagged.jsonl \
--splits data/splits/official/splits.json \
--package-output submission.zipPrepare the canonical official dataset and maintained splits:
python scripts/prepare_data.py --config configs/data_official.yamlThis writes:
data/processed/official_train_tagged.jsonldata/splits/official/splits.json
Generate hard-triad synthetic data for stage2:
python -m src.teacher.synth_generator --config configs/synth_hard_triads.yamlStage1 build:
python -m src.student.sft_dataset_builder \
--input data/processed/official_train_tagged.jsonl \
--output data/processed/stage1_format_align_train.jsonl \
--selection-profile stage1 \
--prompt-mode chat_thinking \
--tokenizer-path artifacts/_tokenizer_cache/metric_nemotron-3-nano-30b-a3b-bf16_transformers_default \
--split-file data/splits/official/splits.json \
--split-name rule_novelty_all \
--split-role train \
--exclude-split-file data/splits/official/splits.json \
--exclude-split-name hard_triad_rule_novelty \
--exclude-split-role validBecause rule_novelty_all and hard_triad_rule_novelty are built
independently (different seeds, different subsets), stage1 train explicitly
excludes hard_triad_rule_novelty/valid so later hard-triad validation stays
unseen across the full staged pipeline (stage1 -> stage2 init_adapter_dir=stage1 -> stage3 init_adapter_dir=stage2).
Stage2 build with family balancing and stronger teacher search:
python -m src.student.sft_dataset_builder \
--input data/processed/official_train_tagged.jsonl,data/synthetic/synth_hard_triads.jsonl \
--output data/processed/stage2_distill_train.jsonl \
--selection-profile stage2 \
--prompt-mode chat_thinking \
--tokenizer-path artifacts/_tokenizer_cache/metric_nemotron-3-nano-30b-a3b-bf16_transformers_default \
--completion-style token_trace \
--beam-width 10 \
--max-depth 3 \
--top-k 3 \
--balance-by-family \
--hard-triad-repeat-factor 2 \
--max-per-signature-bucket 64 \
--report-output data/processed/stage2_distill_report.json \
--split-file data/splits/official/splits.json \
--split-name rule_novelty_all \
--split-role trainStage3 build from stage2 model failures plus all-family replay:
python -m src.student.sft_dataset_builder \
--input data/processed/official_train_tagged.jsonl \
--output data/processed/stage3_repair_train.jsonl \
--selection-profile stage3 \
--prompt-mode chat_thinking \
--tokenizer-path artifacts/_tokenizer_cache/metric_nemotron-3-nano-30b-a3b-bf16_transformers_default \
--completion-style short_trace \
--beam-width 8 \
--max-depth 2 \
--top-k 2 \
--repair-artifact data/processed/stage2_model_failures_train.json \
--replay-input data/processed/stage2_model_successes_all_train.json \
--replay-ratio 0.25 \
--report-output data/processed/stage3_repair_train_report.jsonStage3 failure / success buckets are produced by scripts/train_stage3_repair.sh:
- repair failures come from
hard_triad_rule_novelty/train- the stage2 adapter predictions that missed on the hard triad, the only samples we want to correct - replay successes come from
rule_novelty_all/train- the stage2 adapter predictions that were correct across all six families, so replay can keep easy-triad families anchored and avoid catastrophic forgetting - the builder receives
data/processed/official_train_tagged.jsonlas input sobuild_repair_setcan materialise replay records for ids that live outside the hard-triad subset
- canonical train configs are:
configs/train_stage1_format.yamlconfigs/train_stage2_selected_trace.yamlconfigs/train_stage3_repair.yaml
- stage2 and stage3 continue from the previous stage adapter via
training.init_adapter_dir lora_train.pysupports--force-train; canonical stage scripts pass it explicitlytarget_modulesare hard-validated before training startsmax_seq_length: autouses tokenizer-aware BPE accounting when a tokenizer path is configured- stage2 / stage3 local inference defaults to
max_new_tokens: 2048to avoid truncatingchat_thinkinggenerations max_depth=3is intentionally more expensive for stage2 selection; expect noticeably higher CPU time thanmax_depth=2stage2_distill_valid.jsonlis the SFT loss monitor (teacher-solvable subset); it is not the hard-triad headline metric. Trust the proxy eval artifacts below instead.- after stage2 and stage3 training the canonical scripts run a per-stage bestproxy selector (
scripts/select_best_proxy_checkpoint.py). It iterates over everycheckpoint-*plus the final adapter, scores each against hard-triad and all-family proxies, and writes the winner toartifacts/adapter_stage{2,3}_bestproxy/. The canonical artifact pairs consumed byselect_final_adapter.pyare:- hard-triad proxy:
data/processed/stage{2,3}_bestproxy_hard_eval.json - all-family proxy (leak-free,
rule_novelty_all/validminushard_triad_rule_novelty/train):data/processed/stage{2,3}_bestproxy_all_eval.jsonThe stage-root proxy eval artifacts still exist (stage{2,3}_proxy_valid_eval.json/stage{2,3}_proxy_all_valid_eval.json) as monitoring signals for comparing "final step" vs "bestproxy", but the submission path reads the bestproxy pair.
- hard-triad proxy:
- final adapter selection is automated by
scripts/select_final_adapter.py: it compares the two bestproxy pairs at half-sample tolerance (primary: all-family; tie-break: hard-triad; default on complete tie: stage2) and copies the winner intoartifacts/adapter_final_selected/. It also writesdata/processed/final_adapter_selection.jsonwith the full decision trace. Package submission.zip fromartifacts/adapter_final_selected/, not from the per-stage adapter directories. - stage2 enables a silver hard-triad official pool by default (
--stage2-enable-silver-official): official hard-triad (bit/cipher/equation) samples that fail the strict gate but satisfy weaker thresholds (teacher_confidence >= 0.65, support_coverage >= 0.67) are admitted to the train set withtrace_style=answer_only, capped atmin(0.25 * (strict + synth), 800)and sampledequation -> cipher -> bit. The stage2 build report now carriesselection_countsandofficial_rejection_diagnosticsso the next iteration can decide whether to tune the silver thresholds. - stage2 train also runs a hard-triad rescue search (
--stage2-second-pass-hard-triad, default opt-in only at the shell layer). For every rejected official sample whose family is in--stage2-rescue-families(defaultequation, the family most consistently missingprogram_signaturein diagnostics), a second chain-search pass is run withbeam=12/depth=4/top-k=3. The new annotation is promoted only if the quality tuple (solver_verifiable,support_coverage,has_signature,teacher_confidence,top1_top2_margin) strictly improves; otherwise the first-pass annotation is restored. Rescue cannot degrade any candidate.stage2_distill_report.jsonnow carriesrescue_diagnostics.{rescue_attempted, rescue_promoted, rescue_families, rescue_settings}so you can tell how many rejected equation samples were pulled back into strict-quality annotation. configs/synth_hard_triads.yaml(and the legacyconfigs/synth.yaml) sethard_negative_ratio: 0.0because thehard_negative/negative_answermetadata fields have no downstream consumer in the current training pipeline. Re-enable only when a real consumer (loss term, filter, etc.) lands.- stage3 can skip itself when stage2 produced zero hard-triad train failures. In that case
scripts/train_stage3_repair.shcopies the stage2 bestproxy adapter toartifacts/adapter_stage3_repair/andartifacts/adapter_stage3_bestproxy/, reuses the stage2 bestproxy eval JSONs, and writesstage3_skipped.jsonnext to the weights so downstream packaging / validation does not need to branch. Seedata/processed/stage3_decision.jsonanddata/processed/stage3_best_checkpoint_selection.jsonfor the gate outcome.
H100 smoke path:
bash scripts/train_stage1_smoke.shLegacy local smoke entry now forwards to the canonical stage1 smoke script:
bash scripts/train_smoke_local.shCanonical training order (run top-to-bottom on an H100 box):
python scripts/probe_chat_template.py
python scripts/inspect_target_modules.py --config configs/train_stage1_format.yaml
bash scripts/train_stage1_format_align.sh
bash scripts/train_stage2_distill.sh
bash scripts/train_stage3_repair.sh
python scripts/select_final_adapter.py \
--stage2-hard-eval data/processed/stage2_bestproxy_hard_eval.json \
--stage2-all-eval data/processed/stage2_bestproxy_all_eval.json \
--stage2-adapter-dir artifacts/adapter_stage2_bestproxy \
--stage3-hard-eval data/processed/stage3_bestproxy_hard_eval.json \
--stage3-all-eval data/processed/stage3_bestproxy_all_eval.json \
--stage3-adapter-dir artifacts/adapter_stage3_bestproxy \
--output-adapter-dir artifacts/adapter_final_selected \
--output-json data/processed/final_adapter_selection.json
python scripts/validate_submission.py \
--config configs/train_stage3_repair.yaml \
--adapter-dir artifacts/adapter_final_selected \
--smoke-input data/processed/official_train_tagged.jsonl \
--labels data/processed/official_train_tagged.jsonl \
--splits data/splits/official/splits.json \
--max-new-tokens 2048 \
--package-output submission.zipInside each stage, scripts/select_best_proxy_checkpoint.py iterates over
every checkpoint-* directory plus the final adapter, runs the hard-triad
and all-family proxy evals, and materialises the winner at
artifacts/adapter_stage{2,3}_bestproxy. The selector uses the shared
rule from src/student/proxy_selection.py (all-family primary,
hard-triad tie-break, prefer the final checkpoint on complete tie) so the
same comparison logic is used at both the checkpoint level and the
stage2-vs-stage3 level. Expect roughly 1.5-2.5h extra H100 time per stage
for the selector passes; budget 3-5h total on top of the 24-36h training
run.
The validator hard-fails when:
adapter_config.jsonis missing or its rank cannot be parsed- the parsed rank exceeds 32
--labelsis passed without--smoke-input, or--package-outputis passed without both--smoke-inputand--labels--package-outputruns before a successfullocal_eval
--max-new-tokens is optional; it overrides the inference token budget from
the config (defaults to 2048 for chat_thinking in stage2 / stage3 configs).
python -m pytest -q