NVIDIA Nemotron Reasoning Challenge Repo

This repo is organized around a three-stage single-LoRA competition pipeline:

harness-aligned chat_thinking prompts
local competition-proxy evaluation with competition_correct
staged fine-tuning: stage1 format align -> stage2 distill -> stage3 repair

Canonical Training Order

Run these in order before packaging a submission:

python scripts/probe_chat_template.py
python scripts/inspect_target_modules.py --config configs/train_stage1_format.yaml

bash scripts/train_stage1_format_align.sh
bash scripts/train_stage2_distill.sh
bash scripts/train_stage3_repair.sh

python scripts/validate_submission.py \
  --config configs/train_stage3_repair.yaml \
  --adapter-dir artifacts/adapter_stage3_repair \
  --smoke-input data/processed/official_train_tagged.jsonl \
  --labels data/processed/official_train_tagged.jsonl \
  --splits data/splits/official/splits.json \
  --package-output submission.zip

Data Preparation

Prepare the canonical official dataset and maintained splits:

python scripts/prepare_data.py --config configs/data_official.yaml

This writes:

data/processed/official_train_tagged.jsonl
data/splits/official/splits.json

Generate hard-triad synthetic data for stage2:

python -m src.teacher.synth_generator --config configs/synth_hard_triads.yaml

SFT Dataset Builders

Stage1 build:

python -m src.student.sft_dataset_builder \
  --input data/processed/official_train_tagged.jsonl \
  --output data/processed/stage1_format_align_train.jsonl \
  --selection-profile stage1 \
  --prompt-mode chat_thinking \
  --tokenizer-path artifacts/_tokenizer_cache/metric_nemotron-3-nano-30b-a3b-bf16_transformers_default \
  --split-file data/splits/official/splits.json \
  --split-name rule_novelty_all \
  --split-role train \
  --exclude-split-file data/splits/official/splits.json \
  --exclude-split-name hard_triad_rule_novelty \
  --exclude-split-role valid

Because rule_novelty_all and hard_triad_rule_novelty are built independently (different seeds, different subsets), stage1 train explicitly excludes hard_triad_rule_novelty/valid so later hard-triad validation stays unseen across the full staged pipeline (stage1 -> stage2 init_adapter_dir=stage1 -> stage3 init_adapter_dir=stage2).

Stage2 build with family balancing and stronger teacher search:

python -m src.student.sft_dataset_builder \
  --input data/processed/official_train_tagged.jsonl,data/synthetic/synth_hard_triads.jsonl \
  --output data/processed/stage2_distill_train.jsonl \
  --selection-profile stage2 \
  --prompt-mode chat_thinking \
  --tokenizer-path artifacts/_tokenizer_cache/metric_nemotron-3-nano-30b-a3b-bf16_transformers_default \
  --completion-style token_trace \
  --beam-width 10 \
  --max-depth 3 \
  --top-k 3 \
  --balance-by-family \
  --hard-triad-repeat-factor 2 \
  --max-per-signature-bucket 64 \
  --report-output data/processed/stage2_distill_report.json \
  --split-file data/splits/official/splits.json \
  --split-name rule_novelty_all \
  --split-role train

Stage3 build from stage2 model failures plus all-family replay:

python -m src.student.sft_dataset_builder \
  --input data/processed/official_train_tagged.jsonl \
  --output data/processed/stage3_repair_train.jsonl \
  --selection-profile stage3 \
  --prompt-mode chat_thinking \
  --tokenizer-path artifacts/_tokenizer_cache/metric_nemotron-3-nano-30b-a3b-bf16_transformers_default \
  --completion-style short_trace \
  --beam-width 8 \
  --max-depth 2 \
  --top-k 2 \
  --repair-artifact data/processed/stage2_model_failures_train.json \
  --replay-input data/processed/stage2_model_successes_all_train.json \
  --replay-ratio 0.25 \
  --report-output data/processed/stage3_repair_train_report.json

Stage3 failure / success buckets are produced by scripts/train_stage3_repair.sh:

repair failures come from hard_triad_rule_novelty/train - the stage2 adapter predictions that missed on the hard triad, the only samples we want to correct
replay successes come from rule_novelty_all/train - the stage2 adapter predictions that were correct across all six families, so replay can keep easy-triad families anchored and avoid catastrophic forgetting
the builder receives data/processed/official_train_tagged.jsonl as input so build_repair_set can materialise replay records for ids that live outside the hard-triad subset

Training Notes

canonical train configs are:
- configs/train_stage1_format.yaml
- configs/train_stage2_selected_trace.yaml
- configs/train_stage3_repair.yaml
stage2 and stage3 continue from the previous stage adapter via training.init_adapter_dir
lora_train.py supports --force-train; canonical stage scripts pass it explicitly
target_modules are hard-validated before training starts
max_seq_length: auto uses tokenizer-aware BPE accounting when a tokenizer path is configured
stage2 / stage3 local inference defaults to max_new_tokens: 2048 to avoid truncating chat_thinking generations
max_depth=3 is intentionally more expensive for stage2 selection; expect noticeably higher CPU time than max_depth=2
stage2_distill_valid.jsonl is the SFT loss monitor (teacher-solvable subset); it is not the hard-triad headline metric. Trust the proxy eval artifacts below instead.
after stage2 and stage3 training the canonical scripts run a per-stage bestproxy selector (scripts/select_best_proxy_checkpoint.py). It iterates over every checkpoint-* plus the final adapter, scores each against hard-triad and all-family proxies, and writes the winner to artifacts/adapter_stage{2,3}_bestproxy/. The canonical artifact pairs consumed by select_final_adapter.py are:
- hard-triad proxy: data/processed/stage{2,3}_bestproxy_hard_eval.json
- all-family proxy (leak-free, rule_novelty_all/valid minus hard_triad_rule_novelty/train): data/processed/stage{2,3}_bestproxy_all_eval.json The stage-root proxy eval artifacts still exist (stage{2,3}_proxy_valid_eval.json / stage{2,3}_proxy_all_valid_eval.json) as monitoring signals for comparing "final step" vs "bestproxy", but the submission path reads the bestproxy pair.
final adapter selection is automated by scripts/select_final_adapter.py: it compares the two bestproxy pairs at half-sample tolerance (primary: all-family; tie-break: hard-triad; default on complete tie: stage2) and copies the winner into artifacts/adapter_final_selected/. It also writes data/processed/final_adapter_selection.json with the full decision trace. Package submission.zip from artifacts/adapter_final_selected/, not from the per-stage adapter directories.
stage2 enables a silver hard-triad official pool by default (--stage2-enable-silver-official): official hard-triad (bit/cipher/equation) samples that fail the strict gate but satisfy weaker thresholds (teacher_confidence >= 0.65, support_coverage >= 0.67) are admitted to the train set with trace_style=answer_only, capped at min(0.25 * (strict + synth), 800) and sampled equation -> cipher -> bit. The stage2 build report now carries selection_counts and official_rejection_diagnostics so the next iteration can decide whether to tune the silver thresholds.
stage2 train also runs a hard-triad rescue search (--stage2-second-pass-hard-triad, default opt-in only at the shell layer). For every rejected official sample whose family is in --stage2-rescue-families (default equation, the family most consistently missing program_signature in diagnostics), a second chain-search pass is run with beam=12/depth=4/top-k=3. The new annotation is promoted only if the quality tuple (solver_verifiable, support_coverage, has_signature, teacher_confidence, top1_top2_margin) strictly improves; otherwise the first-pass annotation is restored. Rescue cannot degrade any candidate. stage2_distill_report.json now carries rescue_diagnostics.{rescue_attempted, rescue_promoted, rescue_families, rescue_settings} so you can tell how many rejected equation samples were pulled back into strict-quality annotation.
configs/synth_hard_triads.yaml (and the legacy configs/synth.yaml) set hard_negative_ratio: 0.0 because the hard_negative / negative_answer metadata fields have no downstream consumer in the current training pipeline. Re-enable only when a real consumer (loss term, filter, etc.) lands.
stage3 can skip itself when stage2 produced zero hard-triad train failures. In that case scripts/train_stage3_repair.sh copies the stage2 bestproxy adapter to artifacts/adapter_stage3_repair/ and artifacts/adapter_stage3_bestproxy/, reuses the stage2 bestproxy eval JSONs, and writes stage3_skipped.json next to the weights so downstream packaging / validation does not need to branch. See data/processed/stage3_decision.json and data/processed/stage3_best_checkpoint_selection.json for the gate outcome.

Smoke and Validation

H100 smoke path:

bash scripts/train_stage1_smoke.sh

Legacy local smoke entry now forwards to the canonical stage1 smoke script:

bash scripts/train_smoke_local.sh

Canonical training order (run top-to-bottom on an H100 box):

python scripts/probe_chat_template.py
python scripts/inspect_target_modules.py --config configs/train_stage1_format.yaml

bash scripts/train_stage1_format_align.sh
bash scripts/train_stage2_distill.sh
bash scripts/train_stage3_repair.sh

python scripts/select_final_adapter.py \
  --stage2-hard-eval data/processed/stage2_bestproxy_hard_eval.json \
  --stage2-all-eval  data/processed/stage2_bestproxy_all_eval.json \
  --stage2-adapter-dir artifacts/adapter_stage2_bestproxy \
  --stage3-hard-eval data/processed/stage3_bestproxy_hard_eval.json \
  --stage3-all-eval  data/processed/stage3_bestproxy_all_eval.json \
  --stage3-adapter-dir artifacts/adapter_stage3_bestproxy \
  --output-adapter-dir artifacts/adapter_final_selected \
  --output-json        data/processed/final_adapter_selection.json

python scripts/validate_submission.py \
  --config configs/train_stage3_repair.yaml \
  --adapter-dir artifacts/adapter_final_selected \
  --smoke-input data/processed/official_train_tagged.jsonl \
  --labels data/processed/official_train_tagged.jsonl \
  --splits data/splits/official/splits.json \
  --max-new-tokens 2048 \
  --package-output submission.zip

Inside each stage, scripts/select_best_proxy_checkpoint.py iterates over every checkpoint-* directory plus the final adapter, runs the hard-triad and all-family proxy evals, and materialises the winner at artifacts/adapter_stage{2,3}_bestproxy. The selector uses the shared rule from src/student/proxy_selection.py (all-family primary, hard-triad tie-break, prefer the final checkpoint on complete tie) so the same comparison logic is used at both the checkpoint level and the stage2-vs-stage3 level. Expect roughly 1.5-2.5h extra H100 time per stage for the selector passes; budget 3-5h total on top of the 24-36h training run.

The validator hard-fails when:

adapter_config.json is missing or its rank cannot be parsed
the parsed rank exceeds 32
--labels is passed without --smoke-input, or --package-output is passed without both --smoke-input and --labels
--package-output runs before a successful local_eval

--max-new-tokens is optional; it overrides the inference token budget from the config (defaults to 2048 for chat_thinking in stage2 / stage3 configs).

Tests

python -m pytest -q

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
官方资料		官方资料
.gitignore		.gitignore
3663529.3663827.pdf		3663529.3663827.pdf
Makefile		Makefile
README.md		README.md
codex_takeover_plan.md		codex_takeover_plan.md
kaggle_overview.html		kaggle_overview.html
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA Nemotron Reasoning Challenge Repo

Canonical Training Order

Data Preparation

SFT Dataset Builders

Training Notes

Smoke and Validation

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVIDIA Nemotron Reasoning Challenge Repo

Canonical Training Order

Data Preparation

SFT Dataset Builders

Training Notes

Smoke and Validation

Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages