Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
248 commits
Select commit Hold shift + click to select a range
71344f9
Implemented a distributed solution which refactors current run_model …
coketaste Jul 3, 2025
ea2dc0c
Fixed the test cases for distributed solution
coketaste Jul 3, 2025
bb64b73
Updated the interface and fix the issue due to updating
coketaste Jul 3, 2025
86d1790
Reorganize the cli interface of distributed solution
coketaste Jul 3, 2025
dd71dfa
Updated the interface of distributed solution and refine the code wit…
coketaste Jul 4, 2025
8236a7b
Added setup.py for installation with dev
coketaste Jul 4, 2025
0c42bbf
Fix the test case of distributed cli
coketaste Jul 4, 2025
f942a45
Fixed the flow of manifest and run_phase to work properly
coketaste Jul 4, 2025
08ff29b
Updated setup.py for the cases of modern and legacy installation
coketaste Jul 4, 2025
d82d78e
Fixed and enhanced live log in build and run phases
coketaste Jul 4, 2025
c848419
Fixed the log generate for different phase, and correct log name
coketaste Jul 4, 2025
3e2a44c
Fix the perf.csv generation in distributed execution
coketaste Jul 5, 2025
8a359ac
Fixed the data which update to perf.csv
coketaste Jul 5, 2025
ac32cbe
Fixed the columns in perf.csv due to parsing issue
coketaste Jul 5, 2025
bb6d3fc
Fix the incorrect regex escaping in the container runner that prevent…
coketaste Jul 5, 2025
a255c50
Update the patterns of performance and metric
coketaste Jul 5, 2025
8988508
Fixed the issue of docker_image column in perf.csv
coketaste Jul 5, 2025
f0a10a7
Improve the interface and reduce erro in registry flow
coketaste Jul 5, 2025
8caae5c
Updated the flow of run phase, fix the docker pull, fix the creds ver…
coketaste Jul 5, 2025
942e666
updated the tagged name for docker image and add a docker_image_tagge…
coketaste Jul 5, 2025
a7baa17
Updated the sequence of operations in build phas
coketaste Jul 6, 2025
2e613ca
Fixed the registry_image
coketaste Jul 6, 2025
b4e7d22
Update the registry_image
coketaste Jul 6, 2025
d1ecb97
Updated the process of run phase
coketaste Jul 6, 2025
799cce7
Refactored the file structure of package for distributed_cli
coketaste Jul 6, 2025
9875fda
Fixed the errors in unit tests
coketaste Jul 6, 2025
168ffe5
Fix the error in unit test of distributed cli
coketaste Jul 6, 2025
756d82a
Refactored constants to make design as best practices
coketaste Jul 6, 2025
9431d7f
Cleanup:
coketaste Jul 6, 2025
cf50f13
Fixed the regex pattern
coketaste Jul 6, 2025
c909a93
Fix ensures that distributed_cli logs will now contain the same detai…
coketaste Jul 6, 2025
c04435d
Implemented new test cases for pre/post scripts and profiling cases
coketaste Jul 6, 2025
72bc7bc
Debug the test cases
coketaste Jul 6, 2025
c0dd6ca
Fixed the test cases in distributed integration
coketaste Jul 6, 2025
92db9fb
Refactor context class to make it work on usages of build-only on cpu…
coketaste Jul 6, 2025
9628a01
Update the validation function and GPU detection in additional context
coketaste Jul 6, 2025
68e19fb
tests now automatically detect machine capabilities and skip GPU-depe…
coketaste Jul 6, 2025
5dfa775
Create a new madengine CLI application
coketaste Jul 6, 2025
901c12b
Fixed the test cases of distrubted integration and profiling
coketaste Jul 6, 2025
6caf244
Fix the python version compatible issue
coketaste Jul 7, 2025
d87e9b0
Fixed the error of model dict
coketaste Jul 7, 2025
61ac4f7
Update the input arg of clean docker cache and it guide
coketaste Jul 7, 2025
9469d8b
Updated distributed-execution-solution
coketaste Jul 7, 2025
b94a118
Ensures that when you run the example command on a build-only node, t…
coketaste Jul 7, 2025
802a36c
Fix the docker env vars set during build phase
coketaste Jul 7, 2025
50267e7
Filter out redundent MAD env vars
coketaste Jul 7, 2025
a52f853
Refine the docs and add diagrams of flow
coketaste Jul 7, 2025
c77cee7
Updated images of flow chart
coketaste Jul 7, 2025
df8cb08
Updated the madengien cli guide
coketaste Jul 7, 2025
2d1ae9d
Removed the execution config and enhanced implementation of manifest.…
coketaste Jul 7, 2025
9ee383b
clean up the code
coketaste Jul 7, 2025
3c1da45
Updated the distributed cli interface and clean up the code
coketaste Jul 7, 2025
0fb0e53
Fix the pulling issue from registry
coketaste Jul 7, 2025
ab0bbe6
Updated the docs
coketaste Jul 7, 2025
81bc4e4
Created a professional, comprehensive, and maintainable documentation…
coketaste Jul 8, 2025
ab36c76
make a well-formatted documentation of README
coketaste Jul 8, 2025
85c66de
Fix the MODEL_DIR setup issue
coketaste Jul 8, 2025
91805ae
Fixed the out of date unit tests in distributed cli
coketaste Jul 8, 2025
0a1a679
All syntax errors resolved - file compiles successfully in distribute…
coketaste Jul 8, 2025
ef64de6
Fix the test case of distributed integration
coketaste Jul 8, 2025
23b3bbb
Fixed the test profiling
coketaste Jul 8, 2025
0fec233
Updated the fix to handle permssion erro
coketaste Jul 8, 2025
b5f6486
Refine the assertion
coketaste Jul 8, 2025
7060f76
Added test cases of mad_cli and distributed integration
coketaste Jul 8, 2025
b65bf0d
Massively enhanced distributed execution with runners of SSH, Ansbile…
coketaste Jul 9, 2025
661a9ae
Reverted somme missing functions
coketaste Jul 9, 2025
29ac831
new functionality allows users to provide Docker Hub credentials via …
coketaste Jul 9, 2025
8e26033
Merge branch 'coketaste/refactor' into coketaste/refactor-runners
coketaste Jul 9, 2025
db75808
Changed docker.io to dockerhub
coketaste Jul 9, 2025
14cc12e
Merge branch 'coketaste/refactor' into coketaste/refactor-runners
coketaste Jul 9, 2025
9b09f01
Fix the test case of context
coketaste Jul 9, 2025
2a26dbf
Updated README.md
coketaste Jul 9, 2025
b35508b
Fix the unit test of e2e distributed run with profiling
coketaste Jul 9, 2025
a61c287
Fixed the issue of mocks gpu
coketaste Jul 9, 2025
96d7e27
Rewrite the unit test gpu version
coketaste Jul 9, 2025
566f1cb
Fixed the manfiest name error
coketaste Jul 10, 2025
cbd86c1
Fixed the missing manifest file
coketaste Jul 10, 2025
b3052f5
Updated the warning message of missing cred
coketaste Jul 10, 2025
4955bcf
Merge pull request #14 from ROCm/coketaste/refactor-runners
coketaste Jul 10, 2025
71fe348
Updated the MAD_DOCKERHUB_ creds parsing logic
coketaste Jul 10, 2025
49f60dc
Merge branch 'coketaste/refactor' of https://github.com/ROCm/madengin…
coketaste Jul 10, 2025
32b5ff7
Updatd README
coketaste Jul 11, 2025
b22bc7b
Implemented a batch input arg for madengine-cli build
coketaste Jul 11, 2025
768dcf9
enhanced logging system is now active and will automatically highligh…
coketaste Jul 11, 2025
a4b324f
Fix the error local variable docker_image referenced before assignment
coketaste Jul 11, 2025
ebfb472
Updated the perf dataframe output
coketaste Jul 11, 2025
e47572e
The fixes are backward compatible and maintain existing functionality…
coketaste Jul 11, 2025
3a73edc
Fixed the problematic log
coketaste Jul 11, 2025
e1000a4
Fixed the error pattern, removed the wrong string
coketaste Jul 11, 2025
06934d3
Fixed the error of test prof
coketaste Jul 12, 2025
59dd584
Updated the interface of mad_cli
coketaste Jul 12, 2025
d696784
Merge pull request #17 from ROCm/coketaste/refactor-stage
coketaste Jul 12, 2025
5821b3b
Update README.md
coketaste Jul 14, 2025
30f1329
ensure that the DistributedOrchestrator.build_phase method and the un…
coketaste Jul 21, 2025
f6c18fa
Updated the build batch manifest to distributed orchestrator
coketaste Jul 21, 2025
11895f9
Debug the batch manifest
coketaste Jul 21, 2025
27627aa
Update the flow use per-model registry settings for both build and ru…
coketaste Jul 23, 2025
c7c6d37
correct registry image will be used for each model as intended
coketaste Jul 23, 2025
7449493
The push_image function now accepts and uses the explicit registry_im…
coketaste Jul 23, 2025
7f2c63b
Updated the explicit_registry_image assignment
coketaste Jul 23, 2025
9f50d04
Debug the registry info setting
coketaste Jul 23, 2025
05f8a26
Updated the function of export build manifest
coketaste Jul 23, 2025
8f8dc88
Add verbose for debugging
coketaste Jul 24, 2025
de6b49c
Debug the export build manifest
coketaste Jul 24, 2025
f1a3905
Debug the registry extract from batch build metadata
coketaste Jul 24, 2025
d412956
Debug the exaction
coketaste Jul 24, 2025
a03fa0d
Merge pull request #23 from ROCm/coketaste/refactor-batch
coketaste Jul 24, 2025
624cc29
Corrected the content of synthetic image which built_new is false in …
coketaste Jul 24, 2025
af7ddb4
Fixed the type error in additional context
coketaste Jul 25, 2025
b5a800b
Debug the parsing of gpu vendoer and guest os
coketaste Jul 25, 2025
bc18784
Correct the pattern of Dockerfile
coketaste Jul 25, 2025
558b7af
Updated the print
coketaste Jul 25, 2025
0b7eba6
Update the rich print
coketaste Jul 25, 2025
f4778ec
Merge pull request #25 from ROCm/coketaste/refactor-cleanup
coketaste Jul 25, 2025
57c4bce
Figured out a critical issue about dual CLI implementation creating m…
Jul 26, 2025
7ca3147
Fixed the dockerfile matched
coketaste Jul 27, 2025
55f630d
Resolved conflicts of merge
coketaste Jul 27, 2025
56eda87
refactored the logic in _process_batch_manifest_entries() to include …
coketaste Jul 27, 2025
6b60a37
Added unit tests for new unified error handlers
coketaste Jul 27, 2025
bc9153e
Updated README.md
coketaste Jul 28, 2025
55d378d
Implemented a SLURM runner follows the same comprehensive pattern as …
coketaste Jul 28, 2025
e369f1f
Fixed the errors in unit tests
coketaste Jul 28, 2025
aa9d39f
Merge pull request #29 from ROCm/coketaste/refactor-update-runner
coketaste Jul 28, 2025
90ec534
Used Rich console print to replace part of regular print to enhance t…
coketaste Jul 31, 2025
4256588
Updated rich conosle print to enhance the log readability
coketaste Jul 31, 2025
226b6a4
Update the new line
coketaste Jul 31, 2025
9090d23
Updated the new line for all sections
coketaste Jul 31, 2025
279223a
Updated final table of dataframe
coketaste Jul 31, 2025
bd16f88
Updated the display of dataframe from head to tail
coketaste Jul 31, 2025
af89326
Updated the checking gpu status
coketaste Jul 31, 2025
1c8f17c
Cleanup
coketaste Jul 31, 2025
1445618
Merge pull request #30 from ROCm/coketaste/refactor-update-log
coketaste Jul 31, 2025
4b57f4b
Merge pull request #27 from ROCm/coketaste/refactor-update
coketaste Aug 5, 2025
72982f8
Updated README
coketaste Aug 5, 2025
b6b79ca
Added discover command to mad_cli
coketaste Aug 5, 2025
00f4a5e
Implemented CLI detect MAD_CONTAINER_IMAGE in additional context, pro…
coketaste Aug 6, 2025
ee5740d
Merge pull request #31 from ROCm/coketaste/refactor-interface
coketaste Aug 7, 2025
364bef4
Implemented the core multi-GPU architectures support for docker image…
coketaste Aug 8, 2025
156bcfe
Implemented unit tests for the feature of multi-gpu arch
coketaste Aug 8, 2025
8457257
Debug and fix the unit test of multi gpu arch
coketaste Aug 8, 2025
3a0b4c7
Debug the issue of display results table
coketaste Aug 8, 2025
682bec2
Enhanced the results table, and improved the flow of handle gpu arch …
coketaste Aug 8, 2025
89784ca
Creates architecture-specific images with proper naming and metadata,…
coketaste Aug 9, 2025
23bbf57
Fixed the syntax error
coketaste Aug 9, 2025
4e61147
Merge pull request #32 from ROCm/coketaste/refactor-multi-gpu-archs
coketaste Aug 13, 2025
5444a67
ported changes from coketaste/amd-smi
Boss2002n Oct 3, 2025
9dfe5d8
Revert "ported changes from coketaste/amd-smi"
Boss2002n Oct 3, 2025
d5c3402
Resolved merging conflicts
coketaste Oct 20, 2025
e9202c2
Fixed the tools for distributed mode
coketaste Oct 21, 2025
b49ed4b
Fixed the cleanup
coketaste Oct 21, 2025
0ac1855
Merge pull request #52 from ROCm/coketaste/refactor-tools
coketaste Oct 21, 2025
15cbeaa
Fixed the table of resutls
coketaste Oct 21, 2025
026fec3
Fixed the GPU Product Name
coketaste Nov 27, 2025
9b7b347
Fixed the issue in selftest
coketaste Nov 27, 2025
eca075a
Enhanced unit tests and cleanup
coketaste Nov 27, 2025
ef9a2a8
Refactor the architecture and flow
coketaste Nov 28, 2025
ec49ed4
Updated the REFACTOR Plan
coketaste Nov 28, 2025
ab53bb1
Updated PLAN
coketaste Nov 29, 2025
4706e45
Update Plan
coketaste Nov 29, 2025
df83196
Update Plan
coketaste Nov 29, 2025
dea7f71
Refactor the new madengine cli architecture and flow
coketaste Nov 29, 2025
6fb79a7
Fixed the copy issue of scripts
coketaste Nov 29, 2025
50d35b5
Fixed the migration issues and fixed the tags multitag inputs
coketaste Nov 29, 2025
6fc0dad
Improved error handler and updated unit tests
coketaste Nov 30, 2025
983cd9d
Fixed the prescript with prof power and vram
coketaste Nov 30, 2025
447b9f8
Fixed the issues of data provider and tags
coketaste Nov 30, 2025
3a63f40
Refactored the GPU Tool Manager with factory to handle AMD ROCm and N…
coketaste Nov 30, 2025
99368c8
Reorganize codebase: move docker_builder, discover_models, update_per…
coketaste Nov 30, 2025
9a86eb4
Reorganize file structure and cleanup
coketaste Nov 30, 2025
0b86dff
cleanup runners
coketaste Nov 30, 2025
e153e12
fixed the unit tests for new madengine cli and depreciated unit tests…
coketaste Dec 1, 2025
a804065
Implemented k8s deployment
coketaste Dec 1, 2025
66d0f98
Fixed the perf csv unified format
coketaste Dec 1, 2025
7f64114
Updated the flow of k8s
coketaste Dec 1, 2025
7975dd9
Init examples of k8s config for different use cases
coketaste Dec 1, 2025
24ad08e
Fixed the data provider for minio
coketaste Dec 2, 2025
be95b27
Fixed the data nas
coketaste Dec 2, 2025
27fa4ac
Implemented k8s tools package and Enabled download and pre-script too…
coketaste Dec 2, 2025
6b0e389
Fixed the tools encapsulated
coketaste Dec 3, 2025
a2f2dce
Added PVC storage support for results generated by workload on k8s pod
coketaste Dec 4, 2025
fd88e4c
Implemented the torchrun as runner for multigpu and multinode
coketaste Dec 4, 2025
c9095e0
Fixed the torchrun on multigpu on k8s
coketaste Dec 4, 2025
0830de5
Fixed the tools as pre/post scripts running on multigpu k8s
coketaste Dec 5, 2025
1124433
Fixed the error handler of performance empty due to benchmark failed …
coketaste Dec 5, 2025
e21a163
Fixed the chain tools
coketaste Dec 5, 2025
63a97c4
Fixed the multinode on k8s
coketaste Dec 5, 2025
76b0654
Updated k8s config
coketaste Dec 5, 2025
1818da8
Updated the pod creation with PVC
coketaste Dec 6, 2025
0bcf6d2
Clean up
coketaste Dec 6, 2025
f83238b
Refactored context to handle new madengine cli to support local and d…
coketaste Dec 7, 2025
6396e36
Migrated mad_cli with new cli structure desgin
coketaste Dec 7, 2025
7a609dc
Fixed the unit tests for new madengine cli
coketaste Dec 8, 2025
21ca7bb
Fixed the unit tests
coketaste Dec 9, 2025
9b7d922
Improved the interface of k8s-configs to simplify the complex config …
coketaste Dec 10, 2025
e931698
Fixed the timeout issue
coketaste Dec 10, 2025
41d2a8b
Added models for testing distribution with different launchers
coketaste Dec 10, 2025
c391800
Removed non-existent environment variables; Kept only standard MIOpen…
coketaste Dec 11, 2025
2aaed6e
Added new models of dummy_megatron_lm and dummy_deepspeed to validate…
coketaste Dec 11, 2025
44e586b
Implemented launcher of Slurm running distributed workload
Dec 11, 2025
779d6b9
Fixed the issue of depolyment config
coketaste Dec 11, 2025
7c70212
Fixed the gpu resolution for setting num gpus
coketaste Dec 12, 2025
1e3f9a0
Merge branch 'coketaste/refactor-dis-slurm' of https://github.com/ROC…
coketaste Dec 12, 2025
54a7a03
Fixed the multinode configs and cleanup old MULTI_NODE args which are…
coketaste Dec 12, 2025
1d406ad
Fixed the streaming log
coketaste Dec 12, 2025
daf309f
Refactored the torchrun on slurm with multi gpu and multi node
coketaste Dec 13, 2025
669e230
Fixed the job template of slurm for single and multinode
coketaste Dec 14, 2025
b5073cf
Fixed the multinode job
coketaste Dec 14, 2025
8cc2a48
Fixed the multinode on slurm node
coketaste Dec 15, 2025
d119cd1
Fixed the multinode slurm depoly case which no SIGBUS crashes and wor…
coketaste Dec 15, 2025
92b5356
Validate madengine-cli on compute node
coketaste Dec 15, 2025
15e2c42
Fixed the unit tests for slurm deploy update
coketaste Dec 15, 2025
a2d5d82
Created run script of dummy deepspeed
coketaste Dec 15, 2025
7ebf275
Implemented inference serving launchers using vllm and sglang for dis…
coketaste Dec 16, 2025
906691b
Fixed the issue in vllm for v1 engine
coketaste Dec 16, 2025
641383d
Debug and test vllm deploy on multinode of slurm with v1 engine and r…
coketaste Dec 17, 2025
13068ce
Implemented launchers of vllm and sglang to run workload on single an…
coketaste Dec 18, 2025
7c2842f
Fixed the unit tests of skip gpu arch
coketaste Dec 18, 2025
975ea12
Fixed the issue of copy common and add pyt_huggingface_gpt2 and pyt_h…
coketaste Dec 19, 2025
0698b2b
v2.0 development:
coketaste Dec 19, 2025
56a0de4
Updated the gpu_vendor and guest_os fields in config
coketaste Dec 19, 2025
ff522cf
Updated README.md and its sections in docs
coketaste Dec 19, 2025
b9f7634
Updated the README and cleanup
coketaste Dec 19, 2025
8aff6df
Replace sleep to cat
coketaste Dec 20, 2025
17f0d51
Updated torchrun with run.sh pattern on k8s deployment
coketaste Dec 20, 2025
23eaded
Updated Megatron-lm base image using ROCm/megatron-lm:latest
coketaste Dec 20, 2025
912217b
Implemented report and database commands
coketaste Dec 20, 2025
e7977cc
Added the feature of perf superset to collect configs and multi-results
coketaste Dec 20, 2025
07753bd
Fixed the k8s pvc issue
coketaste Dec 20, 2025
c999f28
Updated the context saving logic
coketaste Dec 21, 2025
780d3b9
Reorganize unit tests, remove reduntant and edge cases, add examples …
coketaste Dec 21, 2025
5942717
Make an universal soluton with docker exec
coketaste Dec 21, 2025
70ca5cd
Replaced sleep with tail
coketaste Dec 21, 2025
cd02965
Fixed teh docker pull issue on compute node if the layer of image cra…
coketaste Dec 21, 2025
4ca22a2
Removed MySQL database interface
coketaste Dec 22, 2025
f05eee3
Updated the Megatron-lm launcher on both k8s and slurm
coketaste Dec 22, 2025
d1f07cf
Updated k8s-configs megatron-lm and deepspeed
coketaste Dec 22, 2025
096e337
Updated the config of gradient_accumulation_steps
coketaste Dec 22, 2025
77b9735
Removed legacy madengine and its relative, use madengine as unified m…
coketaste Dec 23, 2025
f336db1
Cleanup and enhance README of project
coketaste Dec 23, 2025
1af2531
Merge pull request #59 from ROCm/coketaste/refactor-dis-slurm
coketaste Dec 23, 2025
259df29
Fixed the vllm multinode on k8s
coketaste Dec 23, 2025
1438f7b
Fixed the format error in kubernetes
coketaste Dec 24, 2025
7555c14
Implemented sglang-disagg launcher for slurm and k8s on multinode
coketaste Dec 24, 2025
2fe8ab0
Updated docs of project refer to recent changes
coketaste Dec 24, 2025
b28b4f7
Updated README of project
coketaste Dec 24, 2025
f3878bc
Fixed the stack tools for tracing
coketaste Dec 24, 2025
e5094ed
Fixed the tools stack for gpu info power and gpu info vram profilers
coketaste Dec 26, 2025
3e056cd
Fixed the error of regex pattern mismatch
coketaste Dec 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 41 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,22 @@ __pycache__/
# C extensions
*.so

# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# IDE files
.vscode/
.idea/
*.swp
*.swo
*~

# Distribution / packaging
.Python
build/
Expand Down Expand Up @@ -36,7 +52,7 @@ MANIFEST
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
# Testing and coverage
htmlcov/
.tox/
.nox/
Expand All @@ -49,6 +65,23 @@ coverage.xml
*.py,cover
.hypothesis/
.pytest_cache/

# MADEngine specific
credential.json
data.json
*.log
*.csv
*.html
library_trace.csv
library_perf.csv
perf.csv
perf.html

# Temporary and build files
temp/
tmp/
*.tmp
.pytest_cache/
cover/

# Translations
Expand Down Expand Up @@ -101,4 +134,10 @@ scripts/
.*_env/
.vscode/

tmp/
build_manifest.json
tmp/
k8s_manifests/
k8s_results/
rocprof_output/
slurm_output/
MagicMock/
36 changes: 36 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Pre-commit hooks configuration
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-toml
- id: check-added-large-files
- id: check-merge-conflict
- id: debug-statements

- repo: https://github.com/psf/black
rev: 23.3.0
hooks:
- id: black
language_version: python3

- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort

- repo: https://github.com/pycqa/flake8
rev: 6.0.0
hooks:
- id: flake8

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.3.0
hooks:
- id: mypy
additional_dependencies: [types-requests, types-PyYAML]
exclude: ^(tests/|scripts/)
110 changes: 110 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Changelog

All notable changes to madengine will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Fixed
- Removed stale compiled Python file (`__init__.pyc`) from source tree
- Cleaned up unused `typing_extensions` import in `core/console.py`
- Improved type hint accuracy in `Console.sh()` method docstring

### Breaking Changes
- **CLI Unification**: Simplified command-line interface
- ✅ `madengine` is now the unified CLI command (previously `madengine-cli`)
- ❌ Removed legacy `madengine` v1.x CLI (previously `mad.py`)
- ❌ Removed `madengine-cli` alias (use `madengine` instead)
- **Migration**: Simply replace `madengine-cli` with `madengine` in your scripts
- All functionality remains identical, just cleaner command naming

### Removed
- **Legacy CLI Components**:
- `src/madengine/mad.py` - Legacy CLI entry point (v1.x)
- `src/madengine/tools/run_models.py` - Legacy model runner
- `docs/legacy-cli.md` - Legacy CLI documentation
- Justification: Modern `madengine` CLI (formerly `madengine-cli`) provides all functionality plus K8s, SLURM, and distributed support

### Security
- **CRITICAL:** Fixed SQL injection vulnerability in legacy database module (`src/madengine/db/database_functions.py`)
- Replaced string formatting with parameterized queries using SQLAlchemy `text()`
- Prevents potential SQL injection attacks in `get_matching_db_entries()` function
- Fixed 4 instances of bare `except:` blocks that could mask critical exceptions
- `kubernetes.py`: Replaced with specific exception types (`ConfigException`, `FileNotFoundError`, `ApiException`)
- `console.py`: Replaced with specific exception types (`OSError`, `ValueError`) for resource cleanup

### Added
- **Comprehensive Launcher Support**: Full K8s and SLURM support for 6 distributed frameworks
- TorchTitan: LLM pre-training with FSDP2+TP+PP+CP parallelism
- vLLM: High-throughput LLM inference with continuous batching
- SGLang: Fast LLM inference with structured generation
- DeepSpeed: ZeRO optimization training (K8s support added)
- Megatron-LM: Large-scale transformer training (K8s + SLURM)
- torchrun: Standard PyTorch DDP/FSDP
- **Centralized Launcher Documentation**: `docs/distributed-launchers.md` with comprehensive guide
- **Example Configurations**: 6 new minimal configs for distributed launchers (K8s)
- Comprehensive development tooling and configuration
- Pre-commit hooks for code quality
- Makefile for common development tasks
- Developer guide with coding standards
- Type checking with mypy
- Code formatting with black and isort
- Enhanced .gitignore for better file exclusions
- CI/CD configuration templates
- **Major Documentation Refactor**: Complete integration of distributed execution and CLI guides into README.md
- Professional open-source project structure with badges and table of contents
- Comprehensive MAD package integration documentation
- Enhanced model discovery and tag system documentation
- Modern deployment scenarios and configuration examples

### Changed
- **README.md**: Added launcher ecosystem highlights to v2.0 features
- **K8s README**: Updated with new launcher configs and comprehensive launcher section
- **Documentation Structure**: Consolidated all launcher docs into single comprehensive guide
- Improved package initialization and imports
- Replaced print statements with proper logging in main CLI
- Enhanced error handling and logging throughout codebase
- Cleaned up setup.py for better maintainability
- Updated development dependencies in pyproject.toml
- **Complete README.md overhaul**: Merged all documentation into a single, comprehensive source
- Restructured documentation to emphasize MAD package integration
- Enhanced CLI usage examples and distributed execution workflows
- Improved developer contribution guidelines and legacy compatibility notes

### Changed (Previous)
- Removed Python cache files from repository
- Fixed import organization and structure
- Improved docstring formatting and consistency
- Cleaned up documentation fragmentation

### Removed
- Unnecessary debug print statements
- Python cache files and build artifacts
- **Legacy documentation files**: `docs/distributed-execution-solution.md` and `docs/madengine-cli-guide.md`
- **Duplicate documentation**: `docs/TORCHTITAN_LAUNCHER.md` (consolidated into distributed-launchers.md)
- Redundant documentation scattered across multiple files

## [Previous Versions]

For changes in previous versions, please refer to the git history.

---

## Guidelines for Changelog Updates

### Categories
- **Added** for new features
- **Changed** for changes in existing functionality
- **Deprecated** for soon-to-be removed features
- **Removed** for now removed features
- **Fixed** for any bug fixes
- **Security** for vulnerability fixes

### Format
- Keep entries brief but descriptive
- Include ticket/issue numbers when applicable
- Group related changes together
- Use present tense ("Add feature" not "Added feature")
- Target audience: users and developers of the project
Loading