With MLX-LM-LoRA you can, train Large Language Models locally on Apple Silicon using MLX. Training works with all models supported by MLX-LM, including:
- Llama 3, 4
- Phi 2, 3
- Mistral
- Mixtral
- Qwen 2, 2.5, 3
- Qwen3 MoE
- Qwen3 Next
- Gemma 1, 2, 3
- OLMo, OLMoE
- MiniCPM, MiniCPM3
- and more...
Training Types:
- LoRA: Low-Rank Adaptation for efficient fine-tuning
- DoRA: Weight-Decomposed Low-Rank Adaptation
- Full-precision: Train all model parameters
- Quantized training: QLoRA with 4-bit, 6-bit, or 8-bit quantization
Training Algorithms:
- SFT: Supervised Fine-Tuning
- DPO: Direct Preference Optimization
- CPO: Contrastive Preference Optimization
- ORPO: Odds Ratio Preference Optimization
- GRPO: Group Relative Policy Optimization
- GSPO: Group Sequence Policy Optimization
- Dr. GRPO: Dr. Group Relative Policy Optimization
- DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization
- Online DPO: Online Direct Preference Optimization
- XPO: Extended Preference Optimization
- RLHF Reinforce KL: Reinforced Reinforcement Learning from Human Feedback (with KL regularization)
- PPO: Proximal policy Optimization
Synthetic Dataset Creation:
- Prompts: Create a synthetic prompt dataset using a base model
- SFT: Create a synthetic sft dataset using a teacher model
- Preferences: Create a synthetic preference dataset using a base and a teacher model
Training Your Custom Preference Model:
- You can now train a custom preference model for online preference training
- 🧪 Fine-Tuning (Simple) – Shows how to fine-tune a model using LoRA on a standard SFT dataset.
- 🧠 Fine-Tuning (Detailed) – Uses full model weights instead of LoRA for supervised fine-tuning.
- ⚖️ ORPO Training – Monolithic preference optimization without the need for a reference model.
- 📈 DPO Training – Direct preference optimization to improve model on human preference.
- 👥 GRPO Training – Group-based reinforcement training with multiple completions per prompt.
- Yaml configuration – Yaml configuration file.
- Install
- Quick Start
- Training Methods
- Supervised Fine-Tuning (SFT)
- Direct Preference Optimization (DPO)
- Contrastive Preference Optimization (CPO)
- Odds Ratio Preference Optimization (ORPO)
- Group Relative Policy Optimization (GRPO)
- Group Sequence Policy Optimization (GSPO)
- Decoupled Reward Group Relative Policy Optimization (Dr. GRPO)
- Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
- Online DPO
- eXtended Preference Optimization (XPO)
- Reinforcement Learning from Human Feedback Reinforce (RLHF Reinforce)
- Proximal Policy Optimization
- Other Features
- Configuration
- Dataset Formats
- Memory Optimization
- Evaluation & Generation
pip install -U mlx-lm-loraThe main command is mlx_lm_lora.train. To see all options:
mlx_lm_lora.train --helpBasic training command:
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--data mlx-community/wikisql \
--iters 600You can specify a YAML config with -c/--config:
mlx_lm_lora.train --config /path/to/config.yamlCommand-line flags will override corresponding values in the config file.
Standard instruction tuning using prompt-completion pairs.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode sft \
--data mlx-community/hermes-3 \
--batch-size 4 \
--learning-rate 1e-5 \
--iters 1000Key Parameters:
--train-type: Chooselora(default),dora, orfull--mask-prompt: Apply loss only to assistant responses--max-seq-length: Maximum sequence length (default: 2048)--gradient-accumulation-steps: Accumulate gradients over multiple steps
Dataset Format:
{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"prompt": "Explain quantum computing", "completion": "Quantum computing uses..."}
{"text": "Complete text for language modeling"}Train models using preference pairs without a separate reward model.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode dpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoid \
--reference-model-path Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1Key Parameters:
--beta: KL penalty strength (default: 0.1)--dpo-cpo-loss-type: Loss function -sigmoid,hinge,ipo, ordpop--delta: Margin for hinge loss (default: 50.0)--reference-model-path: Reference model path (uses main model if not specified)
Dataset Format:
{"prompt": "User question", "chosen": "Good response", "rejected": "Bad response"}
{"system": "You are helpful", "prompt": "Question", "chosen": "Good", "rejected": "Bad"}Variant of DPO designed for machine translation and other structured tasks.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode cpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoidKey Parameters: Same as DPO. Uses identical dataset format to DPO.
Monolithic preference optimization without requiring a reference model.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode orpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--reward-scaling 1.0Key Parameters:
--beta: Temperature for logistic function (default: 0.1)--reward-scaling: Reward scaling factor (default: 1.0)
Dataset Format:
{"prompt": "Question", "chosen": "Good response", "rejected": "Bad response"}
{"prompt": "Question", "chosen": "Good", "rejected": "Bad", "preference_score": 8.0}
{"prompt": "Question", "chosen": {"messages": [...]}, "rejected": {"messages": [...]}}Generate multiple responses per prompt and learn from their relative quality.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--data mlx-community/gsm8k \
--group-size 4 \
--epsilon 1e-4 \
--max-completion-length 512 \
--temperature 0.8 \
--reward-functions "accuracy_reward,format_reward" \
--reward-weights "[0.7, 0.3]"Key Parameters:
--group-size: Number of generations per prompt (default: 4)--epsilon: Numerical stability constant (default: 1e-4)--max-completion-length: Max generation length (default: 512)--temperature: Sampling temperature (default: 0.8)--reward-functions: Comma-separated reward function names--reward-functions-file: Path to custom reward functions file--reward-weights: JSON list of weights for each reward function--grpo-loss-type: Loss variant -grpo,bnpo, ordr_grpo
Dataset Format:
{"prompt": "Math problem", "answer": "42"}
{"prompt": "Question", "answer": "Response", "system": "You are helpful"}
{"prompt": "Question", "answer": "Response", "type": "math"}Custom Reward Functions: Create a Python file with reward functions:
# my_rewards.py
from mlx_lm_lora.reward_functions import register_reward_function
@register_reward_function()
def my_custom_reward(prompt, completion, reference_answer, **kwargs):
"""Custom reward function"""
# Your logic here
return score # float between 0 and 1Then use: --reward-functions-file ./my_rewards.py --reward-functions "my_custom_reward"
GSPO extends GRPO with importance sampling at token or sequence level for improved sample efficiency.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type grpo \
--importance-sampling-level token \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8Key Parameters:
--importance-sampling-level: Choosetoken,sequence, orNone(default: None)- All other GRPO parameters apply
Dataset Format: Same as GRPO
Dr. GRPO decouples the reward computation from the policy optimization for more stable training.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type dr_grpo \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8Key Parameters:
--grpo-loss-type dr_grpo: Enables Dr. GRPO variant- All other GRPO parameters apply
Dataset Format: Same as GRPO
DAPO uses dual epsilon values for more flexible clipping in policy optimization.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--epsilon 1e-4 \
--epsilon-high 1e-2 \
--group-size 4 \
--temperature 0.8Key Parameters:
--epsilon: Lower bound for clipping (default: 1e-4)--epsilon-high: Upper bound for clipping (uses epsilon value if not specified)- All other GRPO parameters apply
Dataset Format: Same as GRPO
Online preference optimization using a judge model or human feedback.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode online_dpo \
--data ./online_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5Key Parameters:
--judge: Judge model ID or "human" for human feedback--alpha: Learning rate for online updates (default: 1e-5)--judge-config: Additional configuration for judge model
Dataset Format:
{"prompt": [{"role": "user", "content": "Question"}]}
{"messages": [{"role": "user", "content": "Question"}]}XPO extends online DPO with additional preference learning mechanisms.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode xpo \
--data ./xpo_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5 \
--beta 0.1Key Parameters:
--judge: Judge model ID or "human"--alpha: Online learning rate (default: 1e-5)--beta: KL penalty strength (default: 0.1)--judge-config: Additional judge configuration
Dataset Format: Same as Online DPO
Full RLHF REINFORCE pipeline with reward model and policy optimization Ziegler style.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode rlhf-reinforce \
--data Goekdeniz-Guelmez/ultrafeedback-prompt-flat \
--judge mlx-community/reward-model \
--alpha 1e-5 \
--beta 0.1Key Parameters:
--judge: Reward model ID--alpha: Policy learning rate (default: 1e-5)--beta: KL penalty strength (default: 0.1)
Dataset Format: Same as Online DPO
Full PPO pipeline with reward model and policy optimization.
mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode ppo \
--data Goekdeniz-Guelmez/ultrafeedback-prompt-flat \
--judge mlx-community/reward-model \
--epsilon 0.2Key Parameters:
--judge: Reward model ID--epsilon: The Epsilon for numerical stability (default: 0.2)
Dataset Format: Same as Online DPO
This feature makes it able to use mlx-lm's powerfull batch genebrate to create a synthetic datasets using a teacher model, this can be used for knowledge distiliation, etc., and is a powerfull tool to create custom model, fuly locally.
With this you can create a synthetic user prompts dataset using a model. this creates multible files, the first file is a JSONL file that has the generated samples in it, the next ones are parquet verison for HF compatibility. Example:
python -m mlx_lm_lora.synthetic_prompts \
--model mlx-community/Josiefied-Qwen3-4B-Instruct-2507-abliterated-v1-8bit \
--topics 'ML' 'politics' 'web security' \
--docs-dir ./docs-pdfs \
--output-dir ./sft_dataset \
--system-prompt "You are Josie, a cool and fresh ai asstant that talks like a gangster"
--num-samples 1000 \
--valid-split 0.01 \
--batch-size 4 \
--max-tokens 4096Resulting Dataset Format:
{"prompt": "Question", "section": "only happens when using files via --docs-dir", "topic": "only happens when using topics via --topics"}
...You can directly add that into the synthetic SFT dataset creation after finishing.
With this you can create a synthetic SFT dataset using a teacher model. this creates multible files, the first file is a JSONL file that has the generated samples in it, the next ones are parquet verison for HF compatibility. Example:
python -m mlx_lm_lora.synthetic_sft \
--dataset-path Goekdeniz-Guelmez/Josiefication-prompts-online-po \
--model mlx-community/Josiefied-Qwen3-4B-Instruct-2507-abliterated-v1-8bit \
--output-dir ./sft_dataset \
--num-samples 1000 \
--valid-split 0.01 \
--batch-size 16 \
--max-tokens 4096 \
--use-ground-truth \
Dataset Format:
{"prompt": "Question"}
{"prompt": "Question"}
{"prompt": "Question"}With this you can create a synthetic DPO flatt-dataset using a teacher model. this creates multible files just like sft. Example:
python -m mlx_lm_lora.synthetic_dpo \
--dataset-path Goekdeniz-Guelmez/Josiefication-prompts-online-po \
--base-model mlx-community/Qwen3-4B-Instruct-2507-4bit \
--teacher-model mlx-community/Qwen3-4B-Instruct-2507-4bit \
--system-promtp "can be a normal string or the path to a .txt file for longer prompts"t \
--output-dir ./dpo_dataset \
--num-samples 10000 \
--valid-split 0.0001 \
--test-split 0.2 \
--batch-size 16 \
--max-tokens 8192Dataset Format: Same as abouve
This feature adds a second training stage on top of the judge (preference) stage. A reward model thats scores the policy’s generations and the policy is updated with a KL‑penalised PPO‑style loss.
- Collect preference data → judge‑mode (online DPO) → reward model
- Run RLHF (policy optimisation) using the reward model → final policy
python -m mlx_lm_lora.train_judge \
--model Goekdeniz-Guelmez/Josiefied-Qwen3-0.6B-abliterated-v1 \
--train-type full \
--optimizer adamw \
--steps-per-report 1 \
--iters 50 \
--max-seq-length 1024 \
--adapter-path /Users/[email protected]/Library/CloudStorage/OneDrive-COMPUTACENTER/Desktop/test \
--data mlx-community/Human-Like-DPO \
--gradient-accumulation-steps 1Dataset Format: Same as DPO (with prompt, chosen, and rejected pairs).
# Model and data
--model <model_path> # Model path or HF repo
--data <data_path> # Dataset path or HF dataset name
--train-type lora # lora, dora, or full
--train-mode sft # sft, dpo, cpo, orpo, grpo, etc.
# Training schedule
--batch-size 4 # Batch size
--iters 1000 # Training iterations
--epochs 3 # Training epochs (ignored if iters set)
--learning-rate 1e-5 # Learning rate
--gradient-accumulation-steps 1 # Gradient accumulation
# Model architecture
--num-layers 16 # Layers to fine-tune (-1 for all)
--max-seq-length 2048 # Maximum sequence length
# LoRA parameters
--lora-parameters '{"rank": 8, "dropout": 0.0, "scale": 10.0}'
# Optimization
--optimizer adam # adam, adamw, qhadam, muon
--lr-schedule cosine # Learning rate schedule
--grad-checkpoint # Enable gradient checkpointing
# Quantization
--load-in-4bits # 4-bit quantization
--load-in-6bits # 6-bit quantization
--load-in-8bits # 8-bit quantization
# Monitoring
--steps-per-report 10 # Steps between loss reports
--steps-per-eval 200 # Steps between validation
--val-batches 25 # Validation batches (-1 for all)
--wandb project_name # WandB logging
# Checkpointing
--adapter-path ./adapters # Save/load path for adapters
--save-every 100 # Save frequency
--resume-adapter-file <path> # Resume from checkpoint
--fuse # Fuse and save trained modelPreference Optimization Methods:
DPO/CPO:
--beta 0.1 # KL penalty strength
--dpo-cpo-loss-type sigmoid # sigmoid, hinge, ipo, dpop
--delta 50.0 # Margin for hinge loss
--reference-model-path <path> # Reference model pathORPO:
--beta 0.1 # Temperature parameter
--reward-scaling 1.0 # Reward scaling factorGroup-Based Methods:
GRPO (Base):
--group-size 4 # Generations per prompt
--epsilon 1e-4 # Numerical stability constant
--temperature 0.8 # Sampling temperature
--max-completion-length 512 # Max generation length
--reward-functions "func1,func2" # Comma-separated reward functions
--reward-functions-file <path> # Custom reward functions file
--reward-weights "[0.5, 0.5]" # JSON list of reward weights
--grpo-loss-type grpo # grpo, bnpo, dr_grpoGSPO (GRPO + Importance Sampling):
--importance-sampling-level token # token, sequence, or None
# Plus all GRPO parametersDr. GRPO (Decoupled Rewards):
--grpo-loss-type dr_grpo # Enable Dr. GRPO variant
# Plus all GRPO parametersDAPO (Dynamic Clipping):
--epsilon 1e-4 # Lower bound for clipping
--epsilon-high 1e-2 # Upper bound for clipping
# Plus all GRPO parametersOnline Methods:
Online DPO:
--judge <model_id> # Judge model or "human"
--alpha 1e-5 # Online learning rate
--beta 0.1 # KL penalty strength
--judge-config '{}' # Additional judge configurationXPO (Extended Preference Optimization):
--judge <model_id> # Judge model or "human"
--alpha 1e-5 # Online learning rate
--beta 0.1 # KL penalty strength
--judge-config '{}' # Judge configuration
# Plus additional XPO-specific parametersRLHF Reinforce:
--judge <reward_model_id> # Reward model
--alpha 1e-5 # Policy learning rate
--beta 0.1 # KL penalty strength
--group-size 4 # Samples for policy optimization
--judge-config '{}' # Reward model configurationPPO:
--judge <reward_model_id> # Reward model
--alpha 1e-5 # Policy learning rate
--epsilon 0.2 # Numerical stability value
--group-size 4 # Samples for policy optimization
--judge-config '{}' # Reward model configurationPlace JSONL files in a directory:
data/
├── train.jsonl
├── valid.jsonl
└── test.jsonl
mlx_lm_lora.train --data "Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1" --trainConfigure custom field names:
--text-feature "content" # For text datasets
--chat-feature "conversation" # For chat datasets
--prompt-feature "question" # For prompt-completion
--completion-feature "answer" # For prompt-completion
--chosen-feature "preferred" # For preference datasets
--rejected-feature "dispreferred" # For preference datasets
--system-feature "instruction" # For system messagesSFT - Chat Format:
{"messages": [
{"role": "system", "content": "You are helpful"},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]}SFT - Completion Format:
{"prompt": "What is 2+2?", "completion": "2+2 equals 4"}SFT - Text Format:
{"text": "The complete text for language modeling"}DPO/CPO Format:
{"prompt": "Explain AI", "chosen": "AI is artificial intelligence", "rejected": "AI is magic"}ORPO Format:
{"prompt": "What is AI?", "chosen": "Good explanation", "rejected": "Bad explanation", "preference_score": 0.8}GRPO Format:
{"prompt": "Solve: 2+2=?", "answer": "4", "system": "You are a math tutor"}RLHF (Online DPO, XPO, RLHF Reinforced, PPO) Format:
{"prompt": [{"role": "user", "content": "Question"}]}or:
{"prompt": "Question"}Use quantized models to reduce memory usage:
# 4-bit quantization (most memory efficient)
mlx_lm_lora.train --model <model> --load-in-4bits --train
# 6-bit quantization (balanced)
mlx_lm_lora.train --model <model> --load-in-6bits --train
# 8-bit quantization (higher quality)
mlx_lm_lora.train --model <model> --load-in-8bits --train# Reduce batch size
--batch-size 1
# Train fewer layers
--num-layers 8
# Enable gradient checkpointing
--grad-checkpoint
# Reduce sequence length
--max-seq-length 1024
# Use gradient accumulation
--gradient-accumulation-steps 4 --batch-size 1# Smaller LoRA rank
--lora-parameters '{"rank": 4, "dropout": 0.1, "scale": 10.0}'
# Train specific layers only
--num-layers 8Evaluate on test set:
mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--data <data_path> \
--test \
--test-batches 500Use mlx-lm for generation with trained adapters:
mlx_lm.generate \
--model <model_path> \
--adapter-path <adapter_path> \
--prompt "Your prompt here" \
--max-tokens 100 \
--temperature 0.7Merge LoRA weights into base model:
mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--fuse--lr-schedule cosine # Cosine annealing
--lr-schedule linear # Linear decay
--lr-schedule constant # Constant rate--optimizer adam # Adam optimizer
--optimizer adamw # AdamW with weight decay
--optimizer qhadam # Quasi-hyperbolic Adam
--optimizer muon # Muon optimizerList available reward functions:
mlx_lm_lora.train --list-reward-functionsUse multiple reward functions:
--reward-functions "accuracy_reward,format_reward,length_reward" \
--reward-weights "[0.5, 0.3, 0.2]"--wandb my_project_name| Method | Type | Reference Model | Judge Model | Multiple Generations | Key Benefit |
|---|---|---|---|---|---|
| SFT | Supervised | ❌ | ❌ | ❌ | Simple, fast training |
| DPO | Preference | ✅ | ❌ | ❌ | No reward model needed |
| CPO | Preference | ✅ | ❌ | ❌ | Better for structured tasks |
| ORPO | Preference | ❌ | ❌ | ❌ | Monolithic optimization |
| GRPO | Policy | ❌ | ❌ | ✅ | Group-based learning |
| GSPO | Policy | ❌ | ❌ | ✅ | Importance sampling |
| Dr. GRPO | Policy | ❌ | ❌ | ✅ | Decoupled rewards |
| DAPO | Policy | ❌ | ❌ | ✅ | Dynamic clipping |
| Online DPO | Online RL | ✅ | ✅ | ✅ | Real-time feedback |
| XPO | Online RL | ✅ | ✅ | ✅ | Extended preferences |
| RLHF Reinforce | Online RL | ✅ | ✅ | ✅ | Full RL pipeline |
| PPO | Online RL | ✅ | ✅ | ✅ | Full RL pipeline |
# SFT
mlx_lm_lora.train --model <model> --train-mode sft --data <data>
# DPO
mlx_lm_lora.train --model <model> --train-mode dpo --data <data> --beta 0.1
# CPO
mlx_lm_lora.train --model <model> --train-mode cpo --data <data> --beta 0.1
# ORPO
mlx_lm_lora.train --model <model> --train-mode orpo --data <data> --beta 0.1# GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> --group-size 4
# GSPO (GRPO with importance sampling)
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--importance-sampling-level token --group-size 4
# Dr. GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--grpo-loss-type dr_grpo --group-size 4
# DAPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--epsilon 1e-4 --epsilon-high 1e-2 --group-size 4# Online DPO
mlx_lm_lora.train --model <model> --train-mode online_dpo --data <data> \
--judge <judge_model> --alpha 1e-5
# XPO
mlx_lm_lora.train --model <model> --train-mode xpo --data <data> \
--judge <judge_model> --alpha 1e-5
# RLHF Reinforce
mlx_lm_lora.train --model <model> --train-mode rlhf-reinforce --data <data> \
--judge <reward_model> --alpha 1e-5 --group-size 4
# PPO
mlx_lm_lora.train --model <model> --train-mode ppo --data <data> \
--judge <reward_model> --epsilon 0.2 --group-size 4- Out of Memory: Reduce batch size, use quantization, enable gradient checkpointing
- Slow Training: Increase batch size, reduce validation frequency
- Poor Quality: Increase LoRA rank, train more layers, check data quality
- Convergence Issues: Adjust learning rate, try different optimizers
| Model Size | Recommended Settings |
|---|---|
| 1-3B | --batch-size 4 --num-layers 16 |
| 7B | --batch-size 2 --num-layers 8 --load-in-8bits |
| 13B+ | --batch-size 1 --num-layers 4 --load-in-4bits --grad-checkpoint |
model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./my_data
train_type: lora
train_mode: sft
batch_size: 4
learning_rate: 1e-5
iters: 1000
lora_parameters:
rank: 8
dropout: 0.0
scale: 10.0model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./preference_data
train_mode: dpo
beta: 0.1
dpo_cpo_loss_type: sigmoid
batch_size: 2
learning_rate: 5e-6
iters: 500model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./grpo_data
train_mode: grpo
group_size: 4
temperature: 0.8
reward_functions: "accuracy_reward,format_reward"
reward_weights: [0.7, 0.3]
max_completion_length: 512@software{MLX-LM-LoRA,
author = {Gökdeniz Gülmez},
title = {{MLX-LM-LoRA}: Train LLMs on Apple silicon with MLX and the Hugging Face Hub},
url = {https://github.com/Goekdeniz-Guelmez/mlx-lm-lora},
version = {0.1.0},
year = {2025},
}
