MLX-LM-LORA

With MLX-LM-LoRA you can, train Large Language Models locally on Apple Silicon using MLX. Training works with all models supported by MLX-LM, including:

Llama 3, 4
Phi 2, 3
Mistral
Mixtral
Qwen 2, 2.5, 3
Qwen3 MoE
Qwen3 Next
Gemma 1, 2, 3
OLMo, OLMoE
MiniCPM, MiniCPM3
and more...

Supported Training Methods

Training Types:

LoRA: Low-Rank Adaptation for efficient fine-tuning
DoRA: Weight-Decomposed Low-Rank Adaptation
Full-precision: Train all model parameters
Quantized training: QLoRA with 4-bit, 6-bit, or 8-bit quantization

Training Algorithms:

SFT: Supervised Fine-Tuning
DPO: Direct Preference Optimization
CPO: Contrastive Preference Optimization
ORPO: Odds Ratio Preference Optimization
GRPO: Group Relative Policy Optimization
GSPO: Group Sequence Policy Optimization
Dr. GRPO: Dr. Group Relative Policy Optimization
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization
Online DPO: Online Direct Preference Optimization
XPO: Extended Preference Optimization
RLHF Reinforce KL: Reinforced Reinforcement Learning from Human Feedback (with KL regularization)
PPO: Proximal policy Optimization

New Features

Synthetic Dataset Creation:

Prompts: Create a synthetic prompt dataset using a base model
SFT: Create a synthetic sft dataset using a teacher model
Preferences: Create a synthetic preference dataset using a base and a teacher model

Training Your Custom Preference Model:

You can now train a custom preference model for online preference training

📓 Example Notebooks

🧪 Fine-Tuning (Simple) – Shows how to fine-tune a model using LoRA on a standard SFT dataset.
🧠 Fine-Tuning (Detailed) – Uses full model weights instead of LoRA for supervised fine-tuning.
⚖️ ORPO Training – Monolithic preference optimization without the need for a reference model.
📈 DPO Training – Direct preference optimization to improve model on human preference.
👥 GRPO Training – Group-based reinforcement training with multiple completions per prompt.
Yaml configuration – Yaml configuration file.

Install

pip install -U mlx-lm-lora

Quick Start

The main command is mlx_lm_lora.train. To see all options:

mlx_lm_lora.train --help

Basic training command:

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--data mlx-community/wikisql \
--iters 600

You can specify a YAML config with -c/--config:

mlx_lm_lora.train --config /path/to/config.yaml

Command-line flags will override corresponding values in the config file.

Training Methods

Supervised Fine-Tuning (SFT)

Standard instruction tuning using prompt-completion pairs.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode sft \
--data mlx-community/hermes-3 \
--batch-size 4 \
--learning-rate 1e-5 \
--iters 1000

Key Parameters:

--train-type: Choose lora (default), dora, or full
--mask-prompt: Apply loss only to assistant responses
--max-seq-length: Maximum sequence length (default: 2048)
--gradient-accumulation-steps: Accumulate gradients over multiple steps

Dataset Format:

{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"prompt": "Explain quantum computing", "completion": "Quantum computing uses..."}
{"text": "Complete text for language modeling"}

Direct Preference Optimization (DPO)

Train models using preference pairs without a separate reward model.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode dpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoid \
--reference-model-path Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1

Key Parameters:

--beta: KL penalty strength (default: 0.1)
--dpo-cpo-loss-type: Loss function - sigmoid, hinge, ipo, or dpop
--delta: Margin for hinge loss (default: 50.0)
--reference-model-path: Reference model path (uses main model if not specified)

Dataset Format:

{"prompt": "User question", "chosen": "Good response", "rejected": "Bad response"}
{"system": "You are helpful", "prompt": "Question", "chosen": "Good", "rejected": "Bad"}

Contrastive Preference Optimization (CPO)

Variant of DPO designed for machine translation and other structured tasks.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode cpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoid

Key Parameters: Same as DPO. Uses identical dataset format to DPO.

Odds Ratio Preference Optimization (ORPO)

Monolithic preference optimization without requiring a reference model.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode orpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--reward-scaling 1.0

Key Parameters:

--beta: Temperature for logistic function (default: 0.1)
--reward-scaling: Reward scaling factor (default: 1.0)

Dataset Format:

{"prompt": "Question", "chosen": "Good response", "rejected": "Bad response"}
{"prompt": "Question", "chosen": "Good", "rejected": "Bad", "preference_score": 8.0}
{"prompt": "Question", "chosen": {"messages": [...]}, "rejected": {"messages": [...]}}

Group Relative Policy Optimization (GRPO)

Generate multiple responses per prompt and learn from their relative quality.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--data mlx-community/gsm8k \
--group-size 4 \
--epsilon 1e-4 \
--max-completion-length 512 \
--temperature 0.8 \
--reward-functions "accuracy_reward,format_reward" \
--reward-weights "[0.7, 0.3]"

Key Parameters:

--group-size: Number of generations per prompt (default: 4)
--epsilon: Numerical stability constant (default: 1e-4)
--max-completion-length: Max generation length (default: 512)
--temperature: Sampling temperature (default: 0.8)
--reward-functions: Comma-separated reward function names
--reward-functions-file: Path to custom reward functions file
--reward-weights: JSON list of weights for each reward function
--grpo-loss-type: Loss variant - grpo, bnpo, or dr_grpo

Dataset Format:

{"prompt": "Math problem", "answer": "42"}
{"prompt": "Question", "answer": "Response", "system": "You are helpful"}
{"prompt": "Question", "answer": "Response", "type": "math"}

Custom Reward Functions: Create a Python file with reward functions:

# my_rewards.py
from mlx_lm_lora.reward_functions import register_reward_function

@register_reward_function()
def my_custom_reward(prompt, completion, reference_answer, **kwargs):
    """Custom reward function"""
    # Your logic here
    return score  # float between 0 and 1

Then use: --reward-functions-file ./my_rewards.py --reward-functions "my_custom_reward"

Group Sequence Policy Optimization (GSPO)

GSPO extends GRPO with importance sampling at token or sequence level for improved sample efficiency.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type grpo \
--importance-sampling-level token \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8

Key Parameters:

--importance-sampling-level: Choose token, sequence, or None (default: None)
All other GRPO parameters apply

Dataset Format: Same as GRPO

Decoupled Reward Group Relative Policy Optimization (Dr. GRPO)

Dr. GRPO decouples the reward computation from the policy optimization for more stable training.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type dr_grpo \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8

Key Parameters:

--grpo-loss-type dr_grpo: Enables Dr. GRPO variant
All other GRPO parameters apply

Dataset Format: Same as GRPO

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

DAPO uses dual epsilon values for more flexible clipping in policy optimization.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--epsilon 1e-4 \
--epsilon-high 1e-2 \
--group-size 4 \
--temperature 0.8

Key Parameters:

--epsilon: Lower bound for clipping (default: 1e-4)
--epsilon-high: Upper bound for clipping (uses epsilon value if not specified)
All other GRPO parameters apply

Dataset Format: Same as GRPO

Online DPO

Online preference optimization using a judge model or human feedback.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode online_dpo \
--data ./online_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5

Key Parameters:

--judge: Judge model ID or "human" for human feedback
--alpha: Learning rate for online updates (default: 1e-5)
--judge-config: Additional configuration for judge model

Dataset Format:

{"prompt": [{"role": "user", "content": "Question"}]}
{"messages": [{"role": "user", "content": "Question"}]}

eXtended Preference Optimization (XPO)

XPO extends online DPO with additional preference learning mechanisms.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode xpo \
--data ./xpo_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5 \
--beta 0.1

Key Parameters:

--judge: Judge model ID or "human"
--alpha: Online learning rate (default: 1e-5)
--beta: KL penalty strength (default: 0.1)
--judge-config: Additional judge configuration

Dataset Format: Same as Online DPO

Reinforced Reinforcement Learning from Human Feedback with KL

Full RLHF REINFORCE pipeline with reward model and policy optimization Ziegler style.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode rlhf-reinforce \
--data Goekdeniz-Guelmez/ultrafeedback-prompt-flat \
--judge mlx-community/reward-model \
--alpha 1e-5 \
--beta 0.1

Key Parameters:

--judge: Reward model ID
--alpha: Policy learning rate (default: 1e-5)
--beta: KL penalty strength (default: 0.1)

Dataset Format: Same as Online DPO

Proximal Policy Optimization

Full PPO pipeline with reward model and policy optimization.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode ppo \
--data Goekdeniz-Guelmez/ultrafeedback-prompt-flat \
--judge mlx-community/reward-model \
--epsilon 0.2

Key Parameters:

--judge: Reward model ID
--epsilon: The Epsilon for numerical stability (default: 0.2)

Dataset Format: Same as Online DPO

Other Features

Synthetic Dataset Creation

This feature makes it able to use mlx-lm's powerfull batch genebrate to create a synthetic datasets using a teacher model, this can be used for knowledge distiliation, etc., and is a powerfull tool to create custom model, fuly locally.

Synthetic Prompts Dataset Creation

With this you can create a synthetic user prompts dataset using a model. this creates multible files, the first file is a JSONL file that has the generated samples in it, the next ones are parquet verison for HF compatibility. Example:

python -m mlx_lm_lora.synthetic_prompts \
--model mlx-community/Josiefied-Qwen3-4B-Instruct-2507-abliterated-v1-8bit \
--topics 'ML' 'politics' 'web security' \
--docs-dir ./docs-pdfs \
--output-dir ./sft_dataset \
--system-prompt "You are Josie, a cool and fresh ai asstant that talks like a gangster"
--num-samples 1000 \
--valid-split 0.01 \
--batch-size 4 \
--max-tokens 4096

Resulting Dataset Format:

{"prompt": "Question", "section": "only happens when using files via --docs-dir", "topic": "only happens when using topics via --topics"}
...

You can directly add that into the synthetic SFT dataset creation after finishing.

Synthetic SFT Dataset Creation

With this you can create a synthetic SFT dataset using a teacher model. this creates multible files, the first file is a JSONL file that has the generated samples in it, the next ones are parquet verison for HF compatibility. Example:

python -m mlx_lm_lora.synthetic_sft \
--dataset-path Goekdeniz-Guelmez/Josiefication-prompts-online-po \
--model mlx-community/Josiefied-Qwen3-4B-Instruct-2507-abliterated-v1-8bit \
--output-dir ./sft_dataset \
--num-samples 1000 \
--valid-split 0.01 \
--batch-size 16 \
--max-tokens 4096 \
--use-ground-truth \

Dataset Format:

{"prompt": "Question"}
{"prompt": "Question"}
{"prompt": "Question"}

Synthetic Preference Dataset Creation

With this you can create a synthetic DPO flatt-dataset using a teacher model. this creates multible files just like sft. Example:

python -m mlx_lm_lora.synthetic_dpo \
--dataset-path Goekdeniz-Guelmez/Josiefication-prompts-online-po \
--base-model mlx-community/Qwen3-4B-Instruct-2507-4bit \
--teacher-model mlx-community/Qwen3-4B-Instruct-2507-4bit \
--system-promtp "can be a normal string or the path to a .txt file for longer prompts"t \
--output-dir ./dpo_dataset \
--num-samples 10000 \
--valid-split 0.0001 \
--test-split 0.2 \
--batch-size 16 \
--max-tokens 8192

Dataset Format: Same as abouve

Training Your Custom Preference Model

This feature adds a second training stage on top of the judge (preference) stage. A reward model thats scores the policy’s generations and the policy is updated with a KL‑penalised PPO‑style loss.

Collect preference data → judge‑mode (online DPO) → reward model
Run RLHF (policy optimisation) using the reward model → final policy

python -m mlx_lm_lora.train_judge \
--model Goekdeniz-Guelmez/Josiefied-Qwen3-0.6B-abliterated-v1 \
--train-type full \
--optimizer adamw \
--steps-per-report 1 \
--iters 50 \
--max-seq-length 1024 \
--adapter-path /Users/[email protected]/Library/CloudStorage/OneDrive-COMPUTACENTER/Desktop/test \
--data mlx-community/Human-Like-DPO \
--gradient-accumulation-steps 1

Dataset Format: Same as DPO (with prompt, chosen, and rejected pairs).

Configuration

Core Training Parameters

# Model and data
--model <model_path>              # Model path or HF repo
--data <data_path>                # Dataset path or HF dataset name
--train-type lora                 # lora, dora, or full
--train-mode sft                  # sft, dpo, cpo, orpo, grpo, etc.

# Training schedule
--batch-size 4                    # Batch size
--iters 1000                      # Training iterations
--epochs 3                        # Training epochs (ignored if iters set)
--learning-rate 1e-5              # Learning rate
--gradient-accumulation-steps 1   # Gradient accumulation

# Model architecture
--num-layers 16                   # Layers to fine-tune (-1 for all)
--max-seq-length 2048            # Maximum sequence length

# LoRA parameters
--lora-parameters '{"rank": 8, "dropout": 0.0, "scale": 10.0}'

# Optimization
--optimizer adam                  # adam, adamw, qhadam, muon
--lr-schedule cosine             # Learning rate schedule
--grad-checkpoint                # Enable gradient checkpointing

# Quantization
--load-in-4bits                  # 4-bit quantization
--load-in-6bits                  # 6-bit quantization  
--load-in-8bits                  # 8-bit quantization

# Monitoring
--steps-per-report 10            # Steps between loss reports
--steps-per-eval 200             # Steps between validation
--val-batches 25                 # Validation batches (-1 for all)
--wandb project_name             # WandB logging

# Checkpointing
--adapter-path ./adapters        # Save/load path for adapters
--save-every 100                 # Save frequency
--resume-adapter-file <path>     # Resume from checkpoint
--fuse                           # Fuse and save trained model

Algorithm-Specific Parameters

Preference Optimization Methods:

DPO/CPO:

--beta 0.1                        # KL penalty strength
--dpo-cpo-loss-type sigmoid       # sigmoid, hinge, ipo, dpop
--delta 50.0                      # Margin for hinge loss
--reference-model-path <path>     # Reference model path

ORPO:

--beta 0.1                        # Temperature parameter
--reward-scaling 1.0              # Reward scaling factor

Group-Based Methods:

GRPO (Base):

--group-size 4                    # Generations per prompt
--epsilon 1e-4                    # Numerical stability constant
--temperature 0.8                 # Sampling temperature
--max-completion-length 512       # Max generation length
--reward-functions "func1,func2"  # Comma-separated reward functions
--reward-functions-file <path>    # Custom reward functions file
--reward-weights "[0.5, 0.5]"    # JSON list of reward weights
--grpo-loss-type grpo             # grpo, bnpo, dr_grpo

GSPO (GRPO + Importance Sampling):

--importance-sampling-level token # token, sequence, or None
# Plus all GRPO parameters

Dr. GRPO (Decoupled Rewards):

--grpo-loss-type dr_grpo         # Enable Dr. GRPO variant
# Plus all GRPO parameters

DAPO (Dynamic Clipping):

--epsilon 1e-4                   # Lower bound for clipping
--epsilon-high 1e-2              # Upper bound for clipping
# Plus all GRPO parameters

Online Methods:

Online DPO:

--judge <model_id>               # Judge model or "human"
--alpha 1e-5                     # Online learning rate
--beta 0.1                       # KL penalty strength
--judge-config '{}'              # Additional judge configuration

XPO (Extended Preference Optimization):

--judge <model_id>               # Judge model or "human"
--alpha 1e-5                     # Online learning rate
--beta 0.1                       # KL penalty strength
--judge-config '{}'              # Judge configuration
# Plus additional XPO-specific parameters

RLHF Reinforce:

--judge <reward_model_id>        # Reward model
--alpha 1e-5                     # Policy learning rate
--beta 0.1                       # KL penalty strength
--group-size 4                   # Samples for policy optimization
--judge-config '{}'              # Reward model configuration

PPO:

--judge <reward_model_id>        # Reward model
--alpha 1e-5                     # Policy learning rate
--epsilon 0.2                    # Numerical stability value
--group-size 4                   # Samples for policy optimization
--judge-config '{}'              # Reward model configuration

Dataset Formats

Local Datasets

Place JSONL files in a directory:

data/
├── train.jsonl
├── valid.jsonl
└── test.jsonl

Hugging Face Datasets

mlx_lm_lora.train --data "Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1" --train

Custom Dataset Keys

Configure custom field names:

--text-feature "content"          # For text datasets
--chat-feature "conversation"     # For chat datasets
--prompt-feature "question"       # For prompt-completion
--completion-feature "answer"     # For prompt-completion
--chosen-feature "preferred"      # For preference datasets
--rejected-feature "dispreferred" # For preference datasets
--system-feature "instruction"    # For system messages

Dataset Examples by Training Mode

SFT - Chat Format:

{"messages": [
  {"role": "system", "content": "You are helpful"},
  {"role": "user", "content": "What is 2+2?"},
  {"role": "assistant", "content": "4"}
]}

SFT - Completion Format:

{"prompt": "What is 2+2?", "completion": "2+2 equals 4"}

SFT - Text Format:

{"text": "The complete text for language modeling"}

DPO/CPO Format:

{"prompt": "Explain AI", "chosen": "AI is artificial intelligence", "rejected": "AI is magic"}

ORPO Format:

{"prompt": "What is AI?", "chosen": "Good explanation", "rejected": "Bad explanation", "preference_score": 0.8}

GRPO Format:

{"prompt": "Solve: 2+2=?", "answer": "4", "system": "You are a math tutor"}

RLHF (Online DPO, XPO, RLHF Reinforced, PPO) Format:

{"prompt": [{"role": "user", "content": "Question"}]}

or:

{"prompt": "Question"}

Memory Optimization

Quantization (QLoRA)

Use quantized models to reduce memory usage:

# 4-bit quantization (most memory efficient)
mlx_lm_lora.train --model <model> --load-in-4bits --train

# 6-bit quantization (balanced)
mlx_lm_lora.train --model <model> --load-in-6bits --train

# 8-bit quantization (higher quality)
mlx_lm_lora.train --model <model> --load-in-8bits --train

Other Memory Reduction Techniques

# Reduce batch size
--batch-size 1

# Train fewer layers
--num-layers 8

# Enable gradient checkpointing
--grad-checkpoint

# Reduce sequence length
--max-seq-length 1024

# Use gradient accumulation
--gradient-accumulation-steps 4 --batch-size 1

LoRA Configuration for Memory

# Smaller LoRA rank
--lora-parameters '{"rank": 4, "dropout": 0.1, "scale": 10.0}'

# Train specific layers only
--num-layers 8

Evaluation & Generation

Evaluation

Evaluate on test set:

mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--data <data_path> \
--test \
--test-batches 500

Generation

Use mlx-lm for generation with trained adapters:

mlx_lm.generate \
--model <model_path> \
--adapter-path <adapter_path> \
--prompt "Your prompt here" \
--max-tokens 100 \
--temperature 0.7

Fusing Adapters

Merge LoRA weights into base model:

mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--fuse

Advanced Features

Learning Rate Schedules

--lr-schedule cosine              # Cosine annealing
--lr-schedule linear              # Linear decay
--lr-schedule constant            # Constant rate

Multiple Optimizers

--optimizer adam                  # Adam optimizer
--optimizer adamw                 # AdamW with weight decay
--optimizer qhadam               # Quasi-hyperbolic Adam
--optimizer muon                 # Muon optimizer

Reward Function System (GRPO)

List available reward functions:

mlx_lm_lora.train --list-reward-functions

Use multiple reward functions:

--reward-functions "accuracy_reward,format_reward,length_reward" \
--reward-weights "[0.5, 0.3, 0.2]"

WandB Integration

--wandb my_project_name

Training Method Comparison

Method	Type	Reference Model	Judge Model	Multiple Generations	Key Benefit
SFT	Supervised	❌	❌	❌	Simple, fast training
DPO	Preference	✅	❌	❌	No reward model needed
CPO	Preference	✅	❌	❌	Better for structured tasks
ORPO	Preference	❌	❌	❌	Monolithic optimization
GRPO	Policy	❌	❌	✅	Group-based learning
GSPO	Policy	❌	❌	✅	Importance sampling
Dr. GRPO	Policy	❌	❌	✅	Decoupled rewards
DAPO	Policy	❌	❌	✅	Dynamic clipping
Online DPO	Online RL	✅	✅	✅	Real-time feedback
XPO	Online RL	✅	✅	✅	Extended preferences
RLHF Reinforce	Online RL	✅	✅	✅	Full RL pipeline
PPO	Online RL	✅	✅	✅	Full RL pipeline

Example Commands for All Methods

Basic Methods

# SFT
mlx_lm_lora.train --model <model> --train-mode sft --data <data>

# DPO
mlx_lm_lora.train --model <model> --train-mode dpo --data <data> --beta 0.1

# CPO
mlx_lm_lora.train --model <model> --train-mode cpo --data <data> --beta 0.1

# ORPO
mlx_lm_lora.train --model <model> --train-mode orpo --data <data> --beta 0.1

Group-Based Methods

# GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> --group-size 4

# GSPO (GRPO with importance sampling)
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--importance-sampling-level token --group-size 4

# Dr. GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--grpo-loss-type dr_grpo --group-size 4

# DAPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--epsilon 1e-4 --epsilon-high 1e-2 --group-size 4

Online Methods

# Online DPO
mlx_lm_lora.train --model <model> --train-mode online_dpo --data <data> \
--judge <judge_model> --alpha 1e-5

# XPO
mlx_lm_lora.train --model <model> --train-mode xpo --data <data> \
--judge <judge_model> --alpha 1e-5

# RLHF Reinforce
mlx_lm_lora.train --model <model> --train-mode rlhf-reinforce --data <data> \
--judge <reward_model> --alpha 1e-5 --group-size 4

# PPO
mlx_lm_lora.train --model <model> --train-mode ppo --data <data> \
--judge <reward_model> --epsilon 0.2 --group-size 4

Troubleshooting

Common Issues

Out of Memory: Reduce batch size, use quantization, enable gradient checkpointing
Slow Training: Increase batch size, reduce validation frequency
Poor Quality: Increase LoRA rank, train more layers, check data quality
Convergence Issues: Adjust learning rate, try different optimizers

Memory Usage Guidelines

Model Size	Recommended Settings
1-3B	`--batch-size 4 --num-layers 16`
7B	`--batch-size 2 --num-layers 8 --load-in-8bits`
13B+	`--batch-size 1 --num-layers 4 --load-in-4bits --grad-checkpoint`

Example Configurations

Basic LoRA Fine-tuning

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./my_data
train_type: lora
train_mode: sft
batch_size: 4
learning_rate: 1e-5
iters: 1000
lora_parameters:
  rank: 8
  dropout: 0.0
  scale: 10.0

DPO Training

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./preference_data
train_mode: dpo
beta: 0.1
dpo_cpo_loss_type: sigmoid
batch_size: 2
learning_rate: 5e-6
iters: 500

GRPO with Custom Rewards

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./grpo_data
train_mode: grpo
group_size: 4
temperature: 0.8
reward_functions: "accuracy_reward,format_reward"
reward_weights: [0.7, 0.3]
max_completion_length: 512

MLX-LM-LoRA is trusted by teams and industry leaders such as:

Citing MLX-LM-LoRA

@software{MLX-LM-LoRA,
  author = {Gökdeniz Gülmez},
  title = {{MLX-LM-LoRA}: Train LLMs on Apple silicon with MLX and the Hugging Face Hub},
  url = {https://github.com/Goekdeniz-Guelmez/mlx-lm-lora},
  version = {0.1.0},
  year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
.github/workflows		.github/workflows
examples		examples
logos		logos
mlx_lm_lora		mlx_lm_lora
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

Goekdeniz-Guelmez/mlx-lm-lora

Folders and files

Latest commit

History

Repository files navigation

MLX-LM-LORA

Supported Training Methods

New Features

📓 Example Notebooks

Contents

Install

Quick Start

Training Methods

Supervised Fine-Tuning (SFT)

Direct Preference Optimization (DPO)

Contrastive Preference Optimization (CPO)

Odds Ratio Preference Optimization (ORPO)

Group Relative Policy Optimization (GRPO)

Group Sequence Policy Optimization (GSPO)

Decoupled Reward Group Relative Policy Optimization (Dr. GRPO)

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Online DPO

eXtended Preference Optimization (XPO)

Reinforced Reinforcement Learning from Human Feedback with KL

Proximal Policy Optimization

Other Features

Synthetic Dataset Creation

Synthetic Prompts Dataset Creation

Synthetic SFT Dataset Creation

Synthetic Preference Dataset Creation

Training Your Custom Preference Model

Configuration

Core Training Parameters

Algorithm-Specific Parameters

Dataset Formats

Local Datasets

Hugging Face Datasets

Custom Dataset Keys

Dataset Examples by Training Mode

Memory Optimization

Quantization (QLoRA)

Other Memory Reduction Techniques

LoRA Configuration for Memory

Evaluation & Generation

Evaluation

Generation

Fusing Adapters

Advanced Features

Learning Rate Schedules

Multiple Optimizers

Reward Function System (GRPO)

WandB Integration

Training Method Comparison

Example Commands for All Methods

Basic Methods

Group-Based Methods

Online Methods

Troubleshooting

Common Issues

Memory Usage Guidelines

Example Configurations

Basic LoRA Fine-tuning

DPO Training

GRPO with Custom Rewards

MLX-LM-LoRA is trusted by teams and industry leaders such as:

Citing MLX-LM-LoRA

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 27

Packages 0

Uh oh!

Contributors 4

Languages

Packages