Direct Preference Optimization (DPO)

Alignment training that teaches models to generate preferred responses based on human feedback.

Overview
1. What is DPO?
How DPO Works
Technical Details
1. Data Requirements
2. Data Format
Configuration
1. LLaMA-Factory YAML Config
Data Collection
1. Human Feedback Collection
2. AI-Assisted Preference Generation
Training Process
1. Launch DPO Training
2. Monitor Training
Best Practices
1. Hyperparameter Guidelines
2. Data Quality
Evaluation
1. Win Rate Evaluation
2. Benchmark Evaluation
Common Issues
Online DPO Loop
1. Continuous Improvement Workflow
Advanced Techniques
1. Multi-Objective DPO
2. Iterative DPO
Resources
1. Papers
2. Tools
3. Datasets

Overview

What is DPO?

Direct Preference Optimization (DPO) is a simpler alternative to RLHF (Reinforcement Learning from Human Feedback) that directly optimizes the model to prefer chosen responses over rejected ones.

Training Pipeline Position:

graph LR
    A[SFT Model] --> B[Collect
Preferences]
    B --> C[DPO Training]
    C --> D[Aligned Model]
    D --> E[Deploy &
Monitor]
    E --> B
    
    style C fill:#FFB6C1

Key Advantages over RLHF:

No need for separate reward model
More stable training
Lower computational cost
Easier to implement

How DPO Works

Core Concept

DPO’s Key Insight: Instead of training a separate reward model (like RLHF), DPO directly optimizes the language model to prefer chosen responses over rejected ones. It does this by:

Comparing probabilities: Increase probability of chosen response, decrease probability of rejected response
Maintaining reference: Keep model close to reference model (SFT) to prevent degradation
Optimizing preference gap: Maximize the gap between chosen and rejected response probabilities

Training Process:

graph LR
    A[Preference Pair] --> B[Compute Probabilities]
    B --> C[Calculate Preference Gap]
    C --> D[Optimize Model]
    D --> E[Update Weights]
    E --> F[Repeat with Next Pair]

Mathematical Formulation

DPO Loss Function:

Given a preference pair $(x, y_w, y_l)$ where:

$x$ is the input prompt
$y_w$ is the preferred (chosen) response
$y_l$ is the rejected response

The DPO objective is:

\[\mathcal{L}_{\text{DPO}} = -\log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right)\]

Symbol Definitions:

$\pi_\theta$: The model being trained (current policy)
$\pi_{\text{ref}}$: The reference model (usually the SFT model)
$\beta$: The temperature parameter (typically 0.1-0.5), controls how much the model can deviate from reference
$\sigma$: The sigmoid function, converts log-odds to probability

Understanding the Formula

What the Formula Does:

Log Ratio Terms:
- $\log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)}$ measures how much more likely the current model is to generate the chosen response compared to the reference model
- $\log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}$ measures the same for the rejected response
Preference Gap:
- The difference between these two log ratios represents the preference gap
- We want this gap to be large (chosen much more likely than rejected)
Sigmoid Function:
- Converts the gap into a probability (0 to 1)
- Larger gap → higher probability → lower loss
Negative Log:
- We minimize negative log probability, which maximizes the probability
- This encourages the model to prefer chosen over rejected

Intuitive Explanation:

If chosen response becomes more likely and rejected becomes less likely → Loss decreases ✅
If both responses become equally likely → Loss increases ❌
If model deviates too far from reference → KL penalty (via $\beta$) prevents degradation

Technical Details

Data Requirements

Data Format

What is a Preference Pair? A preference pair consists of:

A conversation context (user prompt, possibly with previous turns)
A chosen response (preferred by humans)
A rejected response (less preferred by humans)

The model learns to increase the probability of chosen responses and decrease the probability of rejected responses.

Preference Pair Format:

{
  "conversations": [
    {"from": "user", "value": "Explain quantum computing"}
  ],
  "chosen": "Quantum computing uses quantum mechanical phenomena like superposition...",
  "rejected": "Quantum computing is when computers use quantum."
}

Multi-Turn Preference:

{
  "conversations": [
    {"from": "user", "value": "What is Python?"},
    {"from": "assistant", "value": "Python is a programming language."},
    {"from": "user", "value": "Show me an example"}
  ],
  "chosen": "Here's a simple Python example:\n\n```python\nprint('Hello, World!')\n```\n\nThis is the traditional first program...",
  "rejected": "Sure: print('hi')"
}

Configuration

LLaMA-Factory YAML Config

# DPO Configuration
model_name_or_path: checkpoints/sft/chat-model
ref_model: checkpoints/sft/chat-model  # Reference model
stage: dpo
do_train: true
dataset: comparison_gpt4_en
template: qwen

# DPO-specific parameters
pref_beta: 0.1  # KL penalty coefficient
pref_loss: sigmoid  # or hinge, ipo

# Training hyperparameters
cutoff_len: 2048
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
learning_rate: 5.0e-7  # Much lower than SFT!
num_train_epochs: 1.0  # Usually just 1 epoch
lr_scheduler_type: cosine
warmup_ratio: 0.1

# Optimization
fp16: true
gradient_checkpointing: true

# LoRA configuration
finetuning_type: lora
lora_target: all
lora_rank: 16
lora_alpha: 32

# Evaluation
val_size: 0.1
eval_strategy: steps
eval_steps: 100

# Output
output_dir: checkpoints/dpo/aligned-model
logging_steps: 5
save_steps: 500

Data Collection

Human Feedback Collection

UI for Preference Labeling:

import gradio as gr

def create_preference_ui(model_a, model_b):
    """Create UI for collecting preference data"""
    
    def get_responses(prompt):
        response_a = model_a.generate(prompt)
        response_b = model_b.generate(prompt)
        return response_a, response_b
    
    def save_preference(prompt, response_a, response_b, preference):
        """Save user preference"""
        if preference == "A":
            chosen, rejected = response_a, response_b
        else:
            chosen, rejected = response_b, response_a
        
        preference_data = {
            "conversations": [{"from": "user", "value": prompt}],
            "chosen": chosen,
            "rejected": rejected
        }
        
        with open('data/llmops/dpo/preferences.jsonl', 'a') as f:
            f.write(json.dumps(preference_data) + '\n')
    
    # Create Gradio interface
    with gr.Blocks() as demo:
        prompt_input = gr.Textbox(label="Enter prompt")
        get_btn = gr.Button("Generate Responses")
        
        with gr.Row():
            response_a = gr.Textbox(label="Response A")
            response_b = gr.Textbox(label="Response B")
        
        with gr.Row():
            prefer_a = gr.Button("Prefer A")
            prefer_b = gr.Button("Prefer B")
        
        get_btn.click(
            get_responses, 
            inputs=[prompt_input],
            outputs=[response_a, response_b]
        )
        
        prefer_a.click(
            lambda p, a, b: save_preference(p, a, b, "A"),
            inputs=[prompt_input, response_a, response_b]
        )
    
    return demo

# Launch UI
demo = create_preference_ui(model_a, model_b)
demo.launch()

AI-Assisted Preference Generation

Using GPT-4 as Judge:

from openai import OpenAI

client = OpenAI()

def generate_preference_with_gpt4(prompt, response_a, response_b):
    """Use GPT-4 to judge which response is better"""
    
    judge_prompt = f"""Compare the following two responses to the user prompt and determine which is better.

User Prompt: {prompt}

Response A:
{response_a}

Response B:
{response_b}

Evaluate based on:
1. Helpfulness
2. Accuracy
3. Clarity
4. Safety

Respond with ONLY "A" or "B" to indicate which response is better."""
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": judge_prompt}],
        temperature=0.0
    )
    
    choice = response.choices[0].message.content.strip()
    
    if choice == "A":
        return {"chosen": response_a, "rejected": response_b}
    else:
        return {"chosen": response_b, "rejected": response_a}

Training Process

Launch DPO Training

llamafactory-cli train \
    --config_file config/dpo_config.yaml

Monitor Training

Key Metrics:

Rewards/Chosen: Should increase (model prefers chosen responses)
Rewards/Rejected: Should decrease (model disfavors rejected responses
Rewards/Margins: Gap between chosen and rejected (should increase)
DPO Loss: Should decrease to 0.3-0.6

TensorBoard:

tensorboard --logdir=checkpoints/dpo/aligned-model/runs

Best Practices

Hyperparameter Guidelines

Parameter	Conservative	Recommended
Learning Rate	1e-7	5e-7
Beta (KL Penalty)	0.05	0.1
Epochs	1	1-2
Batch Size	16	32

Critical: Use much lower learning rate than SFT! High LR causes instability.

Data Quality

Good Preference Pairs:

Clear quality difference between chosen and rejected
Both responses are on-topic
Differences are meaningful (not just length)
Diverse types of improvements (accuracy, helpfulness, safety)

Bad Preference Pairs:

Both responses are equally good/bad
Rejected response is completely off-topic
Difference is only formatting
Biased or inconsistent preferences

Evaluation

Win Rate Evaluation

def evaluate_win_rate(test_prompts, model_before, model_after):
    """Evaluate improvement after DPO"""
    
    wins = 0
    total = len(test_prompts)
    
    for prompt in test_prompts:
        response_before = model_before.generate(prompt)
        response_after = model_after.generate(prompt)
        
        # Human or GPT-4 judge
        winner = judge(prompt, response_before, response_after)
        
        if winner == "after":
            wins += 1
    
    win_rate = wins / total
    return win_rate

# Expect 60-80% win rate for successful DPO

Benchmark Evaluation

Standard benchmarks:

MT-Bench: Multi-turn conversation quality
AlpacaEval: Instruction following
TruthfulQA: Factual accuracy
SafetyBench: Harmful content avoidance

Common Issues

Training Instability

Symptoms: Loss spikes, NaN values

Solutions:

# Lower learning rate
learning_rate: 1.0e-7

# Increase KL penalty
pref_beta: 0.2

# Enable gradient clipping
max_grad_norm: 0.5

Model Degradation

Symptoms: Model becomes less helpful, repetitive

Solutions:

Reduce number of epochs (try 0.5 epochs)
Use higher beta (more regularization)
Check preference data quality

Reward Hacking

Symptoms: Model learns to game the preference signal

Solutions:

Diverse preference data
Regular evaluation on held-out set
Mix preferences from multiple sources

Online DPO Loop

Continuous Improvement Workflow

graph TB
    A[Deploy Model v1.0] --> B[Collect User Interactions]
    B --> C[Sample for Annotation]
    C --> D[Human Preference Labels]
    D --> E[DPO Training]
    E --> F[Model v1.1]
    F --> G{Quality Gate}
    G -->|Pass| H[Deploy v1.1]
    G -->|Fail| I[Debug & Retry]
    H --> B
    I --> E

Weekly Iteration Cycle:

Monday: Collect previous week’s data
Tuesday: Sample and annotate preferences
Wednesday: Run DPO training
Thursday: Evaluate new model
Friday: Deploy if passing quality checks

Advanced Techniques

Multi-Objective DPO

Optimize for multiple objectives simultaneously:

# Data format with multiple preference dimensions
{
  "conversations": [...],
  "chosen": "...",
  "rejected": "...",
  "dimensions": {
    "helpfulness": {"chosen": 5, "rejected": 3},
    "safety": {"chosen": 5, "rejected": 4},
    "accuracy": {"chosen": 4, "rejected": 2}
  }
}

Iterative DPO

Apply DPO multiple times with different preference data:

# Round 1: General helpfulness
dataset: preferences_helpfulness

# Round 2: Safety
dataset: preferences_safety

# Round 3: Domain-specific
dataset: preferences_domain

Direct Preference Optimization (DPO)

Table of contents

Overview

What is DPO?

How DPO Works

Core Concept

Mathematical Formulation

Understanding the Formula

Technical Details

Data Requirements

Data Format

Configuration

LLaMA-Factory YAML Config

Data Collection

Human Feedback Collection

AI-Assisted Preference Generation

Training Process

Launch DPO Training

Monitor Training

Best Practices

Hyperparameter Guidelines

Data Quality

Evaluation

Win Rate Evaluation

Benchmark Evaluation

Common Issues

Training Instability

Model Degradation

Reward Hacking

Online DPO Loop

Continuous Improvement Workflow

Advanced Techniques

Multi-Objective DPO

Iterative DPO

Resources

Papers

Tools

Datasets