LLMOps Training Pipeline

A complete large language model training and optimization platform supporting continuous improvement through CPT → SFT → DPO workflow.

System Overview
Architecture
Training Workflow Overview
Training Stages
Online Feedback Loop
Quick Start
Best Practices
Troubleshooting
1. Training Issues
  1. OOM (Out of Memory)
  2. Training Not Converging
2. Data Issues
  1. Data Format Error
  2. Dataset Not Registered
Technical Reference
Related Resources
Glossary

System Overview

Key Features

🔄 Complete Training Workflow: CPT → SFT → DPO three-stage training
📊 Data Generation Pipeline: Self-Instruct generation, preference data collection
🔁 Online Feedback Loop: A/B testing → Preference collection → DPO training → Deployment
📈 Experiment Management: Configuration management, training monitoring, result comparison
🎯 Offline/Online Modes: Support for major version development and continuous optimization

Offline Training vs Online Optimization

Feature	Offline Training	Online Optimization
Purpose	Major version development (v1.0 → v2.0)	Continuous iteration (v1.0 → v1.1)
Stages	CPT + SFT	DPO
Frequency	Quarterly/Semi-annually	Weekly/Monthly
Cost	High (full params or large-scale LoRA)	Low (LoRA only)
Data	Large-scale corpus + instruction sets	Real user feedback
Output	Completion + Chat models	Incremental chat model versions

System Highlights

✅ LLaMA-Factory Integration: Built on mature training framework
✅ Web UI: Gradio-based friendly interface
✅ Modular Design: Separation of data, training, and inference
✅ Automated Workflow: Configuration generation, data conversion, result logging
✅ Demo Mode: Experience full workflow without GPU

Architecture

System Architecture

graph TB
    subgraph "Data Layer"
        A1[Domain Corpus] --> A[DomainCorpusProcessor]
        A2[Preloaded Docs] --> A
        A --> A3[domain_corpus.jsonl]
        
        B1[Seed Instructions] --> B[SelfInstructGenerator]
        B --> B2[sft_data.json]
        
        C1[User Feedback] --> C[PreferenceCollector]
        C2[A/B Testing] --> C
        C --> C3[prefs.jsonl]
    end
    
    subgraph "Configuration Layer"
        D[LLaMAFactoryConfig]
        D --> D1[CPT Config]
        D --> D2[SFT Config]
        D --> D3[DPO Config]
    end
    
    subgraph "Training Layer"
        E[LLaMAFactoryTrainer]
        A3 --> E
        B2 --> E
        C3 --> E
        D1 --> E
        D2 --> E
        D3 --> E
        
        E --> E1[CPT Training]
        E --> E2[SFT Training]
        E --> E3[DPO Training]
    end
    
    subgraph "Model Layer"
        E1 --> F1[Completion Model]
        E2 --> F2[Chat Model v1.0]
        E3 --> F3[Chat Model v1.1]
    end
    
    subgraph "Inference Layer"
        G[InferenceModel]
        F1 --> G
        F2 --> G
        F3 --> G
        
        G --> G1[A/B Testing]
        G --> G2[Production Deployment]
    end
    
    subgraph "Feedback Layer"
        G1 --> H[User Rating]
        H --> C
    end

Data Flow

sequenceDiagram
    participant U as User
    participant UI as Web UI
    participant DG as Data Generator
    participant TC as Training Config
    participant TR as Trainer
    participant IF as Inference
    participant PC as Preference Collector
    
    Note over U,PC: Offline Training Flow
    U->>UI: 1. Generate domain corpus
    UI->>DG: Process documents
    DG-->>UI: domain_corpus.jsonl
    
    U->>UI: 2. Generate instruction data
    UI->>DG: Self-Instruct
    DG-->>UI: sft_data.json
    
    U->>UI: 3. Configure CPT/SFT
    UI->>TC: Generate config files
    TC-->>U: YAML config
    
    U->>TR: 4. Execute training
    TR-->>TR: CPT → Completion Model
    TR-->>TR: SFT → Chat Model v1.0
    
    Note over U,PC: Online Optimization Flow
    U->>UI: 5. A/B testing
    UI->>IF: Call both models
    IF-->>UI: Compare results
    U->>UI: Choose better response
    UI->>PC: Record preference
    PC-->>PC: Accumulate data
    
    Note over PC: Weekly/Monthly
    U->>UI: 6. DPO training
    UI->>TC: Generate DPO config
    U->>TR: Execute DPO
    TR-->>TR: Chat Model v1.1
    
    Note over U,PC: Continuous Iteration
    U->>IF: 7. Deploy new version
    IF-->>IF: Update inference service

Directory Structure

src/search_engine/training_tab/
├── llmops_tab.py                    # Main UI entry
├── llmops_engine.py                 # Engine coordinator
├── llmops_manager.py                # Component manager
├── llmops_runner.py                 # Training runner
├── llmops_control.py                # Control logic
├── llmops_models.py                 # Model configuration
├── llama_factory_config.py          # LLaMA-Factory config generation
├── llamafactory_trainer.py          # Training service
├── self_instruct_generator.py       # Self-Instruct data generation
├── domain_corpus_processor.py       # Domain corpus processing
├── preference_collector.py          # Preference data collection
└── inference_model.py               # Inference service

data/llmops/
├── dataset_info.json                # Dataset registry
├── cpt/                            # CPT data
│   └── domain_corpus_*.jsonl
├── sft/                            # SFT data
│   └── sft_data_*.json
└── dpo/                            # DPO data
    └── prefs.jsonl

checkpoints/
├── cpt/                            # CPT model outputs
├── sft/                            # SFT model outputs
└── dpo/                            # DPO model outputs

configs/llmops/
├── cpt_config_*.yaml               # CPT configuration files
├── sft_config_*.yaml               # SFT configuration files
└── dpo_config_*.yaml               # DPO configuration files

Training Workflow Overview

Three-Stage Training Pipeline

The LLMOps system implements a complete training pipeline: CPT → SFT → DPO, each stage serving a specific purpose in model development:

CPT (Continued Pre-Training): Domain knowledge injection
SFT (Supervised Fine-Tuning): Instruction-following capability
DPO (Direct Preference Optimization): Human preference alignment

Training Stage Relationships

graph LR
    A[Base Model] --> B[CPT Training]
    B --> C[Domain-Aware Model]
    C --> D[SFT Training]
    D --> E[Chat Model v1.0]
    E --> F[DPO Training]
    F --> G[Chat Model v1.1]
    
    H[User Feedback] --> F
    I[Preference Data] --> F

Key Points:

CPT adapts base model to specific domain (optional, for domain-specific use cases)
SFT converts model to conversational format (required for chat models)
DPO optimizes model based on human preferences (continuous improvement)

When to Use Each Stage

Stage	When to Use	Input	Output	Frequency
CPT	Domain adaptation needed	Domain corpus	Completion model	Once per domain
SFT	Building chat model	Instruction data	Chat model	Major versions
DPO	Continuous improvement	Preference pairs	Optimized chat model	Weekly/Monthly

Training Stages

CPT - Continued Pre-Training

Overview

What is CPT? Continued Pre-Training (CPT) is the first stage where we inject domain-specific knowledge into a base language model. Unlike fine-tuning which adapts to tasks, CPT adapts the model to a specific domain or corpus.

Why Use CPT?

Base models have general knowledge but lack domain-specific expertise
CPT helps models understand domain terminology, concepts, and patterns
Provides foundation for subsequent SFT training

Training Objective:

Inject domain knowledge into base model
Enhance language understanding in specific domains
Lay domain foundation for subsequent SFT

Use Cases:

Professional domains (medical, legal, finance)
Specific language or dialect adaptation
Code generation and specialized tasks

Output Model:

Completion Model: Can be directly used for text completion, code generation, etc.

Data Requirements

Data Format:

{
    "text": "This is a domain-specific text passage. Can be complete paragraphs, documents, or code snippets."
}

Data Characteristics:

Large-scale: Typically 10M-100M tokens
High quality: Strong domain relevance
Unlabeled: No Q&A pairs needed

Configuration Example

@staticmethod
def create_cpt_config(
    model_name: str = "Qwen/Qwen2-1.5B",
    dataset: str = "domain_corpus",
    output_dir: str = "checkpoints/cpt/domain-cpt",
    num_train_epochs: int = 1,
    learning_rate: float = 1e-5,
    per_device_train_batch_size: int = 2,
    gradient_accumulation_steps: int = 8,
    max_seq_length: int = 2048,
    lora_r: int = 16,
    lora_alpha: int = 32
) -> Dict[str, Any]:
    """CPT configuration generation"""
    return {
        "stage": "pt",  # pretrain
        "model_name_or_path": model_name,
        "dataset": dataset,
        "finetuning_type": "lora",
        "lora_rank": lora_r,
        "lora_alpha": lora_alpha,
        "num_train_epochs": num_train_epochs,
        "learning_rate": learning_rate,
        ...
    }

Training Process

# 1. Generate configuration
llamafactory-cli export configs/llmops/cpt_config_20231115.yaml

# 2. Execute training
llamafactory-cli train configs/llmops/cpt_config_20231115.yaml

# 3. Training outputs
checkpoints/cpt/domain-cpt/
├── adapter_config.json      # LoRA configuration
├── adapter_model.safetensors # LoRA weights
├── trainer_log.jsonl        # Training logs
└── training_args.bin        # Training arguments

SFT - Supervised Fine-Tuning

Overview

What is SFT? Supervised Fine-Tuning (SFT) converts a pre-trained or CPT-trained model into a conversational model that can follow instructions and engage in dialogue.

Why Use SFT?

Base models are completion models, not conversational
SFT teaches models to understand and respond to user instructions
Enables models to participate in multi-turn conversations

Training Objective:

Convert base model to conversational model
Learn instruction-following capabilities
Adapt to specific task formats

Use Cases:

General conversational assistants
Task-oriented dialogue systems
Multi-turn conversation scenarios

Output Model:

Chat Model: Ready for production conversational service

Data Requirements

Data Format:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is machine learning?"
        },
        {
            "role": "assistant",
            "content": "Machine learning is a subfield of artificial intelligence..."
        }
    ]
}

Data Characteristics:

Medium scale: 10K-100K samples
High quality: Human-annotated or GPT-generated
Diversity: Cover multiple task types

Configuration Example

@staticmethod
def create_sft_config(
    model_name: str = "Qwen/Qwen2-1.5B",
    dataset: str = "sft_data",
    output_dir: str = "checkpoints/sft/sft-lora",
    num_train_epochs: int = 3,
    learning_rate: float = 5e-5,
    template: str = "qwen"
) -> Dict[str, Any]:
    """SFT configuration generation"""
    return {
        "stage": "sft",
        "model_name_or_path": model_name,
        "dataset": dataset,
        "template": template,  # Model conversation template
        "finetuning_type": "lora",
        ...
    }

DPO - Direct Preference Optimization

Overview

What is DPO? Direct Preference Optimization (DPO) is a training method that optimizes model outputs based on human preferences, without requiring a separate reward model.

Why Use DPO?

SFT models may generate responses that are technically correct but not preferred by users
DPO directly optimizes for what humans actually want
Enables continuous model improvement through user feedback

Training Objective:

Optimize model output based on human preferences
Improve generation quality and user satisfaction
Enable online feedback loop iteration

Use Cases:

Continuous optimization after model deployment
Correct model biases
Adapt to user preferences

Output Model:

Chat Model v1.1, v1.2…: Incremental optimized versions

Data Requirements

Data Format:

{
    "conversations": [
        {
            "role": "user",
            "content": "Which is better, Python or Java?"
        }
    ],
    "chosen": {
        "role": "assistant",
        "content": "Python and Java each have advantages, depending on specific scenarios..."
    },
    "rejected": {
        "role": "assistant",
        "content": "Python is better."
    }
}

Data Characteristics:

Small scale: 100-1000 samples can be effective
Authenticity: From real user feedback
Comparative: Must have chosen/rejected pairs

Configuration Example

@staticmethod
def create_dpo_config(
    model_name: str = "Qwen/Qwen2-1.5B",
    adapter_path: str = "checkpoints/sft/sft-lora",
    dataset: str = "prefs",
    output_dir: str = "checkpoints/dpo/dpo-lora",
    num_train_epochs: int = 1,
    learning_rate: float = 5e-6,
    pref_beta: float = 0.1
) -> Dict[str, Any]:
    """DPO configuration generation"""
    return {
        "stage": "dpo",
        "model_name_or_path": model_name,
        "adapter_name_or_path": adapter_path,  # Load SFT weights
        "dataset": dataset,
        "pref_beta": pref_beta,  # KL divergence weight
        "pref_loss": "sigmoid",  # Loss function
        ...
    }

Training Stage Comparison

Stage	Data Volume	Training Time	Cost	Frequency	Output
CPT	10M-100M tokens	2-8 hours	High	Quarterly	Completion Model
SFT	10K-100K samples	1-4 hours	Medium	Quarterly	Chat Model v1.0
DPO	100-1K samples	0.5-2 hours	Low	Weekly/Monthly	Chat Model v1.x

Online Feedback Loop

Feedback Loop Diagram

graph LR
    A[Chat Model v1.0] --> B[Production Deployment]
    B --> C[User Interaction]
    C --> D[A/B Testing]
    D --> E[Collect Preferences]
    E --> F{Enough Data?}
    F -->|No| C
    F -->|Yes| G[DPO Training]
    G --> H[Chat Model v1.1]
    H --> I[Evaluation]
    I --> J{Performance Improved?}
    J -->|Yes| B
    J -->|No| K[Adjust Parameters]
    K --> G

A/B Testing Interface

The system provides a user-friendly A/B testing interface:

Input test questions
Generate responses from two models (e.g., v1.0 vs v1.1)
Users vote for the better response
Preferences are automatically recorded for DPO training

Iteration Strategy

Data Collection Plan:

Minimum: 100 samples/week
Target: 300 samples/week
Optimal: 500 samples/week

Monthly Iteration Plan:

Week 1-2: Collect preference data
Week 3: DPO training
Week 4: Deployment and validation

Quick Start

1. Launch System

cd /Users/tyler/courseware-1/projects/Testbed
python start_system.py

Access the Web UI and navigate to Training & Experiments → LLMOps Loop tab.

2. Offline Training Workflow

Step 1: Data Generation

Domain Corpus:
- Click “Load Preloaded Documents”
- Click “Process Corpus”
- Click “Save Corpus”
Instruction Data:
- Set generation count (50-100)
- Click “Generate Instruction Data”
- Click “Save Instruction Data”

Step 2: Configure Training

CPT Configuration:
- Navigate to “Training Configuration” → “CPT”
- Set base model: Qwen/Qwen2-1.5B
- Click “Generate CPT Config”
SFT Configuration:
- Navigate to “Training Configuration” → “SFT”
- Set model path: checkpoints/cpt/domain-cpt
- Click “Generate SFT Config”

Step 3: Execute Training

# CPT training
llamafactory-cli train configs/llmops/cpt_config_*.yaml

# SFT training
llamafactory-cli train configs/llmops/sft_config_*.yaml

3. Online Optimization Workflow (Recommended)

Step 1: Collect Preferences

Navigate to “A/B Testing & Feedback Loop”
Input question, generate comparison responses
Choose the better response
System automatically records preference

Goal: Collect 100-200 real preferences per week

Step 2: DPO Training

Navigate to “Training Configuration” → “DPO”
Set model path (current version)
Click “Generate DPO Config”
Execute training:

llamafactory-cli train configs/llmops/dpo_config_*.yaml

Step 3: Deploy New Version

After training completes, obtain new LoRA weights
Update inference service configuration
Continue collecting feedback

Best Practices

Data Best Practices

CPT Data

✅ High quality: Ensure strong domain relevance
✅ Large scale: At least 10M tokens
✅ Cleaned: Remove noise and duplicates
✅ Chunked: Reasonable chunk_size (512-2048)

SFT Data

✅ Diversity: Cover multiple task types
✅ High quality: Human-annotated or GPT-generated
✅ Standard format: Strictly follow ShareGPT format
✅ Moderate scale: 10K-100K samples

DPO Data

✅ Authenticity: From real user feedback
✅ Comparative: Clear good/bad contrast
✅ Continuous collection: 100-200 samples/week
✅ Quality control: Regularly review preference data

Training Best Practices

Hyperparameter Tuning

# CPT parameter recommendations
CPT_PARAMS = {
    "learning_rate": 1e-5,  # Lower learning rate
    "num_epochs": 1,        # 1-3 epochs
    "lora_rank": 16,        # Larger rank
    "batch_size": 2-4,      # Adjust based on GPU memory
}

# SFT parameter recommendations
SFT_PARAMS = {
    "learning_rate": 5e-5,  # Medium learning rate
    "num_epochs": 3,        # 3-5 epochs
    "lora_rank": 8,         # Medium rank
    "batch_size": 4-8,      # Adjust based on GPU memory
}

# DPO parameter recommendations
DPO_PARAMS = {
    "learning_rate": 5e-6,  # Very low learning rate
    "num_epochs": 1,        # 1 epoch
    "lora_rank": 8,         # Match SFT
    "pref_beta": 0.1,       # KL divergence weight
}

Training Monitoring

Monitor key metrics:

Loss: Training loss (should continuously decrease)
Learning Rate: Learning rate changes
Gradient Norm: Gradient norm (avoid gradient explosion)
Samples per Second: Training speed

Deployment Best Practices

Gradual Rollout

GRADUAL_ROLLOUT = {
    "stage_1": {"traffic": 0.05, "duration": "1 day"},   # 5% traffic
    "stage_2": {"traffic": 0.20, "duration": "2 days"},  # 20% traffic
    "stage_3": {"traffic": 0.50, "duration": "3 days"},  # 50% traffic
    "stage_4": {"traffic": 1.00, "duration": "stable"}   # 100% traffic
}

Automatic Rollback Conditions

ROLLBACK_CONDITIONS = {
    "error_rate": 0.05,      # Error rate exceeds 5%
    "latency_p99": 2000,     # P99 latency exceeds 2s
    "user_feedback": -0.2    # User feedback drops 20%
}

Troubleshooting

Training Issues

OOM (Out of Memory)

Problem: RuntimeError: CUDA out of memory

Solutions:

# Reduce batch size
per_device_train_batch_size: 1

# Increase gradient accumulation
gradient_accumulation_steps: 16

# Reduce sequence length
cutoff_len: 512

# Use gradient checkpointing
gradient_checkpointing: true

# Use 8-bit quantization
quantization_bit: 8

Training Not Converging

Problem: Loss not decreasing or oscillating

Solutions:

# Lower learning rate
learning_rate: 1.0e-6

# Increase warmup
warmup_steps: 500

# Use gradient clipping
max_grad_norm: 1.0

# Check data quality
# - Ensure correct data format
# - Remove anomalous samples

Data Issues

Data Format Error

Problem: KeyError: 'conversations'

Solution:

# Validate data format
def validate_sft_data(data_file):
    with open(data_file, 'r') as f:
        data = json.load(f)
    
    for i, item in enumerate(data):
        # Check required fields
        assert "conversations" in item, f"Item {i} missing conversations"
        
        # Check conversation format
        for conv in item["conversations"]:
            assert "from" in conv, f"Item {i} conversation missing from"
            assert "value" in conv, f"Item {i} conversation missing value"
            assert conv["from"] in ["human", "gpt"], f"Invalid from field value"

Dataset Not Registered

Problem: Dataset 'xxx' not found in dataset_info.json

Solution:

{
    "your_dataset_name": {
        "file_name": "path/to/your_data.jsonl",
        "columns": {
            "prompt": "text"
        },
        "formatting": "sharegpt",
        "ranking": true
    }
}

Technical Reference

Key Technologies

Component	Technology	Description
Training Framework	LLaMA-Factory	Unified training framework for LLMs
Parameter-Efficient Fine-Tuning	LoRA	Low-rank adaptation for efficient training
Base Models	Qwen2-1.5B	Alibaba’s Qwen2 series models
Preference Alignment	DPO	RLHF-free preference optimization
Data Generation	Self-Instruct	Automated instruction data generation

Configuration Templates

Complete configuration templates are available in configs/llmops/:

cpt_config_template.yaml - CPT training template
sft_config_template.yaml - SFT training template
dpo_config_template.yaml - DPO training template

API Reference

# Training API
from llamafactory_trainer import get_trainer
trainer = get_trainer()
trainer.start_training(config)

# Data Generation API
from self_instruct_generator import SelfInstructGenerator
generator = SelfInstructGenerator()
data = generator.generate_instructions(num_instructions=100)

# Preference Collection API
from preference_collector import PreferenceCollector
collector = PreferenceCollector()
collector.collect_preference(question, response_a, response_b, choice)

# Inference API
from inference_model import InferenceModel
model = InferenceModel()
model.load_model("chat-v1", "Qwen/Qwen2-1.5B", "checkpoints/sft/sft-lora")
response = model.generate("chat-v1", "Hello", max_length=512)

Glossary

Term	Full Name	Description
CPT	Continued Pre-Training	Continued pre-training on domain data
SFT	Supervised Fine-Tuning	Instruction-following training
DPO	Direct Preference Optimization	RLHF-free preference alignment
LoRA	Low-Rank Adaptation	Parameter-efficient fine-tuning method
RLHF	Reinforcement Learning from Human Feedback	Traditional preference learning approach

LLMOps Training Pipeline

Table of contents

System Overview

Key Features

Offline Training vs Online Optimization

System Highlights

Architecture

System Architecture

Data Flow

Directory Structure

Training Workflow Overview

Three-Stage Training Pipeline

Training Stage Relationships

When to Use Each Stage

Training Stages

CPT - Continued Pre-Training

Overview

Data Requirements

Configuration Example

Training Process

SFT - Supervised Fine-Tuning

Overview

Data Requirements

Configuration Example

DPO - Direct Preference Optimization

Overview

Data Requirements

Configuration Example

Training Stage Comparison

Online Feedback Loop

Feedback Loop Diagram

A/B Testing Interface

Iteration Strategy

Quick Start

1. Launch System

2. Offline Training Workflow

3. Online Optimization Workflow (Recommended)

Best Practices

Data Best Practices

CPT Data

SFT Data

DPO Data

Training Best Practices

Hyperparameter Tuning

Training Monitoring

Deployment Best Practices

Gradual Rollout

Automatic Rollback Conditions

Troubleshooting

Training Issues

OOM (Out of Memory)

Training Not Converging

Data Issues

Data Format Error

Dataset Not Registered

Technical Reference

Key Technologies

Configuration Templates

API Reference

Related Resources

Glossary

Table of contents