Context-Enhanced RAG

Retrieval-Augmented Generation for fact-based answers grounded in local documents.

Overview
How It Works
1. Core Concept
2. Three-Stage Pipeline
Usage Guide
Best Practices
Advantages
Limitations
Troubleshooting
Performance Optimization
1. Latency Breakdown
2. Optimization Strategies
Advanced Topics
1. Hybrid Retrieval
2. Contextual Compression
Next Steps
Related Resources

Overview

What is Context-Enhanced Mode?

Context-Enhanced mode (RAG - Retrieval-Augmented Generation) combines document retrieval with LLM generation to provide fact-based answers grounded in your local knowledge base.

Workflow

graph LR
    A[User Query] --> B[Retrieve Top-K Documents]
    B --> C[Build Context]
    C --> D[Generate Prompt]
    D --> E[LLM Inference]
    E --> F[Grounded Answer]

When to Use

✅ Questions about local documents
✅ Fact-based answers required
✅ Source attribution needed
✅ Domain-specific knowledge

❌ General knowledge (use Direct Chat)
❌ Creative tasks (use Direct Chat)
❌ Multi-document comparison (use Multi-step Reasoning)

How It Works

Core Concept

RAG works by:

Retrieving relevant documents from your knowledge base
Enhancing the LLM prompt with retrieved context
Generating answers grounded in the provided context

This approach ensures answers are based on actual documents rather than the LLM’s potentially outdated or incorrect pre-trained knowledge.

Three-Stage Pipeline

Stage 1: Document Retrieval

def retrieve_documents(query: str, top_k: int = 5):
    """Retrieve relevant documents using TF-IDF"""
    results = index_service.search(query, top_k)
    return results

Stage 2: Context Enhancement

def enhance_with_context(query: str, documents: List[Document]):
    """Build enhanced prompt with context"""
    context = "\n\n".join([
        f"Document {i+1}: {doc.content}" 
        for i, doc in enumerate(documents)
    ])
    
    prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:"""
    
    return prompt

Stage 3: LLM Generation

def generate_answer(prompt: str, model: str):
    """Generate answer using Ollama"""
    response = ollama_client.generate(
        model=model,
        prompt=prompt
    )
    return response

Usage Guide

Basic Usage

Navigate to “🤖 RAG Retrieval Enhancement” tab
Enable “Enable Context Engineering” ✅
Disable “Enable Multi-step Reasoning”
Set Top-K (default: 5)
Enter your question about local documents
Click “🚀 Execute Query”

Understanding Results

Output Components:

Generated Answer: LLM’s response based on retrieved context
Retrieved Documents: Table showing doc IDs and relevance scores
Context Used: The actual context sent to LLM
Prompt: Complete prompt template with context

Example

Query: “What is TF-IDF?”

Retrieved Documents:

Doc #42: “TF-IDF stands for Term Frequency-Inverse Document Frequency…”
Doc #87: “TF-IDF is a statistical measure used to evaluate…”

Generated Answer:

Based on the provided documents, TF-IDF (Term Frequency-Inverse Document 
Frequency) is a statistical measure used in information retrieval to 
evaluate how important a word is to a document in a collection. It combines 
two metrics: term frequency (how often a word appears in a document) and 
inverse document frequency (how rare the word is across all documents).

Best Practices

Top-K Selection

TOP_K_GUIDELINES = {
    "simple_questions": 3,       # "What is X?"
    "standard_queries": 5,       # Most queries (default)
    "complex_questions": 10,     # Requires multiple sources
    "exploratory": 15            # Broad research
}

Context Length Management

# Avoid context overflow
MAX_CONTEXT_LENGTH = {
    "llama3.2": 4096,      # tokens
    "llama3.1:8b": 8192,
    "qwen2.5": 8192
}

# Truncate if necessary
def truncate_context(context, max_tokens=2000):
    """Keep context within model limits"""
    # Rough estimation: 1 token ≈ 4 characters
    max_chars = max_tokens * 4
    if len(context) > max_chars:
        return context[:max_chars] + "\n...(truncated)"
    return context

Prompt Template Design

PROMPT_TEMPLATES = {
    "basic": """Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:""",

    "with_constraints": """Based on the following context, answer the question.
If the context doesn't contain the answer, say "I cannot answer based on 
the provided context."

Context:
{context}

Question: {query}

Answer in Chinese:""",

    "with_citation": """Based on the following context, answer the question 
and cite the document numbers you used.

Context:
{context}

Question: {query}

Answer (cite sources):"""
}

Advantages

✅ Factual Accuracy: Grounded in actual documents
✅ Source Attribution: Can trace answer to specific docs
✅ Reduced Hallucination: LLM constrained by context
✅ Domain Expertise: Works with specialized knowledge
✅ Privacy: All data stays local

Limitations

⚠️ Retrieval Quality: Depends on TF-IDF effectiveness
⚠️ Context Window: Limited to model’s max tokens
⚠️ Single-hop: Cannot synthesize across multiple searches
⚠️ Latency: Slower than direct chat (~10s vs ~5s)

Troubleshooting

Poor Retrieval Quality

Problem: Retrieved documents not relevant

Solutions:

# 1. Increase Top-K
top_k = 10  # Retrieve more documents

# 2. Rephrase query
query = "TF-IDF algorithm"  # Instead of "how does search work"

# 3. Check index quality
# Verify documents are properly indexed

Answer Not in Context

Problem: LLM says “I cannot answer”

Solutions:

Verify documents contain relevant information
Try different query phrasing
Increase Top-K value
Check if documents are properly indexed

Context Too Long

Problem: Exceeds model token limit

Solutions:

# Reduce Top-K
top_k = 3  # Instead of 10

# Truncate documents
def truncate_docs(docs, max_per_doc=500):
    return [doc[:max_per_doc] for doc in docs]

# Use model with larger context
model = "llama3.1:8b"  # 8K context instead of 4K

Performance Optimization

Latency Breakdown

LATENCY_COMPONENTS = {
    "retrieval": 2.0,      # TF-IDF search
    "context_building": 0.5,  # String concatenation
    "llm_inference": 7.0,  # Main bottleneck
    "total": 9.5           # seconds
}

Optimization Strategies

1. Cache Retrieved Documents:

from functools import lru_cache

@lru_cache(maxsize=100)
def retrieve_documents_cached(query, top_k):
    """Cache retrieval results for identical queries"""
    return index_service.search(query, top_k)

2. Parallel Processing:

import concurrent.futures

def parallel_rag(queries):
    """Process multiple queries in parallel"""
    with concurrent.futures.ThreadPoolExecutor() as executor:
        results = executor.map(rag_query, queries)
    return list(results)

3. Optimize Retrieval:

Build optimized inverted index
Use approximate nearest neighbors for large datasets
Pre-compute document embeddings

Advanced Topics

Hybrid Retrieval

def hybrid_retrieval(query, top_k=5):
    """Combine TF-IDF with semantic search"""
    
    # TF-IDF results
    tfidf_results = index_service.search(query, top_k)
    
    # Semantic results (if available)
    semantic_results = semantic_search(query, top_k)
    
    # Merge and rerank
    combined = merge_results(tfidf_results, semantic_results)
    return combined[:top_k]

Contextual Compression

def compress_context(context, query):
    """Extract only relevant sentences from context"""
    
    sentences = split_into_sentences(context)
    
    # Score sentences by relevance to query
    scored = [
        (sentence, relevance_score(sentence, query))
        for sentence in sentences
    ]
    
    # Keep top 50% most relevant
    sorted_sentences = sorted(scored, key=lambda x: x[1], reverse=True)
    compressed = [s[0] for s in sorted_sentences[:len(sentences)//2]]
    
    return " ".join(compressed)

Next Steps

For complex multi-step tasks → Multi-step Reasoning
For setup instructions → Ollama Integration
For simple Q&A → Direct LLM Chat

Context-Enhanced RAG

Table of contents

Overview

What is Context-Enhanced Mode?

Workflow

When to Use

How It Works

Core Concept

Three-Stage Pipeline

Usage Guide

Basic Usage

Understanding Results

Example

Best Practices

Top-K Selection

Context Length Management

Prompt Template Design

Advantages

Limitations

Troubleshooting

Poor Retrieval Quality

Answer Not in Context

Context Too Long

Performance Optimization

Latency Breakdown

Optimization Strategies

Advanced Topics

Hybrid Retrieval

Contextual Compression

Next Steps

Related Resources