Context Engineering

Context engineering system based on Model Context Protocol (MCP) architecture, combining symbolic expert systems with connectionist LLMs. Supports dynamic context orchestration, intelligent template selection, and tool invocation through a complete Think-Act-Observe cycle.

System Overview
Architecture
Three Operation Modes
Core Components
Ollama Integration
Usage Guide
Technical Implementation Details
1. RAG Processing Pipeline
2. ReAct Reasoning Flow
Performance & Reliability
1. Performance Optimization
2. Reliability Guarantees
Extensibility Design
1. Model Extension
2. Feature Extension
Security & Compliance
1. Data Security
2. Content Safety
Best Practices
Troubleshooting
Related Resources

System Overview

Core Philosophy

Context Engineering treats the LLM’s context window as a programmable logical space, combining symbolic expert systems with connectionist LLMs through the Model Context Protocol (MCP). This enables dynamic context orchestration, intelligent template selection, and tool invocation through a complete Think-Act-Observe (TAO) cycle.

Background & Goals

Business Context: While large language models possess rich general knowledge, they have limitations when handling domain-specific or latest information. Context engineering provides a systematic approach to dynamically assemble and manage context, enabling accurate, trustworthy Q&A services based on local knowledge bases while maintaining data privacy and security.

Design Goals:

Functionality: Implement MCP-based context engineering with RAG capabilities
Experience: Multiple interaction modes from simple dialogue to complex reasoning
Trustworthiness: Fact-based answers avoiding hallucination issues
Privacy: Local deployment with data never leaving premises
Modularity: Symbolic expert system layer managing context, LLM layer focused on reasoning

Technical Principles:

MCP Protocol: Standardized Model Context Protocol for component interaction
Symbolic + Connectionist: Expert system manages context, LLM handles reasoning
Open Source First: Choose open-source frameworks like Ollama to avoid vendor lock-in
Local Deployment: All data processing completed locally for privacy protection
Progressive Enhancement: Capability ladder from simple dialogue to complex reasoning

Key Features

🧠 MCP Architecture: Model Context Protocol for standardized component interaction
🔄 Four-Stage Cycle: Template selection → Context assembly → LLM inference → Context update
🎯 Dynamic Context Orchestration: Intelligent template selection and placeholder resolution
📚 RAG Capabilities: Local knowledge base retrieval with context enhancement
🔧 Tool Integration: MCP Tools for external capabilities (SEARCH, FINISH, etc.)
💾 Context Memory: MCP Resources for conversation history management
🔌 Ollama Integration: Support for multiple open-source LLMs (Llama, Qwen, Mistral)

Architecture

System Architecture Diagram

graph TB
    subgraph "Application Layer"
        A[User Query] --> A1[Query Parsing]
        A1 --> A2{Mode Router}
        A2 -->|Mode 1| B[Direct Chat]
        A2 -->|Mode 2| C[Context Enhanced]  
        A2 -->|Mode 3| D[Multi-step Reasoning]
    end
    
    subgraph "Service Layer"
        C --> E[RAGService]
        D --> E
        E --> E1[Retrieval Orchestration]
        E --> E2[Context Management]
        E --> E3[Reasoning Control]
        
        B --> F[LLMService]
        E2 --> F
        E3 --> F
    end
    
    subgraph "Algorithm Layer"
        E1 --> G[Document Retrieval]
        G --> G1[TF-IDF Recall]
        G1 --> G2[Relevance Ranking]
        
        E2 --> H[Context Engineering]
        H --> H1[Document Assembly]
        H1 --> H2[Prompt Generation]
        
        E3 --> I[ReAct Reasoning]
        I --> I1[Thought Generation]
        I1 --> I2[Action Parsing]
        I2 --> I3[Observation Feedback]
    end
    
    subgraph "Storage Layer"
        G --> J[Local Doc Store]
        J --> J1[Document Index]
        J --> J2[Original Content]
        
        F --> K[Ollama Service]
        K --> K1[Model Inference]
        K --> K2[Streaming Generation]
    end
    
    subgraph "Agent Toolset"
        I2 --> L[SEARCH Tool]
        I2 --> M[FINISH Tool]
        L --> G
    end

Architecture Highlights

Mode Decoupling: Three operation modes independently designed for separate optimization
Service Layering: Separation of RAG and LLM service responsibilities for maintainability
Local First: Core services deployed locally with minimal external dependencies
Tool-based Design: ReAct framework supports tool registration and extension

Directory Structure

src/search_engine/
├── rag_tab/                        # RAG functionality module
│   ├── __init__.py                # Module initialization
│   ├── rag_service.py             # RAG service core implementation ⭐
│   └── rag_tab.py                 # UI and interaction logic
├── index_tab/                      # Retrieval module (RAG dependency)
│   ├── index_service.py           # Index service
│   └── offline_index.py           # TF-IDF retrieval implementation
├── data_service.py                 # Data service ⭐
├── portal.py                       # Main entry and UI orchestration
└── service_manager.py              # Service manager

data/                               # 数据存储目录
├── preloaded_documents.json       # 预加载知识库文档
└── openkg_triples.tsv             # 知识图谱数据

logs/                               # Log directory
└── rag_query_logs.json            # RAG query logs

Three Operation Modes

Mode 1: Direct LLM Chat

Use Cases: General knowledge Q&A, creative generation

Technical Implementation: Direct Ollama API call without retrieval

Advantages:

Fast response time
Support for open-domain Q&A
No dependency on local knowledge base

Workflow:

User query directly sent to LLM
LLM generates response based on pre-trained knowledge
Return answer to user

Mode 2: Context-Enhanced (RAG)

Use Cases: Professional Q&A based on local documents

Technical Implementation: Retrieval + Context assembly + Prompt engineering

Workflow:

Use IndexService to retrieve Top-K relevant documents
Concatenate document content into context string
Generate structured prompt using template
Call LLM to generate context-based answer

Context Prompt Template:

Based on the following context information, answer the user's question. If there is no relevant information in the context, please state that you cannot answer based on the provided information.

Context Information:
{context}

User Question: {query}

Please answer in Chinese:

Advantages:

Fact-based answers grounded in documents
Reduced hallucination
Traceable sources

Mode 3: Multi-step Reasoning (ReAct)

Use Cases: Complex query decomposition, multi-document information synthesis

Technical Implementation: Think-Act-Observe loop

Agent Actions:

SEARCH(query): Retrieve relevant documents
FINISH(answer): Output final answer

ReAct Prompt Template:

You are an AI assistant capable of using tools. You can solve problems through the following steps:

1. Thought: Analyze what information the problem needs
2. Action: Use available tools to search for information
3. Observation: View the results returned by the tools
4. Repeat above steps until you have enough information
5. Final Answer: Provide an answer based on collected information

Available Tools:
- SEARCH: Search relevant documents, usage: SEARCH["search term"]

Question: {query}

Let's begin:

Thought:

Reasoning Flow:

Initialize reasoning state, set maximum iterations
Thought Phase: LLM generates thinking content
Action Phase: Parse Action type and parameters
Observation Phase: Execute tool call, get results
Loop Check: Check if FINISH state is reached
Result Return: Output reasoning trace and final answer

Core Components

RAGService Class

# File: src/search_engine/rag_tab/rag_service.py
class RAGService:
    """RAG Service Engine - Three-stage processing pipeline"""
    
    def __init__(self, index_service, ollama_url: str = "http://localhost:11434"):
        # Dependency injection - reuse existing retrieval capabilities
        self.index_service = index_service
        self.ollama_url = ollama_url
        self.default_model = "llama3.1:8b"
        
        # Prompt templates - structured prompt design
        self.context_template = self._load_prompt_templates()
    
    def rag_query(self, query: str, model: str = None, top_k: int = 5) -> RAGResponse:
        """RAG query main workflow"""
        # Stage 1: Retrieve relevant documents
        relevant_docs = self.retrieve_documents(query, top_k)
        
        # Stage 2: Context enhancement
        enhanced_prompt = self.enhance_with_context(query, relevant_docs)
        
        # Stage 3: LLM generation
        response = self.generate_answer(enhanced_prompt, model)
        
        return RAGResponse(
            answer=response,
            retrieved_docs=relevant_docs,
            prompt_used=enhanced_prompt
        )

Design Highlights:

Service Reuse: Built on existing IndexService to avoid duplication
Template Management: Prompt templates managed centrally for dynamic adjustment
Observability: Intermediate results preserved at each stage for debugging
Error Handling: Independent error handling at each stage for robustness

Document Retrieval

def retrieve_documents(self, query: str, top_k: int = 5) -> List[Document]:
    """Retrieval stage - reuse inverted index"""
    # Call existing IndexService to avoid reimplementation
    raw_results = self.index_service.search(query, top_k)
    
    # Structure results
    documents = []
    for doc_id, score, content in raw_results:
        documents.append(Document(
            id=doc_id,
            content=content,
            relevance_score=score
        ))
    
    return documents

Context Enhancement

def enhance_with_context(self, query: str, documents: List[Document]) -> str:
    """Enhancement stage - context engineering"""
    # Concatenate document content
    context = "\n\n".join([
        f"Document {i+1}: {doc.content}" 
        for i, doc in enumerate(documents)
    ])
    
    # Apply prompt template
    enhanced_prompt = self.context_template.format(
        context=context,
        query=query
    )
    
    return enhanced_prompt

Ollama Integration

Connection Management

Health Check:

def check_ollama_connection(self) -> Tuple[bool, str]:
    try:
        response = requests.get(f"{self.ollama_url}/api/tags", timeout=5)
        if response.status_code == 200:
            models = response.json().get("models", [])
            return True, f"✅ Available models: {[m['name'] for m in models]}"
    except requests.exceptions.RequestException as e:
        return False, f"❌ Connection failed: {str(e)}"

Model Management:

Dynamically fetch available model list
Support model switching and configuration
Implement request timeout and retry mechanisms

API Call Optimization

Streaming Generation:

def generate_stream(self, prompt: str, model: str):
    response = requests.post(
        f"{self.ollama_url}/api/generate",
        json={"model": model, "prompt": prompt, "stream": True},
        stream=True
    )
    for line in response.iter_lines():
        if line:
            yield json.loads(line)["response"]

Ollama Configuration

OLLAMA_CONFIG = {
    "url": "http://localhost:11434",       # Ollama service address
    "default_model": "llama3.2",          # Default model
    "timeout": 30,                        # Request timeout
    "generation_options": {               # Generation parameters
        "temperature": 0.7,               # Randomness control
        "top_p": 0.9,                     # Nucleus sampling
        "top_k": 40,                      # Top-K sampling
        "repeat_penalty": 1.1             # Repetition penalty
    }
}

# Supported model list
SUPPORTED_MODELS = [
    "llama3.2",                           # Lightweight model
    "llama3.1:8b",                        # Medium-scale model
    "qwen2.5",                            # Chinese-optimized model
    "deepseek-coder",                     # Code-specific model
    "mistral"                             # Multilingual model
]

Usage Guide

Prerequisites

Ollama service running at http://localhost:11434 (configurable in code)
Pull required models in Ollama (e.g., ollama pull llama3.1)

Operation Steps

Navigate to “🤖 Part 3: RAG Retrieval Enhancement” tab
Click “Check Ollama Connection” to verify connectivity and refresh model list
Input your question
Select options:
- Keep “Enable Context Engineering” checked for retrieval-enhanced answers
- Enable “Multi-step Reasoning” if needed for ReAct mode
- Adjust Top-K and select model
Click “🚀 Execute Query”

Understanding Output

Generated Answer: Final answer from LLM
Processing Info: Time elapsed, model used, number of documents retrieved
Prompt/Reasoning Trace:
- Direct or context mode displays exact prompt sent to LLM
- Multi-step mode shows complete chain-of-thought trace (Thought/Action/Observation)
Retrieved Documents: Table with document IDs and TF-IDF scores; separate box shows assembled context

Technical Implementation Details

RAG Processing Pipeline

Query Preprocessing: Keyword extraction, query expansion
Relevant Document Retrieval: Call IndexService.search()
Context Construction: Document content concatenation, length truncation
Prompt Generation: Template filling, structured prompt
LLM Generation: Ollama API call, streaming return

ReAct Reasoning Flow

Initialization: Set reasoning state, maximum iterations
Thought Phase: LLM generates thinking content
Action Phase: Parse Action type and parameters
Observation Phase: Execute tool call, get results
Loop Check: Check if FINISH state is reached
Result Return: Output reasoning trace and final answer

Performance & Reliability

Performance Optimization

Retrieval Optimization: Reuse existing inverted index to avoid redundant computation
Context Management: Intelligent truncation, retain most relevant content
Caching Strategy: Cache identical query results, reduce redundant computation
Concurrency Control: Limit simultaneous requests to avoid Ollama overload

Reliability Guarantees

Error Handling:

Degradation strategy when Ollama service unavailable
Network timeout and retry mechanisms
Fallback solutions for model inference failures

Monitoring Metrics:

Query response time distribution
Ollama service availability
Retrieval recall and accuracy

Extensibility Design

Model Extension

Multi-model Support: Llama, Mistral, CodeLlama, etc.
Model Routing: Select most suitable model based on query type
Model Fine-tuning: Support for domain-specialized model integration

Feature Extension

Tool Extension: ReAct framework supports new tool registration
Context Enhancement: Support multimodal content like charts, code
Conversation Memory: Implement multi-turn dialogue context management

Security & Compliance

Data Security

Local Deployment: All data processing completed locally
Privacy Protection: No user queries uploaded to external services
Access Control: Document permission management and user authentication

Content Safety

Output Filtering: Sensitive content detection and filtering
Prompt Injection Protection: Prevent malicious prompt attacks
Generated Content Monitoring: Output quality and safety monitoring

Best Practices