Context Engineering
Context engineering system based on Model Context Protocol (MCP) architecture, combining symbolic expert systems with connectionist LLMs. Supports dynamic context orchestration, intelligent template selection, and tool invocation through a complete Think-Act-Observe cycle.
Table of contents
- System Overview
- Architecture
- Three Operation Modes
- Core Components
- Ollama Integration
- Usage Guide
- Technical Implementation Details
- Performance & Reliability
- Extensibility Design
- Security & Compliance
- Best Practices
- Troubleshooting
- Related Resources
System Overview
Core Philosophy
Context Engineering treats the LLM’s context window as a programmable logical space, combining symbolic expert systems with connectionist LLMs through the Model Context Protocol (MCP). This enables dynamic context orchestration, intelligent template selection, and tool invocation through a complete Think-Act-Observe (TAO) cycle.
Background & Goals
Business Context: While large language models possess rich general knowledge, they have limitations when handling domain-specific or latest information. Context engineering provides a systematic approach to dynamically assemble and manage context, enabling accurate, trustworthy Q&A services based on local knowledge bases while maintaining data privacy and security.
Design Goals:
- Functionality: Implement MCP-based context engineering with RAG capabilities
- Experience: Multiple interaction modes from simple dialogue to complex reasoning
- Trustworthiness: Fact-based answers avoiding hallucination issues
- Privacy: Local deployment with data never leaving premises
- Modularity: Symbolic expert system layer managing context, LLM layer focused on reasoning
Technical Principles:
- MCP Protocol: Standardized Model Context Protocol for component interaction
- Symbolic + Connectionist: Expert system manages context, LLM handles reasoning
- Open Source First: Choose open-source frameworks like Ollama to avoid vendor lock-in
- Local Deployment: All data processing completed locally for privacy protection
- Progressive Enhancement: Capability ladder from simple dialogue to complex reasoning
Key Features
- 🧠 MCP Architecture: Model Context Protocol for standardized component interaction
- 🔄 Four-Stage Cycle: Template selection → Context assembly → LLM inference → Context update
- 🎯 Dynamic Context Orchestration: Intelligent template selection and placeholder resolution
- 📚 RAG Capabilities: Local knowledge base retrieval with context enhancement
- 🔧 Tool Integration: MCP Tools for external capabilities (SEARCH, FINISH, etc.)
- 💾 Context Memory: MCP Resources for conversation history management
- 🔌 Ollama Integration: Support for multiple open-source LLMs (Llama, Qwen, Mistral)
Architecture
System Architecture Diagram
graph TB
subgraph "Application Layer"
A[User Query] --> A1[Query Parsing]
A1 --> A2{Mode Router}
A2 -->|Mode 1| B[Direct Chat]
A2 -->|Mode 2| C[Context Enhanced]
A2 -->|Mode 3| D[Multi-step Reasoning]
end
subgraph "Service Layer"
C --> E[RAGService]
D --> E
E --> E1[Retrieval Orchestration]
E --> E2[Context Management]
E --> E3[Reasoning Control]
B --> F[LLMService]
E2 --> F
E3 --> F
end
subgraph "Algorithm Layer"
E1 --> G[Document Retrieval]
G --> G1[TF-IDF Recall]
G1 --> G2[Relevance Ranking]
E2 --> H[Context Engineering]
H --> H1[Document Assembly]
H1 --> H2[Prompt Generation]
E3 --> I[ReAct Reasoning]
I --> I1[Thought Generation]
I1 --> I2[Action Parsing]
I2 --> I3[Observation Feedback]
end
subgraph "Storage Layer"
G --> J[Local Doc Store]
J --> J1[Document Index]
J --> J2[Original Content]
F --> K[Ollama Service]
K --> K1[Model Inference]
K --> K2[Streaming Generation]
end
subgraph "Agent Toolset"
I2 --> L[SEARCH Tool]
I2 --> M[FINISH Tool]
L --> G
end
Architecture Highlights
- Mode Decoupling: Three operation modes independently designed for separate optimization
- Service Layering: Separation of RAG and LLM service responsibilities for maintainability
- Local First: Core services deployed locally with minimal external dependencies
- Tool-based Design: ReAct framework supports tool registration and extension
Directory Structure
src/search_engine/
├── rag_tab/ # RAG functionality module
│ ├── __init__.py # Module initialization
│ ├── rag_service.py # RAG service core implementation ⭐
│ └── rag_tab.py # UI and interaction logic
├── index_tab/ # Retrieval module (RAG dependency)
│ ├── index_service.py # Index service
│ └── offline_index.py # TF-IDF retrieval implementation
├── data_service.py # Data service ⭐
├── portal.py # Main entry and UI orchestration
└── service_manager.py # Service manager
data/ # 数据存储目录
├── preloaded_documents.json # 预加载知识库文档
└── openkg_triples.tsv # 知识图谱数据
logs/ # Log directory
└── rag_query_logs.json # RAG query logs
Three Operation Modes
Mode 1: Direct LLM Chat
Use Cases: General knowledge Q&A, creative generation
Technical Implementation: Direct Ollama API call without retrieval
Advantages:
- Fast response time
- Support for open-domain Q&A
- No dependency on local knowledge base
Workflow:
- User query directly sent to LLM
- LLM generates response based on pre-trained knowledge
- Return answer to user
Mode 2: Context-Enhanced (RAG)
Use Cases: Professional Q&A based on local documents
Technical Implementation: Retrieval + Context assembly + Prompt engineering
Workflow:
- Use IndexService to retrieve Top-K relevant documents
- Concatenate document content into context string
- Generate structured prompt using template
- Call LLM to generate context-based answer
Context Prompt Template:
Based on the following context information, answer the user's question. If there is no relevant information in the context, please state that you cannot answer based on the provided information.
Context Information:
{context}
User Question: {query}
Please answer in Chinese:
Advantages:
- Fact-based answers grounded in documents
- Reduced hallucination
- Traceable sources
Mode 3: Multi-step Reasoning (ReAct)
Use Cases: Complex query decomposition, multi-document information synthesis
Technical Implementation: Think-Act-Observe loop
Agent Actions:
SEARCH(query): Retrieve relevant documentsFINISH(answer): Output final answer
ReAct Prompt Template:
You are an AI assistant capable of using tools. You can solve problems through the following steps:
1. Thought: Analyze what information the problem needs
2. Action: Use available tools to search for information
3. Observation: View the results returned by the tools
4. Repeat above steps until you have enough information
5. Final Answer: Provide an answer based on collected information
Available Tools:
- SEARCH: Search relevant documents, usage: SEARCH["search term"]
Question: {query}
Let's begin:
Thought:
Reasoning Flow:
- Initialize reasoning state, set maximum iterations
- Thought Phase: LLM generates thinking content
- Action Phase: Parse Action type and parameters
- Observation Phase: Execute tool call, get results
- Loop Check: Check if FINISH state is reached
- Result Return: Output reasoning trace and final answer
Core Components
RAGService Class
# File: src/search_engine/rag_tab/rag_service.py
class RAGService:
"""RAG Service Engine - Three-stage processing pipeline"""
def __init__(self, index_service, ollama_url: str = "http://localhost:11434"):
# Dependency injection - reuse existing retrieval capabilities
self.index_service = index_service
self.ollama_url = ollama_url
self.default_model = "llama3.1:8b"
# Prompt templates - structured prompt design
self.context_template = self._load_prompt_templates()
def rag_query(self, query: str, model: str = None, top_k: int = 5) -> RAGResponse:
"""RAG query main workflow"""
# Stage 1: Retrieve relevant documents
relevant_docs = self.retrieve_documents(query, top_k)
# Stage 2: Context enhancement
enhanced_prompt = self.enhance_with_context(query, relevant_docs)
# Stage 3: LLM generation
response = self.generate_answer(enhanced_prompt, model)
return RAGResponse(
answer=response,
retrieved_docs=relevant_docs,
prompt_used=enhanced_prompt
)
Design Highlights:
- Service Reuse: Built on existing IndexService to avoid duplication
- Template Management: Prompt templates managed centrally for dynamic adjustment
- Observability: Intermediate results preserved at each stage for debugging
- Error Handling: Independent error handling at each stage for robustness
Document Retrieval
def retrieve_documents(self, query: str, top_k: int = 5) -> List[Document]:
"""Retrieval stage - reuse inverted index"""
# Call existing IndexService to avoid reimplementation
raw_results = self.index_service.search(query, top_k)
# Structure results
documents = []
for doc_id, score, content in raw_results:
documents.append(Document(
id=doc_id,
content=content,
relevance_score=score
))
return documents
Context Enhancement
def enhance_with_context(self, query: str, documents: List[Document]) -> str:
"""Enhancement stage - context engineering"""
# Concatenate document content
context = "\n\n".join([
f"Document {i+1}: {doc.content}"
for i, doc in enumerate(documents)
])
# Apply prompt template
enhanced_prompt = self.context_template.format(
context=context,
query=query
)
return enhanced_prompt
Ollama Integration
Connection Management
Health Check:
def check_ollama_connection(self) -> Tuple[bool, str]:
try:
response = requests.get(f"{self.ollama_url}/api/tags", timeout=5)
if response.status_code == 200:
models = response.json().get("models", [])
return True, f"✅ Available models: {[m['name'] for m in models]}"
except requests.exceptions.RequestException as e:
return False, f"❌ Connection failed: {str(e)}"
Model Management:
- Dynamically fetch available model list
- Support model switching and configuration
- Implement request timeout and retry mechanisms
API Call Optimization
Streaming Generation:
def generate_stream(self, prompt: str, model: str):
response = requests.post(
f"{self.ollama_url}/api/generate",
json={"model": model, "prompt": prompt, "stream": True},
stream=True
)
for line in response.iter_lines():
if line:
yield json.loads(line)["response"]
Ollama Configuration
OLLAMA_CONFIG = {
"url": "http://localhost:11434", # Ollama service address
"default_model": "llama3.2", # Default model
"timeout": 30, # Request timeout
"generation_options": { # Generation parameters
"temperature": 0.7, # Randomness control
"top_p": 0.9, # Nucleus sampling
"top_k": 40, # Top-K sampling
"repeat_penalty": 1.1 # Repetition penalty
}
}
# Supported model list
SUPPORTED_MODELS = [
"llama3.2", # Lightweight model
"llama3.1:8b", # Medium-scale model
"qwen2.5", # Chinese-optimized model
"deepseek-coder", # Code-specific model
"mistral" # Multilingual model
]
Usage Guide
Prerequisites
- Ollama service running at
http://localhost:11434(configurable in code) - Pull required models in Ollama (e.g.,
ollama pull llama3.1)
Operation Steps
- Navigate to “🤖 Part 3: RAG Retrieval Enhancement” tab
- Click “Check Ollama Connection” to verify connectivity and refresh model list
- Input your question
- Select options:
- Keep “Enable Context Engineering” checked for retrieval-enhanced answers
- Enable “Multi-step Reasoning” if needed for ReAct mode
- Adjust Top-K and select model
- Click “🚀 Execute Query”
Understanding Output
- Generated Answer: Final answer from LLM
- Processing Info: Time elapsed, model used, number of documents retrieved
- Prompt/Reasoning Trace:
- Direct or context mode displays exact prompt sent to LLM
- Multi-step mode shows complete chain-of-thought trace (Thought/Action/Observation)
- Retrieved Documents: Table with document IDs and TF-IDF scores; separate box shows assembled context
Technical Implementation Details
RAG Processing Pipeline
- Query Preprocessing: Keyword extraction, query expansion
- Relevant Document Retrieval: Call IndexService.search()
- Context Construction: Document content concatenation, length truncation
- Prompt Generation: Template filling, structured prompt
- LLM Generation: Ollama API call, streaming return
ReAct Reasoning Flow
- Initialization: Set reasoning state, maximum iterations
- Thought Phase: LLM generates thinking content
- Action Phase: Parse Action type and parameters
- Observation Phase: Execute tool call, get results
- Loop Check: Check if FINISH state is reached
- Result Return: Output reasoning trace and final answer
Performance & Reliability
Performance Optimization
- Retrieval Optimization: Reuse existing inverted index to avoid redundant computation
- Context Management: Intelligent truncation, retain most relevant content
- Caching Strategy: Cache identical query results, reduce redundant computation
- Concurrency Control: Limit simultaneous requests to avoid Ollama overload
Reliability Guarantees
Error Handling:
- Degradation strategy when Ollama service unavailable
- Network timeout and retry mechanisms
- Fallback solutions for model inference failures
Monitoring Metrics:
- Query response time distribution
- Ollama service availability
- Retrieval recall and accuracy
Extensibility Design
Model Extension
- Multi-model Support: Llama, Mistral, CodeLlama, etc.
- Model Routing: Select most suitable model based on query type
- Model Fine-tuning: Support for domain-specialized model integration
Feature Extension
- Tool Extension: ReAct framework supports new tool registration
- Context Enhancement: Support multimodal content like charts, code
- Conversation Memory: Implement multi-turn dialogue context management
Security & Compliance
Data Security
- Local Deployment: All data processing completed locally
- Privacy Protection: No user queries uploaded to external services
- Access Control: Document permission management and user authentication
Content Safety
- Output Filtering: Sensitive content detection and filtering
- Prompt Injection Protection: Prevent malicious prompt attacks
- Generated Content Monitoring: Output quality and safety monitoring
Best Practices
Context Engineering
- Document Selection: Retrieve only most relevant documents (Top-K=3-5)
- Content Truncation: Limit context length to avoid exceeding model limits
- Template Design: Clear instructions in prompt templates
- Source Citation: Include document IDs for answer traceability
Model Selection
- General Q&A: Use
llama3.2orqwen2.5 - Code-related: Use
deepseek-coderorcodellama - Multilingual: Use
mistralorqwen2.5 - Long Context: Use models with larger context windows
Prompt Optimization
- Clear Instructions: Explicitly state what LLM should do
- Context Boundaries: Clearly mark context and query sections
- Output Format: Specify desired answer format
- Constraint Statements: Add constraints like “answer in Chinese”, “cite sources”
Troubleshooting
Ollama Connection Issues
Problem: Cannot connect to Ollama service
Solutions:
# 1. Check if Ollama is running
curl http://localhost:11434/api/tags
# 2. Start Ollama service
ollama serve
# 3. Check firewall settings
# 4. Verify Ollama URL in configuration
Model Not Available
Problem: Selected model not found
Solutions:
# Pull model
ollama pull llama3.1:8b
# List available models
ollama list
# Check model name format (case-sensitive)
Poor Answer Quality
Problem: Answers not satisfactory
Solutions:
- Increase Top-K to retrieve more documents
- Try different models
- Adjust generation parameters (temperature, top_p)
- Improve prompt template
- Check if retrieved documents are relevant
Slow Response Time
Problem: Query takes too long
Solutions:
- Reduce Top-K value
- Use smaller models
- Enable streaming for progressive display
- Optimize document index
- Consider GPU acceleration
Related Resources
- Ollama Documentation
- RAG Paper (Lewis et al., 2020)
- ReAct Paper (Yao et al., 2022)
- Prompt Engineering Guide