Search & Recommendation System

A comprehensive search and recommendation system featuring TF-IDF based retrieval, CTR prediction models, and intelligent ranking algorithms.

Table of contents

  1. System Overview
    1. Key Features
  2. Architecture Overview
    1. System Layers
    2. Core Components
  3. Module Documentation
    1. Core Features
    2. Model Analysis & Optimization
  4. Quick Start
    1. 1. Perform Search
    2. 2. View Results
    3. 3. Train CTR Model (Optional)
  5. Technical Highlights
    1. Retrieval Stage
    2. Ranking Stage
  6. Key Files & Directories
  7. Performance Metrics
  8. Next Steps

System Overview

The search and recommendation system implements a complete information retrieval pipeline, from document indexing to intelligent ranking using machine learning models.

Key Features

  • 🔍 Full-Text Search: TF-IDF based inverted index with Chinese word segmentation
  • 🎯 CTR Prediction: Logistic Regression and Wide & Deep neural networks
  • 📊 Model Evaluation: Cross-validation, interpretability, and fairness analysis
  • 🤖 AutoML: Hyperparameter optimization with Grid Search and Optuna
  • 🔄 Online Learning: Click feedback collection and model retraining

Architecture Overview

System Layers

graph TB
    subgraph "Application Layer"
        A[User Query] --> B[Search Interface]
    end
    
    subgraph "Service Layer"
        B --> C[Index Service]
        C --> D[Model Service]
    end
    
    subgraph "Algorithm Layer"
        C --> E[TF-IDF Retrieval]
        D --> F[CTR Ranking]
    end
    
    subgraph "Storage Layer"
        E --> G[Inverted Index]
        F --> H[Model Storage]
    end

Core Components

Component Description Technology
Inverted Index Document indexing and retrieval TF-IDF, jieba
CTR Model Click-through rate prediction Logistic Regression, TensorFlow
Feature Engineering Extract 7-dimensional feature vectors pandas, NumPy
Model Serving Real-time prediction service scikit-learn

Module Documentation

Core Features

CTR Prediction Models Learn about Logistic Regression and Wide & Deep models for CTR prediction

System Architecture Understand the system architecture and design principles

Implementation Details Dive into code implementation and algorithms

Model Analysis & Optimization

Model Evaluation Cross-validation and generalization analysis

Interpretability Analysis LIME and SHAP model explanations

Fairness Analysis Performance analysis across different groups

AutoML Optimization Hyperparameter tuning with Grid Search and Optuna


Quick Start

Navigate to the “🔍 Online Retrieval & Ranking” tab:

  • Enter query terms (e.g., “artificial intelligence”, “machine learning”)
  • Select ranking mode: TF-IDF or CTR
  • Click “🔬 Execute Search”

2. View Results

  • Results displayed in table format with doc ID, TF-IDF score, CTR score, and summary
  • Click on rows to view full document content
  • Interactions are logged for CTR training

3. Train CTR Model (Optional)

Navigate to “📊 Data Collection & Training” tab:

  • Review collected samples and statistics
  • Click “Train CTR Model”
  • Return to search tab and switch to CTR ranking to compare results

Tip: If using preloaded documents from data/preloaded_documents.json, they are automatically indexed on startup.


Technical Highlights

Retrieval Stage

  • Algorithm: TF-IDF with inverted index
  • Tokenization: jieba Chinese word segmentation
  • Optimization: Short-circuit evaluation, LRU caching
  • Scalability: Horizontal sharding support

Ranking Stage

  • Models: Logistic Regression, Wide & Deep networks
  • Features: 7-dimensional feature vector (position, content length, match score, historical CTR)
  • Training: Online learning with click feedback
  • Evaluation: Accuracy, Precision, Recall, AUC

Key Files & Directories

src/search_engine/
├── search_tab/          # Search functionality
├── index_tab/           # Index management
│   └── offline_index.py # Inverted index implementation ⭐
├── training_tab/        # Model training
│   ├── ctr_model.py    # CTR model implementation ⭐
│   ├── model_evaluation.py
│   ├── model_interpretability.py
│   ├── model_fairness.py
│   └── model_automl.py
├── data_service.py      # Data collection ⭐
└── model_service.py     # Model serving ⭐

Performance Metrics

Metric Description Target
CTR Click-through rate > 10%
MRR Mean reciprocal rank > 0.7
Latency Average response time < 100ms
QPS Queries per second > 1000

Next Steps


Table of contents