PDFs, Videos, Images.
One search to query them all.
Documents get chunked with contextual headers and embedded into 768-d vectors. Videos get frame-extracted and CLIP-embedded into 512-d vectors. Images are embedded with the same CLIP backbone. All assets land in pgvector — searchable in seconds.
768-d
Text Embed
512-d
CLIP Embed
300t
Chunk Size
1fps
Frame Rate
top-20
LLM Rerank
pgvec
Vector Store
System Overview
End-to-End Architecture
Three parallel pipelines — document RAG, video semantic search, and image semantic search — share the same pgvector backbone and a unified CLIP embedding engine. Videos and images support LLM-powered query rewriting, result reranking, and cross-modal image-to-video search.
User Query
Natural language question via Web UI
FastAPI Gateway
Async ASGI · Pydantic validation · CORS
Conversation Manager
Load last 3 turns from Supabase. UUID-based session isolation. Persists Q&A pairs for multi-turn context.
Model Router
Deterministic rule engine with 5 rules + OOD filter. Keyword regex, word count, question marks, comparison detection.
Query Embedding
HuggingFace Inference API encodes query into 768-d dense vector. Same model as ingestion for alignment. Batch + retry.
Vector Similarity Search
Supabase pgvector RPC: match_chunks(). L2 distance → similarity score. Returns top-5 chunks sorted by relevance.
Dynamic K-Cutoff
Hard threshold > 0.2 removes noise. Then adaptive filter: keep only chunks scoring ≥ 80% of the top result. Prevents "lost in the middle" degradation.
Prompt Assembly
Layer 1
System Persona
Layer 2
Context Chunks
Layer 3
Conv. History
Layer 4
User Query
Assembled prompt counted via tiktoken (o200k_base) before dispatch. Avg 200–350 input tokens.
LLM Generation
Simple Path
Llama 3.1 8B
Instant inference · Low latency
Complex Path
Llama 3.3 70B
Versatile · Deep reasoning
Output Evaluator
no_context
refusal
unverified
pricing
Checks for hallucination, refusal with partial-answer detection, unverified entities via proper noun extraction, and pricing hedging language.
JSON Response
"answer": "...",
"metadata": { model, tokens, latency, flags },
"sources": [ doc, page, score ],
"conversation_id": "conv_..."
}
Routing Logger
JSONL format with daily rotation & 30-day retention. Logs classification, rule triggered, complexity score, token counts, latency, and evaluator flags for every query.
Phase 1
Document Ingestion
PDFs are loaded, parsed page-by-page, chunked with token-aware splitting, injected with contextual headers extracted from font metadata, embedded into 768-d vectors, and stored in Supabase pgvector.
PDF Extraction
PyMuPDF (fitz) parses documents page-by-page
- Extracts raw text with page.get_text() from each page
- Tracks filename, page numbers (1-indexed), word counts
- Handles corrupted PDFs with graceful error recovery
Contextual Header Extraction
Font-size analysis injects hierarchical document structure
- Uses get_text("dict") to access font metadata per text block
- H1 > 18pt · H2 > 14pt · H3 > 12pt — builds a header stack
- Prefixes each chunk: [Context: Section > Subsection > ...]
- Maintains header hierarchy across page boundaries
Token-Aware Chunking
300-token chunks with 50-token overlap via recursive splitting
- Tokenizer: all-mpnet-base-v2 AutoTokenizer for exact counts
- Recursive separators: \n\n → \n → ". " → " " → char-level
- Overlap decoded from last 50 tokens of previous chunk
- Chunk IDs: {filename}_{page}_{index} for deterministic dedup
Embedding Generation
sentence-transformers/all-mpnet-base-v2 via HuggingFace Inference
- Output: 768-dimensional dense vectors per chunk
- Batch processing for efficient API utilization
- Exponential backoff: 5 retries, 5s → 60s max delay
- Model warmup on startup to avoid cold-start latency
Vector Storage
Supabase PostgreSQL with pgvector extension
- L2 distance converted to similarity: 1 − distance
- RPC function: match_chunks(embedding, threshold, count)
- Upsert strategy prevents duplicate chunks on re-ingestion
- Stores text, metadata, page numbers alongside vectors
Phase 2
Query Pipeline
When a user asks a question, the system classifies complexity, retrieves relevant chunks with adaptive filtering, builds a context-rich prompt, generates via Groq, evaluates output quality, and logs everything.
Query Classification & Model Routing
Deterministic rule-based decision tree — not ML
- OOD filter: greetings & meta-questions skip retrieval entirely
- Complex triggers: keywords (explain, compare, analyze), length >15 words, multiple "?", comparison words (vs, better, worse)
- Simple → Llama 3.1 8B Instant · Complex → Llama 3.3 70B Versatile
- Word-boundary regex prevents false positives (e.g., "CSV" ≠ "vs")
Retrieval with Dynamic K-Cutoff
Vector search + adaptive filtering prevents 'lost in the middle'
- Query embedded with same mpnet-v2 model → 768-d vector
- Top-5 chunks fetched from Supabase pgvector via RPC
- Hard threshold filter: score > 0.2 removes noise
- Dynamic cutoff: only keep chunks ≥ 80% of top score — adaptive k (2–5)
Prompt Construction
Multi-layer prompt with system instructions, context, and history
- 1. System: knowledge assistant persona
- 2. Context: retrieved chunk texts (2–5 chunks)
- 3. History: last 3 conversation turns (multi-turn memory)
- 4. Current question + instruction suffix for grounded answers
LLM Generation
Groq API with streaming SSE and token counting via tiktoken
- Temperature: 0.7 · Max tokens: 500 per response
- Token counting: tiktoken o200k_base encoding pre & post generation
- Streaming: Server-Sent Events yield tokens in real-time
- Error handling: structured retries with exponential backoff
Output Quality Evaluation
4-flag system catches hallucinations, refusals, and uncertainty
- no_context — answered without any retrieved documentation
- refusal — declined to answer (with partial-answer detection to avoid false positives)
- unverified_feature — mentions entities not found in source chunks
- pricing_uncertainty — hedging language or conflicting price info
Phase 3
Video Semantic Search
A parallel pipeline for video content. Upload an MP4, MOV, or MKV — frames are extracted, CLIP-embedded into 512-d vectors, and stored in pgvector. Search by text or upload a reference image. Llama 3.3 70B rewrites queries and reranks results for production-grade recall.
Video Upload & Validation
File validation, sanitization, and async job dispatch
- Accepts MP4, MOV, MKV — up to 500 MB per file
- Filename sanitized against path traversal attacks
- Returns 202 Accepted with video_id and job_id immediately
- Background processing job handles the heavy lifting
Frame Extraction
OpenCV extracts frames at configurable intervals
- Default: 1 frame per second across the entire video
- Each frame saved as JPEG in {video_id}/frames/ directory
- Tracks timestamp (seconds) for seek-to-frame playback
- Configurable interval via VIDEO_FRAME_INTERVAL_SEC
CLIP Embedding
Frames embedded with clip-ViT-B-32 into 512-d vectors
- Two modes: local (sentence-transformers) or server (HuggingFace API)
- Automatic fallback from server to local on API failure
- CLIP aligns image and text in the same vector space
- Enables text-to-image search: "red car on highway" finds that frame
Frame Vector Storage
Supabase pgvector with HNSW index for fast similarity search
- Separate table: video_frame_embeddings with vector(512)
- HNSW index for sub-linear search over millions of frames
- RPC function: match_video_frames() with cosine distance
- Cascade delete: removing a video drops all frames + embeddings
Natural Language Frame Search
Type a query, find the exact video moment
- Query text embedded with CLIP into the same 512-d space
- Cosine similarity ranks frames by semantic relevance
- Results include thumbnail, timestamp, similarity score
- Click a result to open the video player seeked to that frame
LLM Query Rewriting
Llama 3.3 70B expands queries into 3 alternative formulations
- Original query rewritten via Groq into visually descriptive variants
- Each variant embedded separately with CLIP and searched in parallel
- Results merged and deduplicated across all query variants
- Catches synonyms and phrasings the original query would miss
LLM Result Reranking
Top-20 candidates reranked by semantic understanding
- Collects top candidates from all query variants into a pool
- LLM evaluates query intent against each frame’s context
- Reorders results by true relevance, not just cosine distance
- Configurable candidate count via VIDEO_LLM_RERANK_CANDIDATES
Image-to-Video Search
Upload a reference image to find matching video frames
- Reference image embedded with CLIP into the same 512-d space
- Optional text prompt blended with image embedding (0.3 weight)
- Formula: blended = (1−w) × image_vec + w × text_vec
- Enables "find frames that look like this photo" workflows
Phase 4
Image Semantic Search
Upload JPG, PNG, or WebP images — each is embedded with the same CLIP ViT-B-32 backbone into 512-d vectors and stored in pgvector. Search by natural language to find visually matching images across your collection.
Image Upload & Validation
File validation, sanitization, and async processing
- Accepts JPG, PNG, WebP — up to 50 MB per file
- Filename sanitized against path traversal attacks
- Returns 202 Accepted with image_id and job_id immediately
- Background job handles embedding without blocking the API
CLIP Image Embedding
Whole-image embedding via shared clip-ViT-B-32 engine
- Same CLIP backbone as video frames — shared 512-d vector space
- Two modes: local (sentence-transformers) or server (HuggingFace API)
- Automatic fallback from server to local on API failure
- No frame extraction needed — the entire image is embedded directly
Image Vector Storage
Supabase pgvector with HNSW index and cascade deletes
- Separate table: image_embeddings with vector(512)
- HNSW index for sub-linear cosine similarity search
- RPC function: match_images() with configurable threshold
- Cascade delete: removing an image drops its embedding automatically
Natural Language Image Search
Type a query, find visually matching images
- Query text embedded with CLIP into the same 512-d space
- Cosine similarity ranks images by semantic relevance
- Results include thumbnails and similarity percentage
- Click a result to view the full-resolution image in a lightbox
Infrastructure
Technology Stack
FastAPI
API Gateway
Async ASGI with Uvicorn
PyMuPDF
PDF Processing
Page extraction + font metadata
HuggingFace
Text Embeddings
all-mpnet-base-v2 (768-d)
CLIP ViT-B-32
Visual Embeddings
Shared backbone for videos + images (512-d)
OpenCV
Frame Extraction
Video → JPEG frames at 1 fps
Supabase
Vector Store
pgvector with HNSW index
Groq
LLM Inference
Llama 3.1 8B & 3.3 70B + query rewrite/rerank
tiktoken
Token Counting
o200k_base encoding
Next.js
Frontend
React with Tailwind CSS
Pydantic
Validation
Request/response schemas
PIL / Pillow
Image Processing
Image loading for CLIP embedding
httpx
HTTP Client
Async requests with retry
Configuration
Pipeline Parameters
| Parameter | Value | Purpose |
|---|---|---|
| Chunk Size | 300 tokens | Context granularity per chunk |
| Chunk Overlap | 50 tokens | Continuity between adjacent chunks |
| Text Embed Dim | 768 | all-mpnet-base-v2 vector output |
| CLIP Embed Dim | 512 | Shared ViT-B-32 for videos + images |
| Frame Interval | 1.0s | Extract 1 frame per second |
| Max Video Size | 500 MB | Video upload file size limit |
| Max Image Size | 50 MB | Image upload file size limit |
| Retrieval top_k | 5 | Max chunks from vector search |
| Relevance Threshold | 0.2 | Min similarity score to keep |
| Dynamic K-Cutoff | 0.8× | Adaptive filtering multiplier |
| LLM Rewrite Count | 3 | Query variants for video search |
| LLM Rerank Pool | 20 | Candidate frames for reranking |
| Image Prompt Weight | 0.3 | Text vs image blend for hybrid search |
| LLM Temperature | 0.7 | Generation randomness control |
| Max Tokens | 500 | Response length hard limit |
| History Turns | 3 | Multi-turn conversation window |
| Header Font H1/H2/H3 | 18/14/12pt | Font-size thresholds for hierarchy |
Performance
Latency Breakdown
Typical end-to-end latency for a single query across all pipeline stages.
Quality Assurance
Output Evaluator Flags
Every response passes through a 4-flag evaluation system that catches hallucinations, refusals, unverified claims, and pricing uncertainty before reaching the user.
LLM answered without any retrieved documentation — potential hallucination risk
System declined to answer, with partial-answer detection to avoid false positives
Response mentions entities or integrations not found in source chunks
Hedging language or conflicting price information detected in response