Vault

PDFs, Videos, Images.
One search to query them all.

Documents get chunked with contextual headers and embedded into 768-d vectors. Videos get frame-extracted and CLIP-embedded into 512-d vectors. Images are embedded with the same CLIP backbone. All assets land in pgvector — searchable in seconds.

Scrolldown to Architecture

768-d

Text Embed

512-d

CLIP Embed

300t

Chunk Size

1fps

Frame Rate

top-20

LLM Rerank

pgvec

Vector Store

System Overview

End-to-End Architecture

Three parallel pipelines — document RAG, video semantic search, and image semantic search — share the same pgvector backbone and a unified CLIP embedding engine. Videos and images support LLM-powered query rewriting, result reranking, and cross-modal image-to-video search.

User Query

Natural language question via Web UI

FastAPI Gateway

Async ASGI · Pydantic validation · CORS

POST /query

Conversation Manager

Load last 3 turns from Supabase. UUID-based session isolation. Persists Q&A pairs for multi-turn context.

3-Turn WindowPostgreSQL

Model Router

Deterministic rule engine with 5 rules + OOD filter. Keyword regex, word count, question marks, comparison detection.

8B Simple70B ComplexOOD Skip

Query Embedding

HuggingFace Inference API encodes query into 768-d dense vector. Same model as ingestion for alignment. Batch + retry.

mpnet-v2768-d

Vector Similarity Search

Supabase pgvector RPC: match_chunks(). L2 distance → similarity score. Returns top-5 chunks sorted by relevance.

top_k=5pgvector
~50ms

Dynamic K-Cutoff

Hard threshold > 0.2 removes noise. Then adaptive filter: keep only chunks scoring ≥ 80% of the top result. Prevents "lost in the middle" degradation.

>0.2 threshold0.8× cutoffk=2–5

Prompt Assembly

Layer 1

System Persona

Layer 2

Context Chunks

Layer 3

Conv. History

Layer 4

User Query

Assembled prompt counted via tiktoken (o200k_base) before dispatch. Avg 200–350 input tokens.

LLM Generation

~1800ms

Simple Path

Llama 3.1 8B

Instant inference · Low latency

Complex Path

Llama 3.3 70B

Versatile · Deep reasoning

Groq APITemp 0.7Max 500 tokensSSE Streaming

Output Evaluator

no_context

refusal

unverified

pricing

Checks for hallucination, refusal with partial-answer detection, unverified entities via proper noun extraction, and pricing hedging language.

JSON Response

{
  "answer": "...",
  "metadata": { model, tokens, latency, flags },
  "sources": [ doc, page, score ],
  "conversation_id": "conv_..."
}

Routing Logger

JSONL format with daily rotation & 30-day retention. Logs classification, rule triggered, complexity score, token counts, latency, and evaluator flags for every query.

JSONL30-Day RotationFull Audit

Phase 1

Document Ingestion

PDFs are loaded, parsed page-by-page, chunked with token-aware splitting, injected with contextual headers extracted from font metadata, embedded into 768-d vectors, and stored in Supabase pgvector.

01
PyMuPDFPage-Level

PDF Extraction

PyMuPDF (fitz) parses documents page-by-page

  • Extracts raw text with page.get_text() from each page
  • Tracks filename, page numbers (1-indexed), word counts
  • Handles corrupted PDFs with graceful error recovery
02
Font AnalysisHierarchy

Contextual Header Extraction

Font-size analysis injects hierarchical document structure

  • Uses get_text("dict") to access font metadata per text block
  • H1 > 18pt · H2 > 14pt · H3 > 12pt — builds a header stack
  • Prefixes each chunk: [Context: Section > Subsection > ...]
  • Maintains header hierarchy across page boundaries
03
300 tokens50 overlapRecursive

Token-Aware Chunking

300-token chunks with 50-token overlap via recursive splitting

  • Tokenizer: all-mpnet-base-v2 AutoTokenizer for exact counts
  • Recursive separators: \n\n → \n → ". " → " " → char-level
  • Overlap decoded from last 50 tokens of previous chunk
  • Chunk IDs: {filename}_{page}_{index} for deterministic dedup
04
768-dHuggingFaceBatch

Embedding Generation

sentence-transformers/all-mpnet-base-v2 via HuggingFace Inference

  • Output: 768-dimensional dense vectors per chunk
  • Batch processing for efficient API utilization
  • Exponential backoff: 5 retries, 5s → 60s max delay
  • Model warmup on startup to avoid cold-start latency
05
pgvectorSupabaseL2

Vector Storage

Supabase PostgreSQL with pgvector extension

  • L2 distance converted to similarity: 1 − distance
  • RPC function: match_chunks(embedding, threshold, count)
  • Upsert strategy prevents duplicate chunks on re-ingestion
  • Stores text, metadata, page numbers alongside vectors

Phase 2

Query Pipeline

When a user asks a question, the system classifies complexity, retrieves relevant chunks with adaptive filtering, builds a context-rich prompt, generates via Groq, evaluates output quality, and logs everything.

06
Rule Engine8B / 70BDeterministic

Query Classification & Model Routing

Deterministic rule-based decision tree — not ML

  • OOD filter: greetings & meta-questions skip retrieval entirely
  • Complex triggers: keywords (explain, compare, analyze), length >15 words, multiple "?", comparison words (vs, better, worse)
  • Simple → Llama 3.1 8B Instant · Complex → Llama 3.3 70B Versatile
  • Word-boundary regex prevents false positives (e.g., "CSV" ≠ "vs")
07
top-k=5Threshold 0.280% Cutoff

Retrieval with Dynamic K-Cutoff

Vector search + adaptive filtering prevents 'lost in the middle'

  • Query embedded with same mpnet-v2 model → 768-d vector
  • Top-5 chunks fetched from Supabase pgvector via RPC
  • Hard threshold filter: score > 0.2 removes noise
  • Dynamic cutoff: only keep chunks ≥ 80% of top score — adaptive k (2–5)
08
Multi-Turn3-Turn History

Prompt Construction

Multi-layer prompt with system instructions, context, and history

  • 1. System: knowledge assistant persona
  • 2. Context: retrieved chunk texts (2–5 chunks)
  • 3. History: last 3 conversation turns (multi-turn memory)
  • 4. Current question + instruction suffix for grounded answers
09
GroqSSEtiktoken

LLM Generation

Groq API with streaming SSE and token counting via tiktoken

  • Temperature: 0.7 · Max tokens: 500 per response
  • Token counting: tiktoken o200k_base encoding pre & post generation
  • Streaming: Server-Sent Events yield tokens in real-time
  • Error handling: structured retries with exponential backoff
10
4 FlagsAnti-Hallucination

Output Quality Evaluation

4-flag system catches hallucinations, refusals, and uncertainty

  • no_context — answered without any retrieved documentation
  • refusal — declined to answer (with partial-answer detection to avoid false positives)
  • unverified_feature — mentions entities not found in source chunks
  • pricing_uncertainty — hedging language or conflicting price info

Phase 3

Video Semantic Search

A parallel pipeline for video content. Upload an MP4, MOV, or MKV — frames are extracted, CLIP-embedded into 512-d vectors, and stored in pgvector. Search by text or upload a reference image. Llama 3.3 70B rewrites queries and reranks results for production-grade recall.

11
500 MBAsync202 Accepted

Video Upload & Validation

File validation, sanitization, and async job dispatch

  • Accepts MP4, MOV, MKV — up to 500 MB per file
  • Filename sanitized against path traversal attacks
  • Returns 202 Accepted with video_id and job_id immediately
  • Background processing job handles the heavy lifting
12
OpenCV1 fpsJPEG

Frame Extraction

OpenCV extracts frames at configurable intervals

  • Default: 1 frame per second across the entire video
  • Each frame saved as JPEG in {video_id}/frames/ directory
  • Tracks timestamp (seconds) for seek-to-frame playback
  • Configurable interval via VIDEO_FRAME_INTERVAL_SEC
13
CLIP512-dViT-B-32

CLIP Embedding

Frames embedded with clip-ViT-B-32 into 512-d vectors

  • Two modes: local (sentence-transformers) or server (HuggingFace API)
  • Automatic fallback from server to local on API failure
  • CLIP aligns image and text in the same vector space
  • Enables text-to-image search: "red car on highway" finds that frame
14
pgvectorHNSWCosine

Frame Vector Storage

Supabase pgvector with HNSW index for fast similarity search

  • Separate table: video_frame_embeddings with vector(512)
  • HNSW index for sub-linear search over millions of frames
  • RPC function: match_video_frames() with cosine distance
  • Cascade delete: removing a video drops all frames + embeddings
15
Text→ImageSeek-to-Frametop_k

Natural Language Frame Search

Type a query, find the exact video moment

  • Query text embedded with CLIP into the same 512-d space
  • Cosine similarity ranks frames by semantic relevance
  • Results include thumbnail, timestamp, similarity score
  • Click a result to open the video player seeked to that frame
16
Llama 70B3 RewritesGroq

LLM Query Rewriting

Llama 3.3 70B expands queries into 3 alternative formulations

  • Original query rewritten via Groq into visually descriptive variants
  • Each variant embedded separately with CLIP and searched in parallel
  • Results merged and deduplicated across all query variants
  • Catches synonyms and phrasings the original query would miss
17
top-20SemanticRerank

LLM Result Reranking

Top-20 candidates reranked by semantic understanding

  • Collects top candidates from all query variants into a pool
  • LLM evaluates query intent against each frame’s context
  • Reorders results by true relevance, not just cosine distance
  • Configurable candidate count via VIDEO_LLM_RERANK_CANDIDATES
18
Cross-Modal0.3 BlendHybrid

Image-to-Video Search

Upload a reference image to find matching video frames

  • Reference image embedded with CLIP into the same 512-d space
  • Optional text prompt blended with image embedding (0.3 weight)
  • Formula: blended = (1−w) × image_vec + w × text_vec
  • Enables "find frames that look like this photo" workflows

Phase 4

Image Semantic Search

Upload JPG, PNG, or WebP images — each is embedded with the same CLIP ViT-B-32 backbone into 512-d vectors and stored in pgvector. Search by natural language to find visually matching images across your collection.

19
50 MBAsyncJPG/PNG/WebP

Image Upload & Validation

File validation, sanitization, and async processing

  • Accepts JPG, PNG, WebP — up to 50 MB per file
  • Filename sanitized against path traversal attacks
  • Returns 202 Accepted with image_id and job_id immediately
  • Background job handles embedding without blocking the API
20
CLIP512-dShared Engine

CLIP Image Embedding

Whole-image embedding via shared clip-ViT-B-32 engine

  • Same CLIP backbone as video frames — shared 512-d vector space
  • Two modes: local (sentence-transformers) or server (HuggingFace API)
  • Automatic fallback from server to local on API failure
  • No frame extraction needed — the entire image is embedded directly
21
pgvectorHNSWCosine

Image Vector Storage

Supabase pgvector with HNSW index and cascade deletes

  • Separate table: image_embeddings with vector(512)
  • HNSW index for sub-linear cosine similarity search
  • RPC function: match_images() with configurable threshold
  • Cascade delete: removing an image drops its embedding automatically
22
Text→Imagetop_kLightbox

Natural Language Image Search

Type a query, find visually matching images

  • Query text embedded with CLIP into the same 512-d space
  • Cosine similarity ranks images by semantic relevance
  • Results include thumbnails and similarity percentage
  • Click a result to view the full-resolution image in a lightbox

Infrastructure

Technology Stack

FastAPI

API Gateway

Async ASGI with Uvicorn

PyMuPDF

PDF Processing

Page extraction + font metadata

HuggingFace

Text Embeddings

all-mpnet-base-v2 (768-d)

CLIP ViT-B-32

Visual Embeddings

Shared backbone for videos + images (512-d)

OpenCV

Frame Extraction

Video → JPEG frames at 1 fps

Supabase

Vector Store

pgvector with HNSW index

Groq

LLM Inference

Llama 3.1 8B & 3.3 70B + query rewrite/rerank

tiktoken

Token Counting

o200k_base encoding

Next.js

Frontend

React with Tailwind CSS

Pydantic

Validation

Request/response schemas

PIL / Pillow

Image Processing

Image loading for CLIP embedding

httpx

HTTP Client

Async requests with retry

Configuration

Pipeline Parameters

ParameterValuePurpose
Chunk Size300 tokensContext granularity per chunk
Chunk Overlap50 tokensContinuity between adjacent chunks
Text Embed Dim768all-mpnet-base-v2 vector output
CLIP Embed Dim512Shared ViT-B-32 for videos + images
Frame Interval1.0sExtract 1 frame per second
Max Video Size500 MBVideo upload file size limit
Max Image Size50 MBImage upload file size limit
Retrieval top_k5Max chunks from vector search
Relevance Threshold0.2Min similarity score to keep
Dynamic K-Cutoff0.8×Adaptive filtering multiplier
LLM Rewrite Count3Query variants for video search
LLM Rerank Pool20Candidate frames for reranking
Image Prompt Weight0.3Text vs image blend for hybrid search
LLM Temperature0.7Generation randomness control
Max Tokens500Response length hard limit
History Turns3Multi-turn conversation window
Header Font H1/H2/H318/14/12ptFont-size thresholds for hierarchy

Performance

Latency Breakdown

Typical end-to-end latency for a single query across all pipeline stages.

Conversation Check
5ms
Query Embedding (HF API)
800ms
Vector Search (pgvector)
50ms
Dynamic K-Cutoff
5ms
Prompt Construction
10ms
Token Counting
5ms
LLM Generation (Groq)
1800ms
Output Evaluation
20ms
Logging + Response
15ms
Total
~2710ms

Quality Assurance

Output Evaluator Flags

Every response passes through a 4-flag evaluation system that catches hallucinations, refusals, unverified claims, and pricing uncertainty before reaching the user.

no_context

LLM answered without any retrieved documentation — potential hallucination risk

refusal

System declined to answer, with partial-answer detection to avoid false positives

unverified_feature

Response mentions entities or integrations not found in source chunks

pricing_uncertainty

Hedging language or conflicting price information detected in response

See Vault in action

Upload a PDF, image, or video, ask a question, and watch the pipelines execute — text chunks and visual assets, embedded and retrieved with live telemetry on every response.