Vault

PDFs, Videos, Images.
One search to query them all.

Documents get chunked with contextual headers and embedded into 768-d vectors. Videos get frame-extracted and CLIP-embedded into 512-d vectors. Images are embedded with the same CLIP backbone. All assets land in pgvector — searchable in seconds.

Images

Videos

Scrolldown to Architecture

768-d

Text Embed

512-d

CLIP Embed

300t

Chunk Size

1fps

Frame Rate

top-20

LLM Rerank

pgvec

Vector Store

System Overview

End-to-End Architecture

Three parallel pipelines — document RAG, video semantic search, and image semantic search — share the same pgvector backbone and a unified CLIP embedding engine. Videos and images support LLM-powered query rewriting, result reranking, and cross-modal image-to-video search.

User Query

Natural language question via Web UI

FastAPI Gateway

Async ASGI · Pydantic validation · CORS

POST /query

Conversation Manager

Load last 3 turns from Supabase. UUID-based session isolation. Persists Q&A pairs for multi-turn context.

3-Turn WindowPostgreSQL

Model Router

Deterministic rule engine with 5 rules + OOD filter. Keyword regex, word count, question marks, comparison detection.

8B Simple70B ComplexOOD Skip

Query Embedding

HuggingFace Inference API encodes query into 768-d dense vector. Same model as ingestion for alignment. Batch + retry.

mpnet-v2768-d

Vector Similarity Search

Supabase pgvector RPC: match_chunks(). L2 distance → similarity score. Returns top-5 chunks sorted by relevance.

top_k=5pgvector

~50ms

Dynamic K-Cutoff

Hard threshold > 0.2 removes noise. Then adaptive filter: keep only chunks scoring ≥ 80% of the top result. Prevents "lost in the middle" degradation.

>0.2 threshold0.8× cutoffk=2–5

Prompt Assembly

Layer 1

System Persona

Layer 2

Context Chunks

Layer 3

Conv. History

Layer 4

User Query

Assembled prompt counted via tiktoken (o200k_base) before dispatch. Avg 200–350 input tokens.

LLM Generation

~1800ms

Simple Path

Llama 3.1 8B

Instant inference · Low latency

Complex Path

Llama 3.3 70B

Versatile · Deep reasoning

Groq APITemp 0.7Max 500 tokensSSE Streaming

Output Evaluator

no_context

refusal

unverified

pricing

Checks for hallucination, refusal with partial-answer detection, unverified entities via proper noun extraction, and pricing hedging language.

JSON Response

{
  "answer": "...",
  "metadata": { model, tokens, latency, flags },
  "sources": [ doc, page, score ],
  "conversation_id": "conv_..."
}

Routing Logger

JSONL format with daily rotation & 30-day retention. Logs classification, rule triggered, complexity score, token counts, latency, and evaluator flags for every query.

JSONL30-Day RotationFull Audit

Phase 1

Document Ingestion

PDFs are loaded, parsed page-by-page, chunked with token-aware splitting, injected with contextual headers extracted from font metadata, embedded into 768-d vectors, and stored in Supabase pgvector.

PyMuPDFPage-Level

PDF Extraction

PyMuPDF (fitz) parses documents page-by-page

Extracts raw text with page.get_text() from each page
Tracks filename, page numbers (1-indexed), word counts
Handles corrupted PDFs with graceful error recovery

Font AnalysisHierarchy

Contextual Header Extraction

Font-size analysis injects hierarchical document structure

Uses get_text("dict") to access font metadata per text block
H1 > 18pt · H2 > 14pt · H3 > 12pt — builds a header stack
Prefixes each chunk: [Context: Section > Subsection > ...]
Maintains header hierarchy across page boundaries

300 tokens50 overlapRecursive

Token-Aware Chunking

300-token chunks with 50-token overlap via recursive splitting

Tokenizer: all-mpnet-base-v2 AutoTokenizer for exact counts
Recursive separators: \n\n → \n → ". " → " " → char-level
Overlap decoded from last 50 tokens of previous chunk
Chunk IDs: {filename}_{page}_{index} for deterministic dedup

768-dHuggingFaceBatch

Embedding Generation

sentence-transformers/all-mpnet-base-v2 via HuggingFace Inference

Output: 768-dimensional dense vectors per chunk
Batch processing for efficient API utilization
Exponential backoff: 5 retries, 5s → 60s max delay
Model warmup on startup to avoid cold-start latency

pgvectorSupabaseL2

Vector Storage

Supabase PostgreSQL with pgvector extension

L2 distance converted to similarity: 1 − distance
RPC function: match_chunks(embedding, threshold, count)
Upsert strategy prevents duplicate chunks on re-ingestion
Stores text, metadata, page numbers alongside vectors

Phase 2

Query Pipeline

When a user asks a question, the system classifies complexity, retrieves relevant chunks with adaptive filtering, builds a context-rich prompt, generates via Groq, evaluates output quality, and logs everything.

Rule Engine8B / 70BDeterministic

Query Classification & Model Routing

Deterministic rule-based decision tree — not ML

OOD filter: greetings & meta-questions skip retrieval entirely
Complex triggers: keywords (explain, compare, analyze), length >15 words, multiple "?", comparison words (vs, better, worse)
Simple → Llama 3.1 8B Instant · Complex → Llama 3.3 70B Versatile
Word-boundary regex prevents false positives (e.g., "CSV" ≠ "vs")

top-k=5Threshold 0.280% Cutoff

Retrieval with Dynamic K-Cutoff

Vector search + adaptive filtering prevents 'lost in the middle'

Query embedded with same mpnet-v2 model → 768-d vector
Top-5 chunks fetched from Supabase pgvector via RPC
Hard threshold filter: score > 0.2 removes noise
Dynamic cutoff: only keep chunks ≥ 80% of top score — adaptive k (2–5)

Multi-Turn3-Turn History

Prompt Construction

Multi-layer prompt with system instructions, context, and history

1. System: knowledge assistant persona
2. Context: retrieved chunk texts (2–5 chunks)
3. History: last 3 conversation turns (multi-turn memory)
4. Current question + instruction suffix for grounded answers

GroqSSEtiktoken

LLM Generation

Groq API with streaming SSE and token counting via tiktoken

Temperature: 0.7 · Max tokens: 500 per response
Token counting: tiktoken o200k_base encoding pre & post generation
Streaming: Server-Sent Events yield tokens in real-time
Error handling: structured retries with exponential backoff

4 FlagsAnti-Hallucination

Output Quality Evaluation

4-flag system catches hallucinations, refusals, and uncertainty

no_context — answered without any retrieved documentation
refusal — declined to answer (with partial-answer detection to avoid false positives)
unverified_feature — mentions entities not found in source chunks
pricing_uncertainty — hedging language or conflicting price info

Phase 3

Video Semantic Search

A parallel pipeline for video content. Upload an MP4, MOV, or MKV — frames are extracted, CLIP-embedded into 512-d vectors, and stored in pgvector. Search by text or upload a reference image. Llama 3.3 70B rewrites queries and reranks results for production-grade recall.

500 MBAsync202 Accepted

Video Upload & Validation

File validation, sanitization, and async job dispatch

Accepts MP4, MOV, MKV — up to 500 MB per file
Filename sanitized against path traversal attacks
Returns 202 Accepted with video_id and job_id immediately
Background processing job handles the heavy lifting

OpenCV1 fpsJPEG

Frame Extraction

OpenCV extracts frames at configurable intervals

Default: 1 frame per second across the entire video
Each frame saved as JPEG in {video_id}/frames/ directory
Tracks timestamp (seconds) for seek-to-frame playback
Configurable interval via VIDEO_FRAME_INTERVAL_SEC

CLIP512-dViT-B-32

CLIP Embedding

Frames embedded with clip-ViT-B-32 into 512-d vectors

Two modes: local (sentence-transformers) or server (HuggingFace API)
Automatic fallback from server to local on API failure
CLIP aligns image and text in the same vector space
Enables text-to-image search: "red car on highway" finds that frame

pgvectorHNSWCosine

Frame Vector Storage

Supabase pgvector with HNSW index for fast similarity search

Separate table: video_frame_embeddings with vector(512)
HNSW index for sub-linear search over millions of frames
RPC function: match_video_frames() with cosine distance
Cascade delete: removing a video drops all frames + embeddings

Text→ImageSeek-to-Frametop_k

Natural Language Frame Search

Type a query, find the exact video moment

Query text embedded with CLIP into the same 512-d space
Cosine similarity ranks frames by semantic relevance
Results include thumbnail, timestamp, similarity score
Click a result to open the video player seeked to that frame

Llama 70B3 RewritesGroq

LLM Query Rewriting

Llama 3.3 70B expands queries into 3 alternative formulations

Original query rewritten via Groq into visually descriptive variants
Each variant embedded separately with CLIP and searched in parallel
Results merged and deduplicated across all query variants
Catches synonyms and phrasings the original query would miss

top-20SemanticRerank

LLM Result Reranking

Top-20 candidates reranked by semantic understanding

Collects top candidates from all query variants into a pool
LLM evaluates query intent against each frame’s context
Reorders results by true relevance, not just cosine distance
Configurable candidate count via VIDEO_LLM_RERANK_CANDIDATES

Cross-Modal0.3 BlendHybrid

Image-to-Video Search

Upload a reference image to find matching video frames

Reference image embedded with CLIP into the same 512-d space
Optional text prompt blended with image embedding (0.3 weight)
Formula: blended = (1−w) × image_vec + w × text_vec
Enables "find frames that look like this photo" workflows

Phase 4

Image Semantic Search

Upload JPG, PNG, or WebP images — each is embedded with the same CLIP ViT-B-32 backbone into 512-d vectors and stored in pgvector. Search by natural language to find visually matching images across your collection.

50 MBAsyncJPG/PNG/WebP

Image Upload & Validation

File validation, sanitization, and async processing

Accepts JPG, PNG, WebP — up to 50 MB per file
Filename sanitized against path traversal attacks
Returns 202 Accepted with image_id and job_id immediately
Background job handles embedding without blocking the API

CLIP512-dShared Engine

CLIP Image Embedding

Whole-image embedding via shared clip-ViT-B-32 engine

Same CLIP backbone as video frames — shared 512-d vector space
Two modes: local (sentence-transformers) or server (HuggingFace API)
Automatic fallback from server to local on API failure
No frame extraction needed — the entire image is embedded directly

pgvectorHNSWCosine

Image Vector Storage

Supabase pgvector with HNSW index and cascade deletes

Separate table: image_embeddings with vector(512)
HNSW index for sub-linear cosine similarity search
RPC function: match_images() with configurable threshold
Cascade delete: removing an image drops its embedding automatically

Text→Imagetop_kLightbox

Natural Language Image Search

Type a query, find visually matching images

Query text embedded with CLIP into the same 512-d space
Cosine similarity ranks images by semantic relevance
Results include thumbnails and similarity percentage
Click a result to view the full-resolution image in a lightbox

Infrastructure

Technology Stack

FastAPI

API Gateway

Async ASGI with Uvicorn

PyMuPDF

PDF Processing

Page extraction + font metadata

HuggingFace

Text Embeddings

all-mpnet-base-v2 (768-d)

CLIP ViT-B-32

Visual Embeddings

Shared backbone for videos + images (512-d)

OpenCV

Frame Extraction

Video → JPEG frames at 1 fps

Supabase

Vector Store

pgvector with HNSW index

Groq

LLM Inference

Llama 3.1 8B & 3.3 70B + query rewrite/rerank

tiktoken

Token Counting

o200k_base encoding

Next.js

Frontend

React with Tailwind CSS

Pydantic

Validation

Request/response schemas

PIL / Pillow

Image Processing

Image loading for CLIP embedding

httpx

HTTP Client

Async requests with retry

Configuration

Pipeline Parameters

Parameter	Value	Purpose
Chunk Size	300 tokens	Context granularity per chunk
Chunk Overlap	50 tokens	Continuity between adjacent chunks
Text Embed Dim	768	all-mpnet-base-v2 vector output
CLIP Embed Dim	512	Shared ViT-B-32 for videos + images
Frame Interval	1.0s	Extract 1 frame per second
Max Video Size	500 MB	Video upload file size limit
Max Image Size	50 MB	Image upload file size limit
Retrieval top_k	5	Max chunks from vector search
Relevance Threshold	0.2	Min similarity score to keep
Dynamic K-Cutoff	0.8×	Adaptive filtering multiplier
LLM Rewrite Count	3	Query variants for video search
LLM Rerank Pool	20	Candidate frames for reranking
Image Prompt Weight	0.3	Text vs image blend for hybrid search
LLM Temperature	0.7	Generation randomness control
Max Tokens	500	Response length hard limit
History Turns	3	Multi-turn conversation window
Header Font H1/H2/H3	18/14/12pt	Font-size thresholds for hierarchy

Performance

Latency Breakdown

Typical end-to-end latency for a single query across all pipeline stages.

Conversation Check

5ms

Query Embedding (HF API)

800ms

Vector Search (pgvector)

50ms

Dynamic K-Cutoff

5ms

Prompt Construction

10ms

Token Counting

5ms

LLM Generation (Groq)

1800ms

Output Evaluation

20ms

Logging + Response

15ms

Total

~2710ms

Quality Assurance

Output Evaluator Flags

Every response passes through a 4-flag evaluation system that catches hallucinations, refusals, unverified claims, and pricing uncertainty before reaching the user.

no_context

LLM answered without any retrieved documentation — potential hallucination risk

refusal

System declined to answer, with partial-answer detection to avoid false positives

unverified_feature

Response mentions entities or integrations not found in source chunks

pricing_uncertainty

Hedging language or conflicting price information detected in response

See Vault in action

Upload a PDF, image, or video, ask a question, and watch the pipelines execute — text chunks and visual assets, embedded and retrieved with live telemetry on every response.

Images

Videos

PDFs, Videos, Images.One search to query them all.

End-to-End Architecture

Document Ingestion

PDF Extraction

Contextual Header Extraction

Token-Aware Chunking

Embedding Generation

Vector Storage

Query Pipeline

Query Classification & Model Routing

Retrieval with Dynamic K-Cutoff

Prompt Construction

LLM Generation

Output Quality Evaluation

Video Semantic Search

Video Upload & Validation

Frame Extraction

CLIP Embedding

Frame Vector Storage

Natural Language Frame Search

LLM Query Rewriting

LLM Result Reranking

Image-to-Video Search

Image Semantic Search

Image Upload & Validation

CLIP Image Embedding

Image Vector Storage

Natural Language Image Search

Technology Stack

Pipeline Parameters

Latency Breakdown

Output Evaluator Flags

See Vault in action

PDFs, Videos, Images.
One search to query them all.