Document Ingestion
Upload & Vectorize
Drop your PDFs below. Each document is parsed page-by-page, chunked with contextual headers, embedded into 768-d vectors via all-mpnet-base-v2, and stored in Supabase pgvector.
Drop PDFs here or click to browse
Supports multiple PDF files. Each will be processed through the full ingestion pipeline.
What happens to your files
Ingestion Pipeline
PDF Extraction
PyMuPDF parses each page, extracting raw text with word counts and page metadata.
Header Injection
Font-size analysis (H1>18pt, H2>14pt, H3>12pt) builds hierarchical context prefixes.
Token Chunking
300-token chunks with 50-token overlap. Recursive splitting by paragraphs, sentences, words.
Embedding
HuggingFace all-mpnet-base-v2 encodes each chunk into a 768-dimensional dense vector.
Vector Storage
Vectors upserted into Supabase PostgreSQL with pgvector extension. L2 distance indexing.
Ready to Query
Documents are searchable instantly. Ask questions and retrieve context from your knowledge base.
Ingestion Config
300t
Chunk Size
50t
Overlap
768-d
Embedding
mpnet-v2
Model