Document Ingestion

Upload & Vectorize

Drop your PDFs below. Each document is parsed page-by-page, chunked with contextual headers, embedded into 768-d vectors via all-mpnet-base-v2, and stored in Supabase pgvector.

Drop PDFs here or click to browse

Supports multiple PDF files. Each will be processed through the full ingestion pipeline.

.pdfMulti-fileMax 50MB each

What happens to your files

Ingestion Pipeline

PyMuPDFPage-level

PDF Extraction

PyMuPDF parses each page, extracting raw text with word counts and page metadata.

Font metaHierarchy

Header Injection

Font-size analysis (H1>18pt, H2>14pt, H3>12pt) builds hierarchical context prefixes.

300t50 overlap

Token Chunking

300-token chunks with 50-token overlap. Recursive splitting by paragraphs, sentences, words.

768-dBatch

Embedding

HuggingFace all-mpnet-base-v2 encodes each chunk into a 768-dimensional dense vector.

pgvectorSupabase

Vector Storage

Vectors upserted into Supabase PostgreSQL with pgvector extension. L2 distance indexing.

LiveQueryable

Ready to Query

Documents are searchable instantly. Ask questions and retrieve context from your knowledge base.

Ingestion Config

300t

Chunk Size

50t

Overlap

768-d

Embedding

mpnet-v2

Model