0 docs 0 extracted
Documents 0
📄
Drop file here or click to browse
PDF · JPG · PNG
Upload — drop or click  ·  PDF · JPG · PNG
No documents yet.
Upload a delivery note
to get started.
0 selected
Document Preview
Select a document to preview
EXTRACTING...
 ┌─────────────────────────────┐
 │  S H E D   P R O T O T Y P E │
 │  ─────────────────────────  │
 │  1. Upload a document left  │
 │  2. Click ▶ EXTRACT         │
 │  3. View results here       │
 └─────────────────────────────┘

SELECT A DOCUMENT

Upload a delivery note PDF or image, click ▶ EXTRACT, then select the document to view structured results.

ANALYTICS
Extraction quality & workflow overview — computed from all loaded documents
SHED — SiteHub for Extracting Data
v0.2  ·  AI-powered document extraction  ·  Danish document support

SHED is a local web tool for extracting structured data from construction documents (Delivery Notes, Invoices, Certificates, and more) using AI vision models. Upload a scanned PDF or image, run the extraction command, and get all fields populated as structured JSON — ready to review, edit, and compare side-by-side with the original document. Field schemas are configurable per doc type in the FIELDS tab; a 57-field Delivery Note schema is included as the default.

Extraction Pipeline

Each document passes through a preprocessing pipeline before any AI is involved, improving accuracy and minimising token cost.

Full pipeline — upload → smart PDF routing → extract → structured JSON out
Loading diagram…
AI Models & Providers

The extraction model and prescreen model are independently configurable in ⚙ Settings. Each slot can use a different provider — useful for mixing a free-tier Gemini Flash prescreen with a Claude CLI extraction, for example.

Claude CLI (default)
Extraction: sonnet-4-6 · Prescreen: haiku-4-5
Calls claude -p <prompt> --allowedTools Read as a subprocess. Requires a Claude Code CLI login. No API key needed. Images passed via the Read tool.
Claude API
Extraction: sonnet-4-6 · Prescreen: haiku-4-5
Uses the Anthropic Python SDK with base64-encoded images. Requires a Claude API key (no free tier). Up to 20 pages per request.
OpenAI
Extraction: gpt-4o · Prescreen: gpt-4o-mini
Uses the OpenAI Python SDK with base64 image URLs. Requires an OpenAI API key (no free tier). Up to 10 pages per request.
Gemini free tier available
Extraction: gemini-2.5-flash · Prescreen: gemini-2.5-flash (auto)
Uses the Google GenAI SDK. PDFs are passed natively (no rasterisation needed). gemini-2.5-flash is available on the free tier. Up to 16 pages (or native PDF, unlimited).
Mistral free tier available
Extraction: pixtral-large · Prescreen: pixtral-12b
Uses the Mistral Python SDK with base64-encoded images. Free tier available (rate-limited). Up to 8 pages per request.
Output format
Raw JSON — inline { value, conf }
Each found field is {"value": "...", "conf": 0.97}; not-found fields are null. meta includes fill_rate and avg_confidence. No markdown, no code fences. On failure, extract() raises ExtractionError or ClassificationError instead of calling sys.exit() — making it safely callable from Python code (API routes, tests, batch scripts).
Token Cost Optimisations

Image tokens dominate the cost per extraction. Four targeted optimisations bring the total down by roughly 70–75% compared to a naive implementation, with no loss in extraction quality.

No upscale for Claude
~50% fewer image tokens
Claude's vision model downsamples large images internally. Sending a 2× upscaled image just pays for pixels that get discarded. The upscale is kept for Tesseract only, which genuinely needs pixel density.
1200px resolution cap
~60% fewer image tokens
Image token cost scales with pixel count. Capping at 1200px on the longest side is sufficient for Claude to read all text on a delivery note. A 200 dpi A4 scan drops from ~5,100 to ~1,350 image tokens.
OCR hint truncated by source
~300–600 prompt tokens
800 chars for Tesseract (scanned PDFs/images) · 2000 chars for pdfplumber (digital PDFs). A dense A4 page produces 2,000+ characters of Tesseract output; the first 800 chars cover the most useful content. Digital PDFs with clean text layers can usefully provide more context, hence the higher limit.
Per-provider cost estimation
Provider-specific pricing
Cost is calculated using provider-specific rates, not always Sonnet pricing. Claude (sonnet-4-6): $3.00 / $15.00 per 1M tokens. OpenAI (gpt-4o): $2.50 / $10.00. Gemini: $0.00 / $0.00 (free tier). Mistral: $2.00 / $6.00. Gemini extractions show $0 cost in the UI.
Confidence inline, not a separate block
~600 prompt tokens
A separate _confidence schema block duplicated all 57 field names a second time. Instead, one instruction sentence tells the model to return {"value": "...", "conf": 0.0–1.0} per field. Same output, half the schema tokens.
COST BEFORE~$0.06–0.11 / doc
COST AFTER~$0.02–0.04 / doc
MODELclaude-sonnet-4-6
TRACKED~tokens + ~cost per doc
Review, Annotation & Workflow

After extraction, every field can be reviewed and annotated directly in the browser without touching the JSON file. Documents move through workflow phases as they are reviewed and approved.

Confidence dots
● green / amber / red
Each filled field shows a coloured dot based on the model's self-reported confidence: green ≥ 85%, amber 60–84%, red < 60%. Hover to see the exact percentage. Low confidence = worth checking against the original document.
Fill rate & avg confidence
Shown in meta bar
fill_rate is the fraction of all schema fields that have a non-null value. avg_confidence is the mean confidence across all filled fields. Both are computed in Python after extraction and stored in meta.
Error type flags
wrong value · missing · OCR error · hallucinated
Each field has a popup editor (hover → ✎) to flag extraction errors. The flag is stored as "error": "wrong_value" inside the field object. Flagged fields show a coloured badge in view mode. Material rows have a row-level flag.
Inline editing
Per-field popup editor
Any field value can be corrected via the ✎ popup. Shows the original extracted value (read-only), a corrected value input, and an error type dropdown. Saves immediately via PUT — no global edit mode, no terminal needed.
Workflow Phases
PENDING → EXTRACTED → IN REVIEW → APPROVED
Each document progresses through phases. Status is stored in meta.status. Phase tabs in the left panel filter the document list. The workflow bar at the bottom of each FIELDS view advances or resets the phase with one click.
Search & Filter
Phase tabs · filename search · type dropdown
Three filters stack together in the left panel. Phase tabs narrow by workflow status. The filename search filters live by typing. The doc-type dropdown shows only extracted document types (e.g. "Delivery Note") and updates as documents are extracted.
Analytics

The ANALYTICS tab gives a live overview of extraction quality across all loaded documents. All metrics are computed client-side from the JSON already in memory — no extra server calls. Four sub-tabs:

CSV exports (docs and materials) have moved to the TOOLS tab.

Tech Stack
DOCUMENT TOOLS
Merge and split uploaded documents without re-uploading
MERGE DOCUMENTS
Select files to merge (PDF + image sources supported — output is PDF if any input is PDF):
No files uploaded yet
MERGE ORDER — drag to reorder
Output filename: (extension added automatically)
SPLIT PDF
Select a PDF to split — click the dividers between pages to mark split points (amber = split here):
Source file:
Select a file above to see pages
EXPORT DATA
Export extracted data as CSV. Filter by workflow phase before exporting.
Filter: