The Problem With Raw HTML
The web is the world's largest data source. It is also one of the messiest. Every website has a different structure, different navigation, different boilerplate. Getting from a URL to something an LLM can actually use โ without hallucinations, without noise, without manual cleaning โ is surprisingly hard to do at scale.
Spectre Extract solves this. It is the core extraction engine inside Project Spectre: give it a URL and a crawl configuration, and it returns clean, structured data ready for any downstream use.
What It Extracts
For every page it processes, Spectre Extract returns:
- Title โ the document title from the
<title>tag and the primary<h1> - Heading structure โ H2 and H3 headings, preserving the semantic outline of the page
- Clean body text โ main content stripped of navigation menus, footers, ads, cookie banners, and other boilerplate
- Internal links โ all links pointing to other pages on the same domain, with anchor text
- External links โ all links pointing to other domains, which feed automatically into the Spectre Hopper for further investigation
- Word count โ useful for filtering pages that are too short to be substantive
- Crawl depth โ how many hops from the starting URL this page was discovered at
- Metadata โ language, canonical URL, and open graph data where available
Output Formats
Spectre Extract returns data in two formats simultaneously:
JSON โ a structured object with all fields as named properties. Ideal for programmatic processing, vector database ingestion, or piping into another API.
Markdown โ a clean prose version of the page content with headings preserved. Ideal for direct LLM ingestion โ paste it into a prompt and the model sees a clean document, not a wall of HTML.
Both formats are returned in the same response. You choose which one to use downstream, or you use both.
Crawl Configuration
Spectre Extract is not just a single-page scraper โ it can crawl an entire domain or a defined section of one.
Depth control
Set a depth parameter to control how many hops from the starting URL the crawl follows. depth=1 extracts only the starting page. depth=3 follows links three levels deep. Setting no depth limit crawls the entire reachable domain.
Page budget
Set a maxPages limit to cap the total number of pages extracted. Useful for large sites where you want a representative sample rather than an exhaustive crawl, or where you have cost constraints on downstream LLM processing.
Section targeting
Start from an internal page (e.g. https://example.com/docs/) and the crawl stays within that section of the site โ it follows links but respects the starting path as a boundary.
The Site Map Tree
As Spectre Extract discovers pages, it builds a site map tree in real time. The tree shows the hierarchical structure of the crawl โ which pages were discovered from which parent pages, at which depth.
The site map is available in the Spectre dashboard and via the API. It gives you a structural overview of the target domain before you decide which sections to process further.
How to Call It
Spectre Extract is accessible three ways:
Via MCP tool
mcp_spectre_extract is available to any agent with Spectre access. Your Librarian can call it mid-research, crawl a source, and incorporate the content into its synthesis โ all within a single workflow step.
Tool: mcp_spectre_extract
Parameters:
url: "https://example.com"
depth: 2
maxPages: 20
format: "markdown"
Via the API
Authenticated POST request to /sovereignty/api/spectre/extract with your API key. The response is the same structured JSON object.
Via the dashboard
The Spectre section of the dashboard has a full UI โ enter a URL, configure the crawl depth and budget, start the extraction, and watch the site map tree build in real time. Results are saved to your Spectre history.
What Teams Use It For
LLM context enrichment โ crawl a documentation site and push the clean Markdown directly into an agent's context window. No copy-paste, no reformatting.
Competitor intelligence โ extract product pages, pricing pages, and feature lists from competitor domains on a schedule. Track changes over time.
Knowledge base construction โ crawl your own documentation and push it into a vector store for semantic search. Keeps the index current without manual maintenance.
Research sourcing โ the Librarian agent uses Spectre Extract to pull source material for research tasks, ensuring cited content is actually read rather than hallucinated.
Lead intelligence โ extract company websites before outreach to build context-rich prospect profiles without manual research.
The Sovereign Advantage
The key difference between Spectre Extract and hosted extraction services like Firecrawl or Tavily is custody. Everything Spectre Extract processes stays in your PocketBase instance. No data leaves your environment unless you explicitly export it.
For regulated industries, client work, or any context where data residency matters, this is not a marginal benefit โ it is a requirement. Spectre Extract is the only production-grade extraction engine that is fully self-hosted with no external data transmission.