Project Spectre: AI-Powered Site Intelligence at Scale

The Web Is Your Data Source

The most valuable intelligence is often sitting on a public website. Competitor pricing. News coverage. Technical documentation. Job listings that signal what a company is building next. Domain registrations that reveal expansion plans.

The problem is that getting from a URL to structured, usable data — at any scale — is genuinely hard. You need to handle pagination, rate limits, JavaScript rendering, link discovery, and the messy reality of inconsistent HTML structure. Then you need to pipe the results somewhere useful.

Project Spectre is the Sovereignty Protocol's answer to that problem. It is a fully sovereign web intelligence engine built on the same infrastructure as the rest of the platform — callable by your agents, auditable in your logs, and fully under your control.

What Spectre Does

At its core, Spectre extracts structured content from any website in clean JSON and Markdown. For each page it crawls, you get:

Title and H1–H3 heading structure — the semantic skeleton of the page
Clean body text — stripped of navigation, ads, and boilerplate
Internal links — the full link graph within the site
External links — domains this site references, which feed into the Spectre Hopper for chaining
Word count and crawl depth — metadata for filtering and prioritisation
A site map tree — built automatically as pages are discovered

The output format is your choice: JSON for machine processing, Markdown for direct LLM ingestion, or both. The results are ready to pipe directly into a vector store, a Nexus Report, an LLM prompt, or any downstream system you operate.

The Four Spectre Routes

Spectre Extract

The foundation. Give it a URL and a depth budget — extract a single page, a section, or an entire domain. The output is clean, structured, and immediately useful. This is what the mcp_spectre_extract MCP tool calls under the hood when your agents use Spectre autonomously.

Spectre Campaigns

Run intelligence operations against multiple targets in parallel. A campaign defines a set of URLs, a crawl depth, a schedule, and an output handler. Results accumulate over time, so you can track changes across pages — a competitor's pricing page on Monday versus Friday, for example.

Campaigns can trigger Nexus Cascades when they complete, so fresh intelligence flows automatically into your downstream workflows.

Spectre Domain Suite

Domain-level intelligence: link health scanning, external domain discovery, and outreach target identification from a single crawl. The Domain Suite builds a picture of a site's connectivity — what it links to, who links to it (via outbound reference analysis), and where broken links are degrading SEO.

Spectre Email

Outreach discovery integrated with the crawl infrastructure. Spectre Email finds contact signals from target domains — email patterns, contact page structures, LinkedIn references — and enriches them with the context gathered during the crawl.

Combined with a Nexus Cascade automation trigger, this powers fully governed outreach pipelines: crawl a domain, enrich the contact, generate a personalised message, send via Sovereign Mail.

How It Compares

Spectre is in the same category as Firecrawl and Tavily — but with some important differences:

	Spectre	Firecrawl / Tavily
Hosting	Fully self-hosted, your infrastructure	Hosted service, third-party
Data custody	Stays in your PocketBase instance	Sent to external servers
Agent integration	Native MCP tool, callable by any agent	API only
Governance	Subject to Sovereignty Protocol audit logs	External, not auditable
Cascade triggers	Built-in — results can trigger workflows	Not available

If data sovereignty matters to your use case — financial research, competitive intelligence, client data — Spectre is the answer. Nothing leaves your environment unless you explicitly export it.

Calling Spectre From Your Agents

Spectre is callable in three ways:

MCP tool — mcp_spectre_extract is available to any agent with tool access. Your Librarian can crawl a source and fold the content into a research synthesis in a single workflow step.
API key — make authenticated POST requests from any external system
Nexus Cascades — use the http step type to trigger a Spectre extraction as part of a multi-step cascade

The MCP integration is the most powerful path. It means your agents can decide at runtime which URLs to crawl based on what they find in previous steps — adaptive, autonomous web intelligence rather than static extraction jobs.

The Spectre Hopper

When Spectre crawls a page, it discovers external domains linked from that page. Those domains are automatically added to the Spectre Hopper — a queue of discovered targets waiting for further investigation.

You can review the Hopper, select domains to crawl next, and chain extractions without specifying the next URL manually. The crawl infrastructure propagates outward through the web graph, guided by your configuration and your agents' decisions.

This is how Spectre moves from a single-page extraction tool to a genuine intelligence platform.

What Teams Use Spectre For

Competitive monitoring — weekly crawls of competitor pricing and feature pages, filed as Nexus Reports with diff summaries
Market research — domain suite runs across industry verticals to map who is building what
SEO health audits — link scanner reports identifying broken outbound links across a client's domain
Lead enrichment — Spectre Email + Cascade pipelines that turn a domain list into a qualified outreach queue
Documentation ingestion — extract third-party API docs and push them into the Smart Memory System so agents have accurate context without manual copy-paste

Project Spectre is a Pro-tier feature. If you are on a Free plan and want to see it in action, start a trial — full access for 31 days, no card required.