The Change That Doesn't Show Up in a Diff: Monitoring Investor Relations Pages for New Tax Filings
Investor-relations and tax pages barely change visually from quarter to quarter, so WebDoppler tracks the set of documents it has already seen and treats a newly appeared PDF as the real signal, independent of how little the surrounding DOM moved.
The Change That Doesn't Show Up in a Diff: Monitoring IR Pages for New Tax Filings
If you've ever tried to build a "tell me when this page changes" monitor for a fund's investor-relations site, you've hit the same wall: the page barely changes. A distribution history table for a publicly traded partnership looks nearly identical from one quarter to the next — same layout, same headers, same boilerplate. The actual event you care about is quieter than a visual diff: a new PDF appears in the filings list. ACI-8937-2026.pdf shows up next to three years of prior filings, and that's the whole story. A hash-of-the-page monitor either misses it (the surrounding DOM churn from a session token or ad slot swamps the signal) or false-positives on every cache-busting query param the CMS appends to an unrelated asset.
This is the exact problem Gyrence's WebDoppler primitive is built around, and it's worth walking through because the implementation says something useful about how to build change monitoring that developers can actually trust.
Register a target, not a URL
WebDoppler's register endpoint (POST /api/v1/monitor/register) takes more than a URL. A realistic registration for an IR distribution page looks like:
{
"url": "https://bip.brookfield.com/distribution-history",
"channels": ["ir_page"],
"check_interval_hours": 24,
"relocation_threshold": 40,
"discovery_keywords": ["1099-DIV", "8937", "distribution"],
"extract_schema": "tax_8937",
"webhook_url": "https://yourapp.com/hooks/gyrence",
"alert_emails": ["ops@yourapp.com"]
}
extract_schema isn't a free-text prompt you have to hand-tune — it's one of a handful of pre-built, locked extraction templates: tax_8937 (IRS Form 8937, adjustment factor and return-of-capital fields), ptp_distribution_matrix (per-unit-class, per-period withholding line items for publicly traded partnerships), tax_1446a/tax_1446f (§1446 withholding notices), reit_1099_div (1099-DIV box percentages), tax_19a (Rule 19a-1 distribution source notices), plus auto and custom for anything outside that set. That's a deliberately narrow, tax/financial-compliance-shaped starting library — it tells you where WebDoppler was built first, even though the primitive itself works on any public URL.
Two-gate change detection
Before anything expensive runs, WebDoppler hashes the normalized page HTML and compares it to the last known hash. "Normalized" matters: the normalizer strips CSRF/session tokens, session-id query params, cache-busting params (v, cb, _, ts), and hash-like tokens before hashing, so a CMS reshuffling its session cookie doesn't register as a change. If the hash matches, the run stops there — unchanged, no webhook, no LLM call.
If the hash differs, WebDoppler re-locates its anchor element using a fingerprint-and-score approach — a TypeScript port of Scrapling's adaptive element relocation, scoring every candidate element in the tree against the last known fingerprint (tag, text, attributes, DOM path, siblings) and picking the best match. It then diffs the two fingerprints field by field. A change only counts as material if some field's similarity ratio drops below 0.95, or the element's structural path actually moved.
Running alongside that fingerprint diff is a separate, cheaper check purpose-built for exactly the IR-page problem above: WebDoppler tracks the set of PDF URLs it has ever seen linked from the page. If a new one shows up — regardless of what the fingerprint diff says — that's treated as material on its own. The code comment for this is refreshingly direct: the DOM barely changes, "the real event is a NEW dated PDF appearing alongside prior years' filings." When discovery_keywords are set, new-PDF detection prefers links whose anchor text matches one of them, so a monitor watching for "1099-DIV" doesn't fire on an unrelated marketing PDF added to the same page.
What you get on the other end
Once a change is material, WebDoppler runs the locked extraction schema against the new content and layers a materiality classification on top (high / medium / low / noise, with a recommended_action and a one-to-three-sentence summary) before it ever reaches a webhook. Delivery is HMAC-SHA256-signed, with the raw change signal included:
{
"outcome": "changed",
"channel": "ir_page",
"change_signal": [
{ "field": "attr:href", "from": null, "to": "/files/ACI-8937-2026.pdf", "ratio": 0 }
],
"pdf_links": ["https://bip.brookfield.com/files/ACI-8937-2026.pdf"],
"extracted_fields": { "found": true, "data": [ /* schema-shaped fields */ ] }
}
One outcome worth calling out: relocation_failed — the case where the page restructured enough that WebDoppler lost its anchor — is never silently suppressed. It always fires an alert, because a lost anchor is a monitoring gap, not a non-event. That's consistent with how the rest of Gyrence handles uncertainty: blocked fetches return a structured error instead of an empty markdown field, and oversized Parquet decimals get flagged with an approximationRisk breadcrumb rather than silently rounded. For a monitoring product, that bias — surface the gap instead of hiding it — is the difference between a tool you can actually build compliance workflows on and one that quietly stops working.
WebDoppler bills through the same credit pool as every other Gyrence primitive, one credit per check, so adding a monitor doesn't mean standing up a separate product with its own pricing model — it's the same API you're already calling for fetch and extract.