Fetch

Retrieve a single URL and return its parsed title, description, markdown body, lightly-cleaned HTML, and outbound links. Two-tier fetcher: plain HTTP first, escalating to a headless-browser worker when the page needs JS, returns a soft block, or rate-limits.

Fetch infrastructure

Intelligent two-tier routing — HTTP-first with automatic escalation to headless browser rendering on JS-shell detection, bot-protection signals (403/429/503), or post-conversion content loss. Domain health tracking fast-paths known browser-required hosts, skipping the HTTP attempt entirely.
Per-host concurrency cap — enforces PER_HOST_CAP = 3 concurrent outbound connections per target host across all concurrent Gyre walks.
On-the-fly payload trimming — strips base64 data URIs, SVG path coordinate blobs, and embedded media data URIs before markdown conversion, eliminating binary content that carries no information value.
Bounded by design — per-page (8s), first-page (12s), and worker (18s) timeout constraints. Requests exceeding the 25-second hard deadline return code: "timeout".
Iframe content surfacing — <iframe src> URLs are extracted into links[] automatically, making embedded IR platform content (tax tables and PDF links) visible to both Fetch and Gyre. Iframe src references are also preserved as [iframe: url](url) markers in markdown when they appear in the main content region.
Cookie and modal acceptance — the browser tier accepts an acceptSelectors array of CSS selectors to click before page capture. Use this to dismiss cookie banners, terms-of-use modals, and soft gates that would otherwise block content retrieval on JS-rendered pages.

Use this when

You need clean markdown for a single article, doc, or filing — not a whole site.
Your scraper keeps tripping JS shells or soft blocks (Cloudflare, Akamai) and you want auto-escalation to a headless browser.
You're feeding RAG and need both markdown (LLM-ready) and html (your own pipeline) from the same call.
You're indexing outbound links from a page (links[]) without crawling.

Method	POST
Path	`/api/v1/fetch`
Auth	Bearer
Credits	1 (http) or 3 (browser)

Request

Parameter	Type	Description
`url` required	`string`	Absolute http(s) URL. Private, loopback, and link-local hosts are rejected (SSRF).
`forceBrowser`	`boolean` default: `false`	Skip the HTTP tier and render via the browser worker. Also bypasses the SEC.gov fast path. Always costs 3 credits.
`blockAds`	`boolean` default: `false`	Instructs the browser-tier worker to block ad networks, tracking scripts, images, fonts, and stylesheets before page capture. No-op for the HTTP tier. Reduces data transfer on ad-heavy pages. Requires browser escalation to have any effect.
`acceptSelectors`	`string[]`	CSS selectors to click before page capture in the browser tier (e.g. ["button:has-text(\"I Agree\")"]). Dismisses cookie banners and terms modals. No-op for the HTTP tier.

Example body

{
  "url": "https://example.com/article",
  "forceBrowser": false
}

Response

Field	Type	Description
`url`	`string`	Echo of the requested URL (not the post-redirect URL).
`title`	`string`	Contents of <title>. Empty string when absent.
`description`	`string`	<meta name="description">, falling back to og:description. Empty string when absent.
`markdown`	`string`	Main-content markdown. See Notes for the extraction + cleaning pipeline.
`html`	`string`	Lightly-cleaned HTML: only <script>, <style>, and <noscript> are removed. nav/header/footer/form/iframe/svg are preserved here (they're stripped only inside the markdown pipeline).
`links[]`	`string[]`	Absolute http(s) URLs found in <a href> within the cleaned HTML. Deduped, fragments stripped.
`statusCode`	`number`	Origin HTTP status. 0 if the underlying fetch threw (network error).
`fetchedAt`	`string`	ISO timestamp when the response was assembled.
`via`	`"http" \| "browser"`	Tier that produced this result. Determines credit cost (1 vs 3).

Example response

{
  "ok": true,
  "data": {
    "url": "https://example.com/article",
    "title": "Example Article",
    "description": "A short summary.",
    "markdown": "# Example Article\n\n...",
    "html": "<html>...</html>",
    "links": ["https://example.com/related"],
    "statusCode": 200,
    "fetchedAt": "2026-05-29T20:45:00.000Z",
    "via": "http"
  }
}

Example

curl -X POST https://www.gyrence.com/api/v1/fetch \
  -H "Authorization: Bearer $GYRENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/article"}'

Errors

Code	HTTP	Meaning
`bad_request`	400	`url` missing/invalid, or any field fails validation.
`unauthorized`	401	Missing, malformed, or revoked `Authorization` header.
`credits_exhausted`	402	Workspace balance below request cost.
`forbidden_url`	403	SSRF guard rejected a private, loopback, or link-local host.
`not_found`	404	Origin returned 404. Origin 404s are typed — you do not receive `{ok:true, statusCode:404, markdown:"<404 page>"}`.
`timeout`	408	Request exceeded the 25-second hard deadline.
`rate_limited`	429	Per-workspace rate limit.
`upstream_error`	502	Origin returned 5xx.
`unavailable`	503	Block-page detector tripped (see Notes), or any other unmapped error.

{ "ok": false, "error": "origin returned 404", "code": "not_found" }

Credits

1 credit per HTTP-tier success, 3 credits per browser-tier success. Read via on the response to attribute cost. Errors are not charged. forceBrowser: true always costs 3.

Coverage & known limits

Block pages fail closed. Akamai / Cloudflare / PerimeterX / DataDome / generic "access denied" responses return code: "unavailable" — you do not get the block page as markdown.
Markdown chrome-strip is regex-based and can eat legitimate <header> blocks inside articles. Use the html field if you need to run your own conversion.
links[] is DOM-time only. Links injected by JS after page.content() (infinite scroll, modal-loaded) won't appear.

Notes

Credit accounting. 1 credit for HTTP-tier success, 3 credits for browser-tier success. Read via on the response to attribute cost. Errors are not charged.
Escalation triggers. HTTP tier escalates to the browser worker on origin 403, 429, 503, on network error, when the response body is < 500 chars, when the body contains a JS-shell marker (id="root"></div>, id="__next">, id="app">, You need to enable JavaScript, <noscript>), or — post-conversion — when HTML ≥ 1000 chars produced < 100 chars of markdown (content-loss escalation). Browser tier is never retried.
forceBrowser. Skips both the HTTP tier and the SEC fast path; always costs 3 credits even if the page would have succeeded over HTTP.
SEC.gov fast path. *.sec.gov URLs (without forceBrowser) go HTTP-only with an identifying Gyrence (<contact>) - financial-data retrieval UA per SEC's fair-access policy. They skip JS-shell, browser, and content-loss escalation.
Block-page detection. When a response contains markers for Akamai, Cloudflare, PerimeterX, DataDome, or a generic access denied … you don't have permission page (scanned in the first 4 KB), the request fails with code: "unavailable" and a message like Blocked by Akamai (status 403). You do not receive a 200 with the block page as markdown.
markdown extraction. Pipeline: pick the first of <main> / <article> / <body>, strip <script>/<style>/<noscript>/<iframe>/<svg>/<nav>/<footer>/<header>/<form>, then convert via node-html-markdown. The chrome strip is regex-based and known to be fragile on malformed HTML; it may eat legitimate <header> blocks inside articles.
html vs markdown. html is the lightly-cleaned version (scripts/styles/noscript only) — use this if you want to run your own conversion. The heavy chrome strip is applied only to the input of the markdown converter and never leaks into the html field.
links[] is DOM-time. Extracted via regex from html as-of conversion. Links added by JS after render (infinite scroll, modal-loaded content) appear only if they were in the DOM when the browser-tier worker called page.content(). HTTP-tier responses never see post-render links.
statusCode semantics. 200–399 and most non-2xx codes (e.g. 403, 451) pass through as {ok:true, statusCode} with whatever body the origin returned. Only 404 (→ not_found) and 5xx (→ upstream_error) are mapped to error envelopes. statusCode: 0 means the underlying fetch threw before getting a response.
Graceful worker fallback. If escalation is triggered but the browser worker is unreachable, the request falls back to the HTTP-tier result with via: "http" and an error field. The envelope still succeeds — inspect error if present.
SSRF. Private (RFC1918), loopback, link-local, and .local hosts are rejected pre-fetch with forbidden_url.

Response shapes by `kind`

/fetch returns a discriminated union keyed by kind. The envelope shown above is the kind: "html" shape. For other resources, markdown is a short human-readable summary and the kind-specific payload travels in dedicated fields — read those when you want the structured data, not just the markdown preview.

`kind`	Extra fields you get
`html`	`title`, `description`, `html`, `links[]`
`json`	`json` (parsed value)
`ndjson`	`records[]`, `recordCount`
`csv`	`headers[]`, `rows[][]`, `rowCount`, `delimiter`
`yaml`	`yaml` (parsed value)
`xml`	`xml` (DOM-shaped object), `rootElement`
`sitemap`	`urls[]`, `childSitemaps[]`, `urlCount`
`feed`	`feedType`, `feed { title, description, lastUpdated }`, `items[]`, `itemCount`
`xbrl` / `ixbrl`	`taxonomies[]`, `contexts[]`, `units[]`, `facts[]`, `factCount`, plus `raw` (verbatim source)
`parquet`	`sheets[]`, `rowCount`, `columnCount`, plus a tabular markdown preview
`arrow`	`schema`, `recordCount`, plus tabular preview
`pdf` / `docx` / `xlsx` / `pptx`	`via: "docabl_passthrough"`, `bytesLength`, extracted markdown
`text` / `llmstxt`	`text` (verbatim), `textType` ("plain"
`unsupported`	`detectedAs`, `reason`, `bytesLength`

The Playground at /app/fetch shows a Data tab whenever kind is one of the structured branches above, so you can inspect the full payload without piping the JSON through jq. If you only render markdown, you'll see a summary like "# Inline XBRL filing — Taxonomies: 15, Contexts: 326, Facts: 1167" instead of the 1,167-fact structured payload — switch to the facts[]/contexts[] fields to consume it.

Binary documents (docABL passthrough)

When the target URL resolves to a recognized binary document (pdf, docx, xlsx, pptx), Gyrence forwards the raw bytes to docABL, our third-party document renderer, and returns the extracted markdown through the normal /fetch envelope.

You will see this on the response:

via: "docabl_passthrough" (instead of "http" or "browser")
source: "docabl_passthrough" on the body
kind: "pdf" | "docx" | "xlsx" | "pptx"
bytesLength — the raw upstream byte size that was forwarded

The in-product playground at /app/fetch renders a provenance chip above the markdown when this path is taken, and /app/usage shows a per-endpoint docABL call count. Each usage_events row carries an audit-grade metadata JSON blob (renderer, renderMode, dataClass, kind, bytesLength, docablOutcome).

See the full processor listing — including data sent, retention, and region — on the Subprocessors page. Gyrence does not persistently store the bytes or the rendered output.

Try it

Hit Fetch from the console at /app/fetch with one click — no curl required.