Extract

Pull structured JSON out of a single page using an LLM. Gyrence fetches the URL (HTTP first, browser escalation if needed), converts it to Markdown, and uses an LLM to return strict JSON matching your prompt or schema.

Use this when

You need a few specific fields from a page (price, title, author, contact email) without writing selectors.
The site's HTML changes often and selector-based scraping is too brittle.
You want JSON shaped to your own schema, not raw page content.
You're prototyping enrichment before committing to a per-domain parser.

Method	POST
Path	`/api/v1/extract`
Auth	Bearer
Credits	5 (7 if browser)

Request

Parameter	Type	Description
`url` required	`string`	Page URL to extract from. Must be a public http(s) origin. SSRF-blocked hosts are rejected.
`prompt` required	`string`	Natural-language instruction describing what to extract (e.g. "Return the product title, price in USD, and in-stock status").
`schema`	`string`	Optional JSON string describing the desired shape. Sent to the model as a target structure. Must be valid JSON — invalid JSON returns `bad_request`.
`forceBrowser`	`boolean` default: `false`	Skip the HTTP-first attempt and render via the headless worker immediately. Use for known JS-heavy SPAs.

Example body

{
  "url": "https://example.com/products/widget",
  "prompt": "Extract the product name, price, currency, and availability.",
  "schema": "{\"name\":\"string\",\"price\":\"number\",\"currency\":\"string\",\"inStock\":\"boolean\"}"
}

Response

{
  "ok": true,
  "data": {
    "url": "https://example.com/products/widget",
    "title": "Widget — Example",
    "extractedJson": "{\"name\":\"Widget\",\"price\":29.99,\"currency\":\"USD\",\"inStock\":true}",
    "via": "http"
  }
}

Field	Type	Description
`url`	`string`	The final URL fetched (after redirects).
`title`	`string`	Page title parsed from the fetched HTML.
`extractedJson`	`string`	Stringified JSON returned by the model. Always a string — parse it client-side. On malformed model output, falls back to `{"_raw": "<original text>"}`.
`via`	`"http" \| "browser"`	How the page was fetched. `http` = direct fetch succeeded. `browser` = escalated to the headless worker (costs 2 extra credits).

Example

curl -X POST https://www.gyrence.com/api/v1/extract \
  -H "Authorization: Bearer $GYRENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products/widget",
    "prompt": "Extract the product name, price, currency, and availability.",
    "schema": "{\"name\":\"string\",\"price\":\"number\",\"currency\":\"string\",\"inStock\":\"boolean\"}"
  }'

Errors

Code	HTTP	Meaning
`bad_request`	400	`url` missing/invalid, `prompt` empty, or `schema` is not valid JSON.
`unauthorized`	401	Missing, malformed, or revoked `Authorization` header.
`credits_exhausted`	402	Workspace balance below request cost, or the LLM provider returned 402.
`forbidden_url`	403	SSRF guard rejected a private, loopback, or link-local host.
`timeout`	408	Request exceeded the 25-second hard deadline.
`rate_limited`	429	Per-workspace rate limit, or the LLM provider rate-limited the call.
`upstream_error`	502	Page fetch or LLM call returned a 5xx.
`unavailable`	503	Any other unmapped error.

{ "ok": false, "error": "Schema must be valid JSON", "code": "bad_request" }

Credits

5 credits when fetched over HTTP, 7 credits when Gyrence escalates to the headless browser. Charged once per call regardless of model response length. See Notes for the full list of escalation triggers.

Limits & known behavior

Content is truncated. Only the first 12,000 characters of converted Markdown are sent to the model. Long pages have their tail dropped — narrow the URL (e.g. a product page, not a catalog) for best results.
Prompt + page content compete for context. Long prompts reduce the page content the model sees. Aim for prompts under 500 characters; let the schema do the structural work and the prompt focus on what to extract, not how.
JSON guarantee is best-effort. The model is instructed to return strict JSON, but malformed output falls back to { "_raw": "<text>" }. Always handle that shape.
schema is a hint, not a validator. Gyrence does not enforce types on the model's output. Validate with your own Zod/JSON-schema layer if it matters.
One page per call. To extract across many URLs, fan out client-side (use /map to enumerate) and call /extract per URL.
Markdown fidelity. Tables and deeply nested lists may lose structure in HTML→Markdown conversion before the model sees them.

Notes

Extract uses Gemini 2.5 Flash via the Lovable AI Gateway — chosen for its 3× cost advantage over comparable models at equivalent extraction quality. The API contract (request/response shape) is stable regardless of underlying model; Gyrence may update the model tier over time. The system enforces strict JSON output — no prose, no markdown fences.
Fetch pipeline: identical to /fetch. HTTP first, with automatic escalation to the headless browser on any of: origin 403/429/503, network error (status 0), body < 500 chars, JS-shell markers (id="root">, id="__next">, <noscript>, etc.), or post-conversion content loss (HTML ≥ 1000 chars producing < 100 markdown chars). forceBrowser: true skips the HTTP attempt.
via drives pricing, not the input. Even without forceBrowser, an automatic escalation bills as browser (7 credits).
SSRF re-validation runs on the final URL after redirects, not just the input.
No retries on model failure. Garbled output is returned as _raw for your inspection rather than re-prompted server-side.

Try it

Test prompts and schemas live in the console at /app/extract — paste a URL, iterate on the prompt, copy the resulting JSON.