Extract

Pull structured JSON out of a single page using an LLM. Gyrence fetches the URL (HTTP first, browser escalation if needed), converts it to Markdown, and uses an LLM to return strict JSON matching your prompt or schema.

Use this when

  • You need a few specific fields from a page (price, title, author, contact email) without writing selectors.
  • The site's HTML changes often and selector-based scraping is too brittle.
  • You want JSON shaped to your own schema, not raw page content.
  • You're prototyping enrichment before committing to a per-domain parser.
MethodPOST
Path/api/v1/extract
AuthBearer
Credits5 (7 if browser)

Request

ParameterTypeDescription
url
required
stringPage URL to extract from. Must be a public http(s) origin. SSRF-blocked hosts are rejected.
prompt
required
stringNatural-language instruction describing what to extract (e.g. "Return the product title, price in USD, and in-stock status").
schemastringOptional JSON string describing the desired shape. Sent to the model as a target structure. Must be valid JSON — invalid JSON returns `bad_request`.
forceBrowserboolean
default: false
Skip the HTTP-first attempt and render via the headless worker immediately. Use for known JS-heavy SPAs.

Example body

{
  "url": "https://example.com/products/widget",
  "prompt": "Extract the product name, price, currency, and availability.",
  "schema": "{\"name\":\"string\",\"price\":\"number\",\"currency\":\"string\",\"inStock\":\"boolean\"}"
}

Response

{
  "ok": true,
  "data": {
    "url": "https://example.com/products/widget",
    "title": "Widget — Example",
    "extractedJson": "{\"name\":\"Widget\",\"price\":29.99,\"currency\":\"USD\",\"inStock\":true}",
    "via": "http"
  }
}
FieldTypeDescription
urlstringThe final URL fetched (after redirects).
titlestringPage title parsed from the fetched HTML.
extractedJsonstringStringified JSON returned by the model. Always a string — parse it client-side. On malformed model output, falls back to `{"_raw": "<original text>"}`.
via"http" | "browser"How the page was fetched. `http` = direct fetch succeeded. `browser` = escalated to the headless worker (costs 2 extra credits).

Example

curl -X POST https://www.gyrence.com/api/v1/extract \
  -H "Authorization: Bearer $GYRENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products/widget",
    "prompt": "Extract the product name, price, currency, and availability.",
    "schema": "{\"name\":\"string\",\"price\":\"number\",\"currency\":\"string\",\"inStock\":\"boolean\"}"
  }'

Errors

CodeHTTPMeaning
bad_request400url missing/invalid, prompt empty, or schema is not valid JSON.
unauthorized401Missing, malformed, or revoked Authorization header.
credits_exhausted402Workspace balance below request cost, or the LLM provider returned 402.
forbidden_url403SSRF guard rejected a private, loopback, or link-local host.
timeout408Request exceeded the 25-second hard deadline.
rate_limited429Per-workspace rate limit, or the LLM provider rate-limited the call.
upstream_error502Page fetch or LLM call returned a 5xx.
unavailable503Any other unmapped error.
{ "ok": false, "error": "Schema must be valid JSON", "code": "bad_request" }
Credits

5 credits when fetched over HTTP, 7 credits when Gyrence escalates to the headless browser. Charged once per call regardless of model response length. See Notes for the full list of escalation triggers.

Limits & known behavior
  • Content is truncated. Only the first 12,000 characters of converted Markdown are sent to the model. Long pages have their tail dropped — narrow the URL (e.g. a product page, not a catalog) for best results.
  • Prompt + page content compete for context. Long prompts reduce the page content the model sees. Aim for prompts under 500 characters; let the schema do the structural work and the prompt focus on what to extract, not how.
  • JSON guarantee is best-effort. The model is instructed to return strict JSON, but malformed output falls back to { "_raw": "<text>" }. Always handle that shape.
  • schema is a hint, not a validator. Gyrence does not enforce types on the model's output. Validate with your own Zod/JSON-schema layer if it matters.
  • One page per call. To extract across many URLs, fan out client-side (use /map to enumerate) and call /extract per URL.
  • Markdown fidelity. Tables and deeply nested lists may lose structure in HTML→Markdown conversion before the model sees them.

Notes

  • Model selection and prompting are handled by Gyrence. The system enforces strict JSON output — no prose, no markdown fences. Gyrence may update the underlying model over time to improve quality or reduce cost; the API contract (request/response shape) remains stable.
  • Fetch pipeline: identical to /fetch. HTTP first, with automatic escalation to the headless browser on any of: origin 403/429/503, network error (status 0), body < 500 chars, JS-shell markers (id="root">, id="__next">, <noscript>, etc.), or post-conversion content loss (HTML ≥ 1000 chars producing < 100 markdown chars). forceBrowser: true skips the HTTP attempt.
  • via drives pricing, not the input. Even without forceBrowser, an automatic escalation bills as browser (7 credits).
  • SSRF re-validation runs on the final URL after redirects, not just the input.
  • No retries on model failure. Garbled output is returned as _raw for your inspection rather than re-prompted server-side.
Try it

Test prompts and schemas live in the console at /app/extract — paste a URL, iterate on the prompt, copy the resulting JSON.