Extract
Pull structured JSON out of a single page using an LLM. Gyrence fetches the URL (HTTP first, browser escalation if needed), converts it to Markdown, and uses an LLM to return strict JSON matching your prompt or schema.
Use this when
- You need a few specific fields from a page (price, title, author, contact email) without writing selectors.
- The site's HTML changes often and selector-based scraping is too brittle.
- You want JSON shaped to your own schema, not raw page content.
- You're prototyping enrichment before committing to a per-domain parser.
| Method | POST |
| Path | /api/v1/extract |
| Auth | Bearer |
| Credits | 5 (7 if browser) |
Request
| Parameter | Type | Description |
|---|---|---|
urlrequired | string | Page URL to extract from. Must be a public http(s) origin. SSRF-blocked hosts are rejected. |
promptrequired | string | Natural-language instruction describing what to extract (e.g. "Return the product title, price in USD, and in-stock status"). |
schema | string | Optional JSON string describing the desired shape. Sent to the model as a target structure. Must be valid JSON — invalid JSON returns `bad_request`. |
forceBrowser | booleandefault: false | Skip the HTTP-first attempt and render via the headless worker immediately. Use for known JS-heavy SPAs. |
Example body
{
"url": "https://example.com/products/widget",
"prompt": "Extract the product name, price, currency, and availability.",
"schema": "{\"name\":\"string\",\"price\":\"number\",\"currency\":\"string\",\"inStock\":\"boolean\"}"
}Response
{
"ok": true,
"data": {
"url": "https://example.com/products/widget",
"title": "Widget — Example",
"extractedJson": "{\"name\":\"Widget\",\"price\":29.99,\"currency\":\"USD\",\"inStock\":true}",
"via": "http"
}
}| Field | Type | Description |
|---|---|---|
url | string | The final URL fetched (after redirects). |
title | string | Page title parsed from the fetched HTML. |
extractedJson | string | Stringified JSON returned by the model. Always a string — parse it client-side. On malformed model output, falls back to `{"_raw": "<original text>"}`. |
via | "http" | "browser" | How the page was fetched. `http` = direct fetch succeeded. `browser` = escalated to the headless worker (costs 2 extra credits). |
Example
curl -X POST https://www.gyrence.com/api/v1/extract \
-H "Authorization: Bearer $GYRENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products/widget",
"prompt": "Extract the product name, price, currency, and availability.",
"schema": "{\"name\":\"string\",\"price\":\"number\",\"currency\":\"string\",\"inStock\":\"boolean\"}"
}'Errors
| Code | HTTP | Meaning |
|---|---|---|
bad_request | 400 | url missing/invalid, prompt empty, or schema is not valid JSON. |
unauthorized | 401 | Missing, malformed, or revoked Authorization header. |
credits_exhausted | 402 | Workspace balance below request cost, or the LLM provider returned 402. |
forbidden_url | 403 | SSRF guard rejected a private, loopback, or link-local host. |
timeout | 408 | Request exceeded the 25-second hard deadline. |
rate_limited | 429 | Per-workspace rate limit, or the LLM provider rate-limited the call. |
upstream_error | 502 | Page fetch or LLM call returned a 5xx. |
unavailable | 503 | Any other unmapped error. |
{ "ok": false, "error": "Schema must be valid JSON", "code": "bad_request" }Credits
5 credits when fetched over HTTP, 7 credits when Gyrence escalates to the headless browser. Charged once per call regardless of model response length. See Notes for the full list of escalation triggers.
Limits & known behavior
- Content is truncated. Only the first 12,000 characters of converted Markdown are sent to the model. Long pages have their tail dropped — narrow the URL (e.g. a product page, not a catalog) for best results.
- Prompt + page content compete for context. Long prompts reduce the page content the model sees. Aim for prompts under 500 characters; let the schema do the structural work and the prompt focus on what to extract, not how.
- JSON guarantee is best-effort. The model is instructed to return strict JSON, but malformed output falls back to
{ "_raw": "<text>" }. Always handle that shape. schemais a hint, not a validator. Gyrence does not enforce types on the model's output. Validate with your own Zod/JSON-schema layer if it matters.- One page per call. To extract across many URLs, fan out client-side (use
/mapto enumerate) and call/extractper URL. - Markdown fidelity. Tables and deeply nested lists may lose structure in HTML→Markdown conversion before the model sees them.
Notes
- Model selection and prompting are handled by Gyrence. The system enforces strict JSON output — no prose, no markdown fences. Gyrence may update the underlying model over time to improve quality or reduce cost; the API contract (request/response shape) remains stable.
- Fetch pipeline: identical to
/fetch. HTTP first, with automatic escalation to the headless browser on any of: origin 403/429/503, network error (status 0), body < 500 chars, JS-shell markers (id="root">,id="__next">,<noscript>, etc.), or post-conversion content loss (HTML ≥ 1000 chars producing < 100 markdown chars).forceBrowser: trueskips the HTTP attempt. viadrives pricing, not the input. Even withoutforceBrowser, an automatic escalation bills asbrowser(7 credits).- SSRF re-validation runs on the final URL after redirects, not just the input.
- No retries on model failure. Garbled output is returned as
_rawfor your inspection rather than re-prompted server-side.
Try it
Test prompts and schemas live in the console at /app/extract — paste a URL, iterate on the prompt, copy the resulting JSON.
