A source is a named connection to some external system or set of files, exposed to the query engine as one or more tables. Every table is addressed in SQL as <source>.<table> — the source name is the schema prefix. Sources are declared under sources: in your workspace pawrly.yaml (see Configuration, and the examples/pawrly.yaml worked file).
sources:
- name: data
kind: file
config:
path: ./data/*.parquetSELECT * FROM data.orders -- table `orders` in source `data`Pawrly has two foundational backends — file (local files, or object storage) and http (any REST/GraphQL API) — plus a small set of first-class database / lakehouse builtins that run through an in-process DuckDB engine.
This page is the reference for the source block (every top-level field) and the per-kind config for each kind. For the surrounding config — secrets, caching internals, safety semantics, defaults, and multi-file assembly — see Configuration.
The source block
Every entry under sources: is one source. These are the top-level fields:
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
name |
string | yes | — | SQL identifier; becomes the schema prefix (name.table). Must be unique. |
kind |
enum | yes | — | The source kind — see Source kinds. Case-insensitive; some kinds have aliases. |
description |
string | no | — | Free text; surfaced in pawrly source list. |
wiki |
string | no | — | Agent-facing usage notes for the whole source; surfaced by describe_table. See [wiki](#wiki). |
examples |
list | no | [] |
SQL statements that must run against this source; probed by pawrly check. See [examples](#examples). |
config |
mapping | no¹ | {} |
Per-kind settings (connection, auth, paths, storage, …). Shape depends on kind. |
tables |
list | no¹ | [] |
Explicit per-table declarations. Required for some kinds, optional for others (which auto-discover). |
cache |
mapping | no | mode: none |
Per-source caching. See the cache block. |
safety |
mapping | no | permissive | Per-source guard rails. See the safety block. |
raw_table |
bool | no | false |
http only: register a raw-HTTP escape-hatch table named after the source. |
raw_table_safety |
mapping | no | filter-required | Safety policy for the raw table when raw_table: true. |
¹ Whether config or tables is required depends on the kind (see each kind below).
The config-layer source block also accepts from: (load the body from another file) — see Configuration → Multi-file configs.
Strict keys. The source block rejects unknown top-level fields: a typo'd or misplaced key (e.g.
safty:, or a kind-specific key written outsideconfig:/tables:) fails the config load with an error rather than being silently ignored. The full machine-readable shape lives in the generated JSON Schema at[schemas/pawrly.schema.json](../schemas/pawrly.schema.json), which editors can use for completion and validation.
name
A valid SQL identifier (letter or _, then alphanumerics/_). It's the schema under which the source's tables are registered, so SELECT … FROM <name>.<table>. Names must be unique across the merged config.
kind
The kind selects the backend and the shape of config/tables. The list is closed (adding a kind is a code change). Matching is case-insensitive; the aliases below resolve to the same kind.
| Kind | Aliases | Backend |
|---|---|---|
file |
— | DataFusion native readers (local), or DuckDB object-store reads (with a storage: block) |
http |
— | native HTTP table provider |
mcp |
— | external MCP server's tools as tables (stdio or streamable HTTP) |
sqlite |
— | read-only attach |
postgres |
pg, postgresql |
DuckDB ATTACH (read-only) |
mysql |
— | DuckDB ATTACH (read-only) |
duckdb |
— | DuckDB ATTACH of a local .duckdb file (read-only) |
snowflake |
— | DuckDB ATTACH (community extension) |
iceberg |
— | DuckDB iceberg_scan |
delta |
deltalake |
DuckDB delta_scan |
ducklake |
— | DuckDB ATTACH 'ducklake:…' |
The DuckDB-backed kinds load a DuckDB extension on first use (e.g.
postgres,iceberg,ducklake,httpfs; Snowflake's is a community extension). The first registration in a fresh environment may need network access to fetch the extension.
description
Optional human-readable text. No effect on behavior; shown by pawrly source list.
wiki
Optional free-text usage notes aimed at the MCP/agent surface rather than execution: which filters to set, identifier quirks, how to decode a column. Declared on a source (applies to all its tables) and/or on individual table entries; describe_table returns both joined (source notes first). Richer than description, which stays a one-liner for listings.
sources:
- name: gh
kind: http
wiki: |
All endpoints need `owner` and `repo` filters. Timestamps are ISO-8601 UTC.
tables:
- name: pulls
wiki: |
`state` defaults to `open`; pass `state = 'all'` to include merged PRs.
# …examples
Optional list of SQL statements that must run successfully against this source. pawrly check executes them as live probes, so a broken endpoint or credential is caught at check time rather than first query. describe_table also returns the examples that mention the described table, giving agents known-good starting queries.
sources:
- name: gh
kind: http
examples:
- SELECT * FROM gh.pulls WHERE owner = 'pawrly' AND repo = 'pawrly' LIMIT 1
# …config
A per-kind mapping, opaque to the config layer and interpreted by the kind's builder (so each kind documents its own keys below). Strings here may use ${secret:NAME}, ${env:NAME}, and ${file:PATH} interpolation (see Configuration → Secrets).
tables
Explicit per-table declarations. Per-table fields are written flat — the kind-specific keys (path, format, endpoint, params, response, query, …) sit directly under the table entry, not under a nested config::
tables:
- name: orders # required; the SQL table name
description: Daily orders # optional
path: ./data/orders.parquet # ← kind-specific fields, flat
format: parquet
cache: { mode: ttl, ttl: 1h } # optional; overrides the source-level cache
safety: { max_rows: 100000 } # optional; overrides the source-level safetyOnly name, description, wiki, cache, and safety are common; everything else is kind-specific. Some kinds auto-discover tables when tables: is omitted (file globs, sqlite/postgres/mysql/duckdb/snowflake/ducklake enumerate). Others require tables: (iceberg, delta, and object-store file).
Whether a kind needs tables:, and whether it reads per-table fields at all:
| Kind | tables: |
Per-table fields read |
|---|---|---|
file (local) |
optional (globs auto-discover) | path, format, csv, json, schema, partition_cols |
file (object store) |
required | path/location, format |
http |
required for typed tables | full request/response spec (see the Http backend section below) |
mcp |
optional (tools auto-expose) | transport, then command/url + auth (see the MCP backend section) |
sqlite |
optional (auto-enumerates) | query (reshape/restrict) |
postgres, mysql, duckdb, snowflake, ducklake |
optional (lazy catalog enumeration) | none — entries are ignored; the live catalog is exposed as-is |
iceberg, delta |
required | path/location |
For the attach-style catalog kinds (
postgres/mysql/duckdb/snowflake/ducklake), tables are surfaced lazily straight from the remote catalog, so addingtables:entries does not restrict or rename them. Use the semantic layer if you need curated views over an attached database.
The cache block
Opt-in caching for a source (or an individual table). With no block, reads always go live.
cache:
mode: ttl # none | ttl | refresh | cron | append
ttl: 10mmode |
Extra field | Behaviour |
|---|---|---|
none |
— | No caching (default when cache: is absent). |
ttl |
ttl: <dur> |
Serve the cached result until ttl elapses, then re-fetch on the next read. |
refresh |
every: <dur> |
Always read the cache; a background loop re-fetches every every. |
cron |
cron: "<expr>" |
Like refresh, scheduled by a cron expression. |
append |
cursor_column: <col> |
Incremental: only rows newer than the cached cursor_column max are fetched on refresh. |
Durations use humantime (30s, 10m, 1h). Storage location, namespacing, and cache-management commands are covered in Configuration → Caching.
The safety block
Guard rails enforced before a scan runs. All fields are optional and default to permissive.
safety:
require_filters_on: [order_date] # error unless a filter touches each of these columns
require_at_least_one_filter: true # refuse a full-table scan
max_rows: 1000000 # hard cap on returned rows
max_pages: 50 # cap on HTTP pagination calls
timeout: 30s # per-query timeout
required_predicates: # predicates AND-ed into every scan
- "tenant_id = ${param:tenant_id}"required_predicates is most useful with the semantic layer, where ${param:NAME} placeholders are bound from query params as safe literals (row-level security). See Configuration → Safety.
raw_table / raw_table_safety
For kind: http only, raw_table: true registers an escape-hatch table named after the source for endpoints with no typed spec. You provide the request as filters; Pawrly returns the raw response as rows. Columns:
| Column | Type | Notes |
|---|---|---|
request_method |
varchar | defaults to GET if not filtered |
request_path |
varchar | filter required (= or IN (…)) |
request_query |
varchar | optional query string |
response_status |
int | HTTP status code |
response_body |
varchar | raw response body |
SELECT response_status, response_body
FROM gh -- the source itself is the raw table
WHERE request_path = '/rate_limit'raw_table_safety overrides the default policy, which requires a filter on request_path (so a bare SELECT * can't fan out arbitrarily).
Unlike typed tables, the raw table is registered unqualified (in the default schema) under the source's own name, so it is queried as
FROM <source>— notFROM <source>.<table>. This means a source withraw_table: truereserves its bare name for the raw escape hatch; its typed tables still live at<source>.<table>.
Source Kinds
File Backend (file) — local files & object storage
The file backend serves columnar and row files. Local files use DataFusion's native readers; object storage (S3/GCS/Azure) is expressed with a storage: block and read through DuckDB. A file source needs either a top-level config.path glob or at least one tables: entry.
sources:
- name: data
kind: file
config:
path: ./data/*.csv # glob; one table per file, named by file stemPer-table fields are written flat under each tables: entry: path, format, csv, json, schema, partition_cols. The three big topics — formats, globs/partitioning, and object storage — follow.
File formats
format is one of parquet, csv, json. It's inferred from the file extension when omitted (.parquet → parquet; .csv → csv; .json / .jsonl / .ndjson → json). Specify it explicitly for extensionless paths or directories.
CSV — override the dialect with a csv: block (all optional):
| Key | Default | Notes |
|---|---|---|
header |
true |
First row is a header. Set false for headerless files (pair with an explicit schema). |
delimiter |
, |
Single character. "\t" is accepted for tab. |
quote |
" |
Single quote character. |
tables:
- name: metrics
path: ./data/metrics.tsv
format: csv
csv: { header: false, delimiter: "\t" }
schema: # name + type the columns for a headerless file
- { name: host, type: varchar }
- { name: value, type: bigint }JSON/JSONL — files may be newline-delimited (NDJSON) or a single [ … ] array. The layout is auto-detected from the first non-whitespace byte; force it with a json: block:
- name: facts
path: ./data/facts.json
format: json
json: { format: array } # array | ndjson | auto (default)Explicit schema — a schema: list of { name, type } overrides inference (useful for headerless CSV or mis-inferred columns). Column type values (here and in partition_cols): bool/boolean, int/int32, bigint/int64, float/float32, double/float64, date, and varchar (the default for anything else).
File Partitions
A per-table path may be:
- a single file —
./data/orders.parquet; - a glob —
./data/orders/*.parquet(all matches unioned into one table); - a directory —
./lake/events(every file beneath it, read as one table).
For partitioned datasets, declare partition_cols so the partition keys become queryable columns. This applies to all three formats (parquet, csv, json). Two styles, one per table:
Hive Partition — key=value directories (e.g. events/dt=2026-05-31/region=us/*.parquet). The keys are exposed as columns and prune by directory (a filter on dt skips non-matching folders). Streams through the file reader.
- name: events # events/dt=…/region=…/*.parquet
path: ./lake/events
format: parquet
partition_cols:
- { name: dt, type: date }
- { name: region, type: varchar }SELECT * FROM data.events WHERE dt = '2026-05-31' -- only that dt= directory is readSegment — positional partitions for layouts that aren't key=value. Each column takes its value from the directory name at a zero-based index beneath the glob base. Segment-partitioned tables are materialized in memory, so they don't prune.
- name: sessions # projects/<project>/*.jsonl
path: ./projects/*/*.jsonl
format: json
partition_cols:
- { name: project, type: varchar, kind: segment, index: 0 }Each partition_cols entry is { name, type (default varchar), kind: hive | segment (default hive), index (required for segment) }.
Object storage (S3 / GCS / Azure)
Add a storage: block to read from a bucket. storage.type selects the provider; storage.region and the bucket URLs are the location; credentials live under a typed storage.auth block. Object-store file sources require explicit tables:, each pointing at a remote URL.
sources:
- name: lake
kind: file
config:
storage:
type: s3 # s3 | gcs | azure
region: us-east-1 # location, not a credential
auth:
type: access_key # access_key | credential_chain
access_key_id: ${secret:AWS_KEY_ID}
secret_access_key: ${secret:AWS_SECRET}
# endpoint: ${env:AWS_ENDPOINT} # S3-compatible stores (MinIO, R2, …)
tables:
- name: events
path: s3://my-bucket/events/*.parquet
format: parquet # parquet (default) | csv | jsonauth.type selects the method (default access_key); each provider supports more than one:
type |
auth.type |
Fields |
|---|---|---|
s3 |
access_key |
access_key_id, secret_access_key, session_token, endpoint, url_style |
gcs |
access_key |
access_key_id, secret_access_key (HMAC keys) |
azure |
access_key |
connection_string, account_name |
http |
header / basic |
header/basic auth attached to HTTPS file reads |
| any | credential_chain (alias chain) |
none — resolve from the ambient chain (env / instance profile / gcloud / az login); optional endpoint, account_name |
With no auth block, the ambient credential chain is used. auth.type: chain is accepted as a shorthand alias for credential_chain.
In addition to s3/gcs/azure, storage.type: http covers authenticated HTTPS file URLs — combine it with a header or basic auth block to attach credentials to plain-HTTP reads (the custom/oauth2 HTTP-source auth styles do not apply here).
Scheme auto-routing. A file source is routed through DuckDB automatically whenever it sees a storage: block or a remote scheme on any path/location — s3://, gs:///gcs://, az:///azure:///abfss://, or http(s)://. A public bucket or HTTPS file therefore works with no storage: block at all; you only need one to supply credentials, a region, or a custom endpoint.
Remote files are read by DuckDB's
read_parquet/read_csv/read_json, so the local-filecsv/json/partition_cols/schemaoptions do not apply to object-store tables — DuckDB infers the schema and reader from the URL andformat. Remotehttp(s)://paths also cannot be globbed; point each table at a single concrete URL (bucket globs likes3://…/*.parquetare fine).
Http Backend (http) — REST & GraphQL APIs
Turns an HTTP API into SQL tables: you declare each table's request and how to shape its JSON response into rows. Source-level config carries base_url (required), auth, static request headers, retries, and rate limiting; each tables: entry maps one request shape to rows.
sources:
- name: gh
kind: http
config:
base_url: https://api.github.com # joined with each table's endpoint
token: ${secret:GITHUB_TOKEN}
raw_table: true
tables:
- name: pulls
endpoint: /repos/{owner}/{repo}/pulls
params:
- { name: owner, required: true }
- { name: repo, required: true }
- { name: state, required: false, default: open }
response:
path: $
schema:
- { name: number, type: bigint }
- { name: title, type: varchar }
- { name: state, type: varchar }
pagination: { type: link_header }SELECT number, title FROM gh.pulls
WHERE owner = 'CITGuru' AND repo = 'pawrly' AND state = 'open' LIMIT 20From an OpenAPI spec (config.type: openapi)
Instead of hand-writing tables:, point an HTTP source at an OpenAPI 3.0.x spec and Pawrly
synthesizes one table per GET operation at load time — endpoint, params, columns, rows path, and
pagination are read from the document. Here base_url is the spec URL; the real API base comes
from the spec's own servers.
sources:
- name: stripe
kind: http
config:
type: openapi
base_url: https://raw.githubusercontent.com/stripe/openapi/refs/heads/master/latest/openapi.spec3.yaml
auth:
type: header
headers:
- name: Authorization
bearer: ${secret:STRIPE_API_KEY}
openapi:
include: { paths: ["/v1/charges*", "/v1/customers*"] } # optional; default = every GETSELECT id, amount, currency, status FROM stripe.get_charges LIMIT 10Only read-only GET operations are exposed; pagination is inferred generically (page, offset,
cursor, and a last-row cursor). Where inference is uncertain (e.g. a polymorphic response) the column
degrades to json and a diagnostic is logged.
Adjusting a synthesized table. A tables: entry whose name matches a synthesized table
patches it — only the fields you set are merged in, the rest of the synthesis is kept; a name
that matches nothing is a full new table definition. So fixing one field doesn't mean re-declaring
the endpoint and every column:
tables:
- name: get_charges
response: { path: "$.data" } # patch the rows-path; keep the synthesized columns
- name: get_events
pagination: null # drop the inferred paginationFields merge per key; arrays (response.schema, params) and type-tagged blocks (pagination)
replace wholesale, and a null clears a field.
Top-level config (the same plumbing as hand-declared HTTP mode):
| Key | Required | Description |
|---|---|---|
type |
yes | openapi — enables spec-driven synthesis. Absent/manual keeps the hand-declared behaviour. |
base_url |
yes | The spec location: an http(s):// URL or a file:// path. The API base comes from the spec's servers. |
auth |
no | Auth block (header / basic / custom / oauth2) — see Authentication. |
token |
no | Bearer shorthand: sent as Authorization: Bearer <token> (equivalent to a one-header auth). |
headers |
no | Static request headers attached to every call. |
retry |
no | { max_retries, base_backoff_ms, max_backoff_ms } — see Rate limiting & retries. |
rate_limit |
no | { requests_per_second, remaining_header, reset_header, extra_statuses }. |
config.openapi (synthesis-specific):
| Key | Description |
|---|---|
include |
{ tags: [...], paths: [globs], operations: [...] } — only matching GETs become tables. A * glob matches path segments. Omit include to register every GET. |
exclude |
Same shape as include; an operation matching exclude is dropped (wins over include). |
naming |
How tables are named: operationId (default) |
base_url |
Override for the effective request base, used only when the spec declares no usable servers[0].url. |
cache |
{ ttl: <duration> } — cache the fetched spec on disk and reuse it while fresh (e.g. 24h, 30m). Omit to re-fetch on every load. Applies to http(s):// specs only. |
Source-level safety.max_pages caps the pagination loop for synthesized tables (they inherit no
per-table safety). With openapi.cache set, the spec is stored under
$PAWRLY_HOME/cache/openapi/ (default ~/.pawrly) keyed by URL; without it, the document is fetched
on every load (or point base_url at a vendored file:// copy).
Authentication
Set source-level auth with the config.token shorthand or a full auth: block (the block wins if both are present). The block is tagged by type — header, basic, custom, or oauth2:
header — attach one or more headers (bearer tokens and API keys live here). headers is a list; each entry gives a name plus exactly one of bearer (sent as Bearer <value>) or value (sent verbatim). Multiple entries cover APIs that need several auth headers at once (e.g. Datadog).
config:
base_url: https://api.example.com
auth:
type: header
headers:
- { name: Authorization, bearer: "${secret:GITHUB_TOKEN}" } # → "Bearer …"
- { name: X-Api-Key, value: "${secret:API_KEY}" } # literalbasic — base64-encodes username:password into Authorization: Basic ….
auth:
type: basic
username: "${secret:API_USER}"
password: "${secret:API_PASSWORD}"custom — credentials carried outside headers. query is a list of { name, value } appended to every request as query-string params (the many ?api_key=… APIs); body is a list of { name, value } injected into the request body as a JSON object. When a table also declares its own JSON body, the body fields are merged on top of it; with no table body, they are sent as the whole JSON body. (Merging requires a JSON table body — a form body errors.)
auth:
type: custom
query:
- { name: api_key, value: "${secret:API_KEY}" }
body:
- { name: tenant, value: acme }oauth2 — client-credentials grant: a token is fetched on first use, cached, re-fetched before expiry, then sent as Authorization: Bearer <token>. Fields: token_url, client_id, client_secret, optional scope, audience.
auth:
type: oauth2
token_url: https://login.example.com/oauth/token
client_id: ${secret:CLIENT_ID}
client_secret: ${secret:CLIENT_SECRET}
scope: read:data # optionalShorthand — config.token is the dead-simple single-bearer case, equivalent to a header block with one Authorization: Bearer <token> entry:
config:
base_url: https://api.github.com
token: ${secret:GITHUB_TOKEN}Gotcha. A malformed
auth:block (wrongtype, a missing required field) does not raise a config error — it falls back to no authentication, and requests go out unauthenticated. If an API starts returning401/403, double-check theauthblock's shape first.
Request headers
config.headers is a source-level string→string map applied to every request the source issues — typed tables and the raw table alike. Use it for the constant headers an API expects on all calls (a media type, an API-version pin) instead of repeating them in each table's headers:
config:
base_url: https://api.github.com
token: ${secret:GITHUB_TOKEN}
headers:
Accept: application/vnd.github+json
X-GitHub-Api-Version: '2022-11-28'Source headers go on first; a table's own headers are merged on top and override on a key collision. An entry with an invalid header name or non-string value is skipped with a warning rather than failing the source. (Auth headers come from the auth block, not here.)
Request
Each table's request is built from these flat fields:
- endpoint (required) — path appended to
base_url. May carry a query string and{param}placeholders; a param whose name matches a{placeholder}fills the URL path. Remaining params become query parameters — except those consumed by the bodytemplate, which are sent in the body only, not duplicated onto the query string. - method — defaults to
GET. - headers — a per-table map of extra request headers. Constant headers shared by every table (e.g.
Accept, an API-version header) are better set once at the source level viaconfig.headers(see below); per-tableheadersare merged on top and win on a key collision. - body — for POST/PUT/GraphQL:
kind(json, the default, setsContent-Type: application/json; orformforapplication/x-www-form-urlencoded) andtemplate(body text with{param}placeholders; other braces — JSON/GraphQL syntax — are left untouched). In a JSON body, an unbound optional param (no filter and nodefault) doesn't leave its placeholder behind: the JSON object member holding it is dropped, and any object it empties is dropped too. This is what makes inlined optional GraphQL filters work —filter: { team: { id: { eq: "{team_id}" } } }withteam_idunfiltered collapses to noteamfilter (the variable defaults to null = match-all) instead of sending a literal"{team_id}"that matches nothing. - requests — conditional request shapes tried in order; the first whose
when_filtersare all bound replaces the defaultendpoint/method/body. Each entry is{ when_filters: [...], endpoint, method?, body? }. The classic use is a get-by-id endpoint when an id filter is present, falling back to a list endpoint otherwise.
- name: search
endpoint: /graphql
method: POST
params:
- { name: q, required: true }
body:
kind: json
template: '{"query": "{ search(q: \"{q}\") { id name } }"}'
response:
path: $.data.search
schema:
- { name: id, type: varchar }
- { name: name, type: varchar }Query parameters
params declares the columns a table accepts as filters. Each is { name, type (default varchar), required (default false), default, accepts, emit, explode, derive, filterable }:
required: true— the param must appear as a SQL filter, or the scan fails with a clear error (rather than fetching an unbounded result).default— value used when the user doesn't filter on it.filterable: true— let the param be used in aWHEREclause. Normally a param can only be filtered on if it's also aresponse.schemacolumn; this exposes it as a column without declaring one. Use it for a filter the API accepts but doesn't return.
params:
- { name: status, filterable: true } # WHERE status = 'active' → ?status=active- Equality pushes down by default:
WHERE state = 'open'→?state=open. - Comparisons — to push
>=/<=etc., list them inacceptsand map each to a query-parameter name inemit:
params:
- name: created
accepts: [">=", "<="]
emit: { ">=": since, "<=": until } # WHERE created >= X → ?since=X- explode —
explode: truepushes a SQLIN (a, b, c)filter down as repeated query pairs (?key=a&key=b&key=c) instead of post-filtering. Equality still emits a single pair; a non-explodeINis filtered in-engine as before. Only honored for the query string (not path/body params). - derive — compute the param's value when the query doesn't supply one (a dynamic default), tagged by
kind:ago—{ kind: ago, seconds: N }→ epoch secondsnow - N(a relative time window, e.g. "last hour").split—{ kind: split, from: <other param>, separator: "-", part: 0 }→ apart(0-based) of another bound param's value split byseparator. Useful to derive request fields from a composite filter (e.g. an issue keyENG-123→ teamENG+ number123). A derived param that feeds a body template stays out of the query string.
params:
- { name: from, type: bigint, derive: { kind: ago, seconds: 3600 } }
- { name: status, explode: true } # WHERE status IN ('open','closed') → ?status=open&status=closedA param can also be surfaced as an output column with source: param on a response.schema entry (see below).
Response
response describes how to turn the JSON payload into rows:
path— JSONPath to the array of rows.$(the default) means the body is the array;$.datadigs into a wrapper object.reshape— turn a payload that isn't a flat array atpathinto rows before column extraction. See Reshaping the response.schema— the columns to extract per row, each{ name, type, source?, expr? }:type∈varchar/string/text,bigint/int64,int/int32,double,float,bool/boolean,date,timestamp,timestamptz(ISO-8601 / RFC 3339 strings are parsed), andjson(a nested object/array kept as raw JSON text).source— defaults to the row's top-level field of the same name. Set$.nested.fieldto read a different path,$to capture the whole row element (usually into ajsoncolumn), orparamto inject a request parameter as a column.expr— a computed expression evaluated per row, for columns a single path can't express. Takes precedence oversourcewhen set. See Computed columns.
allow_404_empty— treat a404as an empty result set instead of an error.error— surface API failures as a clear scan error:status(a list of codes or matchers like">=400","5xx","<500") and/orpath(a JSONPath to an error message inside a200-with-error body).
response:
path: $.data
allow_404_empty: true
schema:
- { name: id, type: bigint }
- { name: author, type: varchar, source: $.user.login }
- { name: payload, type: json, source: $ }
- { name: repo, type: varchar, source: param }
error:
status: [">=400"]
path: $.messageComputed columns
When a column needs more than a single JSONPath, give it an expr instead of a source. An expr is a small tree, each node tagged by kind, evaluated against each row (plus the bound request params). A missing path, a shape mismatch, or a failed transform yields NULL — never a scan error.
kind |
Fields | Result |
|---|---|---|
path |
path: [a, b] |
Walk object keys a.b from the row. |
coalesce |
exprs: [...] |
First non-null sub-expression. |
literal |
value |
A constant value. |
replace |
expr, from, to |
Find/replace on the string produced by expr. |
current_row |
— | The whole row element (usually into a json column). |
null |
— | Always null. |
from_filter |
filter |
Inject a bound request param's value. |
join |
path |
Join the array at path into "a,b,c" (objects as compact JSON). |
map_join |
path, item_path |
Take item_path of each array element, then join. |
first_of |
path, item_path |
item_path of the first array element (keeps its type). |
lookup |
path, key, key_field?, value_field? |
In [{key, value}], the value_field whose key_field equals key. The fields default to key/value. |
lookup_join |
path, key, key_field?, value_field? |
Like lookup but joins every matching value. |
pick |
path, by, item_path |
Index the object at path by the bound param by, then take item_path. |
to_timestamp |
unit (seconds/millis), expr |
Convert the epoch number from expr to an RFC 3339 string (a string passes through). |
from_base64 |
expr |
Base64-decode the string from expr into UTF-8 text. |
if_present |
check, then_value |
then_value when check is non-null, else null. |
schema:
# title from `attributes.title`, falling back to `title`
- name: title
type: varchar
expr:
kind: coalesce
exprs:
- { kind: path, path: [attributes, title] }
- { kind: path, path: [title] }
# array of label objects -> "bug,p1"
- { name: labels, type: varchar, expr: { kind: map_join, path: [labels], item_path: [name] } }
# epoch seconds -> a real timestamp
- { name: created, type: timestamp, expr: { kind: to_timestamp, unit: seconds, expr: { kind: path, path: [created_at] } } }Reshaping the response
response.path normally points at an array of rows. response.reshape turns the value at path into rows when it isn't one — applied before column extraction. Tagged by kind:
- dict_entries — the value at
pathis an object/map; each entry becomes a row: the entry value with its key added as_key(or{_key, _value}when the value isn't an object). Read the key withsource: $._key.
- name: calendar_colors
endpoint: /colors
response:
path: $.calendar
reshape: { kind: dict_entries }
schema:
- { name: id, type: varchar, source: $._key }
- { name: background, type: varchar }- series_points — flatten
{ <series>: [ { …, <points>: [[t, v], …] } ] }into one row per point: each series object (minus its points field) plus the point's two values, written under thetimestampandvaluecolumn names. Fields:series,points,timestamp,value.
- name: metrics
endpoint: /api/v1/query
response:
path: $
reshape: { kind: series_points, series: series, points: pointlist, timestamp: timestamp, value: value }
schema:
- { name: metric, type: varchar } # carried from each series object
- { name: timestamp, type: bigint } # point[0]
- { name: value, type: double } # point[1]Pagination
Set pagination to keep fetching pages; absent means a single request. A SQL LIMIT stops pagination early once enough rows are collected, and safety.max_pages caps the loop. The strategy is tagged by type:
type |
Fields | Behaviour |
|---|---|---|
link_header |
— | Follows the RFC 5988 Link: header's rel="next" URL until absent. |
cursor |
next_path, param |
Reads an opaque cursor from the body at next_path ($.a.b) and echoes it back as the param query parameter; stops when the cursor is absent/empty. |
body_cursor |
cursor_path, next_path |
Reads the next cursor from the body at next_path and writes it into the next request's JSON body at cursor_path (GraphQL variables.after, Notion start_cursor); stops when the cursor is absent/empty or a page is empty. |
row_cursor |
param, field (default id), more_path? |
Sends param = the last row's field (e.g. starting_after=<last id>); stops when more_path (a $.a.b boolean) is false, else on an empty page. |
page |
param, start (default 1), size_param?, size? |
Increments the page number in param until a page returns zero rows; optionally sends a page size via size_param/size. |
offset |
param, size_param, size |
Increments param by size each page until a short page (fewer than size rows). |
pagination: { type: cursor, next_path: $.response_metadata.next_cursor, param: cursor }Rate limiting & retries
rate_limit keeps requests within the API's quota:
| Field | Notes |
|---|---|
requests_per_second |
Local token-bucket ceiling shared across scans. Omit/zero to disable. |
remaining_header |
Response header carrying remaining quota (e.g. x-ratelimit-remaining); when it reads 0, the next request waits until the reset time. |
reset_header |
Response header carrying the reset time as an epoch-seconds timestamp. |
extra_statuses |
Status codes besides 429/503 also treated as rate-limit signals (e.g. GitHub's secondary-limit 403). |
retry governs transient failures (transport errors, 5xx, 429, 503, and any extra_statuses): max_retries (default 3), base_backoff_ms (default 200, doubles each attempt), max_backoff_ms (default 5000). Backoff honours a Retry-After header when present.
config:
base_url: https://api.github.com
rate_limit:
remaining_header: x-ratelimit-remaining
reset_header: x-ratelimit-reset
extra_statuses: [403] # GitHub secondary limit
retry:
max_retries: 5A runnable cache-over-API walkthrough lives at examples/cache-http/.
MCP Backend (mcp) — an MCP server's tools as tables
Connects to an external Model Context Protocol server and exposes its tools as SQL tables: a SELECT runs tools/call, pushed-down WHERE filters become tool arguments, and the result rows are projected into columns. Connect over stdio (a local subprocess) or streamable HTTP (a remote server).
sources:
- name: linear
kind: mcp
config:
transport: streamable_http
url: https://mcp.linear.app/mcp
auth:
type: header
headers:
- name: Authorization
bearer: ${secret:LINEAR_API_KEY}
- name: github
kind: mcp
config:
transport: stdio
command: ["npx", "-y", "@modelcontextprotocol/server-github"]
env:
GITHUB_TOKEN: ${secret:GITHUB_TOKEN}SELECT id, title, status FROM linear.list_issues WHERE assignee = 'me@example.com' LIMIT 20Two ways to get tables, one dial. A source produces tables from introspection (tools/list) and from declaration (tables:). config.expose sets how much introspection auto-exposes:
expose |
auto-exposed | use |
|---|---|---|
read_only (default) |
tools with annotations.readOnlyHint == true |
zero-config, safe |
all |
every non-destructive tool | you accept a SELECT may call any read tool |
listed |
none — only tables: / include: |
fully declarative |
include/exclude (by tool name) narrow whatever expose admits; a destructiveHint tool is never auto-exposed.
Output columns. A tool result is exposed as a single result json column unless the tool declares an outputSchema (columns inferred) or a tables: entry declares them. Each tables: entry patches a synthesized table of the same name or defines a new one (tool: + knobs):
| field | meaning |
|---|---|
tool |
the MCP tool to call (defaults to the table name on a patch). |
columns |
{ name, type, path: [keys] } — pull a (possibly nested) field out of each row element; empty path is the whole element as JSON. |
tool_args |
static arguments always sent. |
filters |
bind a SQL filter to a differently-named argument. |
limit_binding |
{ tool_arg, max } — push SQL LIMIT into a tool argument. |
pagination |
{ cursor_arg, response_cursor_path } — cursor pagination (default: follow nextCursor). |
Security. A streamable_http url must be https (or http only for loopback), and may not embed credentials in the URL — use the auth block (header/bearer/basic, same shapes as the HTTP backend) with ${secret:…} tokens. Validated at config-load time.
A runnable, per-source-per-file example lives at examples/mcp.yaml.
Databases and Lakehouse Formats
sqlite — local SQLite databases
Attaches a SQLite file read-only and exposes its tables; equality filters push down. When tables: is omitted, every user table is auto-registered. sqlite is the one attach-style kind whose tables: entries are honored — supply a query to restrict or reshape a table.
sources:
- name: app
kind: sqlite
config:
path: ./app.db
# tables: # optional: restrict / reshape
# - name: active_users
# query: SELECT id, email FROM users WHERE active = 1config key |
Required | Notes |
|---|---|---|
path |
yes | Path to the .db file (:memory: allowed). |
| Per-table key | Notes |
|---|---|
query |
Optional SQL backing the table; defaults to SELECT * FROM "<name>". |
postgres, mysql — foreign databases
DuckDB ATTACHes the database read-only and exposes its tables lazily (<source>.<table>); equality predicates, projection, and limits push down. No tables: needed.
| Key | Notes |
|---|---|
dsn |
Full connection string. If present, used as-is. |
host |
Required if no dsn. |
database / dbname |
Database name (either spelling). |
port |
Optional; accepts an integer or a string. |
user, password |
Optional. |
sources:
- name: oltp
kind: postgres # aliases: pg, postgresql
config:
host: db.internal
database: app
user: readonly
password: ${secret:PG_PASSWORD}duckdb — local DuckDB database file
Attaches a .duckdb database file read-only and exposes its tables lazily.
sources:
- name: local_db
kind: duckdb
config:
path: ./analytics.duckdbconfig key |
Required | Notes |
|---|---|---|
path |
yes | Path to the .duckdb file; resolved against the config dir. |
snowflake
DuckDB ATTACH via the Snowflake community extension (installed on first use). Requires account, user, password; optional database, schema, warehouse, role.
sources:
- name: warehouse
kind: snowflake
config:
account: acme.us-east-1
user: ${secret:SNOWFLAKE_USER}
password: ${secret:SNOWFLAKE_PASSWORD}
database: ANALYTICS
schema: PUBLICiceberg, delta — table formats
Each declared table maps to a DuckDB scan function over a table location. tables: is required, each with a path (or location).
sources:
- name: lake
kind: iceberg # or: delta (alias deltalake)
tables:
- name: orders
path: s3://bucket/warehouse/ordersFor tables on an object store, provide credentials with a storage: block (same keys as object-store file); the httpfs extension loads automatically.
| Per-table key | Required | Notes |
|---|---|---|
path / location |
yes | Table location passed to the DuckDB scan function. |
ducklake — DuckLake lakehouse catalog
Attaches a DuckLake catalog (a metadata database plus a data path) and exposes its tables lazily.
sources:
- name: lake
kind: ducklake
config:
catalog: ./metadata.ducklake # sqlite/duckdb/postgres catalog
data_path: ./lake_data # local or s3://… (+ optional storage block)config key |
Required | Notes |
|---|---|---|
catalog |
yes | The metadata catalog (sqlite/duckdb/postgres). |
data_path |
no | Where data files live; local or remote (s3://…). Resolved vs the config dir. |
storage |
no | Object-store credentials for a remote data_path (loads httpfs). |
Column type reference
Several places take a column type — the file backend's schema: and partition_cols, and the http backend's response.schema. The accepted spellings (case-insensitive) and the contexts that support them:
| Canonical type | Accepted spellings | file schema / partitions |
http response |
|---|---|---|---|
| boolean | bool, boolean |
✓ | ✓ |
| int32 | int, int32 |
✓ | ✓ |
| int64 | bigint, int64, long |
✓ | ✓ |
| float32 | float, float32 |
✓ | ✓ |
| float64 | double, float64 |
✓ | ✓ |
| date | date |
✓ | ✓ |
| timestamp | timestamp |
— | ✓ |
| timestamptz | timestamptz |
— | ✓ (RFC 3339) |
| varchar | varchar, string, text |
✓ (default) | ✓ |
| json | json |
— | ✓ (raw JSON text) |
Any unrecognized type spelling falls back to varchar. For http, timestamp/timestamptz parse ISO-8601 / RFC 3339 strings, and json keeps a nested object/array as raw JSON text (pair it with source: $ to capture a whole row element) — see Querying JSON columns to read into it.
Querying JSON columns
A json column is raw JSON text — an object, array, or scalar kept verbatim so nested structure survives the scan. Two function families read into it; they share no names, so mix them freely.
Pawrly JSON Helpers
| Function | Returns | Purpose |
|---|---|---|
from_json(text) |
list<varchar> |
Parse a JSON array into its element texts, so unnest(from_json(col)) explodes it into one row per element. A non-array value yields a single-element list. |
json_extract_string(text, path) |
varchar |
Read a field by key, or a dotted path (a.b.c) for nested objects. Strings come back unquoted; numbers/bools/objects as their JSON text; a missing key (or JSON null) is NULL. |
DataFusion JSON Helpers
| Function / Operator | Returns | Purpose |
|---|---|---|
json_get_str(col, ...) |
varchar |
Read a value at the given key/index path as text. |
json_get_int(col, ...) |
bigint |
Read a value at the path as an integer. |
json_get_float(col, ...) |
double |
Read a value at the path as a float. |
json_get_bool(col, ...) |
boolean |
Read a value at the path as a boolean. |
json_get_json(col, ...) |
varchar |
Read a value at the path back as raw JSON text (objects, arrays, or scalars). |
json_get_array(col, ...) |
list |
Read a value at the path as an array. |
json_length(col, ...) |
bigint |
Length of the array or object at the path (or of col itself). |
json_contains(col, ...) |
boolean |
Whether a value exists at the given key/index path. |
json_object_keys(col) |
list<varchar> |
The keys of the JSON object. |
col -> key |
varchar |
Get the value at key as raw JSON (alias for json_get). |
col ->> key |
varchar |
Get the value at key as text (alias for json_get_str). |
col ? key |
boolean |
Whether key is present (alias for json_contains). |
Exploding an array of objects
unnest must sit in a projection, not a correlated lateral join, so explode the array in a subquery (a CTE) and read its elements in the outer query:
WITH elems AS (
SELECT unnest(from_json(payload)) AS e -- payload is a JSON array of objects
FROM t
)
SELECT json_extract_string(e, 'code') AS code,
CAST(json_extract_string(e, 'amount') AS DOUBLE) AS amount
FROM elems
WHERE json_extract_string(e, 'code') = 'USD'json_length(payload) gives the array length without unnesting.
Gotchas
unnestis projection-only.**FROM t, unnest(from_json(t.payload))(a correlated lateral join) is rejected by the planner — use the CTE form above.- Operator precedence.
->/->>bind looser thanINand comparisons, so parenthesize:(e ->> 'code') IN ('USD', 'EUR'). - Typed getters match the JSON type.
json_get_float(e, 'amount')returns NULL when the value is a JSON string ("12.50") rather than a number — usejson_get_stror->>, thenCAST.
Federation
Every source is a table in one DataFusion plan, so you can join across kinds in a single statement — a local Parquet file against a Postgres table against an HTTP query — with no import step:
SELECT u.email, COUNT(p.number) AS open_prs
FROM oltp.users u
JOIN gh.pulls p ON p.user = u.github_login
WHERE p.owner = 'pawrly' AND p.repo = 'pawrly' AND p.state = 'open'
GROUP BY u.emailSee examples/pawrly.yaml for a kitchen-sink configuration covering every kind.