pawrly

Sources

A source is a named connection to some external system or set of files, exposed to the query engine as one or more tables. Every table is addressed in SQL as <source>.<table> — the source name is the schema prefix. Sources are declared under sources: in your workspace pawrly.yaml (see Configuration, and the examples/pawrly.yaml worked file).

sources:
  - name: data
    kind: file
    config:
      path: ./data/*.parquet
SELECT * FROM data.orders          -- table `orders` in source `data`

Pawrly has two foundational backendsfile (local files, or object storage) and http (any REST/GraphQL API) — plus a small set of first-class database / lakehouse builtins that run through an in-process DuckDB engine.

This page is the reference for the source block (every top-level field) and the per-kind config for each kind. For the surrounding config — secrets, caching internals, safety semantics, defaults, and multi-file assembly — see Configuration.


The source block

Every entry under sources: is one source. These are the top-level fields:

Field Type Required Default Notes
name string yes SQL identifier; becomes the schema prefix (name.table). Must be unique.
kind enum yes The source kind — see Source kinds. Case-insensitive; some kinds have aliases.
description string no Free text; surfaced in pawrly source list.
wiki string no Agent-facing usage notes for the whole source; surfaced by describe_table. See [wiki](#wiki).
examples list no [] SQL statements that must run against this source; probed by pawrly check. See [examples](#examples).
config mapping no¹ {} Per-kind settings (connection, auth, paths, storage, …). Shape depends on kind.
tables list no¹ [] Explicit per-table declarations. Required for some kinds, optional for others (which auto-discover).
cache mapping no mode: none Per-source caching. See the cache block.
safety mapping no permissive Per-source guard rails. See the safety block.
raw_table bool no false http only: register a raw-HTTP escape-hatch table named after the source.
raw_table_safety mapping no filter-required Safety policy for the raw table when raw_table: true.

¹ Whether config or tables is required depends on the kind (see each kind below).

The config-layer source block also accepts from: (load the body from another file) — see Configuration → Multi-file configs.

Strict keys. The source block rejects unknown top-level fields: a typo'd or misplaced key (e.g. safty:, or a kind-specific key written outside config:/tables:) fails the config load with an error rather than being silently ignored. The full machine-readable shape lives in the generated JSON Schema at [schemas/pawrly.schema.json](../schemas/pawrly.schema.json), which editors can use for completion and validation.

name

A valid SQL identifier (letter or _, then alphanumerics/_). It's the schema under which the source's tables are registered, so SELECT … FROM <name>.<table>. Names must be unique across the merged config.

kind

The kind selects the backend and the shape of config/tables. The list is closed (adding a kind is a code change). Matching is case-insensitive; the aliases below resolve to the same kind.

Kind Aliases Backend
file DataFusion native readers (local), or DuckDB object-store reads (with a storage: block)
http native HTTP table provider
mcp external MCP server's tools as tables (stdio or streamable HTTP)
sqlite read-only attach
postgres pg, postgresql DuckDB ATTACH (read-only)
mysql DuckDB ATTACH (read-only)
duckdb DuckDB ATTACH of a local .duckdb file (read-only)
snowflake DuckDB ATTACH (community extension)
iceberg DuckDB iceberg_scan
delta deltalake DuckDB delta_scan
ducklake DuckDB ATTACH 'ducklake:…'

The DuckDB-backed kinds load a DuckDB extension on first use (e.g. postgres, iceberg, ducklake, httpfs; Snowflake's is a community extension). The first registration in a fresh environment may need network access to fetch the extension.

description

Optional human-readable text. No effect on behavior; shown by pawrly source list.

wiki

Optional free-text usage notes aimed at the MCP/agent surface rather than execution: which filters to set, identifier quirks, how to decode a column. Declared on a source (applies to all its tables) and/or on individual table entries; describe_table returns both joined (source notes first). Richer than description, which stays a one-liner for listings.

sources:
  - name: gh
    kind: http
    wiki: |
      All endpoints need `owner` and `repo` filters. Timestamps are ISO-8601 UTC.
    tables:
      - name: pulls
        wiki: |
          `state` defaults to `open`; pass `state = 'all'` to include merged PRs.
        # …

examples

Optional list of SQL statements that must run successfully against this source. pawrly check executes them as live probes, so a broken endpoint or credential is caught at check time rather than first query. describe_table also returns the examples that mention the described table, giving agents known-good starting queries.

sources:
  - name: gh
    kind: http
    examples:
      - SELECT * FROM gh.pulls WHERE owner = 'pawrly' AND repo = 'pawrly' LIMIT 1
    # …

config

A per-kind mapping, opaque to the config layer and interpreted by the kind's builder (so each kind documents its own keys below). Strings here may use ${secret:NAME}, ${env:NAME}, and ${file:PATH} interpolation (see Configuration → Secrets).

tables

Explicit per-table declarations. Per-table fields are written flat — the kind-specific keys (path, format, endpoint, params, response, query, …) sit directly under the table entry, not under a nested config::

tables:
  - name: orders                 # required; the SQL table name
    description: Daily orders     # optional
    path: ./data/orders.parquet   # ← kind-specific fields, flat
    format: parquet
    cache:  { mode: ttl, ttl: 1h }  # optional; overrides the source-level cache
    safety: { max_rows: 100000 }    # optional; overrides the source-level safety

Only name, description, wiki, cache, and safety are common; everything else is kind-specific. Some kinds auto-discover tables when tables: is omitted (file globs, sqlite/postgres/mysql/duckdb/snowflake/ducklake enumerate). Others require tables: (iceberg, delta, and object-store file).

Whether a kind needs tables:, and whether it reads per-table fields at all:

Kind tables: Per-table fields read
file (local) optional (globs auto-discover) path, format, csv, json, schema, partition_cols
file (object store) required path/location, format
http required for typed tables full request/response spec (see the Http backend section below)
mcp optional (tools auto-expose) transport, then command/url + auth (see the MCP backend section)
sqlite optional (auto-enumerates) query (reshape/restrict)
postgres, mysql, duckdb, snowflake, ducklake optional (lazy catalog enumeration) none — entries are ignored; the live catalog is exposed as-is
iceberg, delta required path/location

For the attach-style catalog kinds (postgres/mysql/duckdb/snowflake/ducklake), tables are surfaced lazily straight from the remote catalog, so adding tables: entries does not restrict or rename them. Use the semantic layer if you need curated views over an attached database.

The cache block

Opt-in caching for a source (or an individual table). With no block, reads always go live.

cache:
  mode: ttl        # none | ttl | refresh | cron | append
  ttl: 10m
mode Extra field Behaviour
none No caching (default when cache: is absent).
ttl ttl: <dur> Serve the cached result until ttl elapses, then re-fetch on the next read.
refresh every: <dur> Always read the cache; a background loop re-fetches every every.
cron cron: "<expr>" Like refresh, scheduled by a cron expression.
append cursor_column: <col> Incremental: only rows newer than the cached cursor_column max are fetched on refresh.

Durations use humantime (30s, 10m, 1h). Storage location, namespacing, and cache-management commands are covered in Configuration → Caching.

The safety block

Guard rails enforced before a scan runs. All fields are optional and default to permissive.

safety:
  require_filters_on: [order_date]   # error unless a filter touches each of these columns
  require_at_least_one_filter: true  # refuse a full-table scan
  max_rows: 1000000                  # hard cap on returned rows
  max_pages: 50                      # cap on HTTP pagination calls
  timeout: 30s                       # per-query timeout
  required_predicates:               # predicates AND-ed into every scan
    - "tenant_id = ${param:tenant_id}"

required_predicates is most useful with the semantic layer, where ${param:NAME} placeholders are bound from query params as safe literals (row-level security). See Configuration → Safety.

raw_table / raw_table_safety

For kind: http only, raw_table: true registers an escape-hatch table named after the source for endpoints with no typed spec. You provide the request as filters; Pawrly returns the raw response as rows. Columns:

Column Type Notes
request_method varchar defaults to GET if not filtered
request_path varchar filter required (= or IN (…))
request_query varchar optional query string
response_status int HTTP status code
response_body varchar raw response body
SELECT response_status, response_body
FROM gh                                   -- the source itself is the raw table
WHERE request_path = '/rate_limit'

raw_table_safety overrides the default policy, which requires a filter on request_path (so a bare SELECT * can't fan out arbitrarily).

Unlike typed tables, the raw table is registered unqualified (in the default schema) under the source's own name, so it is queried as FROM <source>not FROM <source>.<table>. This means a source with raw_table: true reserves its bare name for the raw escape hatch; its typed tables still live at <source>.<table>.


Source Kinds

File Backend (file) — local files & object storage

The file backend serves columnar and row files. Local files use DataFusion's native readers; object storage (S3/GCS/Azure) is expressed with a storage: block and read through DuckDB. A file source needs either a top-level config.path glob or at least one tables: entry.

sources:
  - name: data
    kind: file
    config:
      path: ./data/*.csv          # glob; one table per file, named by file stem

Per-table fields are written flat under each tables: entry: path, format, csv, json, schema, partition_cols. The three big topics — formats, globs/partitioning, and object storage — follow.

File formats

format is one of parquet, csv, json. It's inferred from the file extension when omitted (.parquet → parquet; .csv → csv; .json / .jsonl / .ndjson → json). Specify it explicitly for extensionless paths or directories.

CSV — override the dialect with a csv: block (all optional):

Key Default Notes
header true First row is a header. Set false for headerless files (pair with an explicit schema).
delimiter , Single character. "\t" is accepted for tab.
quote " Single quote character.
    tables:
      - name: metrics
        path: ./data/metrics.tsv
        format: csv
        csv: { header: false, delimiter: "\t" }
        schema:                       # name + type the columns for a headerless file
          - { name: host,  type: varchar }
          - { name: value, type: bigint }

JSON/JSONL — files may be newline-delimited (NDJSON) or a single [ … ] array. The layout is auto-detected from the first non-whitespace byte; force it with a json: block:

      - name: facts
        path: ./data/facts.json
        format: json
        json: { format: array }       # array | ndjson | auto (default)

Explicit schema — a schema: list of { name, type } overrides inference (useful for headerless CSV or mis-inferred columns). Column type values (here and in partition_cols): bool/boolean, int/int32, bigint/int64, float/float32, double/float64, date, and varchar (the default for anything else).

File Partitions

A per-table path may be:

  • a single file./data/orders.parquet;
  • a glob./data/orders/*.parquet (all matches unioned into one table);
  • a directory./lake/events (every file beneath it, read as one table).

For partitioned datasets, declare partition_cols so the partition keys become queryable columns. This applies to all three formats (parquet, csv, json). Two styles, one per table:

Hive Partitionkey=value directories (e.g. events/dt=2026-05-31/region=us/*.parquet). The keys are exposed as columns and prune by directory (a filter on dt skips non-matching folders). Streams through the file reader.

      - name: events                  # events/dt=…/region=…/*.parquet
        path: ./lake/events
        format: parquet
        partition_cols:
          - { name: dt,     type: date }
          - { name: region, type: varchar }
SELECT * FROM data.events WHERE dt = '2026-05-31'   -- only that dt= directory is read

Segment — positional partitions for layouts that aren't key=value. Each column takes its value from the directory name at a zero-based index beneath the glob base. Segment-partitioned tables are materialized in memory, so they don't prune.

      - name: sessions                # projects/<project>/*.jsonl
        path: ./projects/*/*.jsonl
        format: json
        partition_cols:
          - { name: project, type: varchar, kind: segment, index: 0 }

Each partition_cols entry is { name, type (default varchar), kind: hive | segment (default hive), index (required for segment) }.

Object storage (S3 / GCS / Azure)

Add a storage: block to read from a bucket. storage.type selects the provider; storage.region and the bucket URLs are the location; credentials live under a typed storage.auth block. Object-store file sources require explicit tables:, each pointing at a remote URL.

sources:
  - name: lake
    kind: file
    config:
      storage:
        type: s3                      # s3 | gcs | azure
        region: us-east-1             # location, not a credential
        auth:
          type: access_key            # access_key | credential_chain
          access_key_id: ${secret:AWS_KEY_ID}
          secret_access_key: ${secret:AWS_SECRET}
          # endpoint: ${env:AWS_ENDPOINT}   # S3-compatible stores (MinIO, R2, …)
    tables:
      - name: events
        path: s3://my-bucket/events/*.parquet
        format: parquet               # parquet (default) | csv | json

auth.type selects the method (default access_key); each provider supports more than one:

type auth.type Fields
s3 access_key access_key_id, secret_access_key, session_token, endpoint, url_style
gcs access_key access_key_id, secret_access_key (HMAC keys)
azure access_key connection_string, account_name
http header / basic header/basic auth attached to HTTPS file reads
any credential_chain (alias chain) none — resolve from the ambient chain (env / instance profile / gcloud / az login); optional endpoint, account_name

With no auth block, the ambient credential chain is used. auth.type: chain is accepted as a shorthand alias for credential_chain.

In addition to s3/gcs/azure, storage.type: http covers authenticated HTTPS file URLs — combine it with a header or basic auth block to attach credentials to plain-HTTP reads (the custom/oauth2 HTTP-source auth styles do not apply here).

Scheme auto-routing. A file source is routed through DuckDB automatically whenever it sees a storage: block or a remote scheme on any path/locations3://, gs:///gcs://, az:///azure:///abfss://, or http(s)://. A public bucket or HTTPS file therefore works with no storage: block at all; you only need one to supply credentials, a region, or a custom endpoint.

Remote files are read by DuckDB's read_parquet/read_csv/read_json, so the local-file csv/json/partition_cols/schema options do not apply to object-store tables — DuckDB infers the schema and reader from the URL and format. Remote http(s):// paths also cannot be globbed; point each table at a single concrete URL (bucket globs like s3://…/*.parquet are fine).

Http Backend (http) — REST & GraphQL APIs

Turns an HTTP API into SQL tables: you declare each table's request and how to shape its JSON response into rows. Source-level config carries base_url (required), auth, static request headers, retries, and rate limiting; each tables: entry maps one request shape to rows.

sources:
  - name: gh
    kind: http
    config:
      base_url: https://api.github.com     # joined with each table's endpoint
      token: ${secret:GITHUB_TOKEN}
    raw_table: true
    tables:
      - name: pulls
        endpoint: /repos/{owner}/{repo}/pulls
        params:
          - { name: owner, required: true }
          - { name: repo,  required: true }
          - { name: state, required: false, default: open }
        response:
          path: $
          schema:
            - { name: number, type: bigint }
            - { name: title,  type: varchar }
            - { name: state,  type: varchar }
        pagination: { type: link_header }
SELECT number, title FROM gh.pulls
WHERE owner = 'CITGuru' AND repo = 'pawrly' AND state = 'open' LIMIT 20

From an OpenAPI spec (config.type: openapi)

Instead of hand-writing tables:, point an HTTP source at an OpenAPI 3.0.x spec and Pawrly synthesizes one table per GET operation at load time — endpoint, params, columns, rows path, and pagination are read from the document. Here base_url is the spec URL; the real API base comes from the spec's own servers.

sources:
  - name: stripe
    kind: http
    config:
      type: openapi
      base_url: https://raw.githubusercontent.com/stripe/openapi/refs/heads/master/latest/openapi.spec3.yaml
      auth:
        type: header
        headers:
          - name: Authorization
            bearer: ${secret:STRIPE_API_KEY}
      openapi:
        include: { paths: ["/v1/charges*", "/v1/customers*"] }   # optional; default = every GET
SELECT id, amount, currency, status FROM stripe.get_charges LIMIT 10

Only read-only GET operations are exposed; pagination is inferred generically (page, offset, cursor, and a last-row cursor). Where inference is uncertain (e.g. a polymorphic response) the column degrades to json and a diagnostic is logged.

Adjusting a synthesized table. A tables: entry whose name matches a synthesized table patches it — only the fields you set are merged in, the rest of the synthesis is kept; a name that matches nothing is a full new table definition. So fixing one field doesn't mean re-declaring the endpoint and every column:

tables:
  - name: get_charges
    response: { path: "$.data" }          # patch the rows-path; keep the synthesized columns
  - name: get_events
    pagination: null                        # drop the inferred pagination

Fields merge per key; arrays (response.schema, params) and type-tagged blocks (pagination) replace wholesale, and a null clears a field.

Top-level config (the same plumbing as hand-declared HTTP mode):

Key Required Description
type yes openapi — enables spec-driven synthesis. Absent/manual keeps the hand-declared behaviour.
base_url yes The spec location: an http(s):// URL or a file:// path. The API base comes from the spec's servers.
auth no Auth block (header / basic / custom / oauth2) — see Authentication.
token no Bearer shorthand: sent as Authorization: Bearer <token> (equivalent to a one-header auth).
headers no Static request headers attached to every call.
retry no { max_retries, base_backoff_ms, max_backoff_ms } — see Rate limiting & retries.
rate_limit no { requests_per_second, remaining_header, reset_header, extra_statuses }.

config.openapi (synthesis-specific):

Key Description
include { tags: [...], paths: [globs], operations: [...] } — only matching GETs become tables. A * glob matches path segments. Omit include to register every GET.
exclude Same shape as include; an operation matching exclude is dropped (wins over include).
naming How tables are named: operationId (default)
base_url Override for the effective request base, used only when the spec declares no usable servers[0].url.
cache { ttl: <duration> } — cache the fetched spec on disk and reuse it while fresh (e.g. 24h, 30m). Omit to re-fetch on every load. Applies to http(s):// specs only.

Source-level safety.max_pages caps the pagination loop for synthesized tables (they inherit no per-table safety). With openapi.cache set, the spec is stored under $PAWRLY_HOME/cache/openapi/ (default ~/.pawrly) keyed by URL; without it, the document is fetched on every load (or point base_url at a vendored file:// copy).

Authentication

Set source-level auth with the config.token shorthand or a full auth: block (the block wins if both are present). The block is tagged by typeheader, basic, custom, or oauth2:

header — attach one or more headers (bearer tokens and API keys live here). headers is a list; each entry gives a name plus exactly one of bearer (sent as Bearer <value>) or value (sent verbatim). Multiple entries cover APIs that need several auth headers at once (e.g. Datadog).

    config:
      base_url: https://api.example.com
      auth:
        type: header
        headers:
          - { name: Authorization, bearer: "${secret:GITHUB_TOKEN}" }   # → "Bearer …"
          - { name: X-Api-Key,     value: "${secret:API_KEY}" }         # literal

basic — base64-encodes username:password into Authorization: Basic ….

      auth:
        type: basic
        username: "${secret:API_USER}"
        password: "${secret:API_PASSWORD}"

custom — credentials carried outside headers. query is a list of { name, value } appended to every request as query-string params (the many ?api_key=… APIs); body is a list of { name, value } injected into the request body as a JSON object. When a table also declares its own JSON body, the body fields are merged on top of it; with no table body, they are sent as the whole JSON body. (Merging requires a JSON table body — a form body errors.)

      auth:
        type: custom
        query:
          - { name: api_key, value: "${secret:API_KEY}" }
        body:
          - { name: tenant, value: acme }

oauth2 — client-credentials grant: a token is fetched on first use, cached, re-fetched before expiry, then sent as Authorization: Bearer <token>. Fields: token_url, client_id, client_secret, optional scope, audience.

      auth:
        type: oauth2
        token_url:     https://login.example.com/oauth/token
        client_id:     ${secret:CLIENT_ID}
        client_secret: ${secret:CLIENT_SECRET}
        scope:         read:data       # optional

Shorthandconfig.token is the dead-simple single-bearer case, equivalent to a header block with one Authorization: Bearer <token> entry:

    config:
      base_url: https://api.github.com
      token: ${secret:GITHUB_TOKEN}

Gotcha. A malformed auth: block (wrong type, a missing required field) does not raise a config error — it falls back to no authentication, and requests go out unauthenticated. If an API starts returning 401/403, double-check the auth block's shape first.

Request headers

config.headers is a source-level string→string map applied to every request the source issues — typed tables and the raw table alike. Use it for the constant headers an API expects on all calls (a media type, an API-version pin) instead of repeating them in each table's headers:

    config:
      base_url: https://api.github.com
      token: ${secret:GITHUB_TOKEN}
      headers:
        Accept: application/vnd.github+json
        X-GitHub-Api-Version: '2022-11-28'

Source headers go on first; a table's own headers are merged on top and override on a key collision. An entry with an invalid header name or non-string value is skipped with a warning rather than failing the source. (Auth headers come from the auth block, not here.)

Request

Each table's request is built from these flat fields:

  • endpoint (required) — path appended to base_url. May carry a query string and {param} placeholders; a param whose name matches a {placeholder} fills the URL path. Remaining params become query parameters — except those consumed by the body template, which are sent in the body only, not duplicated onto the query string.
  • method — defaults to GET.
  • headers — a per-table map of extra request headers. Constant headers shared by every table (e.g. Accept, an API-version header) are better set once at the source level via config.headers (see below); per-table headers are merged on top and win on a key collision.
  • body — for POST/PUT/GraphQL: kind (json, the default, sets Content-Type: application/json; or form for application/x-www-form-urlencoded) and template (body text with {param} placeholders; other braces — JSON/GraphQL syntax — are left untouched). In a JSON body, an unbound optional param (no filter and no default) doesn't leave its placeholder behind: the JSON object member holding it is dropped, and any object it empties is dropped too. This is what makes inlined optional GraphQL filters work — filter: { team: { id: { eq: "{team_id}" } } } with team_id unfiltered collapses to no team filter (the variable defaults to null = match-all) instead of sending a literal "{team_id}" that matches nothing.
  • requests — conditional request shapes tried in order; the first whose when_filters are all bound replaces the default endpoint/method/body. Each entry is { when_filters: [...], endpoint, method?, body? }. The classic use is a get-by-id endpoint when an id filter is present, falling back to a list endpoint otherwise.
      - name: search
        endpoint: /graphql
        method: POST
        params:
          - { name: q, required: true }
        body:
          kind: json
          template: '{"query": "{ search(q: \"{q}\") { id name } }"}'
        response:
          path: $.data.search
          schema:
            - { name: id,   type: varchar }
            - { name: name, type: varchar }

Query parameters

params declares the columns a table accepts as filters. Each is { name, type (default varchar), required (default false), default, accepts, emit, explode, derive, filterable }:

  • required: true — the param must appear as a SQL filter, or the scan fails with a clear error (rather than fetching an unbounded result).
  • default — value used when the user doesn't filter on it.
  • filterable: true — let the param be used in a WHERE clause. Normally a param can only be filtered on if it's also a response.schema column; this exposes it as a column without declaring one. Use it for a filter the API accepts but doesn't return.
        params:
          - { name: status, filterable: true }   # WHERE status = 'active' → ?status=active
  • Equality pushes down by default: WHERE state = 'open'?state=open.
  • Comparisons — to push >= / <= etc., list them in accepts and map each to a query-parameter name in emit:
        params:
          - name: created
            accepts: [">=", "<="]
            emit: { ">=": since, "<=": until }   # WHERE created >= X → ?since=X
  • explodeexplode: true pushes a SQL IN (a, b, c) filter down as repeated query pairs (?key=a&key=b&key=c) instead of post-filtering. Equality still emits a single pair; a non-explode IN is filtered in-engine as before. Only honored for the query string (not path/body params).
  • derive — compute the param's value when the query doesn't supply one (a dynamic default), tagged by kind:
    • ago{ kind: ago, seconds: N } → epoch seconds now - N (a relative time window, e.g. "last hour").
    • split{ kind: split, from: <other param>, separator: "-", part: 0 } → a part (0-based) of another bound param's value split by separator. Useful to derive request fields from a composite filter (e.g. an issue key ENG-123 → team ENG + number 123). A derived param that feeds a body template stays out of the query string.
        params:
          - { name: from,   type: bigint, derive: { kind: ago, seconds: 3600 } }
          - { name: status, explode: true }              # WHERE status IN ('open','closed') → ?status=open&status=closed

A param can also be surfaced as an output column with source: param on a response.schema entry (see below).

Response

response describes how to turn the JSON payload into rows:

  • path — JSONPath to the array of rows. $ (the default) means the body is the array; $.data digs into a wrapper object.
  • reshape — turn a payload that isn't a flat array at path into rows before column extraction. See Reshaping the response.
  • schema — the columns to extract per row, each { name, type, source?, expr? }:
    • typevarchar/string/text, bigint/int64, int/int32, double, float, bool/boolean, date, timestamp, timestamptz (ISO-8601 / RFC 3339 strings are parsed), and json (a nested object/array kept as raw JSON text).
    • source — defaults to the row's top-level field of the same name. Set $.nested.field to read a different path, $ to capture the whole row element (usually into a json column), or param to inject a request parameter as a column.
    • expr — a computed expression evaluated per row, for columns a single path can't express. Takes precedence over source when set. See Computed columns.
  • allow_404_empty — treat a 404 as an empty result set instead of an error.
  • error — surface API failures as a clear scan error: status (a list of codes or matchers like ">=400", "5xx", "<500") and/or path (a JSONPath to an error message inside a 200-with-error body).
        response:
          path: $.data
          allow_404_empty: true
          schema:
            - { name: id,       type: bigint }
            - { name: author,   type: varchar, source: $.user.login }
            - { name: payload,  type: json,    source: $ }
            - { name: repo,     type: varchar, source: param }
          error:
            status: [">=400"]
            path: $.message

Computed columns

When a column needs more than a single JSONPath, give it an expr instead of a source. An expr is a small tree, each node tagged by kind, evaluated against each row (plus the bound request params). A missing path, a shape mismatch, or a failed transform yields NULL — never a scan error.

kind Fields Result
path path: [a, b] Walk object keys a.b from the row.
coalesce exprs: [...] First non-null sub-expression.
literal value A constant value.
replace expr, from, to Find/replace on the string produced by expr.
current_row The whole row element (usually into a json column).
null Always null.
from_filter filter Inject a bound request param's value.
join path Join the array at path into "a,b,c" (objects as compact JSON).
map_join path, item_path Take item_path of each array element, then join.
first_of path, item_path item_path of the first array element (keeps its type).
lookup path, key, key_field?, value_field? In [{key, value}], the value_field whose key_field equals key. The fields default to key/value.
lookup_join path, key, key_field?, value_field? Like lookup but joins every matching value.
pick path, by, item_path Index the object at path by the bound param by, then take item_path.
to_timestamp unit (seconds/millis), expr Convert the epoch number from expr to an RFC 3339 string (a string passes through).
from_base64 expr Base64-decode the string from expr into UTF-8 text.
if_present check, then_value then_value when check is non-null, else null.
        schema:
          # title from `attributes.title`, falling back to `title`
          - name: title
            type: varchar
            expr:
              kind: coalesce
              exprs:
                - { kind: path, path: [attributes, title] }
                - { kind: path, path: [title] }
          # array of label objects -> "bug,p1"
          - { name: labels, type: varchar,   expr: { kind: map_join, path: [labels], item_path: [name] } }
          # epoch seconds -> a real timestamp
          - { name: created, type: timestamp, expr: { kind: to_timestamp, unit: seconds, expr: { kind: path, path: [created_at] } } }

Reshaping the response

response.path normally points at an array of rows. response.reshape turns the value at path into rows when it isn't one — applied before column extraction. Tagged by kind:

  • dict_entries — the value at path is an object/map; each entry becomes a row: the entry value with its key added as _key (or {_key, _value} when the value isn't an object). Read the key with source: $._key.
      - name: calendar_colors
        endpoint: /colors
        response:
          path: $.calendar
          reshape: { kind: dict_entries }
          schema:
            - { name: id,         type: varchar, source: $._key }
            - { name: background, type: varchar }
  • series_points — flatten { <series>: [ { …, <points>: [[t, v], …] } ] } into one row per point: each series object (minus its points field) plus the point's two values, written under the timestamp and value column names. Fields: series, points, timestamp, value.
      - name: metrics
        endpoint: /api/v1/query
        response:
          path: $
          reshape: { kind: series_points, series: series, points: pointlist, timestamp: timestamp, value: value }
          schema:
            - { name: metric,    type: varchar }   # carried from each series object
            - { name: timestamp, type: bigint }    # point[0]
            - { name: value,     type: double }    # point[1]

Pagination

Set pagination to keep fetching pages; absent means a single request. A SQL LIMIT stops pagination early once enough rows are collected, and safety.max_pages caps the loop. The strategy is tagged by type:

type Fields Behaviour
link_header Follows the RFC 5988 Link: header's rel="next" URL until absent.
cursor next_path, param Reads an opaque cursor from the body at next_path ($.a.b) and echoes it back as the param query parameter; stops when the cursor is absent/empty.
body_cursor cursor_path, next_path Reads the next cursor from the body at next_path and writes it into the next request's JSON body at cursor_path (GraphQL variables.after, Notion start_cursor); stops when the cursor is absent/empty or a page is empty.
row_cursor param, field (default id), more_path? Sends param = the last row's field (e.g. starting_after=<last id>); stops when more_path (a $.a.b boolean) is false, else on an empty page.
page param, start (default 1), size_param?, size? Increments the page number in param until a page returns zero rows; optionally sends a page size via size_param/size.
offset param, size_param, size Increments param by size each page until a short page (fewer than size rows).
pagination: { type: cursor, next_path: $.response_metadata.next_cursor, param: cursor }

Rate limiting & retries

rate_limit keeps requests within the API's quota:

Field Notes
requests_per_second Local token-bucket ceiling shared across scans. Omit/zero to disable.
remaining_header Response header carrying remaining quota (e.g. x-ratelimit-remaining); when it reads 0, the next request waits until the reset time.
reset_header Response header carrying the reset time as an epoch-seconds timestamp.
extra_statuses Status codes besides 429/503 also treated as rate-limit signals (e.g. GitHub's secondary-limit 403).

retry governs transient failures (transport errors, 5xx, 429, 503, and any extra_statuses): max_retries (default 3), base_backoff_ms (default 200, doubles each attempt), max_backoff_ms (default 5000). Backoff honours a Retry-After header when present.

    config:
      base_url: https://api.github.com
      rate_limit:
        remaining_header: x-ratelimit-remaining
        reset_header:     x-ratelimit-reset
        extra_statuses:   [403]            # GitHub secondary limit
      retry:
        max_retries: 5

A runnable cache-over-API walkthrough lives at examples/cache-http/.

MCP Backend (mcp) — an MCP server's tools as tables

Connects to an external Model Context Protocol server and exposes its tools as SQL tables: a SELECT runs tools/call, pushed-down WHERE filters become tool arguments, and the result rows are projected into columns. Connect over stdio (a local subprocess) or streamable HTTP (a remote server).

sources:
  - name: linear
    kind: mcp
    config:
      transport: streamable_http
      url: https://mcp.linear.app/mcp
      auth:
        type: header
        headers:
          - name: Authorization
            bearer: ${secret:LINEAR_API_KEY}

  - name: github
    kind: mcp
    config:
      transport: stdio
      command: ["npx", "-y", "@modelcontextprotocol/server-github"]
      env:
        GITHUB_TOKEN: ${secret:GITHUB_TOKEN}
SELECT id, title, status FROM linear.list_issues WHERE assignee = 'me@example.com' LIMIT 20

Two ways to get tables, one dial. A source produces tables from introspection (tools/list) and from declaration (tables:). config.expose sets how much introspection auto-exposes:

expose auto-exposed use
read_only (default) tools with annotations.readOnlyHint == true zero-config, safe
all every non-destructive tool you accept a SELECT may call any read tool
listed none — only tables: / include: fully declarative

include/exclude (by tool name) narrow whatever expose admits; a destructiveHint tool is never auto-exposed.

Output columns. A tool result is exposed as a single result json column unless the tool declares an outputSchema (columns inferred) or a tables: entry declares them. Each tables: entry patches a synthesized table of the same name or defines a new one (tool: + knobs):

field meaning
tool the MCP tool to call (defaults to the table name on a patch).
columns { name, type, path: [keys] } — pull a (possibly nested) field out of each row element; empty path is the whole element as JSON.
tool_args static arguments always sent.
filters bind a SQL filter to a differently-named argument.
limit_binding { tool_arg, max } — push SQL LIMIT into a tool argument.
pagination { cursor_arg, response_cursor_path } — cursor pagination (default: follow nextCursor).

Security. A streamable_http url must be https (or http only for loopback), and may not embed credentials in the URL — use the auth block (header/bearer/basic, same shapes as the HTTP backend) with ${secret:…} tokens. Validated at config-load time.

A runnable, per-source-per-file example lives at examples/mcp.yaml.

Databases and Lakehouse Formats

sqlite — local SQLite databases

Attaches a SQLite file read-only and exposes its tables; equality filters push down. When tables: is omitted, every user table is auto-registered. sqlite is the one attach-style kind whose tables: entries are honored — supply a query to restrict or reshape a table.

sources:
  - name: app
    kind: sqlite
    config:
      path: ./app.db
    # tables:                       # optional: restrict / reshape
    #   - name: active_users
    #     query: SELECT id, email FROM users WHERE active = 1
config key Required Notes
path yes Path to the .db file (:memory: allowed).
Per-table key Notes
query Optional SQL backing the table; defaults to SELECT * FROM "<name>".

postgres, mysql — foreign databases

DuckDB ATTACHes the database read-only and exposes its tables lazily (<source>.<table>); equality predicates, projection, and limits push down. No tables: needed.

Key Notes
dsn Full connection string. If present, used as-is.
host Required if no dsn.
database / dbname Database name (either spelling).
port Optional; accepts an integer or a string.
user, password Optional.
sources:
  - name: oltp
    kind: postgres            # aliases: pg, postgresql
    config:
      host: db.internal
      database: app
      user: readonly
      password: ${secret:PG_PASSWORD}

duckdb — local DuckDB database file

Attaches a .duckdb database file read-only and exposes its tables lazily.

sources:
  - name: local_db
    kind: duckdb
    config:
      path: ./analytics.duckdb
config key Required Notes
path yes Path to the .duckdb file; resolved against the config dir.

snowflake

DuckDB ATTACH via the Snowflake community extension (installed on first use). Requires account, user, password; optional database, schema, warehouse, role.

sources:
  - name: warehouse
    kind: snowflake
    config:
      account: acme.us-east-1
      user: ${secret:SNOWFLAKE_USER}
      password: ${secret:SNOWFLAKE_PASSWORD}
      database: ANALYTICS
      schema: PUBLIC

iceberg, delta — table formats

Each declared table maps to a DuckDB scan function over a table location. tables: is required, each with a path (or location).

sources:
  - name: lake
    kind: iceberg                 # or: delta (alias deltalake)
    tables:
      - name: orders
        path: s3://bucket/warehouse/orders

For tables on an object store, provide credentials with a storage: block (same keys as object-store file); the httpfs extension loads automatically.

Per-table key Required Notes
path / location yes Table location passed to the DuckDB scan function.

ducklake — DuckLake lakehouse catalog

Attaches a DuckLake catalog (a metadata database plus a data path) and exposes its tables lazily.

sources:
  - name: lake
    kind: ducklake
    config:
      catalog: ./metadata.ducklake     # sqlite/duckdb/postgres catalog
      data_path: ./lake_data            # local or s3://… (+ optional storage block)
config key Required Notes
catalog yes The metadata catalog (sqlite/duckdb/postgres).
data_path no Where data files live; local or remote (s3://…). Resolved vs the config dir.
storage no Object-store credentials for a remote data_path (loads httpfs).

Column type reference

Several places take a column type — the file backend's schema: and partition_cols, and the http backend's response.schema. The accepted spellings (case-insensitive) and the contexts that support them:

Canonical type Accepted spellings file schema / partitions http response
boolean bool, boolean
int32 int, int32
int64 bigint, int64, long
float32 float, float32
float64 double, float64
date date
timestamp timestamp
timestamptz timestamptz ✓ (RFC 3339)
varchar varchar, string, text ✓ (default)
json json ✓ (raw JSON text)

Any unrecognized type spelling falls back to varchar. For http, timestamp/timestamptz parse ISO-8601 / RFC 3339 strings, and json keeps a nested object/array as raw JSON text (pair it with source: $ to capture a whole row element) — see Querying JSON columns to read into it.


Querying JSON columns

A json column is raw JSON text — an object, array, or scalar kept verbatim so nested structure survives the scan. Two function families read into it; they share no names, so mix them freely.

Pawrly JSON Helpers

Function Returns Purpose
from_json(text) list<varchar> Parse a JSON array into its element texts, so unnest(from_json(col)) explodes it into one row per element. A non-array value yields a single-element list.
json_extract_string(text, path) varchar Read a field by key, or a dotted path (a.b.c) for nested objects. Strings come back unquoted; numbers/bools/objects as their JSON text; a missing key (or JSON null) is NULL.

DataFusion JSON Helpers

Function / Operator Returns Purpose
json_get_str(col, ...) varchar Read a value at the given key/index path as text.
json_get_int(col, ...) bigint Read a value at the path as an integer.
json_get_float(col, ...) double Read a value at the path as a float.
json_get_bool(col, ...) boolean Read a value at the path as a boolean.
json_get_json(col, ...) varchar Read a value at the path back as raw JSON text (objects, arrays, or scalars).
json_get_array(col, ...) list Read a value at the path as an array.
json_length(col, ...) bigint Length of the array or object at the path (or of col itself).
json_contains(col, ...) boolean Whether a value exists at the given key/index path.
json_object_keys(col) list<varchar> The keys of the JSON object.
col -> key varchar Get the value at key as raw JSON (alias for json_get).
col ->> key varchar Get the value at key as text (alias for json_get_str).
col ? key boolean Whether key is present (alias for json_contains).

Exploding an array of objects

unnest must sit in a projection, not a correlated lateral join, so explode the array in a subquery (a CTE) and read its elements in the outer query:

WITH elems AS (
  SELECT unnest(from_json(payload)) AS e   -- payload is a JSON array of objects
  FROM t
)
SELECT json_extract_string(e, 'code')                   AS code,
       CAST(json_extract_string(e, 'amount') AS DOUBLE)  AS amount
FROM elems
WHERE json_extract_string(e, 'code') = 'USD'

json_length(payload) gives the array length without unnesting.

Gotchas

  • unnest is projection-only.** FROM t, unnest(from_json(t.payload)) (a correlated lateral join) is rejected by the planner — use the CTE form above.
  • Operator precedence. ->/->> bind looser than IN and comparisons, so parenthesize: (e ->> 'code') IN ('USD', 'EUR').
  • Typed getters match the JSON type. json_get_float(e, 'amount') returns NULL when the value is a JSON string ("12.50") rather than a number — use json_get_str or ->>, then CAST.

Federation

Every source is a table in one DataFusion plan, so you can join across kinds in a single statement — a local Parquet file against a Postgres table against an HTTP query — with no import step:

SELECT u.email, COUNT(p.number) AS open_prs
FROM oltp.users u
JOIN gh.pulls p ON p.user = u.github_login
WHERE p.owner = 'pawrly' AND p.repo = 'pawrly' AND p.state = 'open'
GROUP BY u.email

See examples/pawrly.yaml for a kitchen-sink configuration covering every kind.