pawrly

Configuration

A workspace is described by a single YAML file, pawrly.yaml. The directory holding that file is the workspace: relative paths in the config (include:, from:, file-source paths) resolve against it.

Pawrly discovers the manifest in this order — first hit wins, nothing merges:

  1. --config <path>
  2. the PAWRLY_CONFIG environment variable
  3. ./pawrly.yaml in the current directory
  4. $PAWRLY_HOME/pawrly.yaml — the default workspace

PAWRLY_HOME (or --home) is Pawrly's data directory, ~/.pawrly by default. Besides the default-workspace manifest it holds the cache root (cache/, see Storage location) and the daemon socket (sockets/pawrly.sock). pawrly serve follows the same discovery, so a daemon started with no --config serves the default workspace.

A JSON Schema for the file ships in schemas/pawrly.schema.json. Wire it into your editor for completion and inline validation:

# yaml-language-server: $schema=./schemas/pawrly.schema.json

Run pawrly validate to check a config without executing anything.

Top-level shape

version: 1                 # required; only `1` is supported
name: my-workspace         # optional label

defaults:                  # optional; workspace-wide defaults
  # ...

secrets:                   # optional; where secret references resolve from
  - kind: env
  - kind: file
    path: ~/.pawrly/secrets.yaml
  - kind: keyring
    service: pawrly

include:                   # optional; splice sources from other files (see "Multi-file configs")
  - ./sources/*.yaml

sources:                   # the data sources
  - <source>

semantic:                  # optional; business models over the sources
  models:
    - <model>              # see docs/semantic.md

observability:             # optional; logging, OpenTelemetry export, activity log
  # ...                    # see "Observability"

Only version and sources are needed for a useful config; everything else is optional.

Sources

Each entry under sources: declares one source. The common shape:

sources:
  - name: data                 # required; used as the schema prefix in SQL (data.orders)
    kind: file                 # required; see docs/sources.md for kinds
    description: Local fixtures # optional
    config:                    # kind-specific settings
      path: ./data/*.csv
    tables:                    # optional; explicit per-table definitions
      - name: orders
        path: ./data/orders.csv
        format: csv
    cache: <cache>             # optional; per-source caching (see below)
    safety: <safety>           # optional; per-source guard rails (see below)

A source's name becomes the schema in SQL: a table orders under source data is queried as data.orders. Per-kind options live under config: (or per-table fields); see Sources for each kind.

Secrets

Reference secrets in any string with ${secret:NAME}; they resolve at load time from the backends listed under secrets:, tried in order (first hit wins):

secrets:
  - kind: env                              # environment variables
  - kind: file                             # a YAML or dotenv (.env) file
    path: ~/.pawrly/secrets.yaml
    format: auto                           # auto (by extension) | yaml | dotenv
  - kind: keyring                          # the OS keychain
    service: pawrly

sources:
  - name: gh
    kind: http
    config:
      base_url: https://api.github.com
      token: ${secret:GITHUB_TOKEN}

Backends:

  • env — process environment variables.
  • file — a file of KEY: value (YAML) or KEY=value (dotenv) pairs. format defaults to auto, which picks dotenv for a .env extension/name and YAML otherwise. A relative path resolves against the config file's directory. The file must not be world-readable (mode 0600 on Unix) or it is rejected.
  • keyring — the OS keychain under service (default pawrly).
  • auto — convenience chain: env, then keyring, then a .env file in the config directory if present. A missing or insecure .env is skipped with a warning, never fatal.

When secrets: is omitted entirely, the chain defaults to a single auto backend — so a .env beside your config is picked up automatically. Pawrly does not otherwise load .env into the process environment; it is read only through the file/auto backends.

Two more interpolations work in any config string: ${env:NAME} splices a plain environment variable (independent of the chain), and ${file:PATH} inlines a file's trimmed contents (~ expands to $HOME).

Caching

Caching is opt-in per table (or per source). Add a cache: block; with no block, reads always go live.

cache:
  mode: ttl        # none | ttl | refresh | cron
  ttl: 1h          # for mode: ttl

Modes:

Mode Behaviour
none No caching (the default when cache: is absent).
ttl Cache the result; serve it until ttl elapses, then re-fetch on the next read.
refresh Keep the cache warm with a background loop on a fixed interval (every: 1h).
cron Like refresh, but the schedule is a cron expression (cron: "0 * * * *").

Storage location

The cache writes Parquet plus a JSON manifest under defaults.cache.storage. When unset, the root derives from the Pawrly home as $PAWRLY_HOME/cache (default ~/.pawrly/cache) — anchored at the home, not the current directory or the workspace, regardless of where you run pawrly from. An explicit value may use a leading ~, which expands to $HOME.

defaults:
  cache:
    storage: ~/.pawrly/cache   # optional; default $PAWRLY_HOME/cache
    namespace: my-project      # optional; isolates this workspace (see below)

Because storage is shared across every workspace on the machine, Pawrly inserts a namespace segment so two workspaces that each define, say, a data.orders table don't collide on the same path:

<storage>/<namespace>/data/<source>/<table>/part-000000.parquet
  • Default (automatic). With namespace unset, it's derived as <dirname>-<hash> from the workspace's canonical absolute path. The same workspace always maps to the same directory; distinct workspaces never collide. Moving or renaming the workspace directory changes the id, so its cache starts cold. The default workspace (the manifest at $PAWRLY_HOME/pawrly.yaml) gets the literal namespace default instead.
  • Override. Set defaults.cache.namespace to pin a stable namespace (e.g. so the cache survives moving the directory) or to deliberately share one cache across workspaces by giving them the same value. A blank value falls back to the derived id.

The cache is restart-safe and concurrency-safe. Inspect and manage it with the pawrly cache commands.

Safety

A safety: block sets guard rails that are enforced before a scan runs:

safety:
  require_filters_on: [order_date]   # error unless a filter touches these columns
  require_at_least_one_filter: true  # refuse a full-table scan
  max_rows: 1000000                  # cap returned rows
  max_pages: 50                      # cap HTTP pagination
  timeout: 30s                       # per-query timeout
  required_predicates:               # predicates AND-ed into every scan (see semantic RLS)
    - "tenant_id = ${param:tenant_id}"

required_predicates is most useful with the semantic layer, where ${param:NAME} placeholders are bound from a query's params as safe literals for row-level security.

Multi-file configs

As a workspace grows, split pawrly.yaml so sources can live in their own files. Both primitives resolve paths relative to the declaring file and are assembled before validation, so the rest of the pipeline sees one merged tree.

  • include: (top-level) — splices other files into this one. Globs are allowed and sorted lexicographically. Each included file may be either form:

    • a fragment — a mapping carrying a sources: (and optional secrets:) list; or
    • a bare single source — the SourceDef itself, with name/kind/config at the top level and no sources: wrapper (recognised by the top-level kind:). Handy for one-source-per-file layouts behind a glob.
    include:
      - ./sources/*.yaml       # may hold fragments and/or bare single sources
      - ./team-sources.yaml

    Either form may also carry a top-level models: list — the semantic models defined over its sources — which is spliced into semantic.models. This lets one file fully describe an integration (its source and its models). Models still merge into the one global semantic layer, so a co-located model may relate to a model declared in another file, and duplicate model names across files are rejected with both filenames.

    # sources/github.yaml — a bare single source plus the models over it
    name: gh
    kind: http
    config:
      base_url: https://api.github.com
      token: ${secret:GITHUB_TOKEN}
    raw_table: true
    models:
      - name: gh_issues
        source: gh.issues
        dimensions: [{ name: state, expr: state, type: string }]
        measures:  [{ name: issue_count, agg: count, expr: id }]

    (include:d models: is the file-level co-location convenience; for model-only files that aren't tied to one source, use semantic.include: below.)

  • from: (on a source) — loads one source's body from a sibling file:

    sources:
      - name: warehouse
        from: ./sources/warehouse.yaml
  • semantic.include: — splices the models of other files into semantic.models. Each referenced file contains only models — either a top-level models: list or a bare sequence of model mappings — never sources, secrets, or other config. Globs are allowed and sorted lexicographically; inline models: come first, then the included files. Duplicate model names (within or across files) are rejected with both filenames.

    semantic:
      include:
        - ./models/*.yaml      # each file holds only models
      models:
        - name: inline_model   # optional inline models still allowed
          # ...
    # models/orders.yaml — a model-only file
    models:
      - name: orders
        source: data.orders
        dimensions: [{ name: status, expr: status, type: string }]
        measures:  [{ name: revenue, agg: sum, expr: total_amount }]

Includes can chain; from: is not transitive. Cycles are detected and reported. Model source: references and relationships are validated against the merged config, so a model in one file may reference a source declared in another. pawrly config show --tree prints the assembled tree, and pawrly source list annotates each source with the file it came from. A runnable layout lives under examples/multi-file/.

Observability

observability: is optional. Absent, Pawrly logs to stderr as before and exports nothing. The block has three parts; CLI flags (--log-level, --log-format, --otel-endpoint, --otel-protocol, --prometheus-listen) override the matching settings.

observability:
  tracing:
    level: info            # EnvFilter directive; RUST_LOG still wins
    format: text           # text | json
  otel:
    enabled: false         # master switch for OTLP export
    endpoint: http://localhost:4317
    protocol: grpc         # grpc | http
    service_name: pawrly
    traces: true
    metrics: true
    logs: true             # bridge tracing events to OTel logs
    sample_ratio: 1.0      # parent-based ratio sampler
    prometheus:
      enabled: false       # serve a /metrics pull endpoint (independent of OTLP push)
      listen: 127.0.0.1:9090
  activity:
    enabled: false         # master switch for the activity log
    sinks: [tracing]       # any of: tracing, table
    redact_sql: false      # false | literals | true
    ring_capacity: 10000   # in-memory rows kept for the `table` sink
    store: ~/.pawrly/activity  # persist to Parquet; omit for in-memory only
    partition_hours: 4         # hr= partition width
    flush_threshold: 1000      # records buffered before a file is written
    flush_interval: 60s        # or this, whichever first
    retention: 30d             # prune files older than this; omit to keep all
  • Traces & logs are emitted as tracing spans/events and, when otel.enabled, exported over OTLP. W3C traceparent is propagated across the gRPC and MCP boundaries, so a CLI→daemon request is a single trace.

  • Metrics (query/cache/source counters and histograms) export over OTLP push when otel.metrics is on, and/or a Prometheus pull endpoint when otel.prometheus.enabled is on.

  • Activity log records one row per operation. The tracing sink emits a structured event; the table sink exposes the recent rows as the system.activity SQL table:

    SELECT interface, status, count(*), avg(duration_ms)
    FROM system.activity
    WHERE at > now() - INTERVAL '1 hour'
    GROUP BY 1, 2;

    redact_sql controls SQL capture: false stores it verbatim, literals replaces literal values with $REDACTED (keeping shape), and true stores only the statement kind and tables. Parameter values are never stored.

    Without store, system.activity is an in-memory ring of the most recent ring_capacity rows, lost on restart. Set store to persist records as date/hour-partitioned Parquet (dt=YYYY-MM-DD/hr=HH/…); the table then unions the on-disk history with the not-yet-flushed buffer, so it survives restarts. retention prunes old files. A runnable config lives at examples/observability.yaml.

Defaults

defaults: sets workspace-wide values inherited by sources that don't override them — for example HTTP client settings and baseline safety caps. See the worked configurations in examples/pawrly.yaml for the full set of options in context.