A workspace is described by a single YAML file, pawrly.yaml. The directory holding that file is the workspace: relative paths in the config (include:, from:, file-source paths) resolve against it.
Pawrly discovers the manifest in this order — first hit wins, nothing merges:
--config <path>- the
PAWRLY_CONFIGenvironment variable ./pawrly.yamlin the current directory$PAWRLY_HOME/pawrly.yaml— the default workspace
PAWRLY_HOME (or --home) is Pawrly's data directory, ~/.pawrly by default. Besides the default-workspace manifest it holds the cache root (cache/, see Storage location) and the daemon socket (sockets/pawrly.sock). pawrly serve follows the same discovery, so a daemon started with no --config serves the default workspace.
A JSON Schema for the file ships in schemas/pawrly.schema.json. Wire it into your editor for completion and inline validation:
# yaml-language-server: $schema=./schemas/pawrly.schema.jsonRun pawrly validate to check a config without executing anything.
Top-level shape
version: 1 # required; only `1` is supported
name: my-workspace # optional label
defaults: # optional; workspace-wide defaults
# ...
secrets: # optional; where secret references resolve from
- kind: env
- kind: file
path: ~/.pawrly/secrets.yaml
- kind: keyring
service: pawrly
include: # optional; splice sources from other files (see "Multi-file configs")
- ./sources/*.yaml
sources: # the data sources
- <source>
semantic: # optional; business models over the sources
models:
- <model> # see docs/semantic.md
observability: # optional; logging, OpenTelemetry export, activity log
# ... # see "Observability"Only version and sources are needed for a useful config; everything else is optional.
Sources
Each entry under sources: declares one source. The common shape:
sources:
- name: data # required; used as the schema prefix in SQL (data.orders)
kind: file # required; see docs/sources.md for kinds
description: Local fixtures # optional
config: # kind-specific settings
path: ./data/*.csv
tables: # optional; explicit per-table definitions
- name: orders
path: ./data/orders.csv
format: csv
cache: <cache> # optional; per-source caching (see below)
safety: <safety> # optional; per-source guard rails (see below)A source's name becomes the schema in SQL: a table orders under source data is queried as data.orders. Per-kind options live under config: (or per-table fields); see Sources for each kind.
Secrets
Reference secrets in any string with ${secret:NAME}; they resolve at load time from the backends listed under secrets:, tried in order (first hit wins):
secrets:
- kind: env # environment variables
- kind: file # a YAML or dotenv (.env) file
path: ~/.pawrly/secrets.yaml
format: auto # auto (by extension) | yaml | dotenv
- kind: keyring # the OS keychain
service: pawrly
sources:
- name: gh
kind: http
config:
base_url: https://api.github.com
token: ${secret:GITHUB_TOKEN}Backends:
env— process environment variables.file— a file ofKEY: value(YAML) orKEY=value(dotenv) pairs.formatdefaults toauto, which picks dotenv for a.envextension/name and YAML otherwise. A relativepathresolves against the config file's directory. The file must not be world-readable (mode0600on Unix) or it is rejected.keyring— the OS keychain underservice(defaultpawrly).auto— convenience chain:env, thenkeyring, then a.envfile in the config directory if present. A missing or insecure.envis skipped with a warning, never fatal.
When secrets: is omitted entirely, the chain defaults to a single auto backend — so a .env beside your config is picked up automatically. Pawrly does not otherwise load .env into the process environment; it is read only through the file/auto backends.
Two more interpolations work in any config string: ${env:NAME} splices a plain environment variable (independent of the chain), and ${file:PATH} inlines a file's trimmed contents (~ expands to $HOME).
Caching
Caching is opt-in per table (or per source). Add a cache: block; with no block, reads always go live.
cache:
mode: ttl # none | ttl | refresh | cron
ttl: 1h # for mode: ttlModes:
| Mode | Behaviour |
|---|---|
none |
No caching (the default when cache: is absent). |
ttl |
Cache the result; serve it until ttl elapses, then re-fetch on the next read. |
refresh |
Keep the cache warm with a background loop on a fixed interval (every: 1h). |
cron |
Like refresh, but the schedule is a cron expression (cron: "0 * * * *"). |
Storage location
The cache writes Parquet plus a JSON manifest under defaults.cache.storage. When unset, the root derives from the Pawrly home as $PAWRLY_HOME/cache (default ~/.pawrly/cache) — anchored at the home, not the current directory or the workspace, regardless of where you run pawrly from. An explicit value may use a leading ~, which expands to $HOME.
defaults:
cache:
storage: ~/.pawrly/cache # optional; default $PAWRLY_HOME/cache
namespace: my-project # optional; isolates this workspace (see below)Because storage is shared across every workspace on the machine, Pawrly inserts a namespace segment so two workspaces that each define, say, a data.orders table don't collide on the same path:
<storage>/<namespace>/data/<source>/<table>/part-000000.parquet- Default (automatic). With
namespaceunset, it's derived as<dirname>-<hash>from the workspace's canonical absolute path. The same workspace always maps to the same directory; distinct workspaces never collide. Moving or renaming the workspace directory changes the id, so its cache starts cold. The default workspace (the manifest at$PAWRLY_HOME/pawrly.yaml) gets the literal namespacedefaultinstead. - Override. Set
defaults.cache.namespaceto pin a stable namespace (e.g. so the cache survives moving the directory) or to deliberately share one cache across workspaces by giving them the same value. A blank value falls back to the derived id.
The cache is restart-safe and concurrency-safe. Inspect and manage it with the pawrly cache commands.
Safety
A safety: block sets guard rails that are enforced before a scan runs:
safety:
require_filters_on: [order_date] # error unless a filter touches these columns
require_at_least_one_filter: true # refuse a full-table scan
max_rows: 1000000 # cap returned rows
max_pages: 50 # cap HTTP pagination
timeout: 30s # per-query timeout
required_predicates: # predicates AND-ed into every scan (see semantic RLS)
- "tenant_id = ${param:tenant_id}"required_predicates is most useful with the semantic layer, where ${param:NAME} placeholders are bound from a query's params as safe literals for row-level security.
Multi-file configs
As a workspace grows, split pawrly.yaml so sources can live in their own files. Both primitives resolve paths relative to the declaring file and are assembled before validation, so the rest of the pipeline sees one merged tree.
include:(top-level) — splices other files into this one. Globs are allowed and sorted lexicographically. Each included file may be either form:- a fragment — a mapping carrying a
sources:(and optionalsecrets:) list; or - a bare single source — the SourceDef itself, with
name/kind/configat the top level and nosources:wrapper (recognised by the top-levelkind:). Handy for one-source-per-file layouts behind a glob.
include: - ./sources/*.yaml # may hold fragments and/or bare single sources - ./team-sources.yamlEither form may also carry a top-level
models:list — the semantic models defined over its sources — which is spliced intosemantic.models. This lets one file fully describe an integration (its source and its models). Models still merge into the one global semantic layer, so a co-located model may relate to a model declared in another file, and duplicate model names across files are rejected with both filenames.# sources/github.yaml — a bare single source plus the models over it name: gh kind: http config: base_url: https://api.github.com token: ${secret:GITHUB_TOKEN} raw_table: true models: - name: gh_issues source: gh.issues dimensions: [{ name: state, expr: state, type: string }] measures: [{ name: issue_count, agg: count, expr: id }](
include:dmodels:is the file-level co-location convenience; for model-only files that aren't tied to one source, usesemantic.include:below.)- a fragment — a mapping carrying a
from:(on a source) — loads one source's body from a sibling file:sources: - name: warehouse from: ./sources/warehouse.yamlsemantic.include:— splices the models of other files intosemantic.models. Each referenced file contains only models — either a top-levelmodels:list or a bare sequence of model mappings — never sources, secrets, or other config. Globs are allowed and sorted lexicographically; inlinemodels:come first, then the included files. Duplicate model names (within or across files) are rejected with both filenames.semantic: include: - ./models/*.yaml # each file holds only models models: - name: inline_model # optional inline models still allowed # ...# models/orders.yaml — a model-only file models: - name: orders source: data.orders dimensions: [{ name: status, expr: status, type: string }] measures: [{ name: revenue, agg: sum, expr: total_amount }]
Includes can chain; from: is not transitive. Cycles are detected and reported. Model source: references and relationships are validated against the merged config, so a model in one file may reference a source declared in another. pawrly config show --tree prints the assembled tree, and pawrly source list annotates each source with the file it came from. A runnable layout lives under examples/multi-file/.
Observability
observability: is optional. Absent, Pawrly logs to stderr as before and exports nothing. The block has three parts; CLI flags (--log-level, --log-format, --otel-endpoint, --otel-protocol, --prometheus-listen) override the matching settings.
observability:
tracing:
level: info # EnvFilter directive; RUST_LOG still wins
format: text # text | json
otel:
enabled: false # master switch for OTLP export
endpoint: http://localhost:4317
protocol: grpc # grpc | http
service_name: pawrly
traces: true
metrics: true
logs: true # bridge tracing events to OTel logs
sample_ratio: 1.0 # parent-based ratio sampler
prometheus:
enabled: false # serve a /metrics pull endpoint (independent of OTLP push)
listen: 127.0.0.1:9090
activity:
enabled: false # master switch for the activity log
sinks: [tracing] # any of: tracing, table
redact_sql: false # false | literals | true
ring_capacity: 10000 # in-memory rows kept for the `table` sink
store: ~/.pawrly/activity # persist to Parquet; omit for in-memory only
partition_hours: 4 # hr= partition width
flush_threshold: 1000 # records buffered before a file is written
flush_interval: 60s # or this, whichever first
retention: 30d # prune files older than this; omit to keep allTraces & logs are emitted as
tracingspans/events and, whenotel.enabled, exported over OTLP. W3Ctraceparentis propagated across the gRPC and MCP boundaries, so a CLI→daemon request is a single trace.Metrics (query/cache/source counters and histograms) export over OTLP push when
otel.metricsis on, and/or a Prometheus pull endpoint whenotel.prometheus.enabledis on.Activity log records one row per operation. The
tracingsink emits a structured event; thetablesink exposes the recent rows as thesystem.activitySQL table:SELECT interface, status, count(*), avg(duration_ms) FROM system.activity WHERE at > now() - INTERVAL '1 hour' GROUP BY 1, 2;redact_sqlcontrols SQL capture:falsestores it verbatim,literalsreplaces literal values with$REDACTED(keeping shape), andtruestores only the statement kind and tables. Parameter values are never stored.Without
store,system.activityis an in-memory ring of the most recentring_capacityrows, lost on restart. Setstoreto persist records as date/hour-partitioned Parquet (dt=YYYY-MM-DD/hr=HH/…); the table then unions the on-disk history with the not-yet-flushed buffer, so it survives restarts.retentionprunes old files. A runnable config lives atexamples/observability.yaml.
Defaults
defaults: sets workspace-wide values inherited by sources that don't override them — for example HTTP client settings and baseline safety caps. See the worked configurations in examples/pawrly.yaml for the full set of options in context.