Architecture Decision Records

How we made the tradeoffs.

Public ADRs in the Thoughtworks pattern. Each entry shows the real alternatives we considered, why we rejected them, and the consequences we’re living with. 8 decisions documented; we add to this list as we make new ones.

Different from manifesto (our beliefs) and why-now (the market thesis). This page is the engineering reasoning.

ADR 0001SupersededSuperseded by ADR 00082026-02-01

Why we ship installable binaries instead of preview URLs

Context

Every other AI app builder (Lovable, v0, Bolt, Replit Agent) stops at a deployable web app or a preview URL. We had to decide whether to follow that pattern or commit to a heavier execution plane.

Decision

Ship real installable binaries (.exe, .dmg, .apk, .AppImage) via dedicated bridge nodes. Web output is supported, but not the primary product.

Alternatives we rejected

Web-only output via WebContainers (Bolt pattern)

Rejected: Wins on developer experience but locks us out of half the economy: bookkeepers, plumbers, lawyers, clinics. They need installable software, not URLs.

Container-based "downloadable" web app (Electron wrapping a hosted URL)

Rejected: Inherits the worst of both: still needs internet, still feels like a web app, but adds installer friction. Customers see through it.

Consequences we live with

Good

✓Differentiation: nobody else does this autonomously
✓Real proof packs (build + launch verified on actual Windows / Linux)
✓Output works offline once installed

Bad

·Bridge infrastructure cost (real EC2 + provisioning logic)
·Build times are slower than browser sandboxes
·macOS and iOS require physical Mac infrastructure — currently deferred
·In practice the Windows bridge stayed stopped and no installer ever shipped end-to-end; ADR 0008 documents the pivot

Superseded by ADR 0008 →

ADR 0002Accepted2026-02-15

Why we run a multi-model gateway instead of locking to one provider

Context

Choosing an LLM provider for production. Frontier model quality is a moving target — every quarter, the leaderboards reshuffle. Provider outages cause customer outages.

Decision

Route every AI call through our own gateway that supports 12 models across 4 providers (Anthropic, OpenAI, Google, Mistral). Users can hot-swap per request.

Alternatives we rejected

Single-provider integration (Anthropic-only)

Rejected: Concentrates risk: their outage = our outage. Their rate limit = our rate limit. Their pricing change = our cost spike.

Use a third-party gateway (OpenRouter, Portkey)

Rejected: Adds another dependency without giving us provider-specific features (e.g., Claude prompt caching, GPT structured outputs). Also fewer levers on rate-limiting and per-tenant quotas.

Consequences we live with

Good

✓Provider failover (Mistral as fallback when Anthropic is degraded)
✓Per-call cost telemetry across providers
✓BYOK option (Pro+ can bring their own provider keys)

Bad

·Maintenance: every new model = a gateway update
·Cost of running our own observability layer instead of using a hosted one

See all 12 models →

ADR 0003Accepted2026-03-10

Why we generate one entity at a time instead of one big prompt

Context

Early experiments asked the AI to generate the whole app in a single call. The output was spaghetti: inconsistent naming, broken cross-references, unfixable compile errors.

Decision

Generate one entity per AI call. Pass relationship context (foreign keys, dependent workflows) into each call as system-prompt frame. The orchestrator stitches the results.

Alternatives we rejected

One big prompt for the whole app

Rejected: Output quality drops sharply past ~4 entities. Compile-fix loops never converge on the messy parts.

Two-pass: outline first, then generate each entity in a follow-up

Rejected: We tried this. The first-pass outline drifted from final-output reality so often that it didn't add value. We rolled it back.

Consequences we live with

Good

✓Compile-fix cycles converge faster (1-2 cycles vs 4-5)
✓AI calls are cheaper per entity (smaller context window)
✓Easier to retry just the failed entity without regenerating the whole app

Bad

·More orchestrator complexity (stitching state across calls)
·Total AI cost per build is higher (more calls, even if cheaper each)

Read the Check Writer Pro build report →

ADR 0004Accepted2026-03-25

Why we cap compile-fix retries at 6 build attempts and 15 total repair cycles

Context

The compile-fix loop has the AI patch failed files. Initial implementation was unlimited retries. We found jobs occasionally spinning for 20+ minutes on the same kind of error.

Decision

Cap at 6 build attempts per job (initial + 5 repair rounds) and 15 total repair cycles across the whole job. If the loop hasn't converged by then, mark the file as needing human attention and proceed with the rest of the build.

Alternatives we rejected

Unlimited retries

Rejected: Burns AI tokens on convergence problems the AI can't actually solve (e.g., genuine logic ambiguity in the prompt).

Cap at 3 attempts (more conservative)

Rejected: Cut off too many successful builds at retry 4. We have data showing convergence usually happens by cycle 2-3, but the long tail goes to 4-5.

Consequences we live with

Good

✓Bounded build cost per job
✓Faster failure feedback (5 repair rounds × ~30s each = ~2.5 min ceiling per file)
✓Error catalog grows: failed file + repair attempts become a training signal

Bad

·Some genuinely-fixable builds get cut off if errors are subtle
·Cap is empirical — needs adjusting as models get better

ADR 0005Accepted2026-04-12

Why we built our own audit log instead of using a SIEM

Context

Enterprise procurement needs an audit log they can export and ingest into their SIEM. We could either ship our own append-only log or call a vendor (Datadog, Splunk, AuditBoard).

Decision

Build our own hash-chained append-only log in Postgres. Each event records: actor (user / agent), agent_run_id, action, target, timestamp, prev_hash. Export to CSV/JSONL.

Alternatives we rejected

Call Datadog logs API for everything

Rejected: We pay Datadog per ingested event. Audit events would push us into the next pricing tier instantly. Also: exporting to a customer's SIEM through Datadog is awkward.

Use a dedicated audit-log SaaS (AuditBoard, Standard Notes)

Rejected: These are designed for human-actor audit. Our log needs first-class agent_run_id semantics and a way to attest hash-chain integrity. SaaS options assume humans.

Consequences we live with

Good

✓Cheap (just another Postgres table)
✓Export format is whatever procurement asks for
✓Hash chain proves tamper-evidence without depending on a third party

Bad

·We carry the operational burden (retention, archival, query performance)
·No fancy dashboards out of the box — the operator builds basic views

ADR 0006Accepted2026-04-20

Why our proof pack is structured JSON instead of a video

Context

How do we prove a build worked? Two options: video of the launched app + a human comment ("ran fine") or a structured proof pack.

Decision

Structured JSON proof pack: prompt, manifest, generation trace, compile attempts, repair cycles, launch result, screenshot URL, artifact SHA-256. Stored in sf_factory_proofs.

Alternatives we rejected

Loom video per build

Rejected: Beautiful for marketing, useless for procurement. Auditors can't grep video. Customers can't bulk-export it.

PDF report per build

Rejected: Fixed format means we can't evolve the schema easily. Hard to compare two proof packs programmatically.

Consequences we live with

Good

✓Comparable across builds (same shape, different values)
✓Easy to extend (add a field; old proofs still readable)
✓Auditor-friendly (CSV/JSONL export of any subset)
✓Screenshot URL inside the pack means the pack can still be a visual artifact

Bad

·Not as compelling on first-glance demos (a video tells a story; JSON requires reading)

Read the Check Writer Pro proof pack →

ADR 0007Accepted2026-05-03

Why we publicly publish our roadmap including "considering" items

Context

Most product roadmaps are private or filtered to "things definitely shipping." We had to decide whether to follow that pattern.

Decision

Publish the full roadmap publicly. Four statuses: shipped / in-progress / planned / considering. "Considering" items explicitly invite buyer feedback.

Alternatives we rejected

Private roadmap shared under NDA

Rejected: Defeats the purpose: prospects who would tell us "I'd pay for X" never see X listed.

Public roadmap showing only shipped + in-progress

Rejected: Customers can't tell what we're thinking. They assume we won't build X and go elsewhere.

Consequences we live with

Good

✓Signal: serious about transparency
✓Inbound feedback on "considering" items helps prioritize
✓No surprise pivots — buyers see strategy evolving

Bad

·Competitors see our priorities (low cost — they probably already inferred them)
·We have to commit to "considering" not meaning "shipping next quarter"

See the public roadmap →

ADR 0008Accepted2026-05-18

Web-first pipeline today; desktop and mobile as roadmap

Context

ADR 0001 (2026-02-01) committed to installable binaries (.exe / .dmg / .apk / .AppImage) as the primary product. Three months later, the practical state of the codebase forced a decision: the Windows bridge node is provisioned (EC2 i-013da1cbb85fd6168) but stopped, no agent installed; the WPF code paths exist in factory/ but no end-to-end build has ever produced a verified installer on prod; the mobile Capacitor stub has never produced an APK. Meanwhile the web pipeline (Linux bridge, Next.js + libSQL, functional POST + GET smoke) ran job 222cdd42 to a verified end-to-end pass on 2026-05-18. We had to pick: keep claiming what we have not shipped, or write down what we ship today and what is roadmap.

Decision

Ship the web pipeline as the primary product. Every passing build today is a downloadable Next.js + libSQL workspace verified by a real POST + GET smoke against the running server on the bridge. Desktop (Windows / macOS / Linux) and mobile (Android / iOS) pipelines remain on the roadmap; we will not market or sell them as available until the same proof bar (verified end-to-end run on /factory/proof) is met.

Alternatives we rejected

Keep ADR 0001 in force and ship more marketing about installable binaries

Rejected: The factory has never shipped a verified installer. Continuing to claim it would build a credibility debt that costs more to unwind later than the strip costs now.

Drop desktop and mobile from the roadmap entirely

Rejected: The architecture (bridge-node design, capability resolver, packaging modules) still makes sense. We are pausing the surface, not killing it. The roadmap and the matrix on /factory mark these as planned.

Consequences we live with

Good

✓Every public claim about the product matches what the factory has actually produced
✓Verified-build evidence on /factory/proof is the bar for "shipping", not roadmap intent
✓Buyers who specifically need a web app get a clean, accurate pitch

Bad

·Loses the "only AI factory that ships installable native apps" differentiator until desktop is actually verified end-to-end
·Customers reading old marketing in archived blog posts / case studies may need to be redirected to the current ADR
·The Windows bridge node accrues some idle cost while stopped (~$0.24/hr when started; currently $0)

Recipe Manager Web — first verified prod pass →

Want to argue with a decision?

Each ADR is open for revision. If you think we picked the wrong alternative, email [email protected]. We respect a good counter-argument; we’ll add a new ADR if you change our minds.

See our manifesto See decisions in production