feat: containerized control plane — Postgres persistence, dark-agent init, tenant onboarding, fleet UI #3

Merged
arento merged 19 commits from feat/containerize into main 2026-07-02 12:02:54 +00:00
Owner

Turns the pre-scaffold repo into a working, containerized multi-tenant control plane. Verified live end-to-end against real Yandex Cloud, a live dark-agent 0.19.0 guest, HashiCorp Vault, and Postgres.

What landed

  • Deploy: multi-stage Dockerfile + docker-compose.yml (orchestrator + Postgres + Vault + UI); DARK_BIND configurable listen address.
  • Persistence base (#756): FleetRegistry port + Postgres adapter (durable fleet + lifecycle status), durable Postgres vault, db module + migrations; reconcile_once loop (orphan #767 / stuck / interrupted-teardown / Ready-drift), timer in main.
  • dark-agent init (#750): corrected DarkAgent port to the real guest API + HttpDarkAgent (reqwest/rustls); the init→claude→export dance, bundle round-tripped through the vault; guest_report (live /health + /heal).
  • Tenant onboarding (#740): TenantSecretSource/Sink ports + Vault KV v2 adapter; PUT /tenants/:tenant/secrets onboarding endpoint — a secret-write surface kept off the fleet-command API (#737).
  • API + UI: GET /instances, GET /instances/:tenant/:id detail; a TypeScript backend-for-frontend dashboard (fleet table, provision/teardown/onboard, per-VM page with opencode/dark-agent/heal entry points + live guest).

Live-hardening fixes (from the real run)

  • VM image is operator config (DARK_VM_IMAGE), not a request field.
  • Unique VM names dark-vm-{uuid} (no YC 409 on re-provision; non-ASCII tenants safe).
  • Liberal decode of dark-agent responses (0.19.0 omits nks_configured).
  • RECONCILE_INTERVAL_SECS=0 disables reconcile (safety valve for manual real-cloud runs); empty env vars treated as unset.

Known follow-ups (not in this PR)

  • reconcile race: reconcile can reap a VM mid-create (in YC, not yet in registry). Needs a grace period / early pre-record before re-enabling reconcile against real cloud.
  • Orchestrator-side bundle encryption seam is still identity (#628).
  • 409 name-conflict surfaces as a generic 500.

Sandbox project — CI runs fmt + clippy + test (41 tests; PG/Vault/YC integration tests are #[ignore]).

Turns the pre-scaffold repo into a working, containerized multi-tenant control plane. Verified live end-to-end against real Yandex Cloud, a live dark-agent 0.19.0 guest, HashiCorp Vault, and Postgres. ## What landed - **Deploy**: multi-stage `Dockerfile` + `docker-compose.yml` (orchestrator + Postgres + Vault + UI); `DARK_BIND` configurable listen address. - **Persistence base (#756)**: `FleetRegistry` port + Postgres adapter (durable fleet + lifecycle status), durable Postgres vault, `db` module + migrations; `reconcile_once` loop (orphan #767 / stuck / interrupted-teardown / Ready-drift), timer in `main`. - **dark-agent init (#750)**: corrected `DarkAgent` port to the real guest API + `HttpDarkAgent` (reqwest/rustls); the init→claude→export dance, bundle round-tripped through the vault; `guest_report` (live /health + /heal). - **Tenant onboarding (#740)**: `TenantSecretSource`/`Sink` ports + Vault KV v2 adapter; `PUT /tenants/:tenant/secrets` onboarding endpoint — a secret-write surface kept off the fleet-command API (#737). - **API + UI**: `GET /instances`, `GET /instances/:tenant/:id` detail; a TypeScript backend-for-frontend dashboard (fleet table, provision/teardown/onboard, per-VM page with opencode/dark-agent/heal entry points + live guest). ## Live-hardening fixes (from the real run) - VM image is operator config (`DARK_VM_IMAGE`), not a request field. - Unique VM names `dark-vm-{uuid}` (no YC 409 on re-provision; non-ASCII tenants safe). - Liberal decode of dark-agent responses (0.19.0 omits `nks_configured`). - `RECONCILE_INTERVAL_SECS=0` disables reconcile (safety valve for manual real-cloud runs); empty env vars treated as unset. ## Known follow-ups (not in this PR) - **reconcile race**: reconcile can reap a VM mid-create (in YC, not yet in registry). Needs a grace period / early pre-record before re-enabling reconcile against real cloud. - Orchestrator-side bundle encryption seam is still identity (#628). - 409 name-conflict surfaces as a generic 500. Sandbox project — CI runs fmt + clippy + test (41 tests; PG/Vault/YC integration tests are `#[ignore]`).
Multi-stage Dockerfile (rustls → no OpenSSL, debian-slim runtime + ca-certs)
and a .dockerignore that keeps target/.env/secrets/*.pem out of the build
context. main.rs reads DARK_BIND (default 127.0.0.1:8080); the image sets
0.0.0.0:8080 so a deploy listens off-loopback. Verified: image builds, container
serves /health 200.
Persistent-state foundation for the control plane, mock-backed for now
(Postgres adapters land next):

- FleetRegistry port (src/fleet/) — tenant->instance->LifecycleStatus, keyed
  by provider instance_id; MockFleetRegistry in-memory adapter.
- CloudProvider::list_instances (mock + yc, yc scoped to the dark-vm- prefix)
  gives reconcile its ACTUAL state.
- Orchestrator now records lifecycle: provision writes Provisioning/Initializing/
  Ready; teardown marks TearingDown then a Retired tombstone (kept either mode) so
  a failed destroy is caught by reconcile (NKS #769).
- Orchestrator::reconcile_once — converges ACTUAL toward DESIRED: destroys orphans
  (#767) and stuck-provisioning rows, resumes interrupted/failed teardowns, reports
  vanished-Ready drift (respawn is #754 policy, not a reconcile default). now/stuck
  injected so the timer stays out of the core.
- clock::now_ms shared helper (unix-ms i64, no chrono).

11 control tests + fleet/provider unit tests green.
Split vault.rs into vault/{mod,mock}.rs (dir-module form) and add a
VaultError::Backend variant distinct from Missing. The new variant forces the
load-bearing match in control::initialize to decide explicitly: a backend read
failure now bails (return Err) instead of falling through to the init/export
path that would rotate a live tenant's durable bundle (NKS #628). Regression
test proves no dark-agent ops run on a vault backend error.
- src/db.rs: PgPool connect + sqlx migrate (startup concern, typed DbError).
- migrations/0001_init.sql: fleet + secrets tables, unix-ms BIGINT, tenant index.
- PgFleetRegistry / PgVault: runtime sqlx queries only (no compile-time macros,
  so CI needs no DB); errors translated to Fleet/VaultError::Backend at the edge;
  vault keeps the load-bearing Missing-vs-Backend split.
- docker-compose.yml: postgres:16 + orchestrator (forward-looking wiring).
- #[ignore] integration tests gated on DATABASE_URL.
Correct the DarkAgent port to the real guest API and add the reqwest adapter:
- claude_login -> provision_claude(target, oauth_token): the guest does no OAuth
  flow; the orchestrator supplies an already-minted token and the guest launches
  Meridian. init returns InitOutcome (ssh_pubkey + forge registration), not ().
- Durable bundle is a typed, core-owned CredentialBundle (JSON), not opaque bytes;
  control::initialize serde-serializes it into the vault blob and back. The Claude
  token is node-local (not in the bundle), so respawn also calls provision_claude.
- HttpDarkAgent (rustls, timeouts): http://{target}[:8080]/credentials/*, boundary
  error translation (transport->Unreachable, 503->Unavailable, 400->Validation,
  else Credentials); never logs secret-bearing bodies. Not yet wired in main.
- DarkAgentError gains Unavailable/Validation; secret-bearing types omit Debug (#737).
Adapter built in an isolated worktree; control.rs dance re-applied onto the
registry/reconcile base by hand. 29 tests green.
main now assembles the real stack behind 12-factor env switches: DATABASE_URL
selects the Postgres fleet registry + vault (else mocks), DARK_AGENT=http selects
the real guest credentials client (else mock), on top of the existing DARK_PROVIDER
and DARK_BIND. Spawns the reconcile loop as a background tokio task (timer in main,
core stays scheduling-free; RECONCILE_INTERVAL_SECS / RECONCILE_STUCK_SECS).
AppState holds Arc<Orchestrator> so the loop and the HTTP handlers share it.
clock::now_ms made pub for the loop.

Verified live against a real Postgres: migrations run on boot; provision writes
Ready rows + durable bundles; teardown(Retire) leaves a Retired tombstone and drops
only that tenant's bundle.
Orchestrator::list_fleet exposes the registry's view; api.rs maps FleetRecord to
a secret-free InstanceDto (tenant, instance_id, address, status text, timestamps)
and serves it at GET /instances. The durable bundle never crosses this boundary
(#737). Also tweak docker-compose comment + explicit mock provider/agent + expose
Postgres on host 5433.
A small dashboard for the control plane, in ui/ (Node + TypeScript + Express):
- server.ts is a backend-for-frontend: serves one static page and proxies the
  orchestrator API server-side (GET /instances, /health, POST /provision, DELETE),
  so the browser never hits the orchestrator directly (no CORS) and never touches
  the secrets-bearing DB. 502 JSON on an unreachable upstream, no crash.
- public/index.html: vanilla HTML+CSS+fetch — auto-refreshing fleet table with
  color-coded status, health dot, provision form, per-row teardown. XSS-escaped.
- ui/Dockerfile (multi-stage node:20-slim) + ui service in docker-compose (port
  3000, ORCHESTRATOR_URL=http://orchestrator:8080). Root .dockerignore skips /ui/
  so the Rust image build context stays clean.
Verified: UI up, page 200, proxy lists the fleet, provision/teardown round-trip.
Closes the write-side of tenant onboarding so a tenant's provisioning secrets can
reach the orchestrator WITHOUT crossing the fleet-command API (NKS #737):

- tenant_secrets port + HashiCorp Vault KV v2 adapter (read: control loop fetches
  a tenant's forge tokens / nks_pat / Claude oauth to drive the init dance; write:
  onboarding stores them). One Vault client backs both; a stub refuses without a
  backend. Path segments percent-encoded (no traversal, #632). Bodies never logged.
- Orchestrator::initialize now sources secrets from the port (was a hardcoded stub).
- PUT /tenants/:tenant/secrets onboarding endpoint (AppState holds the sink) — a
  separate secret-write surface from the provision/teardown command path.
- main: one Vault instance as source+sink; empty DATABASE_URL/VAULT_ADDR now count
  as unset (env_nonempty) so a blank .env selects mock/stub instead of panicking.
- .env.example documents VAULT_*/DATABASE_URL/DARK_AGENT.

Verified live: PUT onboards a tenant, the doc lands at secret/tenants/<t> in a real
Vault, write/read schema round-trips. 39 tests green.
Extend the dashboard to cover the full control surface:
- server.ts: PUT /api/tenants/:tenant/secrets proxy (same forward() helper, tenant
  percent-encoded, body forwarded verbatim and never logged — it carries secrets).
- index.html: 'Onboard tenant' form (tenant + claude_oauth_token + nks_pat + one
  forge row), all secret fields type=password/autocomplete=new-password, cleared on
  success so secrets don't linger in the DOM; plus a manual Refresh button beside the
  3s auto-refresh. Provision form, per-row Teardown, and health dot unchanged.
Verified against the real yc+vault orchestrator: onboard PUT returns 204 (wrote to
the live Vault), /api/instances + /api/health proxy through.
Two fixes from the first live YC run:
- image_ref was a client-supplied /provision field (and a UI input defaulting to the
  bogus 'img-1' → 'Image not found'). The image a tenant runs is an operator decision,
  not a per-request choice: Orchestrator now holds a configured image (main reads
  DARK_VM_IMAGE, falling back to YC_TEST_IMAGE_ID); provision takes only the tenant.
  ProvisionReq and the UI provision form drop the image field.
- RECONCILE_INTERVAL_SECS=0 now disables the reconcile loop entirely — a safety valve
  for driving real cloud by hand, where an empty mock registry would otherwise reap
  live VMs as orphans (boot tick) or stuck rows. This is what turned a manually-created
  VM into 'retired': its health-wait to a PRIVATE-IP guest timed out, then reconcile
  reaped the stuck row.
39 tests green.
First real provision→init against a live dark-agent 0.19.0 failed at the init step
with 'error decoding response body': the guest returns {ssh_pubkey, forges} but the
InitOutcome DTO required nks_configured. Make InitOutcome/ClaudeOutcome/ImportOutcome
#[serde(default)] (Default-derived) so the adapter is liberal in what it accepts
across dark-agent versions. Regression test pins the real 0.19.0 init response.

Verified end to end after the fix: VM provisions with a public IP, /health is reached,
init decodes, and the claude step runs and correctly surfaces the guest's validation
(the test tenant's stored token wasn't a real sk-ant-oat01 Claude Max token).
Backs a per-VM operator page (NKS #755 read side):
- DarkAgent::guest_report(target) fetches the guest's live /health + /heal as opaque
  serde_json::Value (mock + http adapters); diagnostic, no secrets. Transport failure
  → Unreachable so the caller can show 'unreachable'.
- Orchestrator::get_instance(id) + guest_report(target) passthroughs.
- GET /instances/:tenant/:id → InstanceDetailDto: the registry record, entry-point URLs
  (opencode http://addr:4096 per dark-vm opencode.nix; dark-agent :8080), and a
  best-effort GuestView {reachable, health, heal, error}. Fetching /heal also drives the
  guest's self-repair. 404 when the tenant/id don't match a record.
40 tests green.
New ui/public/instance.html (opened via ?tenant=&id= from a fleet row's tenant link
or a Details button): shows the VM's non-secret identity (tenant, instance_id,
address, color-coded status, created/updated as local time), entry points into
working with the VM — Open opencode (opencode_url), dark-agent link, a Heal/refresh
button that re-drives the guest /heal, and connection data (address + ports 4096/8080)
— and a live guest section (health version/uptime + pretty-printed /heal, or an
'unreachable' banner), auto-refreshing every 5s. server.ts gains a GET
/api/instances/:tenant/:id proxy (percent-encoded, shared forward()). All values
HTML-escaped; no secret fields. Verified: page 200, proxy relays 404, dashboard intact.
The tenant-derived name meant YC's per-folder name uniqueness rejected a second
provision for the same tenant with 409 (surfaced as a 500), and a leftover VM
blocked re-provision. Name each VM dark-vm-{uuid} instead: unique per provision, so
a tenant may run more than one VM and re-provision never collides. The tenant is
NOT in the name — the registry maps provider-id -> tenant. Keeps FLEET_NAME_PREFIX
for reconcile's orphan scan. Side benefit: non-ASCII tenants can no longer produce
an invalid YC resource name (the earlier Cyrillic 'invalid resource name').
refactor(deploy): builtin-only repo sync (git archive/unarchive), drop ansible.posix; lint clean
All checks were successful
ci / test (push) Successful in 1m55s
ci / test (pull_request) Successful in 44s
554db72fe1
arento merged commit f98aaa773d into main 2026-07-02 12:02:54 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
projects/dark-orchestrator!3
No description provided.