AG
TR

// writing / May 5, 2026

Architecting FIVUCSAS: a multi-tenant biometric authentication platform

How a Marmara senior project grew into a production multi-tenant biometric platform — and the three production incidents the team learned the most from.

FIVUCSASFace and Identity Verification Using Cloud-based SaaS — started as my Marmara senior engineering project and now ships under RollingCat Software, the umbrella name I publish some of my work under. It is a production multi-tenant biometric authentication platform: Spring Boot 3 / Java 21 on PostgreSQL 16 + pgvector, a FastAPI ML sidecar, a React 18 web app, and Kotlin Multiplatform desktop and mobile clients. WebAuthn-first, KVKK-compliant, self-hosted on a Hetzner CX43 box behind Traefik, observed with Loki + Promtail + Grafana, backed up with pgBackRest WAL archiving for PITR. The repo lives at github.com/Rollingcat-Software/FIVUCSAS, and the embeddable widget at verify.fivucsas.com.

This piece is not a feature tour. It is what I would tell another engineer about to build something at this scope — the three production incidents the team learned the most from, written up the way I wish the post had existed when we were getting into it.

The shape of the system

The platform splits into three runtime axes:

  1. Identity Core API — Spring Boot 3 / Java 21, the authoritative source of truth for tenants, users, sessions, audit logs, MFA factors (TOTP, WebAuthn, NFC, biometric).
  2. Biometric Processor — FastAPI sidecar in Python. Owns the ML stack (face mesh, embedding extraction, active-liveness puzzle scoring). Talks to the API over a private Docker network — never exposed publicly.
  3. Clients — React web app + Kotlin Multiplatform mobile + a Desktop / Admin client. Each one is a thin shell over the API; the embeddable widget is the smallest possible surface a tenant integrates against.

The core insight that shaped almost every decision: biometric data must never sit unencrypted at rest, and the embedding extraction process must never be reachable from the internet. Everything else fell out of that.

Incident 1 — The day a test wiped a real user

In the early schema, users was hard-deletable. The cascade graph (which we had not fully drawn) reached 13 tables including webauthn_credentials, nfc_keys, totp_secrets, and the biometric reference table. A test cleanup script, harmless in isolation, ran against a row that turned out to be a production account. The cascade did exactly what we told it to.

What was lost, in seconds: TOTP, WebAuthn passkey, NFC binding, biometric reference. What was not lost: the audit log, because audit lives in a separate cascade-isolated table on purpose.

The fix was small and obvious in retrospect:

  • Move users to soft-delete (deleted_at IS NULL everywhere we used to filter on existence).
  • Patch every findByEmail / findById to add the soft-delete predicate.
  • Schema-level guardrails: a CI check that greps for DELETE FROM users in the codebase.

The lesson we took: draw the FK graph first, before the first migration. A 13-table cascade is not a bug — it is an architecture decision you made without realizing it.

Incident 2 — Embedding encryption, the third time

The first version stored face embeddings as raw float[] columns. The second version base64-encoded them. Neither is encryption. The third version finally did it right: every embedding is encrypted with Fernet (AES-128-CBC + HMAC-SHA256) using a per-environment key, stored encrypted in pgvector, and decrypted in-process only at the moment the cosine similarity calculation runs.

The migration to v3 had to be done online, against production data, so the rollout was:

  1. Add the encrypted column alongside the plaintext one.
  2. Dual-write for one release.
  3. Backfill existing rows in batches off-hours.
  4. Switch reads to the encrypted column.
  5. Drop the plaintext column in a follow-up migration.

The bit that almost broke it: the operator must set FIVUCSAS_EMBEDDING_KEY at boot — the application fails fast if the key is missing, which is intentional. Letting it default to a random key would silently invalidate every existing embedding.

The lesson: for irreversible data transformations, fail-fast on configuration is better than fail-soft on behavior.

Incident 3 — Refresh tokens, family revocation, and the rollback

Refresh-token rotation is one of those things that looks simple in a sequence diagram and breaks in seven non-obvious ways at scale. The version I shipped first was correct in isolation but wrong under concurrent requests: a mobile client briefly online over flaky 3G could submit a refresh, get a new pair, lose the response, and retry — and my server would treat the second submission as token reuse and revoke the entire family.

The user-visible symptom: a single failed network round-trip would log them out across all devices. From the inside it looked correct — a “stolen token” detection — but from the outside it was a heisenbug that mostly hit mobile users.

The fix was a hashing-based family-revoke design (V55 in the migration log):

  • Store hashed refresh tokens, not plaintext.
  • A short reuse-grace window (a few seconds) where the same client can resubmit without triggering family-revoke.
  • Distinguish “different token in same family used after rotation” (real attack signal — revoke) from “same token submitted twice in 5 seconds” (network retry — accept).

The lesson: the network is not a sequence diagram. Every protocol that distinguishes “legitimate retry” from “attack” needs to model retries explicitly, not as an afterthought.

What the architecture is good at

  • Clean separation of concerns. The ML sidecar can be replaced or upgraded without touching the API; the API can move identity providers without touching the ML stack.
  • Schema-driven multi-tenancy. Tenant isolation is a database-level concern, not an application-level filter. There is no WHERE tenant_id = ? clause that someone can forget to add.
  • Operational primitives by default. Loki + Promtail + Grafana for logs and metrics, pgBackRest for WAL archiving, fail-loud backups with restore-verify, gitleaks CI on every commit.

What we would do differently

  • Draw the FK graph on day zero. Soft-delete by default; add hard-delete only where there is a real legal requirement.
  • Treat the ML sidecar as a separate deployable from the start. Bundling it in early was a shortcut; splitting it later was three weeks of yak-shaving.
  • Pick one MFA factor as the canonical primary. WebAuthn is the right answer; everything else (TOTP, NFC, biometric) is a fallback.

Reading list

  • Designing Data-Intensive Applications, ch. 5 (Replication) and ch. 9 (Consistency and Consensus).
  • The WebAuthn Level 3 spec — particularly user verification and client-extension handling.
  • The Fernet specification + the cryptography.io implementation notes on key rotation.

The full source is private until a third-party security review completes. Source access is available on request — email me at ahmetabdullahgultekin@gmail.com.

  • architecture
  • biometric
  • postgres
  • webauthn