FIVUCSAS — Face and Identity Verification Using Cloud-based SaaS — started as my Marmara senior engineering project and now ships under RollingCat Software, the umbrella name I publish some of my work under. It is a production multi-tenant biometric authentication platform: Spring Boot 3 / Java 21 on PostgreSQL 16 + pgvector, a FastAPI ML sidecar, a React 18 web app, and Kotlin Multiplatform desktop and mobile clients. WebAuthn-first, KVKK-compliant, self-hosted on a Hetzner CX43 box behind Traefik, observed with Loki + Promtail + Grafana, backed up with pgBackRest WAL archiving for PITR. The repo lives at github.com/Rollingcat-Software/FIVUCSAS, and the embeddable widget at verify.fivucsas.com.
This piece is not a feature tour. It is what I would tell another engineer about to build something at this scope — the three production incidents the team learned the most from, written up the way I wish the post had existed when we were getting into it.
The shape of the system
The platform splits into three runtime axes:
- Identity Core API — Spring Boot 3 / Java 21, the authoritative source of truth for tenants, users, sessions, audit logs, MFA factors (TOTP, WebAuthn, NFC, biometric).
- Biometric Processor — FastAPI sidecar in Python. Owns the ML stack (face mesh, embedding extraction, active-liveness puzzle scoring). Talks to the API over a private Docker network — never exposed publicly.
- Clients — React web app + Kotlin Multiplatform mobile + a Desktop / Admin client. Each one is a thin shell over the API; the embeddable widget is the smallest possible surface a tenant integrates against.
The core insight that shaped almost every decision: biometric data must never sit unencrypted at rest, and the embedding extraction process must never be reachable from the internet. Everything else fell out of that.
Incident 1 — The day a test wiped a real user
In the early schema, users was hard-deletable. The cascade graph (which we
had not fully drawn) reached 13 tables including webauthn_credentials,
nfc_keys, totp_secrets, and the biometric reference table. A test cleanup
script, harmless in isolation, ran against a row that turned out to be a
production account. The cascade did exactly what we told it to.
What was lost, in seconds: TOTP, WebAuthn passkey, NFC binding, biometric reference. What was not lost: the audit log, because audit lives in a separate cascade-isolated table on purpose.
The fix was small and obvious in retrospect:
- Move
usersto soft-delete (deleted_at IS NULLeverywhere we used to filter on existence). - Patch every
findByEmail/findByIdto add the soft-delete predicate. - Schema-level guardrails: a CI check that greps for
DELETE FROM usersin the codebase.
The lesson we took: draw the FK graph first, before the first migration. A 13-table cascade is not a bug — it is an architecture decision you made without realizing it.
Incident 2 — Embedding encryption, the third time
The first version stored face embeddings as raw float[] columns. The
second version base64-encoded them. Neither is encryption. The third version
finally did it right: every embedding is encrypted with Fernet
(AES-128-CBC + HMAC-SHA256) using a per-environment key, stored encrypted in
pgvector, and decrypted in-process only at the moment the cosine similarity
calculation runs.
The migration to v3 had to be done online, against production data, so the rollout was:
- Add the encrypted column alongside the plaintext one.
- Dual-write for one release.
- Backfill existing rows in batches off-hours.
- Switch reads to the encrypted column.
- Drop the plaintext column in a follow-up migration.
The bit that almost broke it: the operator must set
FIVUCSAS_EMBEDDING_KEY at boot — the application fails fast if the
key is missing, which is intentional. Letting it default to a random key
would silently invalidate every existing embedding.
The lesson: for irreversible data transformations, fail-fast on configuration is better than fail-soft on behavior.
Incident 3 — Refresh tokens, family revocation, and the rollback
Refresh-token rotation is one of those things that looks simple in a sequence diagram and breaks in seven non-obvious ways at scale. The version I shipped first was correct in isolation but wrong under concurrent requests: a mobile client briefly online over flaky 3G could submit a refresh, get a new pair, lose the response, and retry — and my server would treat the second submission as token reuse and revoke the entire family.
The user-visible symptom: a single failed network round-trip would log them out across all devices. From the inside it looked correct — a “stolen token” detection — but from the outside it was a heisenbug that mostly hit mobile users.
The fix was a hashing-based family-revoke design (V55 in the migration log):
- Store hashed refresh tokens, not plaintext.
- A short reuse-grace window (a few seconds) where the same client can resubmit without triggering family-revoke.
- Distinguish “different token in same family used after rotation” (real attack signal — revoke) from “same token submitted twice in 5 seconds” (network retry — accept).
The lesson: the network is not a sequence diagram. Every protocol that distinguishes “legitimate retry” from “attack” needs to model retries explicitly, not as an afterthought.
What the architecture is good at
- Clean separation of concerns. The ML sidecar can be replaced or upgraded without touching the API; the API can move identity providers without touching the ML stack.
- Schema-driven multi-tenancy. Tenant isolation is a database-level
concern, not an application-level filter. There is no
WHERE tenant_id = ?clause that someone can forget to add. - Operational primitives by default. Loki + Promtail + Grafana for logs and metrics, pgBackRest for WAL archiving, fail-loud backups with restore-verify, gitleaks CI on every commit.
What we would do differently
- Draw the FK graph on day zero. Soft-delete by default; add hard-delete only where there is a real legal requirement.
- Treat the ML sidecar as a separate deployable from the start. Bundling it in early was a shortcut; splitting it later was three weeks of yak-shaving.
- Pick one MFA factor as the canonical primary. WebAuthn is the right answer; everything else (TOTP, NFC, biometric) is a fallback.
Reading list
- Designing Data-Intensive Applications, ch. 5 (Replication) and ch. 9 (Consistency and Consensus).
- The WebAuthn Level 3 spec — particularly user verification and client-extension handling.
- The Fernet specification + the cryptography.io implementation notes on key rotation.
The full source is private until a third-party security review completes. Source access is available on request — email me at ahmetabdullahgultekin@gmail.com.
- architecture
- biometric
- postgres
- webauthn