moonlock/DECISIONS.md

# Decisions

Architectural and design decisions for Moonlock, in reverse chronological order.

## 2026-06-17 – Restore hardened release profile after the crash hunt (v0.6.18)

- **Who**: ClaudeCode, Dom
- **Why**: A security audit found the `[profile.release]` in `Cargo.toml` had silently drifted from the hardened profile decided on 2026-04-24 (`lto = "fat"`, `strip = true`). Git blame traced the drift to v0.6.14 (commit `85cf039`, "refactor: power-confirm via PowerAction table"): `lto` was reverted `fat`→`thin`, `strip` flipped `true`→`false`, and `debug = true` was added — all bundled into an unrelated refactor commit, with no commit-message mention and no entry here. The pattern (no strip + debug symbols + faster thin LTO) was a debug aid for the suspend/resume SIGSEGV hunt that ran v0.6.9–v0.6.17, giving symbolized coredump backtraces. That crash is fixed as of v0.6.17, so the debug profile has outlived its purpose, while shipping debug symbols on a security-critical auth binary eases reverse-engineering of the auth path and bloats the binary.
- **Tradeoffs**: Restoring `lto = "fat"` roughly doubles release build time (~30 s → ~60 s) for better cross-crate inlining; acceptable for a binary compiled once per release. Dropping `strip = false` + `debug = true` means field coredumps are no longer symbolized out of the box — a deliberate trade now that the crash is resolved; debug symbols can be re-enabled temporarily if a new crash needs hunting. Not chosen: keeping `lto = "fat"` but retaining the debug symbols — rejected because the symbols' only justification (the crash hunt) is gone.
- **How**: `Cargo.toml` `[profile.release]` restored to `lto = "fat"`, `strip = true`, with the `debug = true` line removed; `codegen-units = 1` unchanged. Verified via `file target/release/moonlock` reporting a stripped binary.

## 2026-06-02 – Real fix for the unlock SIGSEGV: quit in ::unlocked, never destroy windows ourselves (v0.6.17)

- **Who**: ClaudeCode, Dom
- **Why**: The v0.6.15 "stale per-monitor window" theory was a **misdiagnosis**. A clean manual `target/release/moonlock` lock→unlock (single monitor, no suspend, no monitor removal) still crashed, preceded by `Gdk-CRITICAL: gdk_surface_get_display: assertion 'GDK_IS_SURFACE (surface)' failed`. The v0.6.16 backtrace's top app frame is the `unlock_callback` closure, which ran `lock.unlock(); app.quit();`. Per the gtk4-session-lock header (`assign_window_to_monitor`: "the window will be unmapped and `gtk_window_destroy()` called on it when the current lock ends") the **library destroys the lock windows itself** on unlock. `app.quit()` then destroyed the same windows again — surface already gone → SIGSEGV. A double-destroy in the teardown sequence, entirely independent of monitors.
- **Tradeoffs**: Reverted the v0.6.15/v0.6.16 monitor-pruning + `LockscreenHandles.monitor` field: it targeted the wrong cause, never fired in testing, and manually called `app.remove_window()` on library-managed lock windows — exactly what the upstream example warns against ("does NOT manually destroy or close the lock windows"). Monitor-removal-during-lock is left entirely to the library (which already auto-unmaps+dereferences such windows). Three attempts total: idle_add deferral (wrong: reentrancy theory), monitor-pruning (wrong: misdiagnosis), and this one — verified against both the C header and the upstream `examples/simple.rs`.
- **How**: `unlock_callback` now calls only `lock.unlock()`. A new `lock.connect_unlocked(|_| app.quit())` quits the app only after the library finishes its teardown and fires `::unlocked`. Mirrors the canonical gtk4-session-lock usage exactly.

## 2026-06-02 – Prune per-monitor windows on monitor removal to fix resume-unlock SIGSEGV (v0.6.15)

- **Who**: ClaudeCode, Dom
- **Why**: Chronic SIGSEGV (multiple coredumps/day) on the unlock following a suspend/resume. Backtrace: `app.quit()` → `gtk_window_destroy` → gtk4-layer-shell → `g_signal_emit` → NULL deref (`0x278`, rax=0). Root cause: `connect_monitor` is add-only — it creates one window per monitor and pushes to `all_handles`, but nothing ever removes a window when a monitor powers off on suspend. gtk4-session-lock unmaps + drops *its* ref to that window on monitor removal (per `gtk4-session-lock` 0.4 docs), but the GtkApplication (`ApplicationWindow::builder().application(app)`) and `all_handles` still hold refs. The orphaned window survives until unlock, where destroying it dereferences its now-NULL monitor association. A diagnostic confirmed the windows are GTK-valid but accumulate (3 windows for 1-2 monitors) with a NULL associated object.
- **Tradeoffs**: An earlier attempt deferred `unlock()`/`quit()` via `glib::idle_add_local_once` on a reentrancy theory — proven wrong by the logs (crash happens *inside* the idle trampoline). Reverted. The library doc explicitly says monitor removal is detected via "GTK APIs"; we follow that rather than fighting the library. We release our refs (do **not** call `destroy` — the lib already unmapped+dereffed the window). Diagnostic UNLOCK logging is kept for now, to be removed once a suspend/resume validation cycle confirms the fix.
- **How**: `LockscreenHandles` gains `monitor: Option<gdk::Monitor>`, set in the `connect_monitor` handler. `activate_with_session_lock` watches `display.monitors()` via `connect_items_changed`; on any change it retains only handles whose monitor `is_valid()`, calling `app.remove_window()` on the pruned ones to drop the application's ref. With both refs released, the orphaned window is gone before unlock.

## 2026-06-02 – Align power-confirm to moonset's ActionDef pattern (v0.6.14)

- **Who**: ClaudeCode, Dom
- **Why**: A code review of moongreet's power-confirm (ported from this file) flagged the shared pattern as lower-altitude than moonset's: two near-identical reboot/shutdown handlers and a `show_power_confirm` taking loose `message`/`action_fn`/`error_message` params that can drift apart. moonset already solved this with an `ActionDef` table + button factory. Changed here in lockstep with moongreet to keep the three projects symmetric.
- **Tradeoffs**: A `PowerAction` struct + `power_actions()` table + `create_power_button` factory is slightly more machinery for two actions, but couples icon/prompt/error/action into one value (a mismatched prompt/action becomes unrepresentable) and makes a third action a one-line table entry. Did NOT touch `confirm_box: Rc<RefCell<Option<gtk::Box>>>` — moonset uses the same, it is the shared convention.
- **How**: Replaced the two hand-wired handlers with a loop over `power_actions()`; `show_power_confirm`/`execute_power_action` now take `PowerAction` (Copy) instead of three loose strings. Added an in-flight re-trigger guard via `power_box.set_sensitive(false)` (re-enabled on failure), matching moonset; also clear a stale `error_label` when showing a new prompt.

## 2026-05-04 – Drop PAM account/session stack, remove `check_account`, drop `pam_acct_mgmt` from password path (v0.6.13)

- **Who**: ClaudeCode, Dom
- **Why**: 20 SIGSEGV coredumps in 6 days. All crashes preceded by `pam_unix(moonlock:account): setuid failed: Operation not permitted`. The previous PAM config (`auth/account/session include system-auth`) pulled `pam_unix(account)`, which needs setuid root for `/etc/shadow`. moonlock runs as the user, so `pam_acct_mgmt` always failed. The failure cascaded into the FP-resume path (`gio::spawn_blocking(check_account) → false → 2s timeout → resume_async`) where a use-after-free during `gtk_window_destroy` killed the process. Each crash left a dead `ext-session-lock-v1` client and the compositor stuck on its red fallback backdrop until manual recovery. Initial fix on 2026-04-30 dropped the FP-side `check_account` and the account/session lines from the PAM config, but left the `pam_acct_mgmt` call in `auth.rs::authenticate()` for the password path. Result: a PAM stack with no `account` module fell back to `/etc/pam.d/other` (`pam_deny`) and rejected every password — Dom got locked out on 2026-05-04.
- **Tradeoffs**: Aligned with the swaylock/gtklock pattern (only `auth include login`). Lost: PAM-driven account expiry/lockout in both paths. Acceptable because (a) FP attempts are still bounded by `MAX_FP_ATTEMPTS` in `fingerprint.rs`, (b) password-path lockout still works through `pam_faillock.so preauth/authfail/authsucc` in the inherited `auth` stack, and (c) account validity was already verified by the login manager when the session was opened — a lockscreen unlocks an existing session, it does not gate access to a new one. Not chosen: a custom `account` stack with `pam_faillock.so` only — would have kept PAM-level FP lockout but adds a non-standard config that other lockers do not use, and `pam_faillock` standalone in `account` is rarely tested in the wild.
- **How**: (1) `config/moonlock-pam` reduced to a single `auth include login` line. (2) `auth.rs::check_account()` and its two unit tests removed. (3) `auth.rs::authenticate()` no longer calls `pam_acct_mgmt`; `pam_authenticate` result is returned directly. The `pam_acct_mgmt` extern declaration is removed. (4) `lockscreen.rs::start_fingerprint` simplified — the `gio::spawn_blocking(check_account)` async block, the failure-side error UI, and the 2-second `resume_async` retry path all removed; on FP success the closure now goes `label.set_text(success); fp.stop(); cb()`. (5) `fingerprint.rs::resume_async()` removed. (6) `CLAUDE.md` architecture and security sections updated to describe the new PAM stack.

## 2026-04-24 – Audit LOW fixes: docs, rustdoc, check_account scope, debug gating, lto fat (v0.6.12)

- **Who**: ClaudeCode, Dom
- **Why**: Six LOW findings cleared in a single pass. (1) Docs referenced the old `[0,100]` blur range; code clamps `[0,200]` since v0.6.8. (2) The `MAX_BLUR_DIMENSION` doc comment was split by a `// SYNC:` block, producing a truncated sentence in rustdoc. (3) `check_account` was `pub` and relied on callers only ever passing `getuid()`-derived usernames; the contract was not enforced by the type system. (4) `MOONLOCK_DEBUG` env var flipped log verbosity to Debug in release builds, letting a compromised session script escalate journal noise about fprintd / D-Bus. (5) `pam_setcred` absence was undocumented. (6) `[profile.release]` used `lto = "thin"` — fine for most crates, but for a latency-critical auth binary compiled once per release, fat LTO's extra cross-crate inlining is worth the ~1 min build hit.
- **Tradeoffs**: `lto = "fat"` roughly doubles release build time (~30 s → ~60 s) for slightly better inlining across PAM FFI wrappers and the i18n/status paths. `#[cfg(debug_assertions)]` on the debug-level selector means you have to run a debug build to raise log level — inconvenient for live troubleshooting, but aligned with the security-first posture.
- **How**: (1) `CLAUDE.md` + `README.md` updated to `[0,200]`. (2) `// SYNC:` block moved above the `///` doc so rustdoc renders one coherent paragraph. (3) `check_account` visibility narrowed to `pub(crate)` with a `Precondition` paragraph explaining the username contract. (4) Debug-level selection wrapped in `#[cfg(debug_assertions)]`; release builds always run at `LevelFilter::Info`. (5) Added a comment block in `authenticate()` documenting why `pam_setcred` is deliberately absent and where it would hook in if needed. (6) `lto = "fat"` in `Cargo.toml`.

## 2026-04-24 – Audit MEDIUM fixes: D-Bus cleanup race, TOCTOU open, FP reset, GTK entry clear (v0.6.11)

- **Who**: ClaudeCode, Dom
- **Why**: Second round after the HIGH fixes, addressing the four MEDIUM findings. (1) `cleanup_dbus` spawned VerifyStop + Release as fire-and-forget, then `resume_async` called Claim after only a 2 s timeout — shorter than the 3 s D-Bus timeout, so on a slow bus the Claim could race the Release and fprintd would reject it, leaving the FP listener permanently dead. (2) `load_background_texture` relied on the caller's `symlink_metadata` check, re-opening the path via `gdk::Texture::from_file` — a classic TOCTOU window. (3) `resume_async` unconditionally reset `failed_attempts`, allowing an attacker with sensor control to evade the 10-attempt cap by cycling verify-match → `check_account` fail → resume. (4) The GTK `PasswordEntry` buffer was only cleared on timeout or auth failure, leaving the password in GLib malloc'd memory longer than necessary.
- **Tradeoffs**: The D-Bus cleanup is now split into a synchronous helper (`take_cleanup_proxy` — signal disconnect + flag clear) and an async helper (`perform_dbus_cleanup` — VerifyStop + Release), so `resume_async` can await the release while `stop()` stays fire-and-forget. Dropping the `failed_attempts` reset means a flaky sensor could reach 10 failures faster, but the correct remedy is a new lock session (construction) rather than a reset that also helps attackers.
- **How**: (1) Split `cleanup_dbus` into `take_cleanup_proxy()` (sync) + `perform_dbus_cleanup(proxy)` (async). `resume_async` now awaits `perform_dbus_cleanup` before `begin_verification`. `stop()` still spawns the cleanup fire-and-forget. (2) `load_background_texture` opens with `O_NOFOLLOW` via `std::fs::OpenOptions::custom_flags`, reads to bytes, and builds the texture via `gdk::Texture::from_bytes`. (3) Removed `listener.borrow_mut().failed_attempts = 0` from `resume_async`. (4) `password_entry.set_text("")` now fires right after the `Zeroizing::new(entry.text().to_string())` extraction, shortening the GTK-side window.

## 2026-04-24 – Audit fixes: RefCell borrow across await, async avatar decode

- **Who**: ClaudeCode, Dom
- **Why**: Triple audit found two HIGH issues. (1) `init_fingerprint_async` held a `RefCell` immutable borrow across `is_available_async().await` — a concurrent `connect_monitor` signal (hotplug / suspend-resume) invoking `borrow_mut()` during the await would panic. (2) `set_avatar_from_file` decoded avatars synchronously via `Pixbuf::from_file_at_scale`, blocking the GTK main thread inside the `connect_monitor` handler. With `MAX_AVATAR_FILE_SIZE` at 10 MB the worst-case stall was 200–500 ms on monitor hotplug.
- **Tradeoffs**: Avatar is shown as the symbolic default icon for a brief window while decoding completes. Wallpaper stays synchronous because `connect_monitor` fires during `lock()` and needs the texture already present (see 2026-04-09).
- **How**: (1) Extract `username` into a local `String` in `init_fingerprint_async`, drop the borrow before the await, re-borrow in a new scope after — no awaits inside the second borrow, so hotplug during signal setup is safe. (2) `set_avatar_from_file` now uses `gio::File::read_future` + `Pixbuf::from_stream_at_scale_future` for async I/O and decode. The default icon is shown immediately; the decoded texture replaces it when ready. `Pixbuf` itself is `!Send`, so `gio::spawn_blocking` does not apply — the GIO async stream loader keeps the `Pixbuf` on the main thread while the kernel does the I/O asynchronously.

## 2026-04-09 – Monitor hotplug via connect_monitor signal

- **Who**: ClaudeCode, Dom
- **Why**: moonlock crashed with segfault in libgtk-4.so after suspend/resume — HDMI monitor disconnect/reconnect invalidated GDK monitor objects, and the statically created windows referenced destroyed surfaces. Crash at consistent GTK4 offset (0x278 NULL dereference), 3x reproduced.
- **Tradeoffs**: Wallpaper texture now loaded before `lock()` instead of after (connect_monitor fires during lock() and needs the texture). Local JPEG loading is fast enough that the delay is negligible. Shared state moved to Rc's for the signal closure — slightly more indirection but necessary for dynamic window creation.
- **How**: (1) Bump gtk4-session-lock feature from `v1_1` to `v1_2` to enable `Instance::connect_monitor`. (2) Replace manual monitor iteration with `lock.connect_monitor()` signal handler that creates windows on demand. (3) Signal fires once per existing monitor at `lock()` and again on hotplug. (4) Windows auto-unmap when their monitor disappears (ext-session-lock-v1 guarantee). (5) Fingerprint listener published to shared Rc so hotplugged monitors get FP labels.

## 2026-03-31 – Fourth audit: peek icon, blur limit, GResource compression, sync markers

- **Who**: ClaudeCode, Dom
- **Why**: Fourth triple audit found blur limit inconsistency (moonlock 0–100 vs moongreet/moonset 0–200), missing GResource compression, peek icon inconsistency, and duplicated code without sync markers.
- **Tradeoffs**: Peek icon enabled in lockscreen — user decision favoring UX consistency over shoulder-surfing protection. Acceptable for single-user desktop. Blur limit raised to 200 for ecosystem consistency.
- **How**: (1) `show_peek_icon(true)` in lockscreen password entry. (2) `clamp(0.0, 200.0)` for blur in config.rs. (3) `compressed="true"` on CSS/SVG GResource entries. (4) SYNC comments on duplicated blur/background functions pointing to moongreet and moonset.

## 2026-03-30 – Third audit: blur offset, lock-before-IO, FP signal lifecycle, TOCTOU

- **Who**: ClaudeCode, Dom
- **Why**: Third triple audit (quality, performance, security) found: blur padding offset rendering texture at (0,0) instead of (-pad,-pad) causing edge darkening on left/top (BUG), wallpaper disk I/O blocking before lock() extending the unsecured window (PERF/SEC), signal handler duplication on resume_async (SEC), failed_attempts not reset on FP resume (SEC), unknown VerifyStatus with done=false hanging FP listener (SEC), TOCTOU in is_file+is_symlink checks (SEC), dead code in faillock_warning (QUALITY), unbounded blur sigma (SEC).
- **Tradeoffs**: Wallpaper loads after lock() — screen briefly shows without wallpaper until texture is ready. Acceptable: security > aesthetics. Blur sigma clamped to [0.0, 100.0] — arbitrary upper bound but prevents GPU memory exhaustion.
- **How**: (1) Texture offset to (-pad, -pad) in render_blurred_texture. (2) lock.lock() before resolve_background_path. (3) begin_verification disconnects old signal_id before registering new. (4) resume_async resets failed_attempts. (5) Unknown VerifyStatus with done=true triggers restart. (6) symlink_metadata() for atomic file+symlink check. (7) faillock_warning dead code removed, saturating_sub. (8) background_blur clamped. (9) Redundant Zeroizing<Vec<u8>> removed. (10) Default impl for FingerprintListener. (11) on_verify_status restricted to pub(crate). (12) Warn logging for non-UTF-8 GECOS and avatar paths.

## 2026-03-30 – Second audit: zeroize CString, FP account check, PAM timeout, blur downscale

- **Who**: ClaudeCode, Dom
- **Why**: Second triple audit (quality, performance, security) found: CString password copy not zeroized (HIGH), fingerprint unlock bypassing pam_acct_mgmt (MEDIUM), no PAM timeout leaving user locked out on hanging modules (MEDIUM), GPU blur on full wallpaper resolution (MEDIUM), no-monitor edge case doing `return` instead of `exit(1)` (MEDIUM).
- **Tradeoffs**: PAM timeout (30s) uses a generation counter to avoid stale result interference — adds complexity but prevents parallel PAM sessions. FP restart after failed account check re-claims the device, adding a D-Bus round-trip, but prevents permanent FP death on transient failures. Blur downscale to 1920px cap trades negligible quality for ~4x less GPU work on 4K wallpapers.
- **How**: (1) `Zeroizing<CString>` wraps password in auth.rs, `zeroize/std` feature enabled. (2) `check_account()` calls pam_acct_mgmt after FP match; `resume_async()` restarts FP on transient failure. (3) `auth_generation` counter invalidates stale PAM results; 30s timeout re-enables UI. (4) `MAX_BLUR_DIMENSION` caps blur input at 1920px, sigma scaled proportionally. (5) `exit(1)` on no-monitor after `lock.lock()`.

## 2026-03-28 – Remove embedded wallpaper from binary

- **Who**: ClaudeCode, Dom
- **Why**: Wallpaper is installed by moonarch to /usr/share/moonarch/wallpaper.jpg. Embedding a 374K JPEG in the binary is redundant. GTK background color (Catppuccin Mocha base) is a clean fallback.
- **Tradeoffs**: Without moonarch installed AND without config, lockscreen shows plain dark background instead of wallpaper. Acceptable — that's the expected minimal state.
- **How**: Remove wallpaper.jpg from GResources, return None from resolve_background_path when no file found, skip background picture creation when no texture available.

## 2026-03-28 – Audit-driven security and lifecycle fixes (v0.6.0)

- **Who**: ClaudeCode, Dom
- **Why**: Triple audit (quality, performance, security) revealed a critical D-Bus signal spoofing vector, fingerprint lifecycle bugs, and multi-monitor performance issues.
- **Tradeoffs**: `cleanup_dbus()` extraction adds a method but clarifies the stop/match ownership; `running_flag: Rc<Cell<bool>>` adds a field but prevents race between async restart and stop; sender validation adds a check per signal but closes the only known auth bypass.
- **How**: (1) Validate D-Bus VerifyStatus sender against fprintd's unique bus name. (2) Extract `cleanup_dbus()` from `stop()`, call it on verify-match. (3) `Rc<Cell<bool>>` running flag checked after await in `restart_verify_async`. (4) Consistent 3s D-Bus timeouts. (5) Panic hook before logging. (6) Blur and avatar caches shared across monitors. (7) Peek icon disabled. (8) Symlink rejection for background_path. (9) TOML parse errors logged.

## 2026-03-28 – GPU blur via GskBlurNode replaces CPU blur

- **Who**: ClaudeCode, Dom
- **Why**: CPU-side Gaussian blur (`image` crate) blocked the GTK main thread for 500ms–2s on 4K wallpapers at cold cache. Disk cache mitigated repeat starts but added ~100 lines of complexity.
- **Tradeoffs**: GPU blur quality is slightly different (box-blur approximation vs true Gaussian), acceptable for wallpaper. Removes `image` and `dirs` dependencies entirely. No disk cache needed.
- **How**: `Snapshot::push_blur()` + `GskRenderer::render_texture()` on `connect_realize`. Blur happens once on the GPU when the widget gets its renderer, producing a concrete `gdk::Texture`. Zero startup latency.

## 2026-03-28 – Optional background blur via `image` crate (superseded)

- **Who**: ClaudeCode, Dom
- **Why**: Consistent with moonset/moongreet — blurred wallpaper as lockscreen background is a common UX pattern
- **Tradeoffs**: Adds `image` crate dependency (~15 transitive crates); CPU-side Gaussian blur at load time adds startup latency proportional to image size and sigma. Acceptable because blur runs once and the texture is shared across monitors.
- **How**: `load_background_texture(bg_path, blur_radius)` loads texture, optionally applies `imageops::blur()`, returns `gdk::Texture`. Config option `background_blur: Option<f32>` in TOML.

## 2026-03-28 – Shared wallpaper texture pattern (aligned with moonset/moongreet)

- **Who**: ClaudeCode, Dom
- **Why**: Previously loaded wallpaper per-window via `Picture::for_filename()`. Multi-monitor setups decoded the JPEG redundantly. Blur feature requires texture pixel access anyway.
- **Tradeoffs**: Slightly more code in main.rs (texture loaded before window creation), but avoids redundant decoding and enables the blur feature.
- **How**: `load_background_texture()` in lockscreen.rs decodes once, `create_background_picture()` wraps shared `gdk::Texture` in `gtk::Picture`. Same pattern as moonset/moongreet.