Files
moonlock/DECISIONS.md
T
nevaforget d292eaa4c8 fix: harden release profile, drop dead struct fields (v0.6.18)
Security-audit follow-up. The release profile had silently drifted from
the hardened profile (v0.6.12): v0.6.14 bundled lto fat->thin, strip
true->false, and debug=true into an unrelated refactor — a debug aid for
the suspend/resume SIGSEGV hunt. That crash is fixed (v0.6.17), so
restore lto=fat + strip=true and drop the debug symbols, which on a
security-critical auth binary only ease reverse-engineering of the auth
path and bloat the binary.

Also remove two vestigial struct fields the audit surfaced: never read,
no behavior change.
- LockscreenHandles.password_entry: the entry is fully wired via internal
  closures before the handles return; no caller read the field.
- User.uid: superseded by getuid() (root check) and username lookups.
2026-06-17 10:46:14 +02:00

123 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Decisions
Architectural and design decisions for Moonlock, in reverse chronological order.
## 2026-06-17 Restore hardened release profile after the crash hunt (v0.6.18)
- **Who**: ClaudeCode, Dom
- **Why**: A security audit found the `[profile.release]` in `Cargo.toml` had silently drifted from the hardened profile decided on 2026-04-24 (`lto = "fat"`, `strip = true`). Git blame traced the drift to v0.6.14 (commit `85cf039`, "refactor: power-confirm via PowerAction table"): `lto` was reverted `fat``thin`, `strip` flipped `true``false`, and `debug = true` was added — all bundled into an unrelated refactor commit, with no commit-message mention and no entry here. The pattern (no strip + debug symbols + faster thin LTO) was a debug aid for the suspend/resume SIGSEGV hunt that ran v0.6.9v0.6.17, giving symbolized coredump backtraces. That crash is fixed as of v0.6.17, so the debug profile has outlived its purpose, while shipping debug symbols on a security-critical auth binary eases reverse-engineering of the auth path and bloats the binary.
- **Tradeoffs**: Restoring `lto = "fat"` roughly doubles release build time (~30 s → ~60 s) for better cross-crate inlining; acceptable for a binary compiled once per release. Dropping `strip = false` + `debug = true` means field coredumps are no longer symbolized out of the box — a deliberate trade now that the crash is resolved; debug symbols can be re-enabled temporarily if a new crash needs hunting. Not chosen: keeping `lto = "fat"` but retaining the debug symbols — rejected because the symbols' only justification (the crash hunt) is gone.
- **How**: `Cargo.toml` `[profile.release]` restored to `lto = "fat"`, `strip = true`, with the `debug = true` line removed; `codegen-units = 1` unchanged. Verified via `file target/release/moonlock` reporting a stripped binary.
## 2026-06-02 Real fix for the unlock SIGSEGV: quit in ::unlocked, never destroy windows ourselves (v0.6.17)
- **Who**: ClaudeCode, Dom
- **Why**: The v0.6.15 "stale per-monitor window" theory was a **misdiagnosis**. A clean manual `target/release/moonlock` lock→unlock (single monitor, no suspend, no monitor removal) still crashed, preceded by `Gdk-CRITICAL: gdk_surface_get_display: assertion 'GDK_IS_SURFACE (surface)' failed`. The v0.6.16 backtrace's top app frame is the `unlock_callback` closure, which ran `lock.unlock(); app.quit();`. Per the gtk4-session-lock header (`assign_window_to_monitor`: "the window will be unmapped and `gtk_window_destroy()` called on it when the current lock ends") the **library destroys the lock windows itself** on unlock. `app.quit()` then destroyed the same windows again — surface already gone → SIGSEGV. A double-destroy in the teardown sequence, entirely independent of monitors.
- **Tradeoffs**: Reverted the v0.6.15/v0.6.16 monitor-pruning + `LockscreenHandles.monitor` field: it targeted the wrong cause, never fired in testing, and manually called `app.remove_window()` on library-managed lock windows — exactly what the upstream example warns against ("does NOT manually destroy or close the lock windows"). Monitor-removal-during-lock is left entirely to the library (which already auto-unmaps+dereferences such windows). Three attempts total: idle_add deferral (wrong: reentrancy theory), monitor-pruning (wrong: misdiagnosis), and this one — verified against both the C header and the upstream `examples/simple.rs`.
- **How**: `unlock_callback` now calls only `lock.unlock()`. A new `lock.connect_unlocked(|_| app.quit())` quits the app only after the library finishes its teardown and fires `::unlocked`. Mirrors the canonical gtk4-session-lock usage exactly.
## 2026-06-02 Prune per-monitor windows on monitor removal to fix resume-unlock SIGSEGV (v0.6.15)
- **Who**: ClaudeCode, Dom
- **Why**: Chronic SIGSEGV (multiple coredumps/day) on the unlock following a suspend/resume. Backtrace: `app.quit()``gtk_window_destroy` → gtk4-layer-shell → `g_signal_emit` → NULL deref (`0x278`, rax=0). Root cause: `connect_monitor` is add-only — it creates one window per monitor and pushes to `all_handles`, but nothing ever removes a window when a monitor powers off on suspend. gtk4-session-lock unmaps + drops *its* ref to that window on monitor removal (per `gtk4-session-lock` 0.4 docs), but the GtkApplication (`ApplicationWindow::builder().application(app)`) and `all_handles` still hold refs. The orphaned window survives until unlock, where destroying it dereferences its now-NULL monitor association. A diagnostic confirmed the windows are GTK-valid but accumulate (3 windows for 1-2 monitors) with a NULL associated object.
- **Tradeoffs**: An earlier attempt deferred `unlock()`/`quit()` via `glib::idle_add_local_once` on a reentrancy theory — proven wrong by the logs (crash happens *inside* the idle trampoline). Reverted. The library doc explicitly says monitor removal is detected via "GTK APIs"; we follow that rather than fighting the library. We release our refs (do **not** call `destroy` — the lib already unmapped+dereffed the window). Diagnostic UNLOCK logging is kept for now, to be removed once a suspend/resume validation cycle confirms the fix.
- **How**: `LockscreenHandles` gains `monitor: Option<gdk::Monitor>`, set in the `connect_monitor` handler. `activate_with_session_lock` watches `display.monitors()` via `connect_items_changed`; on any change it retains only handles whose monitor `is_valid()`, calling `app.remove_window()` on the pruned ones to drop the application's ref. With both refs released, the orphaned window is gone before unlock.
## 2026-06-02 Align power-confirm to moonset's ActionDef pattern (v0.6.14)
- **Who**: ClaudeCode, Dom
- **Why**: A code review of moongreet's power-confirm (ported from this file) flagged the shared pattern as lower-altitude than moonset's: two near-identical reboot/shutdown handlers and a `show_power_confirm` taking loose `message`/`action_fn`/`error_message` params that can drift apart. moonset already solved this with an `ActionDef` table + button factory. Changed here in lockstep with moongreet to keep the three projects symmetric.
- **Tradeoffs**: A `PowerAction` struct + `power_actions()` table + `create_power_button` factory is slightly more machinery for two actions, but couples icon/prompt/error/action into one value (a mismatched prompt/action becomes unrepresentable) and makes a third action a one-line table entry. Did NOT touch `confirm_box: Rc<RefCell<Option<gtk::Box>>>` — moonset uses the same, it is the shared convention.
- **How**: Replaced the two hand-wired handlers with a loop over `power_actions()`; `show_power_confirm`/`execute_power_action` now take `PowerAction` (Copy) instead of three loose strings. Added an in-flight re-trigger guard via `power_box.set_sensitive(false)` (re-enabled on failure), matching moonset; also clear a stale `error_label` when showing a new prompt.
## 2026-05-04 Drop PAM account/session stack, remove `check_account`, drop `pam_acct_mgmt` from password path (v0.6.13)
- **Who**: ClaudeCode, Dom
- **Why**: 20 SIGSEGV coredumps in 6 days. All crashes preceded by `pam_unix(moonlock:account): setuid failed: Operation not permitted`. The previous PAM config (`auth/account/session include system-auth`) pulled `pam_unix(account)`, which needs setuid root for `/etc/shadow`. moonlock runs as the user, so `pam_acct_mgmt` always failed. The failure cascaded into the FP-resume path (`gio::spawn_blocking(check_account) → false → 2s timeout → resume_async`) where a use-after-free during `gtk_window_destroy` killed the process. Each crash left a dead `ext-session-lock-v1` client and the compositor stuck on its red fallback backdrop until manual recovery. Initial fix on 2026-04-30 dropped the FP-side `check_account` and the account/session lines from the PAM config, but left the `pam_acct_mgmt` call in `auth.rs::authenticate()` for the password path. Result: a PAM stack with no `account` module fell back to `/etc/pam.d/other` (`pam_deny`) and rejected every password — Dom got locked out on 2026-05-04.
- **Tradeoffs**: Aligned with the swaylock/gtklock pattern (only `auth include login`). Lost: PAM-driven account expiry/lockout in both paths. Acceptable because (a) FP attempts are still bounded by `MAX_FP_ATTEMPTS` in `fingerprint.rs`, (b) password-path lockout still works through `pam_faillock.so preauth/authfail/authsucc` in the inherited `auth` stack, and (c) account validity was already verified by the login manager when the session was opened — a lockscreen unlocks an existing session, it does not gate access to a new one. Not chosen: a custom `account` stack with `pam_faillock.so` only — would have kept PAM-level FP lockout but adds a non-standard config that other lockers do not use, and `pam_faillock` standalone in `account` is rarely tested in the wild.
- **How**: (1) `config/moonlock-pam` reduced to a single `auth include login` line. (2) `auth.rs::check_account()` and its two unit tests removed. (3) `auth.rs::authenticate()` no longer calls `pam_acct_mgmt`; `pam_authenticate` result is returned directly. The `pam_acct_mgmt` extern declaration is removed. (4) `lockscreen.rs::start_fingerprint` simplified — the `gio::spawn_blocking(check_account)` async block, the failure-side error UI, and the 2-second `resume_async` retry path all removed; on FP success the closure now goes `label.set_text(success); fp.stop(); cb()`. (5) `fingerprint.rs::resume_async()` removed. (6) `CLAUDE.md` architecture and security sections updated to describe the new PAM stack.
## 2026-04-24 Audit LOW fixes: docs, rustdoc, check_account scope, debug gating, lto fat (v0.6.12)
- **Who**: ClaudeCode, Dom
- **Why**: Six LOW findings cleared in a single pass. (1) Docs referenced the old `[0,100]` blur range; code clamps `[0,200]` since v0.6.8. (2) The `MAX_BLUR_DIMENSION` doc comment was split by a `// SYNC:` block, producing a truncated sentence in rustdoc. (3) `check_account` was `pub` and relied on callers only ever passing `getuid()`-derived usernames; the contract was not enforced by the type system. (4) `MOONLOCK_DEBUG` env var flipped log verbosity to Debug in release builds, letting a compromised session script escalate journal noise about fprintd / D-Bus. (5) `pam_setcred` absence was undocumented. (6) `[profile.release]` used `lto = "thin"` — fine for most crates, but for a latency-critical auth binary compiled once per release, fat LTO's extra cross-crate inlining is worth the ~1 min build hit.
- **Tradeoffs**: `lto = "fat"` roughly doubles release build time (~30 s → ~60 s) for slightly better inlining across PAM FFI wrappers and the i18n/status paths. `#[cfg(debug_assertions)]` on the debug-level selector means you have to run a debug build to raise log level — inconvenient for live troubleshooting, but aligned with the security-first posture.
- **How**: (1) `CLAUDE.md` + `README.md` updated to `[0,200]`. (2) `// SYNC:` block moved above the `///` doc so rustdoc renders one coherent paragraph. (3) `check_account` visibility narrowed to `pub(crate)` with a `Precondition` paragraph explaining the username contract. (4) Debug-level selection wrapped in `#[cfg(debug_assertions)]`; release builds always run at `LevelFilter::Info`. (5) Added a comment block in `authenticate()` documenting why `pam_setcred` is deliberately absent and where it would hook in if needed. (6) `lto = "fat"` in `Cargo.toml`.
## 2026-04-24 Audit MEDIUM fixes: D-Bus cleanup race, TOCTOU open, FP reset, GTK entry clear (v0.6.11)
- **Who**: ClaudeCode, Dom
- **Why**: Second round after the HIGH fixes, addressing the four MEDIUM findings. (1) `cleanup_dbus` spawned VerifyStop + Release as fire-and-forget, then `resume_async` called Claim after only a 2 s timeout — shorter than the 3 s D-Bus timeout, so on a slow bus the Claim could race the Release and fprintd would reject it, leaving the FP listener permanently dead. (2) `load_background_texture` relied on the caller's `symlink_metadata` check, re-opening the path via `gdk::Texture::from_file` — a classic TOCTOU window. (3) `resume_async` unconditionally reset `failed_attempts`, allowing an attacker with sensor control to evade the 10-attempt cap by cycling verify-match → `check_account` fail → resume. (4) The GTK `PasswordEntry` buffer was only cleared on timeout or auth failure, leaving the password in GLib malloc'd memory longer than necessary.
- **Tradeoffs**: The D-Bus cleanup is now split into a synchronous helper (`take_cleanup_proxy` — signal disconnect + flag clear) and an async helper (`perform_dbus_cleanup` — VerifyStop + Release), so `resume_async` can await the release while `stop()` stays fire-and-forget. Dropping the `failed_attempts` reset means a flaky sensor could reach 10 failures faster, but the correct remedy is a new lock session (construction) rather than a reset that also helps attackers.
- **How**: (1) Split `cleanup_dbus` into `take_cleanup_proxy()` (sync) + `perform_dbus_cleanup(proxy)` (async). `resume_async` now awaits `perform_dbus_cleanup` before `begin_verification`. `stop()` still spawns the cleanup fire-and-forget. (2) `load_background_texture` opens with `O_NOFOLLOW` via `std::fs::OpenOptions::custom_flags`, reads to bytes, and builds the texture via `gdk::Texture::from_bytes`. (3) Removed `listener.borrow_mut().failed_attempts = 0` from `resume_async`. (4) `password_entry.set_text("")` now fires right after the `Zeroizing::new(entry.text().to_string())` extraction, shortening the GTK-side window.
## 2026-04-24 Audit fixes: RefCell borrow across await, async avatar decode
- **Who**: ClaudeCode, Dom
- **Why**: Triple audit found two HIGH issues. (1) `init_fingerprint_async` held a `RefCell` immutable borrow across `is_available_async().await` — a concurrent `connect_monitor` signal (hotplug / suspend-resume) invoking `borrow_mut()` during the await would panic. (2) `set_avatar_from_file` decoded avatars synchronously via `Pixbuf::from_file_at_scale`, blocking the GTK main thread inside the `connect_monitor` handler. With `MAX_AVATAR_FILE_SIZE` at 10 MB the worst-case stall was 200500 ms on monitor hotplug.
- **Tradeoffs**: Avatar is shown as the symbolic default icon for a brief window while decoding completes. Wallpaper stays synchronous because `connect_monitor` fires during `lock()` and needs the texture already present (see 2026-04-09).
- **How**: (1) Extract `username` into a local `String` in `init_fingerprint_async`, drop the borrow before the await, re-borrow in a new scope after — no awaits inside the second borrow, so hotplug during signal setup is safe. (2) `set_avatar_from_file` now uses `gio::File::read_future` + `Pixbuf::from_stream_at_scale_future` for async I/O and decode. The default icon is shown immediately; the decoded texture replaces it when ready. `Pixbuf` itself is `!Send`, so `gio::spawn_blocking` does not apply — the GIO async stream loader keeps the `Pixbuf` on the main thread while the kernel does the I/O asynchronously.
## 2026-04-09 Monitor hotplug via connect_monitor signal
- **Who**: ClaudeCode, Dom
- **Why**: moonlock crashed with segfault in libgtk-4.so after suspend/resume — HDMI monitor disconnect/reconnect invalidated GDK monitor objects, and the statically created windows referenced destroyed surfaces. Crash at consistent GTK4 offset (0x278 NULL dereference), 3x reproduced.
- **Tradeoffs**: Wallpaper texture now loaded before `lock()` instead of after (connect_monitor fires during lock() and needs the texture). Local JPEG loading is fast enough that the delay is negligible. Shared state moved to Rc's for the signal closure — slightly more indirection but necessary for dynamic window creation.
- **How**: (1) Bump gtk4-session-lock feature from `v1_1` to `v1_2` to enable `Instance::connect_monitor`. (2) Replace manual monitor iteration with `lock.connect_monitor()` signal handler that creates windows on demand. (3) Signal fires once per existing monitor at `lock()` and again on hotplug. (4) Windows auto-unmap when their monitor disappears (ext-session-lock-v1 guarantee). (5) Fingerprint listener published to shared Rc so hotplugged monitors get FP labels.
## 2026-03-31 Fourth audit: peek icon, blur limit, GResource compression, sync markers
- **Who**: ClaudeCode, Dom
- **Why**: Fourth triple audit found blur limit inconsistency (moonlock 0100 vs moongreet/moonset 0200), missing GResource compression, peek icon inconsistency, and duplicated code without sync markers.
- **Tradeoffs**: Peek icon enabled in lockscreen — user decision favoring UX consistency over shoulder-surfing protection. Acceptable for single-user desktop. Blur limit raised to 200 for ecosystem consistency.
- **How**: (1) `show_peek_icon(true)` in lockscreen password entry. (2) `clamp(0.0, 200.0)` for blur in config.rs. (3) `compressed="true"` on CSS/SVG GResource entries. (4) SYNC comments on duplicated blur/background functions pointing to moongreet and moonset.
## 2026-03-30 Third audit: blur offset, lock-before-IO, FP signal lifecycle, TOCTOU
- **Who**: ClaudeCode, Dom
- **Why**: Third triple audit (quality, performance, security) found: blur padding offset rendering texture at (0,0) instead of (-pad,-pad) causing edge darkening on left/top (BUG), wallpaper disk I/O blocking before lock() extending the unsecured window (PERF/SEC), signal handler duplication on resume_async (SEC), failed_attempts not reset on FP resume (SEC), unknown VerifyStatus with done=false hanging FP listener (SEC), TOCTOU in is_file+is_symlink checks (SEC), dead code in faillock_warning (QUALITY), unbounded blur sigma (SEC).
- **Tradeoffs**: Wallpaper loads after lock() — screen briefly shows without wallpaper until texture is ready. Acceptable: security > aesthetics. Blur sigma clamped to [0.0, 100.0] — arbitrary upper bound but prevents GPU memory exhaustion.
- **How**: (1) Texture offset to (-pad, -pad) in render_blurred_texture. (2) lock.lock() before resolve_background_path. (3) begin_verification disconnects old signal_id before registering new. (4) resume_async resets failed_attempts. (5) Unknown VerifyStatus with done=true triggers restart. (6) symlink_metadata() for atomic file+symlink check. (7) faillock_warning dead code removed, saturating_sub. (8) background_blur clamped. (9) Redundant Zeroizing<Vec<u8>> removed. (10) Default impl for FingerprintListener. (11) on_verify_status restricted to pub(crate). (12) Warn logging for non-UTF-8 GECOS and avatar paths.
## 2026-03-30 Second audit: zeroize CString, FP account check, PAM timeout, blur downscale
- **Who**: ClaudeCode, Dom
- **Why**: Second triple audit (quality, performance, security) found: CString password copy not zeroized (HIGH), fingerprint unlock bypassing pam_acct_mgmt (MEDIUM), no PAM timeout leaving user locked out on hanging modules (MEDIUM), GPU blur on full wallpaper resolution (MEDIUM), no-monitor edge case doing `return` instead of `exit(1)` (MEDIUM).
- **Tradeoffs**: PAM timeout (30s) uses a generation counter to avoid stale result interference — adds complexity but prevents parallel PAM sessions. FP restart after failed account check re-claims the device, adding a D-Bus round-trip, but prevents permanent FP death on transient failures. Blur downscale to 1920px cap trades negligible quality for ~4x less GPU work on 4K wallpapers.
- **How**: (1) `Zeroizing<CString>` wraps password in auth.rs, `zeroize/std` feature enabled. (2) `check_account()` calls pam_acct_mgmt after FP match; `resume_async()` restarts FP on transient failure. (3) `auth_generation` counter invalidates stale PAM results; 30s timeout re-enables UI. (4) `MAX_BLUR_DIMENSION` caps blur input at 1920px, sigma scaled proportionally. (5) `exit(1)` on no-monitor after `lock.lock()`.
## 2026-03-28 Remove embedded wallpaper from binary
- **Who**: ClaudeCode, Dom
- **Why**: Wallpaper is installed by moonarch to /usr/share/moonarch/wallpaper.jpg. Embedding a 374K JPEG in the binary is redundant. GTK background color (Catppuccin Mocha base) is a clean fallback.
- **Tradeoffs**: Without moonarch installed AND without config, lockscreen shows plain dark background instead of wallpaper. Acceptable — that's the expected minimal state.
- **How**: Remove wallpaper.jpg from GResources, return None from resolve_background_path when no file found, skip background picture creation when no texture available.
## 2026-03-28 Audit-driven security and lifecycle fixes (v0.6.0)
- **Who**: ClaudeCode, Dom
- **Why**: Triple audit (quality, performance, security) revealed a critical D-Bus signal spoofing vector, fingerprint lifecycle bugs, and multi-monitor performance issues.
- **Tradeoffs**: `cleanup_dbus()` extraction adds a method but clarifies the stop/match ownership; `running_flag: Rc<Cell<bool>>` adds a field but prevents race between async restart and stop; sender validation adds a check per signal but closes the only known auth bypass.
- **How**: (1) Validate D-Bus VerifyStatus sender against fprintd's unique bus name. (2) Extract `cleanup_dbus()` from `stop()`, call it on verify-match. (3) `Rc<Cell<bool>>` running flag checked after await in `restart_verify_async`. (4) Consistent 3s D-Bus timeouts. (5) Panic hook before logging. (6) Blur and avatar caches shared across monitors. (7) Peek icon disabled. (8) Symlink rejection for background_path. (9) TOML parse errors logged.
## 2026-03-28 GPU blur via GskBlurNode replaces CPU blur
- **Who**: ClaudeCode, Dom
- **Why**: CPU-side Gaussian blur (`image` crate) blocked the GTK main thread for 500ms2s on 4K wallpapers at cold cache. Disk cache mitigated repeat starts but added ~100 lines of complexity.
- **Tradeoffs**: GPU blur quality is slightly different (box-blur approximation vs true Gaussian), acceptable for wallpaper. Removes `image` and `dirs` dependencies entirely. No disk cache needed.
- **How**: `Snapshot::push_blur()` + `GskRenderer::render_texture()` on `connect_realize`. Blur happens once on the GPU when the widget gets its renderer, producing a concrete `gdk::Texture`. Zero startup latency.
## 2026-03-28 Optional background blur via `image` crate (superseded)
- **Who**: ClaudeCode, Dom
- **Why**: Consistent with moonset/moongreet — blurred wallpaper as lockscreen background is a common UX pattern
- **Tradeoffs**: Adds `image` crate dependency (~15 transitive crates); CPU-side Gaussian blur at load time adds startup latency proportional to image size and sigma. Acceptable because blur runs once and the texture is shared across monitors.
- **How**: `load_background_texture(bg_path, blur_radius)` loads texture, optionally applies `imageops::blur()`, returns `gdk::Texture`. Config option `background_blur: Option<f32>` in TOML.
## 2026-03-28 Shared wallpaper texture pattern (aligned with moonset/moongreet)
- **Who**: ClaudeCode, Dom
- **Why**: Previously loaded wallpaper per-window via `Picture::for_filename()`. Multi-monitor setups decoded the JPEG redundantly. Blur feature requires texture pixel access anyway.
- **Tradeoffs**: Slightly more code in main.rs (texture loaded before window creation), but avoids redundant decoding and enables the blur feature.
- **How**: `load_background_texture()` in lockscreen.rs decodes once, `create_background_picture()` wraps shared `gdk::Texture` in `gtk::Picture`. Same pattern as moonset/moongreet.