56a8634a58
Hygiene audit found deps clean but CI ran no vulnerability scan, so a future advisory against a locked crate would go undetected. Add an Audit workflow running cargo audit on push/PR — parses Cargo.lock, needs no GTK4 build env. Remove pkg/PKGBUILD: orphaned moonlock-git VCS recipe from the pre-tag-build era, two minors behind; canonical packaging lives in moonarch-pkgbuilds and is auto-bumped by update-pkgver.yaml. Drop the now-dead pkg/* makepkg ignore lines and add .pytest_cache/. No version bump — no change to the binary.
25 KiB
25 KiB
Decisions
Architectural and design decisions for Moonlock, in reverse chronological order.
2026-06-17 – Add cargo-audit CI gate, remove orphaned -git PKGBUILD
- Who: ClaudeCode, Dom
- Why: A hygiene audit found supply chain and dependency health clean (
cargo audit0.22.2: 0 vulnerabilities across 116 crates), but flagged repo-hygiene gaps: CI (.gitea/workflows/) only auto-bumped pkgver and ran no vulnerability scan, so a future advisory against a locked dependency would go undetected;pkg/PKGBUILDwas an orphanedmoonlock-gitVCS PKGBUILD (pkgver=0.4.1.r1..., two minor versions behind v0.6.18) left over from the pre-tag-build era — the canonical packaging now lives inmoonarch-pkgbuilds/moonlock/PKGBUILDand is auto-bumped byupdate-pkgver.yaml; a stray.pytest_cache/sat in the repo root of this pure-Rust project, only suppressed by a user-global gitignore. - Tradeoffs: CI gate scoped to
cargo auditonly, notcargo test/clippy/build—cargo auditparsesCargo.lockand needs no GTK4/PAM build environment, whereas the broader checks depend on themoonarchrunner having the full dev toolchain, which is unverified and beyond the hygiene scope.cargo install cargo-auditrecompiles per run (~2-3 min); accepted, caching deferred. Open risk: the runner must provide a Rust toolchain (cargo);update-pkgver.yamlonly usesgit, so this needs validation on first run. Deletingpkg/PKGBUILDrather than updating it: it is not in the build pipeline and the-gitvariant was abandoned at the tag-build switch (seemoonarch-pkgbuilds/DECISIONS.md2026-06-10). - How: New
.gitea/workflows/ci.yaml(Auditworkflow, triggers on push tomain,v*tags, and PRs tomain) clones the repo and runscargo audit. Removedpkg/PKGBUILDand the emptypkg/dir; dropped the now-deadpkg/*makepkg-artifact lines from.gitignoreand added.pytest_cache/. No version bump — no code or behavior change to the binary.
2026-06-17 – Restore hardened release profile after the crash hunt (v0.6.18)
- Who: ClaudeCode, Dom
- Why: A security audit found the
[profile.release]inCargo.tomlhad silently drifted from the hardened profile decided on 2026-04-24 (lto = "fat",strip = true). Git blame traced the drift to v0.6.14 (commit85cf039, "refactor: power-confirm via PowerAction table"):ltowas revertedfat→thin,stripflippedtrue→false, anddebug = truewas added — all bundled into an unrelated refactor commit, with no commit-message mention and no entry here. The pattern (no strip + debug symbols + faster thin LTO) was a debug aid for the suspend/resume SIGSEGV hunt that ran v0.6.9–v0.6.17, giving symbolized coredump backtraces. That crash is fixed as of v0.6.17, so the debug profile has outlived its purpose, while shipping debug symbols on a security-critical auth binary eases reverse-engineering of the auth path and bloats the binary. - Tradeoffs: Restoring
lto = "fat"roughly doubles release build time (~30 s → ~60 s) for better cross-crate inlining; acceptable for a binary compiled once per release. Droppingstrip = false+debug = truemeans field coredumps are no longer symbolized out of the box — a deliberate trade now that the crash is resolved; debug symbols can be re-enabled temporarily if a new crash needs hunting. Not chosen: keepinglto = "fat"but retaining the debug symbols — rejected because the symbols' only justification (the crash hunt) is gone. - How:
Cargo.toml[profile.release]restored tolto = "fat",strip = true, with thedebug = trueline removed;codegen-units = 1unchanged. Verified viafile target/release/moonlockreporting a stripped binary.
2026-06-02 – Real fix for the unlock SIGSEGV: quit in ::unlocked, never destroy windows ourselves (v0.6.17)
- Who: ClaudeCode, Dom
- Why: The v0.6.15 "stale per-monitor window" theory was a misdiagnosis. A clean manual
target/release/moonlocklock→unlock (single monitor, no suspend, no monitor removal) still crashed, preceded byGdk-CRITICAL: gdk_surface_get_display: assertion 'GDK_IS_SURFACE (surface)' failed. The v0.6.16 backtrace's top app frame is theunlock_callbackclosure, which ranlock.unlock(); app.quit();. Per the gtk4-session-lock header (assign_window_to_monitor: "the window will be unmapped andgtk_window_destroy()called on it when the current lock ends") the library destroys the lock windows itself on unlock.app.quit()then destroyed the same windows again — surface already gone → SIGSEGV. A double-destroy in the teardown sequence, entirely independent of monitors. - Tradeoffs: Reverted the v0.6.15/v0.6.16 monitor-pruning +
LockscreenHandles.monitorfield: it targeted the wrong cause, never fired in testing, and manually calledapp.remove_window()on library-managed lock windows — exactly what the upstream example warns against ("does NOT manually destroy or close the lock windows"). Monitor-removal-during-lock is left entirely to the library (which already auto-unmaps+dereferences such windows). Three attempts total: idle_add deferral (wrong: reentrancy theory), monitor-pruning (wrong: misdiagnosis), and this one — verified against both the C header and the upstreamexamples/simple.rs. - How:
unlock_callbacknow calls onlylock.unlock(). A newlock.connect_unlocked(|_| app.quit())quits the app only after the library finishes its teardown and fires::unlocked. Mirrors the canonical gtk4-session-lock usage exactly.
2026-06-02 – Prune per-monitor windows on monitor removal to fix resume-unlock SIGSEGV (v0.6.15)
- Who: ClaudeCode, Dom
- Why: Chronic SIGSEGV (multiple coredumps/day) on the unlock following a suspend/resume. Backtrace:
app.quit()→gtk_window_destroy→ gtk4-layer-shell →g_signal_emit→ NULL deref (0x278, rax=0). Root cause:connect_monitoris add-only — it creates one window per monitor and pushes toall_handles, but nothing ever removes a window when a monitor powers off on suspend. gtk4-session-lock unmaps + drops its ref to that window on monitor removal (pergtk4-session-lock0.4 docs), but the GtkApplication (ApplicationWindow::builder().application(app)) andall_handlesstill hold refs. The orphaned window survives until unlock, where destroying it dereferences its now-NULL monitor association. A diagnostic confirmed the windows are GTK-valid but accumulate (3 windows for 1-2 monitors) with a NULL associated object. - Tradeoffs: An earlier attempt deferred
unlock()/quit()viaglib::idle_add_local_onceon a reentrancy theory — proven wrong by the logs (crash happens inside the idle trampoline). Reverted. The library doc explicitly says monitor removal is detected via "GTK APIs"; we follow that rather than fighting the library. We release our refs (do not calldestroy— the lib already unmapped+dereffed the window). Diagnostic UNLOCK logging is kept for now, to be removed once a suspend/resume validation cycle confirms the fix. - How:
LockscreenHandlesgainsmonitor: Option<gdk::Monitor>, set in theconnect_monitorhandler.activate_with_session_lockwatchesdisplay.monitors()viaconnect_items_changed; on any change it retains only handles whose monitoris_valid(), callingapp.remove_window()on the pruned ones to drop the application's ref. With both refs released, the orphaned window is gone before unlock.
2026-06-02 – Align power-confirm to moonset's ActionDef pattern (v0.6.14)
- Who: ClaudeCode, Dom
- Why: A code review of moongreet's power-confirm (ported from this file) flagged the shared pattern as lower-altitude than moonset's: two near-identical reboot/shutdown handlers and a
show_power_confirmtaking loosemessage/action_fn/error_messageparams that can drift apart. moonset already solved this with anActionDeftable + button factory. Changed here in lockstep with moongreet to keep the three projects symmetric. - Tradeoffs: A
PowerActionstruct +power_actions()table +create_power_buttonfactory is slightly more machinery for two actions, but couples icon/prompt/error/action into one value (a mismatched prompt/action becomes unrepresentable) and makes a third action a one-line table entry. Did NOT touchconfirm_box: Rc<RefCell<Option<gtk::Box>>>— moonset uses the same, it is the shared convention. - How: Replaced the two hand-wired handlers with a loop over
power_actions();show_power_confirm/execute_power_actionnow takePowerAction(Copy) instead of three loose strings. Added an in-flight re-trigger guard viapower_box.set_sensitive(false)(re-enabled on failure), matching moonset; also clear a staleerror_labelwhen showing a new prompt.
2026-05-04 – Drop PAM account/session stack, remove check_account, drop pam_acct_mgmt from password path (v0.6.13)
- Who: ClaudeCode, Dom
- Why: 20 SIGSEGV coredumps in 6 days. All crashes preceded by
pam_unix(moonlock:account): setuid failed: Operation not permitted. The previous PAM config (auth/account/session include system-auth) pulledpam_unix(account), which needs setuid root for/etc/shadow. moonlock runs as the user, sopam_acct_mgmtalways failed. The failure cascaded into the FP-resume path (gio::spawn_blocking(check_account) → false → 2s timeout → resume_async) where a use-after-free duringgtk_window_destroykilled the process. Each crash left a deadext-session-lock-v1client and the compositor stuck on its red fallback backdrop until manual recovery. Initial fix on 2026-04-30 dropped the FP-sidecheck_accountand the account/session lines from the PAM config, but left thepam_acct_mgmtcall inauth.rs::authenticate()for the password path. Result: a PAM stack with noaccountmodule fell back to/etc/pam.d/other(pam_deny) and rejected every password — Dom got locked out on 2026-05-04. - Tradeoffs: Aligned with the swaylock/gtklock pattern (only
auth include login). Lost: PAM-driven account expiry/lockout in both paths. Acceptable because (a) FP attempts are still bounded byMAX_FP_ATTEMPTSinfingerprint.rs, (b) password-path lockout still works throughpam_faillock.so preauth/authfail/authsuccin the inheritedauthstack, and (c) account validity was already verified by the login manager when the session was opened — a lockscreen unlocks an existing session, it does not gate access to a new one. Not chosen: a customaccountstack withpam_faillock.soonly — would have kept PAM-level FP lockout but adds a non-standard config that other lockers do not use, andpam_faillockstandalone inaccountis rarely tested in the wild. - How: (1)
config/moonlock-pamreduced to a singleauth include loginline. (2)auth.rs::check_account()and its two unit tests removed. (3)auth.rs::authenticate()no longer callspam_acct_mgmt;pam_authenticateresult is returned directly. Thepam_acct_mgmtextern declaration is removed. (4)lockscreen.rs::start_fingerprintsimplified — thegio::spawn_blocking(check_account)async block, the failure-side error UI, and the 2-secondresume_asyncretry path all removed; on FP success the closure now goeslabel.set_text(success); fp.stop(); cb(). (5)fingerprint.rs::resume_async()removed. (6)CLAUDE.mdarchitecture and security sections updated to describe the new PAM stack.
2026-04-24 – Audit LOW fixes: docs, rustdoc, check_account scope, debug gating, lto fat (v0.6.12)
- Who: ClaudeCode, Dom
- Why: Six LOW findings cleared in a single pass. (1) Docs referenced the old
[0,100]blur range; code clamps[0,200]since v0.6.8. (2) TheMAX_BLUR_DIMENSIONdoc comment was split by a// SYNC:block, producing a truncated sentence in rustdoc. (3)check_accountwaspuband relied on callers only ever passinggetuid()-derived usernames; the contract was not enforced by the type system. (4)MOONLOCK_DEBUGenv var flipped log verbosity to Debug in release builds, letting a compromised session script escalate journal noise about fprintd / D-Bus. (5)pam_setcredabsence was undocumented. (6)[profile.release]usedlto = "thin"— fine for most crates, but for a latency-critical auth binary compiled once per release, fat LTO's extra cross-crate inlining is worth the ~1 min build hit. - Tradeoffs:
lto = "fat"roughly doubles release build time (~30 s → ~60 s) for slightly better inlining across PAM FFI wrappers and the i18n/status paths.#[cfg(debug_assertions)]on the debug-level selector means you have to run a debug build to raise log level — inconvenient for live troubleshooting, but aligned with the security-first posture. - How: (1)
CLAUDE.md+README.mdupdated to[0,200]. (2)// SYNC:block moved above the///doc so rustdoc renders one coherent paragraph. (3)check_accountvisibility narrowed topub(crate)with aPreconditionparagraph explaining the username contract. (4) Debug-level selection wrapped in#[cfg(debug_assertions)]; release builds always run atLevelFilter::Info. (5) Added a comment block inauthenticate()documenting whypam_setcredis deliberately absent and where it would hook in if needed. (6)lto = "fat"inCargo.toml.
2026-04-24 – Audit MEDIUM fixes: D-Bus cleanup race, TOCTOU open, FP reset, GTK entry clear (v0.6.11)
- Who: ClaudeCode, Dom
- Why: Second round after the HIGH fixes, addressing the four MEDIUM findings. (1)
cleanup_dbusspawned VerifyStop + Release as fire-and-forget, thenresume_asynccalled Claim after only a 2 s timeout — shorter than the 3 s D-Bus timeout, so on a slow bus the Claim could race the Release and fprintd would reject it, leaving the FP listener permanently dead. (2)load_background_texturerelied on the caller'ssymlink_metadatacheck, re-opening the path viagdk::Texture::from_file— a classic TOCTOU window. (3)resume_asyncunconditionally resetfailed_attempts, allowing an attacker with sensor control to evade the 10-attempt cap by cycling verify-match →check_accountfail → resume. (4) The GTKPasswordEntrybuffer was only cleared on timeout or auth failure, leaving the password in GLib malloc'd memory longer than necessary. - Tradeoffs: The D-Bus cleanup is now split into a synchronous helper (
take_cleanup_proxy— signal disconnect + flag clear) and an async helper (perform_dbus_cleanup— VerifyStop + Release), soresume_asynccan await the release whilestop()stays fire-and-forget. Dropping thefailed_attemptsreset means a flaky sensor could reach 10 failures faster, but the correct remedy is a new lock session (construction) rather than a reset that also helps attackers. - How: (1) Split
cleanup_dbusintotake_cleanup_proxy()(sync) +perform_dbus_cleanup(proxy)(async).resume_asyncnow awaitsperform_dbus_cleanupbeforebegin_verification.stop()still spawns the cleanup fire-and-forget. (2)load_background_textureopens withO_NOFOLLOWviastd::fs::OpenOptions::custom_flags, reads to bytes, and builds the texture viagdk::Texture::from_bytes. (3) Removedlistener.borrow_mut().failed_attempts = 0fromresume_async. (4)password_entry.set_text("")now fires right after theZeroizing::new(entry.text().to_string())extraction, shortening the GTK-side window.
2026-04-24 – Audit fixes: RefCell borrow across await, async avatar decode
- Who: ClaudeCode, Dom
- Why: Triple audit found two HIGH issues. (1)
init_fingerprint_asyncheld aRefCellimmutable borrow acrossis_available_async().await— a concurrentconnect_monitorsignal (hotplug / suspend-resume) invokingborrow_mut()during the await would panic. (2)set_avatar_from_filedecoded avatars synchronously viaPixbuf::from_file_at_scale, blocking the GTK main thread inside theconnect_monitorhandler. WithMAX_AVATAR_FILE_SIZEat 10 MB the worst-case stall was 200–500 ms on monitor hotplug. - Tradeoffs: Avatar is shown as the symbolic default icon for a brief window while decoding completes. Wallpaper stays synchronous because
connect_monitorfires duringlock()and needs the texture already present (see 2026-04-09). - How: (1) Extract
usernameinto a localStringininit_fingerprint_async, drop the borrow before the await, re-borrow in a new scope after — no awaits inside the second borrow, so hotplug during signal setup is safe. (2)set_avatar_from_filenow usesgio::File::read_future+Pixbuf::from_stream_at_scale_futurefor async I/O and decode. The default icon is shown immediately; the decoded texture replaces it when ready.Pixbufitself is!Send, sogio::spawn_blockingdoes not apply — the GIO async stream loader keeps thePixbufon the main thread while the kernel does the I/O asynchronously.
2026-04-09 – Monitor hotplug via connect_monitor signal
- Who: ClaudeCode, Dom
- Why: moonlock crashed with segfault in libgtk-4.so after suspend/resume — HDMI monitor disconnect/reconnect invalidated GDK monitor objects, and the statically created windows referenced destroyed surfaces. Crash at consistent GTK4 offset (0x278 NULL dereference), 3x reproduced.
- Tradeoffs: Wallpaper texture now loaded before
lock()instead of after (connect_monitor fires during lock() and needs the texture). Local JPEG loading is fast enough that the delay is negligible. Shared state moved to Rc's for the signal closure — slightly more indirection but necessary for dynamic window creation. - How: (1) Bump gtk4-session-lock feature from
v1_1tov1_2to enableInstance::connect_monitor. (2) Replace manual monitor iteration withlock.connect_monitor()signal handler that creates windows on demand. (3) Signal fires once per existing monitor atlock()and again on hotplug. (4) Windows auto-unmap when their monitor disappears (ext-session-lock-v1 guarantee). (5) Fingerprint listener published to shared Rc so hotplugged monitors get FP labels.
2026-03-31 – Fourth audit: peek icon, blur limit, GResource compression, sync markers
- Who: ClaudeCode, Dom
- Why: Fourth triple audit found blur limit inconsistency (moonlock 0–100 vs moongreet/moonset 0–200), missing GResource compression, peek icon inconsistency, and duplicated code without sync markers.
- Tradeoffs: Peek icon enabled in lockscreen — user decision favoring UX consistency over shoulder-surfing protection. Acceptable for single-user desktop. Blur limit raised to 200 for ecosystem consistency.
- How: (1)
show_peek_icon(true)in lockscreen password entry. (2)clamp(0.0, 200.0)for blur in config.rs. (3)compressed="true"on CSS/SVG GResource entries. (4) SYNC comments on duplicated blur/background functions pointing to moongreet and moonset.
2026-03-30 – Third audit: blur offset, lock-before-IO, FP signal lifecycle, TOCTOU
- Who: ClaudeCode, Dom
- Why: Third triple audit (quality, performance, security) found: blur padding offset rendering texture at (0,0) instead of (-pad,-pad) causing edge darkening on left/top (BUG), wallpaper disk I/O blocking before lock() extending the unsecured window (PERF/SEC), signal handler duplication on resume_async (SEC), failed_attempts not reset on FP resume (SEC), unknown VerifyStatus with done=false hanging FP listener (SEC), TOCTOU in is_file+is_symlink checks (SEC), dead code in faillock_warning (QUALITY), unbounded blur sigma (SEC).
- Tradeoffs: Wallpaper loads after lock() — screen briefly shows without wallpaper until texture is ready. Acceptable: security > aesthetics. Blur sigma clamped to [0.0, 100.0] — arbitrary upper bound but prevents GPU memory exhaustion.
- How: (1) Texture offset to (-pad, -pad) in render_blurred_texture. (2) lock.lock() before resolve_background_path. (3) begin_verification disconnects old signal_id before registering new. (4) resume_async resets failed_attempts. (5) Unknown VerifyStatus with done=true triggers restart. (6) symlink_metadata() for atomic file+symlink check. (7) faillock_warning dead code removed, saturating_sub. (8) background_blur clamped. (9) Redundant Zeroizing<Vec> removed. (10) Default impl for FingerprintListener. (11) on_verify_status restricted to pub(crate). (12) Warn logging for non-UTF-8 GECOS and avatar paths.
2026-03-30 – Second audit: zeroize CString, FP account check, PAM timeout, blur downscale
- Who: ClaudeCode, Dom
- Why: Second triple audit (quality, performance, security) found: CString password copy not zeroized (HIGH), fingerprint unlock bypassing pam_acct_mgmt (MEDIUM), no PAM timeout leaving user locked out on hanging modules (MEDIUM), GPU blur on full wallpaper resolution (MEDIUM), no-monitor edge case doing
returninstead ofexit(1)(MEDIUM). - Tradeoffs: PAM timeout (30s) uses a generation counter to avoid stale result interference — adds complexity but prevents parallel PAM sessions. FP restart after failed account check re-claims the device, adding a D-Bus round-trip, but prevents permanent FP death on transient failures. Blur downscale to 1920px cap trades negligible quality for ~4x less GPU work on 4K wallpapers.
- How: (1)
Zeroizing<CString>wraps password in auth.rs,zeroize/stdfeature enabled. (2)check_account()calls pam_acct_mgmt after FP match;resume_async()restarts FP on transient failure. (3)auth_generationcounter invalidates stale PAM results; 30s timeout re-enables UI. (4)MAX_BLUR_DIMENSIONcaps blur input at 1920px, sigma scaled proportionally. (5)exit(1)on no-monitor afterlock.lock().
2026-03-28 – Remove embedded wallpaper from binary
- Who: ClaudeCode, Dom
- Why: Wallpaper is installed by moonarch to /usr/share/moonarch/wallpaper.jpg. Embedding a 374K JPEG in the binary is redundant. GTK background color (Catppuccin Mocha base) is a clean fallback.
- Tradeoffs: Without moonarch installed AND without config, lockscreen shows plain dark background instead of wallpaper. Acceptable — that's the expected minimal state.
- How: Remove wallpaper.jpg from GResources, return None from resolve_background_path when no file found, skip background picture creation when no texture available.
2026-03-28 – Audit-driven security and lifecycle fixes (v0.6.0)
- Who: ClaudeCode, Dom
- Why: Triple audit (quality, performance, security) revealed a critical D-Bus signal spoofing vector, fingerprint lifecycle bugs, and multi-monitor performance issues.
- Tradeoffs:
cleanup_dbus()extraction adds a method but clarifies the stop/match ownership;running_flag: Rc<Cell<bool>>adds a field but prevents race between async restart and stop; sender validation adds a check per signal but closes the only known auth bypass. - How: (1) Validate D-Bus VerifyStatus sender against fprintd's unique bus name. (2) Extract
cleanup_dbus()fromstop(), call it on verify-match. (3)Rc<Cell<bool>>running flag checked after await inrestart_verify_async. (4) Consistent 3s D-Bus timeouts. (5) Panic hook before logging. (6) Blur and avatar caches shared across monitors. (7) Peek icon disabled. (8) Symlink rejection for background_path. (9) TOML parse errors logged.
2026-03-28 – GPU blur via GskBlurNode replaces CPU blur
- Who: ClaudeCode, Dom
- Why: CPU-side Gaussian blur (
imagecrate) blocked the GTK main thread for 500ms–2s on 4K wallpapers at cold cache. Disk cache mitigated repeat starts but added ~100 lines of complexity. - Tradeoffs: GPU blur quality is slightly different (box-blur approximation vs true Gaussian), acceptable for wallpaper. Removes
imageanddirsdependencies entirely. No disk cache needed. - How:
Snapshot::push_blur()+GskRenderer::render_texture()onconnect_realize. Blur happens once on the GPU when the widget gets its renderer, producing a concretegdk::Texture. Zero startup latency.
2026-03-28 – Optional background blur via image crate (superseded)
- Who: ClaudeCode, Dom
- Why: Consistent with moonset/moongreet — blurred wallpaper as lockscreen background is a common UX pattern
- Tradeoffs: Adds
imagecrate dependency (~15 transitive crates); CPU-side Gaussian blur at load time adds startup latency proportional to image size and sigma. Acceptable because blur runs once and the texture is shared across monitors. - How:
load_background_texture(bg_path, blur_radius)loads texture, optionally appliesimageops::blur(), returnsgdk::Texture. Config optionbackground_blur: Option<f32>in TOML.
2026-03-28 – Shared wallpaper texture pattern (aligned with moonset/moongreet)
- Who: ClaudeCode, Dom
- Why: Previously loaded wallpaper per-window via
Picture::for_filename(). Multi-monitor setups decoded the JPEG redundantly. Blur feature requires texture pixel access anyway. - Tradeoffs: Slightly more code in main.rs (texture loaded before window creation), but avoids redundant decoding and enables the blur feature.
- How:
load_background_texture()in lockscreen.rs decodes once,create_background_picture()wraps sharedgdk::Textureingtk::Picture. Same pattern as moonset/moongreet.