e51a847d48
The real cause of the unlock SIGSEGV: gtk4-session-lock destroys the lock windows itself when the lock ends (per its header), so unlock_callback's `unlock(); app.quit();` destroyed them a second time — surface already gone → gdk_surface_get_display NULL → crash in gtk_window_destroy. Reproduces on a plain single-monitor lock/unlock, no suspend needed. unlock_callback now calls only unlock(); a new connect_unlocked handler quits the app after the library finishes teardown. Mirrors the upstream examples/simple.rs exactly. Reverts the v0.6.15/v0.6.16 monitor-pruning + LockscreenHandles.monitor: it was a misdiagnosis (the crash is monitor-independent) and manually manipulated library-managed lock windows, which the upstream example explicitly warns against. Monitor removal is left to the library.
21 KiB
21 KiB
Decisions
Architectural and design decisions for Moonlock, in reverse chronological order.
2026-06-02 – Real fix for the unlock SIGSEGV: quit in ::unlocked, never destroy windows ourselves (v0.6.17)
- Who: ClaudeCode, Dom
- Why: The v0.6.15 "stale per-monitor window" theory was a misdiagnosis. A clean manual
target/release/moonlocklock→unlock (single monitor, no suspend, no monitor removal) still crashed, preceded byGdk-CRITICAL: gdk_surface_get_display: assertion 'GDK_IS_SURFACE (surface)' failed. The v0.6.16 backtrace's top app frame is theunlock_callbackclosure, which ranlock.unlock(); app.quit();. Per the gtk4-session-lock header (assign_window_to_monitor: "the window will be unmapped andgtk_window_destroy()called on it when the current lock ends") the library destroys the lock windows itself on unlock.app.quit()then destroyed the same windows again — surface already gone → SIGSEGV. A double-destroy in the teardown sequence, entirely independent of monitors. - Tradeoffs: Reverted the v0.6.15/v0.6.16 monitor-pruning +
LockscreenHandles.monitorfield: it targeted the wrong cause, never fired in testing, and manually calledapp.remove_window()on library-managed lock windows — exactly what the upstream example warns against ("does NOT manually destroy or close the lock windows"). Monitor-removal-during-lock is left entirely to the library (which already auto-unmaps+dereferences such windows). Three attempts total: idle_add deferral (wrong: reentrancy theory), monitor-pruning (wrong: misdiagnosis), and this one — verified against both the C header and the upstreamexamples/simple.rs. - How:
unlock_callbacknow calls onlylock.unlock(). A newlock.connect_unlocked(|_| app.quit())quits the app only after the library finishes its teardown and fires::unlocked. Mirrors the canonical gtk4-session-lock usage exactly.
2026-06-02 – Prune per-monitor windows on monitor removal to fix resume-unlock SIGSEGV (v0.6.15)
- Who: ClaudeCode, Dom
- Why: Chronic SIGSEGV (multiple coredumps/day) on the unlock following a suspend/resume. Backtrace:
app.quit()→gtk_window_destroy→ gtk4-layer-shell →g_signal_emit→ NULL deref (0x278, rax=0). Root cause:connect_monitoris add-only — it creates one window per monitor and pushes toall_handles, but nothing ever removes a window when a monitor powers off on suspend. gtk4-session-lock unmaps + drops its ref to that window on monitor removal (pergtk4-session-lock0.4 docs), but the GtkApplication (ApplicationWindow::builder().application(app)) andall_handlesstill hold refs. The orphaned window survives until unlock, where destroying it dereferences its now-NULL monitor association. A diagnostic confirmed the windows are GTK-valid but accumulate (3 windows for 1-2 monitors) with a NULL associated object. - Tradeoffs: An earlier attempt deferred
unlock()/quit()viaglib::idle_add_local_onceon a reentrancy theory — proven wrong by the logs (crash happens inside the idle trampoline). Reverted. The library doc explicitly says monitor removal is detected via "GTK APIs"; we follow that rather than fighting the library. We release our refs (do not calldestroy— the lib already unmapped+dereffed the window). Diagnostic UNLOCK logging is kept for now, to be removed once a suspend/resume validation cycle confirms the fix. - How:
LockscreenHandlesgainsmonitor: Option<gdk::Monitor>, set in theconnect_monitorhandler.activate_with_session_lockwatchesdisplay.monitors()viaconnect_items_changed; on any change it retains only handles whose monitoris_valid(), callingapp.remove_window()on the pruned ones to drop the application's ref. With both refs released, the orphaned window is gone before unlock.
2026-06-02 – Align power-confirm to moonset's ActionDef pattern (v0.6.14)
- Who: ClaudeCode, Dom
- Why: A code review of moongreet's power-confirm (ported from this file) flagged the shared pattern as lower-altitude than moonset's: two near-identical reboot/shutdown handlers and a
show_power_confirmtaking loosemessage/action_fn/error_messageparams that can drift apart. moonset already solved this with anActionDeftable + button factory. Changed here in lockstep with moongreet to keep the three projects symmetric. - Tradeoffs: A
PowerActionstruct +power_actions()table +create_power_buttonfactory is slightly more machinery for two actions, but couples icon/prompt/error/action into one value (a mismatched prompt/action becomes unrepresentable) and makes a third action a one-line table entry. Did NOT touchconfirm_box: Rc<RefCell<Option<gtk::Box>>>— moonset uses the same, it is the shared convention. - How: Replaced the two hand-wired handlers with a loop over
power_actions();show_power_confirm/execute_power_actionnow takePowerAction(Copy) instead of three loose strings. Added an in-flight re-trigger guard viapower_box.set_sensitive(false)(re-enabled on failure), matching moonset; also clear a staleerror_labelwhen showing a new prompt.
2026-05-04 – Drop PAM account/session stack, remove check_account, drop pam_acct_mgmt from password path (v0.6.13)
- Who: ClaudeCode, Dom
- Why: 20 SIGSEGV coredumps in 6 days. All crashes preceded by
pam_unix(moonlock:account): setuid failed: Operation not permitted. The previous PAM config (auth/account/session include system-auth) pulledpam_unix(account), which needs setuid root for/etc/shadow. moonlock runs as the user, sopam_acct_mgmtalways failed. The failure cascaded into the FP-resume path (gio::spawn_blocking(check_account) → false → 2s timeout → resume_async) where a use-after-free duringgtk_window_destroykilled the process. Each crash left a deadext-session-lock-v1client and the compositor stuck on its red fallback backdrop until manual recovery. Initial fix on 2026-04-30 dropped the FP-sidecheck_accountand the account/session lines from the PAM config, but left thepam_acct_mgmtcall inauth.rs::authenticate()for the password path. Result: a PAM stack with noaccountmodule fell back to/etc/pam.d/other(pam_deny) and rejected every password — Dom got locked out on 2026-05-04. - Tradeoffs: Aligned with the swaylock/gtklock pattern (only
auth include login). Lost: PAM-driven account expiry/lockout in both paths. Acceptable because (a) FP attempts are still bounded byMAX_FP_ATTEMPTSinfingerprint.rs, (b) password-path lockout still works throughpam_faillock.so preauth/authfail/authsuccin the inheritedauthstack, and (c) account validity was already verified by the login manager when the session was opened — a lockscreen unlocks an existing session, it does not gate access to a new one. Not chosen: a customaccountstack withpam_faillock.soonly — would have kept PAM-level FP lockout but adds a non-standard config that other lockers do not use, andpam_faillockstandalone inaccountis rarely tested in the wild. - How: (1)
config/moonlock-pamreduced to a singleauth include loginline. (2)auth.rs::check_account()and its two unit tests removed. (3)auth.rs::authenticate()no longer callspam_acct_mgmt;pam_authenticateresult is returned directly. Thepam_acct_mgmtextern declaration is removed. (4)lockscreen.rs::start_fingerprintsimplified — thegio::spawn_blocking(check_account)async block, the failure-side error UI, and the 2-secondresume_asyncretry path all removed; on FP success the closure now goeslabel.set_text(success); fp.stop(); cb(). (5)fingerprint.rs::resume_async()removed. (6)CLAUDE.mdarchitecture and security sections updated to describe the new PAM stack.
2026-04-24 – Audit LOW fixes: docs, rustdoc, check_account scope, debug gating, lto fat (v0.6.12)
- Who: ClaudeCode, Dom
- Why: Six LOW findings cleared in a single pass. (1) Docs referenced the old
[0,100]blur range; code clamps[0,200]since v0.6.8. (2) TheMAX_BLUR_DIMENSIONdoc comment was split by a// SYNC:block, producing a truncated sentence in rustdoc. (3)check_accountwaspuband relied on callers only ever passinggetuid()-derived usernames; the contract was not enforced by the type system. (4)MOONLOCK_DEBUGenv var flipped log verbosity to Debug in release builds, letting a compromised session script escalate journal noise about fprintd / D-Bus. (5)pam_setcredabsence was undocumented. (6)[profile.release]usedlto = "thin"— fine for most crates, but for a latency-critical auth binary compiled once per release, fat LTO's extra cross-crate inlining is worth the ~1 min build hit. - Tradeoffs:
lto = "fat"roughly doubles release build time (~30 s → ~60 s) for slightly better inlining across PAM FFI wrappers and the i18n/status paths.#[cfg(debug_assertions)]on the debug-level selector means you have to run a debug build to raise log level — inconvenient for live troubleshooting, but aligned with the security-first posture. - How: (1)
CLAUDE.md+README.mdupdated to[0,200]. (2)// SYNC:block moved above the///doc so rustdoc renders one coherent paragraph. (3)check_accountvisibility narrowed topub(crate)with aPreconditionparagraph explaining the username contract. (4) Debug-level selection wrapped in#[cfg(debug_assertions)]; release builds always run atLevelFilter::Info. (5) Added a comment block inauthenticate()documenting whypam_setcredis deliberately absent and where it would hook in if needed. (6)lto = "fat"inCargo.toml.
2026-04-24 – Audit MEDIUM fixes: D-Bus cleanup race, TOCTOU open, FP reset, GTK entry clear (v0.6.11)
- Who: ClaudeCode, Dom
- Why: Second round after the HIGH fixes, addressing the four MEDIUM findings. (1)
cleanup_dbusspawned VerifyStop + Release as fire-and-forget, thenresume_asynccalled Claim after only a 2 s timeout — shorter than the 3 s D-Bus timeout, so on a slow bus the Claim could race the Release and fprintd would reject it, leaving the FP listener permanently dead. (2)load_background_texturerelied on the caller'ssymlink_metadatacheck, re-opening the path viagdk::Texture::from_file— a classic TOCTOU window. (3)resume_asyncunconditionally resetfailed_attempts, allowing an attacker with sensor control to evade the 10-attempt cap by cycling verify-match →check_accountfail → resume. (4) The GTKPasswordEntrybuffer was only cleared on timeout or auth failure, leaving the password in GLib malloc'd memory longer than necessary. - Tradeoffs: The D-Bus cleanup is now split into a synchronous helper (
take_cleanup_proxy— signal disconnect + flag clear) and an async helper (perform_dbus_cleanup— VerifyStop + Release), soresume_asynccan await the release whilestop()stays fire-and-forget. Dropping thefailed_attemptsreset means a flaky sensor could reach 10 failures faster, but the correct remedy is a new lock session (construction) rather than a reset that also helps attackers. - How: (1) Split
cleanup_dbusintotake_cleanup_proxy()(sync) +perform_dbus_cleanup(proxy)(async).resume_asyncnow awaitsperform_dbus_cleanupbeforebegin_verification.stop()still spawns the cleanup fire-and-forget. (2)load_background_textureopens withO_NOFOLLOWviastd::fs::OpenOptions::custom_flags, reads to bytes, and builds the texture viagdk::Texture::from_bytes. (3) Removedlistener.borrow_mut().failed_attempts = 0fromresume_async. (4)password_entry.set_text("")now fires right after theZeroizing::new(entry.text().to_string())extraction, shortening the GTK-side window.
2026-04-24 – Audit fixes: RefCell borrow across await, async avatar decode
- Who: ClaudeCode, Dom
- Why: Triple audit found two HIGH issues. (1)
init_fingerprint_asyncheld aRefCellimmutable borrow acrossis_available_async().await— a concurrentconnect_monitorsignal (hotplug / suspend-resume) invokingborrow_mut()during the await would panic. (2)set_avatar_from_filedecoded avatars synchronously viaPixbuf::from_file_at_scale, blocking the GTK main thread inside theconnect_monitorhandler. WithMAX_AVATAR_FILE_SIZEat 10 MB the worst-case stall was 200–500 ms on monitor hotplug. - Tradeoffs: Avatar is shown as the symbolic default icon for a brief window while decoding completes. Wallpaper stays synchronous because
connect_monitorfires duringlock()and needs the texture already present (see 2026-04-09). - How: (1) Extract
usernameinto a localStringininit_fingerprint_async, drop the borrow before the await, re-borrow in a new scope after — no awaits inside the second borrow, so hotplug during signal setup is safe. (2)set_avatar_from_filenow usesgio::File::read_future+Pixbuf::from_stream_at_scale_futurefor async I/O and decode. The default icon is shown immediately; the decoded texture replaces it when ready.Pixbufitself is!Send, sogio::spawn_blockingdoes not apply — the GIO async stream loader keeps thePixbufon the main thread while the kernel does the I/O asynchronously.
2026-04-09 – Monitor hotplug via connect_monitor signal
- Who: ClaudeCode, Dom
- Why: moonlock crashed with segfault in libgtk-4.so after suspend/resume — HDMI monitor disconnect/reconnect invalidated GDK monitor objects, and the statically created windows referenced destroyed surfaces. Crash at consistent GTK4 offset (0x278 NULL dereference), 3x reproduced.
- Tradeoffs: Wallpaper texture now loaded before
lock()instead of after (connect_monitor fires during lock() and needs the texture). Local JPEG loading is fast enough that the delay is negligible. Shared state moved to Rc's for the signal closure — slightly more indirection but necessary for dynamic window creation. - How: (1) Bump gtk4-session-lock feature from
v1_1tov1_2to enableInstance::connect_monitor. (2) Replace manual monitor iteration withlock.connect_monitor()signal handler that creates windows on demand. (3) Signal fires once per existing monitor atlock()and again on hotplug. (4) Windows auto-unmap when their monitor disappears (ext-session-lock-v1 guarantee). (5) Fingerprint listener published to shared Rc so hotplugged monitors get FP labels.
2026-03-31 – Fourth audit: peek icon, blur limit, GResource compression, sync markers
- Who: ClaudeCode, Dom
- Why: Fourth triple audit found blur limit inconsistency (moonlock 0–100 vs moongreet/moonset 0–200), missing GResource compression, peek icon inconsistency, and duplicated code without sync markers.
- Tradeoffs: Peek icon enabled in lockscreen — user decision favoring UX consistency over shoulder-surfing protection. Acceptable for single-user desktop. Blur limit raised to 200 for ecosystem consistency.
- How: (1)
show_peek_icon(true)in lockscreen password entry. (2)clamp(0.0, 200.0)for blur in config.rs. (3)compressed="true"on CSS/SVG GResource entries. (4) SYNC comments on duplicated blur/background functions pointing to moongreet and moonset.
2026-03-30 – Third audit: blur offset, lock-before-IO, FP signal lifecycle, TOCTOU
- Who: ClaudeCode, Dom
- Why: Third triple audit (quality, performance, security) found: blur padding offset rendering texture at (0,0) instead of (-pad,-pad) causing edge darkening on left/top (BUG), wallpaper disk I/O blocking before lock() extending the unsecured window (PERF/SEC), signal handler duplication on resume_async (SEC), failed_attempts not reset on FP resume (SEC), unknown VerifyStatus with done=false hanging FP listener (SEC), TOCTOU in is_file+is_symlink checks (SEC), dead code in faillock_warning (QUALITY), unbounded blur sigma (SEC).
- Tradeoffs: Wallpaper loads after lock() — screen briefly shows without wallpaper until texture is ready. Acceptable: security > aesthetics. Blur sigma clamped to [0.0, 100.0] — arbitrary upper bound but prevents GPU memory exhaustion.
- How: (1) Texture offset to (-pad, -pad) in render_blurred_texture. (2) lock.lock() before resolve_background_path. (3) begin_verification disconnects old signal_id before registering new. (4) resume_async resets failed_attempts. (5) Unknown VerifyStatus with done=true triggers restart. (6) symlink_metadata() for atomic file+symlink check. (7) faillock_warning dead code removed, saturating_sub. (8) background_blur clamped. (9) Redundant Zeroizing<Vec> removed. (10) Default impl for FingerprintListener. (11) on_verify_status restricted to pub(crate). (12) Warn logging for non-UTF-8 GECOS and avatar paths.
2026-03-30 – Second audit: zeroize CString, FP account check, PAM timeout, blur downscale
- Who: ClaudeCode, Dom
- Why: Second triple audit (quality, performance, security) found: CString password copy not zeroized (HIGH), fingerprint unlock bypassing pam_acct_mgmt (MEDIUM), no PAM timeout leaving user locked out on hanging modules (MEDIUM), GPU blur on full wallpaper resolution (MEDIUM), no-monitor edge case doing
returninstead ofexit(1)(MEDIUM). - Tradeoffs: PAM timeout (30s) uses a generation counter to avoid stale result interference — adds complexity but prevents parallel PAM sessions. FP restart after failed account check re-claims the device, adding a D-Bus round-trip, but prevents permanent FP death on transient failures. Blur downscale to 1920px cap trades negligible quality for ~4x less GPU work on 4K wallpapers.
- How: (1)
Zeroizing<CString>wraps password in auth.rs,zeroize/stdfeature enabled. (2)check_account()calls pam_acct_mgmt after FP match;resume_async()restarts FP on transient failure. (3)auth_generationcounter invalidates stale PAM results; 30s timeout re-enables UI. (4)MAX_BLUR_DIMENSIONcaps blur input at 1920px, sigma scaled proportionally. (5)exit(1)on no-monitor afterlock.lock().
2026-03-28 – Remove embedded wallpaper from binary
- Who: ClaudeCode, Dom
- Why: Wallpaper is installed by moonarch to /usr/share/moonarch/wallpaper.jpg. Embedding a 374K JPEG in the binary is redundant. GTK background color (Catppuccin Mocha base) is a clean fallback.
- Tradeoffs: Without moonarch installed AND without config, lockscreen shows plain dark background instead of wallpaper. Acceptable — that's the expected minimal state.
- How: Remove wallpaper.jpg from GResources, return None from resolve_background_path when no file found, skip background picture creation when no texture available.
2026-03-28 – Audit-driven security and lifecycle fixes (v0.6.0)
- Who: ClaudeCode, Dom
- Why: Triple audit (quality, performance, security) revealed a critical D-Bus signal spoofing vector, fingerprint lifecycle bugs, and multi-monitor performance issues.
- Tradeoffs:
cleanup_dbus()extraction adds a method but clarifies the stop/match ownership;running_flag: Rc<Cell<bool>>adds a field but prevents race between async restart and stop; sender validation adds a check per signal but closes the only known auth bypass. - How: (1) Validate D-Bus VerifyStatus sender against fprintd's unique bus name. (2) Extract
cleanup_dbus()fromstop(), call it on verify-match. (3)Rc<Cell<bool>>running flag checked after await inrestart_verify_async. (4) Consistent 3s D-Bus timeouts. (5) Panic hook before logging. (6) Blur and avatar caches shared across monitors. (7) Peek icon disabled. (8) Symlink rejection for background_path. (9) TOML parse errors logged.
2026-03-28 – GPU blur via GskBlurNode replaces CPU blur
- Who: ClaudeCode, Dom
- Why: CPU-side Gaussian blur (
imagecrate) blocked the GTK main thread for 500ms–2s on 4K wallpapers at cold cache. Disk cache mitigated repeat starts but added ~100 lines of complexity. - Tradeoffs: GPU blur quality is slightly different (box-blur approximation vs true Gaussian), acceptable for wallpaper. Removes
imageanddirsdependencies entirely. No disk cache needed. - How:
Snapshot::push_blur()+GskRenderer::render_texture()onconnect_realize. Blur happens once on the GPU when the widget gets its renderer, producing a concretegdk::Texture. Zero startup latency.
2026-03-28 – Optional background blur via image crate (superseded)
- Who: ClaudeCode, Dom
- Why: Consistent with moonset/moongreet — blurred wallpaper as lockscreen background is a common UX pattern
- Tradeoffs: Adds
imagecrate dependency (~15 transitive crates); CPU-side Gaussian blur at load time adds startup latency proportional to image size and sigma. Acceptable because blur runs once and the texture is shared across monitors. - How:
load_background_texture(bg_path, blur_radius)loads texture, optionally appliesimageops::blur(), returnsgdk::Texture. Config optionbackground_blur: Option<f32>in TOML.
2026-03-28 – Shared wallpaper texture pattern (aligned with moonset/moongreet)
- Who: ClaudeCode, Dom
- Why: Previously loaded wallpaper per-window via
Picture::for_filename(). Multi-monitor setups decoded the JPEG redundantly. Blur feature requires texture pixel access anyway. - Tradeoffs: Slightly more code in main.rs (texture loaded before window creation), but avoids redundant decoding and enables the blur feature.
- How:
load_background_texture()in lockscreen.rs decodes once,create_background_picture()wraps sharedgdk::Textureingtk::Picture. Same pattern as moonset/moongreet.