Skip to content

lahfir/agent-desktop

Repository files navigation

AGENT DESKTOP

OBSERVE. DECIDE. ACT.

CI status GitHub release npm version ClawHub skill skills.sh listing Apache-2.0 License

agent-desktop tutorial demo

agent-desktop is a native desktop automation CLI designed for AI agents, built with Rust. It gives structured access to any application through OS accessibility trees — no screenshots, no pixel matching, no browser required.

Architecture

agent-desktop architecture diagram

agent-desktop real-world example — Slack accessibility tree with 97% token savings

Star history for lahfir/agent-desktop

Key Features

  • Native Rust CLI: Fast, single binary, no runtime dependencies
  • C-ABI cdylib (libagent_desktop_ffi): Load once from Python / Swift / Go / Ruby / Node / C instead of forking the CLI per call
  • 54 commands: Observation, interaction, keyboard, mouse, notifications, clipboard, window management, plus a bundled skills doc loader
  • Progressive skeleton traversal: 78–96% token reduction on dense apps via shallow overview + targeted drill-down
  • Snapshot & refs: AI-optimized workflow using compact snapshot IDs and deterministic element references (@e1, @e2)
  • Headless-by-default interactions: Ref actions use accessibility APIs and block silent focus, cursor, keyboard, or pasteboard side effects
  • Structured JSON output: Machine-readable responses with error codes and recovery hints
  • Works with any app: Finder, Safari, System Settings, Xcode, Slack — anything with an accessibility tree

Installation

npm (recommended)

npm install -g agent-desktop        # downloads prebuilt binary automatically

Or without installing:

npx agent-desktop snapshot --app Finder -i

From source

git clone https://github.com/lahfir/agent-desktop
cd agent-desktop
cargo build --release
cp target/release/agent-desktop /usr/local/bin/

Requires Rust 1.78+ and macOS 13.0+.

Permissions

macOS requires Accessibility permission. Screenshots also require Screen Recording permission. Grant them in System Settings > Privacy & Security by adding the app that launches agent-desktop, or:

agent-desktop permissions --request   # trigger platform permission request path

Permission fields are explicit objects, for example:

{
  "accessibility": { "state": "granted" },
  "screen_recording": { "state": "denied", "suggestion": "Grant Screen Recording permission" },
  "automation": { "state": "not_required" }
}

Language bindings (FFI)

Every GitHub Release ships a prebuilt C-ABI cdylib alongside the CLI tarballs. Hosts that need in-process calls (Python agents, Swift apps, Go services, Node tools, Ruby scripts, C/C++ code) dlopen the dylib and call the functions declared in agent_desktop.h — no fork-exec per command.

Platform Artifact
macOS arm64 agent-desktop-ffi-v<ver>-aarch64-apple-darwin.tar.gz
macOS x86_64 agent-desktop-ffi-v<ver>-x86_64-apple-darwin.tar.gz
Linux x86_64 (glibc) agent-desktop-ffi-v<ver>-x86_64-unknown-linux-gnu.tar.gz
Linux arm64 (glibc) agent-desktop-ffi-v<ver>-aarch64-unknown-linux-gnu.tar.gz
Windows x86_64 (MSVC) agent-desktop-ffi-v<ver>-x86_64-pc-windows-msvc.zip

Each archive contains lib/libagent_desktop_ffi.{dylib,so,dll}, include/agent_desktop.h, LICENSE, and a short README. Verify the download with the release's checksums.txt:

shasum -a 256 -c checksums.txt
gh attestation verify agent-desktop-ffi-v*.tar.gz --repo lahfir/agent-desktop   # Sigstore provenance

Minimal Python round-trip:

import ctypes
lib = ctypes.CDLL("./lib/libagent_desktop_ffi.dylib")
lib.ad_adapter_create.restype = ctypes.c_void_p
adapter = lib.ad_adapter_create()
# ... call ad_list_apps / ad_get_tree / ad_execute_action, see docs below
lib.ad_adapter_destroy(adapter)

Full consumer guide — error-handling contract, ownership rules, threading constraints, every entrypoint with Safety docs: skills/agent-desktop-ffi/.

Core Workflow for AI

For dense apps (Slack, VS Code, Notion), use progressive skeleton traversal to minimize token usage:

# 1. Shallow overview — depth-3 map, truncated containers show children_count
agent-desktop snapshot --skeleton --app Slack -i --compact
# Keep snapshot_id, for example s8f3k2p9

# 2. Drill into a region of interest (named containers get refs as drill targets)
agent-desktop snapshot --root @e3 --snapshot s8f3k2p9 -i --compact

# 3. Act on an element found in the drill-down
agent-desktop click @e12 --snapshot s8f3k2p9

# 4. Re-drill the same region to verify the state change
agent-desktop snapshot --root @e3 --snapshot s8f3k2p9 -i --compact

For simple apps, a full snapshot is fine:

agent-desktop snapshot --app Finder -i   # get interactive elements with refs and snapshot_id
agent-desktop click @e3 --snapshot s8f3k2p9  # click a button by ref
agent-desktop type @e5 --snapshot s8f3k2p9 "quarterly report"  # insert text into a field
agent-desktop press cmd+s               # keyboard shortcut
agent-desktop snapshot -i               # re-observe after UI changes
Agent loop:  snapshot → decide → act → snapshot → decide → act → ...

Commands

Observation

agent-desktop snapshot --app Safari -i           # accessibility tree with refs
agent-desktop snapshot --surface menu            # capture open menu
agent-desktop screenshot --app Finder            # PNG screenshot
agent-desktop find --role button --app TextEdit  # search by role, name, value, text
agent-desktop get @e3 --snapshot s8f3k2p9 --property value  # read element property
agent-desktop is @e7 --snapshot s8f3k2p9 --property checked # check boolean state
agent-desktop list-surfaces --app Notes          # list menus, sheets, popovers, alerts

Interaction

agent-desktop click @e3                  # semantic AX-first click
agent-desktop double-click @e3           # AXOpen; physical double-click uses mouse-click --count 2
agent-desktop triple-click @e3           # POLICY_DENIED if physical input is disabled
agent-desktop right-click @e3            # open verified context menu
agent-desktop type @e5 "hello world"     # insert text into element
agent-desktop set-value @e5 "new value"  # set value directly via AX
agent-desktop clear @e5                  # clear element value
agent-desktop focus @e5                  # set keyboard focus
agent-desktop select @e9 "Option B"      # select verified dropdown/list option
agent-desktop toggle @e12                # flip checkbox or switch
agent-desktop check @e12                 # idempotent check
agent-desktop uncheck @e12               # idempotent uncheck
agent-desktop expand @e15                # expand disclosure/tree item
agent-desktop collapse @e15              # collapse disclosure/tree item
agent-desktop scroll @e1 --direction down --amount 3  # scroll (AX-first)
agent-desktop scroll-to @e20             # scroll element into view

Keyboard

agent-desktop press cmd+s               # key combo
agent-desktop press cmd+shift+z          # multi-modifier
agent-desktop press escape               # single key
agent-desktop key-down shift             # hold key
agent-desktop key-up shift               # release key

Mouse

agent-desktop hover @e3                  # move cursor to element
agent-desktop hover --xy 500,300         # move cursor to coordinates
agent-desktop drag @e3 --to @e8          # drag between elements
agent-desktop drag --xy 100,200 --to-xy 400,200  # drag between coordinates
agent-desktop mouse-click --xy 500,300   # click at coordinates
agent-desktop mouse-down --xy 500,300    # press at coordinates
agent-desktop mouse-up --xy 500,300      # release at coordinates

App & Window Management

agent-desktop launch Safari              # launch app by name
agent-desktop launch com.apple.Safari    # launch by bundle ID
agent-desktop close-app Safari           # quit app
agent-desktop close-app Safari --force   # force quit (SIGKILL)
agent-desktop list-apps                  # list running GUI apps
agent-desktop list-windows               # list visible windows
agent-desktop list-windows --app Finder  # windows for specific app
agent-desktop focus-window w-4521        # bring window to front
agent-desktop resize-window w-4521 800 600  # resize
agent-desktop move-window w-4521 100 100    # move
agent-desktop minimize w-4521            # minimize
agent-desktop maximize w-4521            # maximize
agent-desktop restore w-4521             # restore

Notifications (macOS only)

agent-desktop list-notifications                       # list all notifications
agent-desktop list-notifications --app "Slack"         # filter by app
agent-desktop list-notifications --text "deploy" --limit 5  # filter by text
agent-desktop dismiss-notification 1                   # dismiss by index
agent-desktop dismiss-all-notifications                # dismiss all
agent-desktop dismiss-all-notifications --app "Slack"  # dismiss all from app
agent-desktop notification-action 1 --action "Reply"   # click action button

Clipboard

agent-desktop clipboard-get              # read clipboard text
agent-desktop clipboard-set "copied"     # write to clipboard
agent-desktop clipboard-clear            # clear clipboard

Wait

agent-desktop wait 500                                       # sleep 500ms
agent-desktop wait --element @e3 --timeout 5000              # wait for element
agent-desktop wait --window "Save" --timeout 10000           # wait for window
agent-desktop wait --text "Loading complete" --app Safari    # wait for text
agent-desktop wait --menu --timeout 3000                     # wait for menu

Batch

agent-desktop batch '[
  {"command": "click", "args": {"ref_id": "@e2", "snapshot": "<snapshot_id>"}},
  {"command": "type", "args": {"ref_id": "@e5", "snapshot": "<snapshot_id>", "text": "hello"}},
  {"command": "press", "args": {"combo": "return"}}
]' --stop-on-error

System

agent-desktop status                     # platform, permission report, latest snapshot
agent-desktop permissions                # check accessibility/screen-recording/automation
agent-desktop permissions --request      # invoke platform request path
agent-desktop version                    # version string

Snapshot Options

agent-desktop snapshot [OPTIONS]
Flag Default Description
--app <NAME> focused app Filter to a specific application
--window-id <ID> - Filter to a specific window
-i / --interactive-only off Only include interactive elements
--compact off Omit empty structural nodes
--include-bounds off Include pixel bounds (x, y, width, height)
--max-depth <N> 10 Maximum tree depth
--skeleton off Shallow 3-level overview; truncated containers show children_count and get refs as drill targets
--root <REF> - Start traversal from this ref; merges into existing refmap with scoped invalidation
--snapshot <snapshot_id> latest Snapshot ID to use when resolving --root
--surface <TYPE> window window, focused, menu, menubar, sheet, popover, alert

JSON Output

Every command returns structured JSON:

{
  "version": "2.0",
  "ok": true,
  "command": "click",
  "data": { "action": "click" }
}

Errors include machine-readable codes and recovery hints:

{
  "version": "2.0",
  "ok": false,
  "command": "click",
  "error": {
    "code": "STALE_REF",
    "message": "Element at @e7 no longer matches the last snapshot",
    "suggestion": "Run 'snapshot' to refresh refs, then retry"
  }
}

Error Codes

Code Meaning
PERM_DENIED Accessibility permission not granted
ELEMENT_NOT_FOUND No element matched the ref or query
APP_NOT_FOUND Application not running or no windows
STALE_REF Ref is from a previous snapshot
SNAPSHOT_NOT_FOUND Snapshot ID is missing or expired
POLICY_DENIED Physical/headed path blocked by policy
ACTION_FAILED The OS rejected the action
PLATFORM_NOT_SUPPORTED Adapter method not implemented on this platform
TIMEOUT Wait condition expired
INVALID_ARGS Invalid argument values

Exit Codes

0 success, 1 structured error (JSON on stdout), 2 argument parse error.

Ref System

snapshot assigns refs to interactive elements in depth-first order: @e1, @e2, @e3, etc. Refs are scoped to a compact snapshot_id such as s8f3k2p9. Commands can omit --snapshot to use the latest snapshot pointer, but passing the ID is more deterministic in multi-step flows.

Interactive roles that receive refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, incrementor, menubutton, switch, colorwell, dockitem.

Static elements (labels, groups, containers) appear in the tree for context but have no ref.

Stale ref recovery:

snapshot → act → STALE_REF? → snapshot again → retry

Platform Support

macOS Windows Linux
Accessibility tree Yes Planned Planned
Click / type / keyboard Yes Planned Planned
Mouse input Yes Planned Planned
Screenshot Yes Planned Planned
Clipboard Yes Planned Planned
App & window management Yes Planned Planned
Notifications Yes Planned Planned

Development

cargo build                               # debug build
cargo build --release                     # optimized (<15MB)
cargo test --lib --workspace              # run tests
cargo clippy --all-targets -- -D warnings # lint (must pass with zero warnings)

License

Apache-2.0

About

Native desktop automation CLI for AI agents. Control any application through OS accessibility trees with structured JSON output and deterministic element refs.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors