Skip to content

Commit 508e45b

Browse files
committed
Document accessibility, VLM, run-history, OCR, and related features
1 parent e166298 commit 508e45b

6 files changed

Lines changed: 1006 additions & 71 deletions

File tree

README.md

Lines changed: 257 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,20 @@
2222
- [Mouse Control](#mouse-control)
2323
- [Keyboard Control](#keyboard-control)
2424
- [Image Recognition](#image-recognition)
25+
- [Accessibility Element Finder](#accessibility-element-finder)
26+
- [AI Element Locator (VLM)](#ai-element-locator-vlm)
27+
- [OCR (Text on Screen)](#ocr-text-on-screen)
28+
- [Clipboard](#clipboard)
2529
- [Screen Operations](#screen-operations)
2630
- [Action Recording & Playback](#action-recording--playback)
2731
- [Action Scripting (JSON Executor)](#action-scripting-json-executor)
32+
- [Scheduler (Interval & Cron)](#scheduler-interval--cron)
33+
- [Global Hotkey Daemon](#global-hotkey-daemon)
34+
- [Event Triggers](#event-triggers)
35+
- [Run History](#run-history)
2836
- [Report Generation](#report-generation)
29-
- [Remote Automation (Socket Server)](#remote-automation-socket-server)
37+
- [Remote Automation (Socket / REST)](#remote-automation-socket--rest)
38+
- [Plugin Loader](#plugin-loader)
3039
- [Shell Command Execution](#shell-command-execution)
3140
- [Screen Recording](#screen-recording)
3241
- [Callback Executor](#callback-executor)
@@ -46,17 +55,27 @@
4655
- **Mouse Automation** — move, click, press, release, drag, and scroll with precise coordinate control
4756
- **Keyboard Automation** — press/release individual keys, type strings, hotkey combinations, key state detection
4857
- **Image Recognition** — locate UI elements on screen using OpenCV template matching with configurable threshold
58+
- **Accessibility Element Finder** — query the OS accessibility tree (Windows UIA / macOS AX) to locate buttons, menus, and controls by name/role
59+
- **AI Element Locator (VLM)** — describe a UI element in plain language and let a vision-language model (Anthropic / OpenAI) find its screen coordinates
60+
- **OCR** — extract text from screen regions using Tesseract; wait for, click, or locate rendered text
61+
- **Clipboard** — read/write system clipboard text on Windows, macOS, and Linux
4962
- **Screenshot & Screen Recording** — capture full screen or regions as images, record screen to video (AVI/MP4)
5063
- **Action Recording & Playback** — record mouse/keyboard events and replay them
51-
- **JSON-Based Action Scripting** — define and execute automation flows using JSON action files
64+
- **JSON-Based Action Scripting** — define and execute automation flows using JSON action files (dry-run + step debug)
65+
- **Scheduler** — run scripts on an interval or cron expression; jobs persist across restarts
66+
- **Global Hotkey Daemon** — bind OS-level hotkeys to action scripts (Windows today; macOS/Linux stubs in place)
67+
- **Event Triggers** — fire scripts when an image appears, a window opens, a pixel changes, or a file is modified
68+
- **Run History** — SQLite-backed run log across scheduler / triggers / hotkeys / REST with auto error-screenshot artifacts
5269
- **Report Generation** — export test records as HTML, JSON, or XML reports with success/failure status
53-
- **Remote Automation** — start a TCP socket server to receive and execute automation commands from remote clients
70+
- **Remote Automation** — TCP socket server **and** REST API server to receive automation commands
71+
- **Plugin Loader** — drop `.py` files exposing `AC_*` callables into a directory and register them as executor commands at runtime
5472
- **Shell Integration** — execute shell commands within automation workflows with async output capture
5573
- **Callback Executor** — trigger automation functions with callback hooks for chaining operations
5674
- **Dynamic Package Loading** — extend the executor at runtime by importing external Python packages
5775
- **Project & Template Management** — scaffold automation projects with keyword/executor directory structure
5876
- **Window Management** — send keyboard/mouse events directly to specific windows (Windows/Linux)
59-
- **GUI Application** — built-in PySide6 graphical interface for interactive automation
77+
- **GUI Application** — built-in PySide6 graphical interface with live language switching (English / 繁體中文 / 简体中文 / 日本語)
78+
- **CLI Runner**`python -m je_auto_control.cli run|list-jobs|start-server|start-rest`
6079
- **Cross-Platform** — unified API across Windows, macOS, and Linux (X11)
6180

6281
---
@@ -80,10 +99,23 @@ je_auto_control/
8099
├── executor/ # JSON action executor engine
81100
├── callback/ # Callback function executor
82101
├── cv2_utils/ # OpenCV screenshot, template matching, video recording
102+
├── accessibility/ # UIA (Windows) / AX (macOS) element finder
103+
├── vision/ # VLM-based locator (Anthropic / OpenAI backends)
104+
├── ocr/ # Tesseract-backed text locator
105+
├── clipboard/ # Cross-platform clipboard
106+
├── scheduler/ # Interval + cron scheduler
107+
├── hotkey/ # Global hotkey daemon
108+
├── triggers/ # Image/window/pixel/file triggers
109+
├── run_history/ # SQLite run log + error-screenshot artifacts
110+
├── rest_api/ # Stdlib HTTP/REST server
111+
├── plugin_loader/ # Dynamic AC_* plugin discovery
83112
├── socket_server/ # TCP socket server for remote automation
84113
├── shell_process/ # Shell command manager
85114
├── generate_report/ # HTML / JSON / XML report generators
86115
├── test_record/ # Test action recording
116+
├── script_vars/ # Script variable interpolation
117+
├── watcher/ # Mouse / pixel / log watchers (Live HUD)
118+
├── recording_edit/ # Trim, filter, re-scale recorded actions
87119
├── json/ # JSON action file read/write
88120
├── project/ # Project scaffolding & templates
89121
├── package_manager/ # Dynamic package loading
@@ -135,6 +167,13 @@ sudo apt-get install cmake libssl-dev
135167
| `python-Xlib` | Linux X11 backend (auto-installed on Linux) |
136168
| `PySide6` | GUI application (optional, install with `[gui]`) |
137169
| `qt-material` | GUI theme (optional, install with `[gui]`) |
170+
| `uiautomation` | Windows accessibility backend (optional, loaded on demand) |
171+
| `pytesseract` + Tesseract | OCR engine (optional, loaded on demand) |
172+
| `anthropic` | VLM locator — Anthropic backend (optional, loaded on demand) |
173+
| `openai` | VLM locator — OpenAI backend (optional, loaded on demand) |
174+
175+
See [Third_Party_License.md](Third_Party_License.md) for a full list of
176+
third-party components and their licenses.
138177

139178
---
140179

@@ -197,6 +236,95 @@ print(f"Found at: ({cx}, {cy})")
197236
je_auto_control.locate_and_click("submit_button.png", mouse_keycode="mouse_left")
198237
```
199238

239+
### Accessibility Element Finder
240+
241+
Query the OS accessibility tree to locate controls by name, role, or app.
242+
Works on Windows (UIA, via `uiautomation`) and macOS (AX).
243+
244+
```python
245+
import je_auto_control
246+
247+
# List all visible buttons in the Calculator app
248+
elements = je_auto_control.list_accessibility_elements(app_name="Calculator")
249+
250+
# Find a specific element
251+
ok = je_auto_control.find_accessibility_element(name="OK", role="Button")
252+
if ok is not None:
253+
print(ok.bounds, ok.center)
254+
255+
# Click it directly
256+
je_auto_control.click_accessibility_element(name="OK", app_name="Calculator")
257+
```
258+
259+
Raises `AccessibilityNotAvailableError` if no accessibility backend is
260+
installed for the current platform.
261+
262+
### AI Element Locator (VLM)
263+
264+
When template matching and accessibility both fail, describe the element
265+
in plain language and let a vision-language model find its coordinates.
266+
267+
```python
268+
import je_auto_control
269+
270+
# Uses Anthropic by default if ANTHROPIC_API_KEY is set, else OpenAI.
271+
x, y = je_auto_control.locate_by_description("the green Submit button")
272+
273+
# Or click it in one shot
274+
je_auto_control.click_by_description(
275+
"the cookie-banner 'Accept all' button",
276+
screen_region=[0, 800, 1920, 1080], # optional crop
277+
)
278+
```
279+
280+
Configuration (environment variables only — keys are never persisted or
281+
logged):
282+
283+
| Variable | Effect |
284+
|---|---|
285+
| `ANTHROPIC_API_KEY` | Enables the Anthropic backend |
286+
| `OPENAI_API_KEY` | Enables the OpenAI backend |
287+
| `AUTOCONTROL_VLM_BACKEND` | `anthropic` or `openai` to force a backend |
288+
| `AUTOCONTROL_VLM_MODEL` | Override the default model (e.g. `claude-opus-4-7`, `gpt-4o-mini`) |
289+
290+
Raises `VLMNotAvailableError` if neither SDK is installed or no API key
291+
is set.
292+
293+
### OCR (Text on Screen)
294+
295+
```python
296+
import je_auto_control as ac
297+
298+
# Locate all matches of a piece of text
299+
matches = ac.find_text_matches("Submit")
300+
301+
# Center of the first match, or None
302+
cx, cy = ac.locate_text_center("Submit")
303+
304+
# Click text in one call
305+
ac.click_text("Submit")
306+
307+
# Block until text appears (or timeout)
308+
ac.wait_for_text("Loading complete", timeout=15.0)
309+
```
310+
311+
If Tesseract is not on `PATH`, point at it explicitly:
312+
313+
```python
314+
ac.set_tesseract_cmd(r"C:\Program Files\Tesseract-OCR\tesseract.exe")
315+
```
316+
317+
### Clipboard
318+
319+
```python
320+
import je_auto_control as ac
321+
ac.set_clipboard("hello")
322+
text = ac.get_clipboard()
323+
```
324+
325+
Backends: Windows (Win32 via `ctypes`), macOS (`pbcopy`/`pbpaste`),
326+
Linux (`xclip` or `xsel`).
327+
200328
### Screenshot
201329

202330
```python
@@ -270,13 +398,84 @@ je_auto_control.execute_action([
270398
| Keyboard | `AC_type_keyboard`, `AC_press_keyboard_key`, `AC_release_keyboard_key`, `AC_write`, `AC_hotkey`, `AC_check_key_is_press` |
271399
| Image | `AC_locate_all_image`, `AC_locate_image_center`, `AC_locate_and_click` |
272400
| Screen | `AC_screen_size`, `AC_screenshot` |
401+
| Accessibility | `AC_a11y_list`, `AC_a11y_find`, `AC_a11y_click` |
402+
| VLM (AI Locator) | `AC_vlm_locate`, `AC_vlm_click` |
403+
| OCR | `AC_locate_text`, `AC_click_text`, `AC_wait_text` |
404+
| Clipboard | `AC_clipboard_get`, `AC_clipboard_set` |
273405
| Record | `AC_record`, `AC_stop_record` |
274406
| Report | `AC_generate_html`, `AC_generate_json`, `AC_generate_xml`, `AC_generate_html_report`, `AC_generate_json_report`, `AC_generate_xml_report` |
275407
| Project | `AC_create_project` |
276408
| Shell | `AC_shell_command` |
277409
| Process | `AC_execute_process` |
278410
| Executor | `AC_execute_action`, `AC_execute_files` |
279411

412+
### Scheduler (Interval & Cron)
413+
414+
```python
415+
import je_auto_control as ac
416+
417+
# Interval job — run every 30 seconds
418+
job = ac.default_scheduler.add_job(
419+
script_path="scripts/poll.json", interval_seconds=30, repeat=True,
420+
)
421+
422+
# Cron job — 09:00 on weekdays (minute hour dom month dow)
423+
cron_job = ac.default_scheduler.add_cron_job(
424+
script_path="scripts/daily.json", cron_expression="0 9 * * 1-5",
425+
)
426+
427+
ac.default_scheduler.start()
428+
```
429+
430+
Both flavours coexist; `job.is_cron` tells them apart.
431+
432+
### Global Hotkey Daemon
433+
434+
Bind OS-level hotkeys to action JSON scripts (Windows backend today;
435+
macOS / Linux raise `NotImplementedError` on `start()` with Strategy-
436+
pattern seams in place).
437+
438+
```python
439+
from je_auto_control import default_hotkey_daemon
440+
441+
default_hotkey_daemon.bind("ctrl+alt+1", "scripts/greet.json")
442+
default_hotkey_daemon.start()
443+
```
444+
445+
### Event Triggers
446+
447+
Poll-based triggers that fire a script when a condition becomes true:
448+
449+
```python
450+
from je_auto_control import (
451+
default_trigger_engine, ImageAppearsTrigger,
452+
WindowAppearsTrigger, PixelColorTrigger, FilePathTrigger,
453+
)
454+
455+
default_trigger_engine.add(ImageAppearsTrigger(
456+
trigger_id="", script_path="scripts/click_ok.json",
457+
image_path="templates/ok_button.png", threshold=0.85, repeat=True,
458+
))
459+
default_trigger_engine.start()
460+
```
461+
462+
### Run History
463+
464+
Every run from the scheduler, trigger engine, hotkey daemon, REST API,
465+
and manual GUI replay is recorded to `~/.je_auto_control/history.db`.
466+
Errors automatically attach a screenshot under
467+
`~/.je_auto_control/artifacts/run_{id}_{ms}.png` for post-mortem.
468+
469+
```python
470+
from je_auto_control import default_history_store
471+
472+
for run in default_history_store.list_runs(limit=20):
473+
print(run.id, run.source, run.status, run.artifact_path)
474+
```
475+
476+
The GUI **Run History** tab exposes filter/refresh/clear and
477+
double-click-to-open on the artifact column.
478+
280479
### Report Generation
281480

282481
```python
@@ -302,19 +501,24 @@ xml_string = je_auto_control.generate_xml()
302501

303502
Reports include: function name, parameters, timestamp, and exception info (if any) for each recorded action. HTML reports display successful actions in cyan and failed actions in red.
304503

305-
### Remote Automation (Socket Server)
504+
### Remote Automation (Socket / REST)
306505

307-
Start a TCP server to receive JSON automation commands from remote clients:
506+
Two servers are available — a raw TCP socket and a stdlib HTTP/REST
507+
server. Both default to `127.0.0.1`; binding to `0.0.0.0` is an explicit,
508+
documented opt-in.
308509

309510
```python
310-
import je_auto_control
511+
import je_auto_control as ac
311512

312-
# Start the server (default: localhost:9938)
313-
server = je_auto_control.start_autocontrol_socket_server(host="localhost", port=9938)
513+
# TCP socket server (default: 127.0.0.1:9938)
514+
ac.start_autocontrol_socket_server(host="127.0.0.1", port=9938)
314515

315-
# The server runs in a background thread
316-
# Send JSON action commands via TCP to execute remotely
317-
# Send "quit_server" to shut down
516+
# REST API server (default: 127.0.0.1:9939)
517+
ac.start_rest_api_server(host="127.0.0.1", port=9939)
518+
# Endpoints:
519+
# GET /health liveness probe
520+
# GET /jobs scheduler job list
521+
# POST /execute body: {"actions": [...]}
318522
```
319523

320524
Client example:
@@ -339,6 +543,26 @@ print(response)
339543
sock.close()
340544
```
341545

546+
### Plugin Loader
547+
548+
Drop `.py` files defining top-level `AC_*` callables into a directory,
549+
then register them as executor commands at runtime:
550+
551+
```python
552+
from je_auto_control import (
553+
load_plugin_directory, register_plugin_commands,
554+
)
555+
556+
commands = load_plugin_directory("./my_plugins")
557+
register_plugin_commands(commands)
558+
559+
# Now usable from any JSON action script:
560+
# [["AC_greet", {"name": "world"}]]
561+
```
562+
563+
> **Warning:** Plugin files execute arbitrary Python on load. Only load
564+
> from directories you control.
565+
342566
### Shell Command Execution
343567

344568
```python
@@ -497,6 +721,24 @@ python -m je_auto_control --execute_str '[["AC_screenshot", {"file_path": "test.
497721
python -m je_auto_control -c ./my_project
498722
```
499723

724+
A richer subcommand CLI built on the headless APIs:
725+
726+
```bash
727+
# Run a script, optionally with variables, and/or a dry-run
728+
python -m je_auto_control.cli run script.json
729+
python -m je_auto_control.cli run script.json --var name=alice --dry-run
730+
731+
# List scheduler jobs
732+
python -m je_auto_control.cli list-jobs
733+
734+
# Start the socket or REST server
735+
python -m je_auto_control.cli start-server --port 9938
736+
python -m je_auto_control.cli start-rest --port 9939
737+
```
738+
739+
`--var name=value` is parsed as JSON when possible (so `count=10` becomes
740+
an int), otherwise treated as a string.
741+
500742
---
501743

502744
## Platform Support
@@ -541,4 +783,6 @@ python -m pytest test/integrated_test/
541783

542784
## License
543785

544-
[MIT License](LICENSE) © JE-Chen
786+
[MIT License](LICENSE) © JE-Chen.
787+
See [Third_Party_License.md](Third_Party_License.md) for the licenses of
788+
bundled and optional third-party dependencies.

0 commit comments

Comments
 (0)