Skip to content

Commit 8bee474

Browse files
authored
Merge pull request #181 from Integration-Automation/dev
Add OCR ext, runtime vars, LLM planner, remote desktop, plus matching GUI
2 parents 4162327 + 33240af commit 8bee474

58 files changed

Lines changed: 8746 additions & 50 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.idea/workspace.xml

Lines changed: 1 addition & 23 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 207 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@
2424
- [Accessibility Element Finder](#accessibility-element-finder)
2525
- [AI Element Locator (VLM)](#ai-element-locator-vlm)
2626
- [OCR (Text on Screen)](#ocr-text-on-screen)
27+
- [LLM Action Planner](#llm-action-planner)
28+
- [Runtime Variables & Control Flow](#runtime-variables--control-flow)
29+
- [Remote Desktop](#remote-desktop)
2730
- [Clipboard](#clipboard)
2831
- [Screenshot](#screenshot)
2932
- [Action Recording & Playback](#action-recording--playback)
@@ -57,7 +60,10 @@
5760
- **Image Recognition** — locate UI elements on screen using OpenCV template matching with configurable threshold
5861
- **Accessibility Element Finder** — query the OS accessibility tree (Windows UIA / macOS AX) to locate buttons, menus, and controls by name/role
5962
- **AI Element Locator (VLM)** — describe a UI element in plain language and let a vision-language model (Anthropic / OpenAI) find its screen coordinates
60-
- **OCR** — extract text from screen regions using Tesseract; wait for, click, or locate rendered text
63+
- **OCR** — extract text from screen regions using Tesseract; wait for, click, or locate rendered text; regex search and full-region dump
64+
- **LLM Action Planner** — translate a plain-language description into a validated `AC_*` action list using Claude
65+
- **Runtime Variables & Control Flow**`${var}` substitution at execution time, plus `AC_set_var` / `AC_inc_var` / `AC_if_var` / `AC_for_each` / `AC_loop` / `AC_retry` for data-driven scripts
66+
- **Remote Desktop** — stream this machine's screen and accept remote input over a token-authenticated TCP protocol, *or* connect to another machine and view + control it (host + viewer GUIs included). Optional TLS (HTTPS-grade encryption), WebSocket transport (ws:// + wss:// for browser / firewall-friendly clients), persistent 9-digit Host ID, host→viewer audio streaming, bidirectional clipboard sync (text + image), and chunked file transfer (drag-drop + progress bar; arbitrary destination path; no size cap)
6167
- **Clipboard** — read/write system clipboard text on Windows, macOS, and Linux
6268
- **Screenshot & Screen Recording** — capture full screen or regions as images, record screen to video (AVI/MP4)
6369
- **Action Recording & Playback** — record mouse/keyboard events and replay them
@@ -408,6 +414,201 @@ If Tesseract is not on `PATH`, point at it explicitly:
408414
ac.set_tesseract_cmd(r"C:\Program Files\Tesseract-OCR\tesseract.exe")
409415
```
410416

417+
Dump every recognised text record in a region (or full screen), or
418+
search by regex when the text varies:
419+
420+
```python
421+
import je_auto_control as ac
422+
423+
# Every hit in a region as TextMatch records (text, bounding box, confidence)
424+
for match in ac.read_text_in_region(region=[0, 0, 800, 600]):
425+
print(match.text, match.center, match.confidence)
426+
427+
# Regex — accepts a pattern string or a compiled re.Pattern
428+
for match in ac.find_text_regex(r"Order#\d+"):
429+
print(match.text, match.center)
430+
```
431+
432+
GUI: **OCR Reader** tab.
433+
434+
### LLM Action Planner
435+
436+
Translate plain-language descriptions into validated `AC_*` action lists
437+
using an LLM (Anthropic Claude by default). Output is leniently parsed
438+
(strips code fences, extracts the first JSON array from prose) and then
439+
validated by the same schema the executor uses, so the result can be
440+
piped straight into `execute_action`:
441+
442+
```python
443+
import je_auto_control as ac
444+
from je_auto_control.utils.executor.action_executor import executor
445+
446+
actions = ac.plan_actions(
447+
"click the Submit button, then type 'done' and save",
448+
known_commands=executor.known_commands(),
449+
)
450+
executor.execute_action(actions)
451+
452+
# Or in a single call:
453+
ac.run_from_description("open Notepad and type hello", executor=executor)
454+
```
455+
456+
| Variable | Effect |
457+
|---|---|
458+
| `ANTHROPIC_API_KEY` | Enables the Anthropic backend |
459+
| `AUTOCONTROL_LLM_BACKEND` | `anthropic` to force a backend |
460+
| `AUTOCONTROL_LLM_MODEL` | Override the default model (e.g. `claude-opus-4-7`) |
461+
462+
GUI: **LLM Planner** tab — description box, `QThread`-backed *Plan*
463+
button, action-list preview, and a *Run plan* button.
464+
465+
### Runtime Variables & Control Flow
466+
467+
The executor resolves `${var}` placeholders **per command call** rather
468+
than pre-flattening, so nested `body` / `then` / `else` lists keep their
469+
placeholders and re-bind on every iteration. Combined with new mutation
470+
commands, scripts can drive themselves from data without Python glue:
471+
472+
```json
473+
[
474+
["AC_set_var", {"name": "items", "value": ["alpha", "beta"]}],
475+
["AC_set_var", {"name": "i", "value": 0}],
476+
["AC_for_each", {
477+
"items": "${items}", "as": "name",
478+
"body": [
479+
["AC_inc_var", {"name": "i"}],
480+
["AC_if_var", {
481+
"name": "i", "op": "ge", "value": 2,
482+
"then": [["AC_break"]], "else": []
483+
}]
484+
]
485+
}]
486+
]
487+
```
488+
489+
`AC_if_var` operators: `eq`, `ne`, `lt`, `le`, `gt`, `ge`, `contains`,
490+
`startswith`, `endswith`. GUI: **Variables** tab — live view of
491+
`executor.variables` with single-set, JSON seed, and clear-all controls.
492+
493+
### Remote Desktop
494+
495+
Stream this machine's screen and accept remote input, **or** view and
496+
control another machine. The wire format is a length-prefixed framing
497+
on raw TCP (no extra deps), starting with an HMAC-SHA256
498+
challenge / response handshake; viewers that fail auth are dropped
499+
before they can see a frame. JPEG frames are produced at the configured
500+
FPS / quality and broadcast to authenticated viewers via a shared
501+
latest-frame slot, so a slow viewer drops frames instead of blocking
502+
the rest. Viewer input is JSON, validated against an allowlist, and
503+
applied through the existing wrappers.
504+
505+
```python
506+
# Be remoted — start a host and hand the token + port to whoever views you
507+
from je_auto_control import RemoteDesktopHost
508+
host = RemoteDesktopHost(token="hunter2", bind="127.0.0.1",
509+
port=0, fps=10, quality=70)
510+
host.start()
511+
print("listening on", host.port, "viewers:", host.connected_clients)
512+
```
513+
514+
```python
515+
# Control another machine — connect a viewer and send input
516+
from je_auto_control import RemoteDesktopViewer
517+
viewer = RemoteDesktopViewer(host="10.0.0.5", port=51234, token="hunter2",
518+
on_frame=lambda jpeg: ...)
519+
viewer.connect()
520+
viewer.send_input({"action": "mouse_move", "x": 100, "y": 200})
521+
viewer.send_input({"action": "type", "text": "hello"})
522+
viewer.disconnect()
523+
```
524+
525+
GUI: **Remote Desktop** tab with two sub-tabs.
526+
527+
- **Host** — token field with a *Generate* button, security warning
528+
about the bind address, start / stop controls, refreshing port +
529+
viewer-count status, and a 4 fps preview pane below the controls so
530+
the user being remoted sees what viewers see.
531+
- **Viewer** — address / port / token form, *Connect* / *Disconnect*,
532+
and a custom frame-display widget that paints incoming JPEG frames
533+
scaled with `KeepAspectRatio`. Mouse / wheel / key events on the
534+
display are remapped from widget coordinates back to the remote
535+
screen's pixel space using the latest frame's dimensions, then
536+
forwarded as `INPUT` messages.
537+
538+
> ⚠️ Anyone with the host:port and token gets full mouse / keyboard
539+
> control of the host machine. Default bind is `127.0.0.1`; expose
540+
> externally only via SSH tunnel or TLS front-end. The token is the
541+
> only line of defence — treat it like a password.
542+
543+
**Encrypted transports + alternate protocols.** Pass an `ssl_context`
544+
to either `RemoteDesktopHost` or `RemoteDesktopViewer` to wrap every
545+
connection in TLS. For firewall-friendly access, use the in-tree
546+
WebSocket variants (no extra deps) — same protocol, RFC 6455 framing,
547+
and `wss://` if you also pass `ssl_context`:
548+
549+
```python
550+
from je_auto_control import (
551+
WebSocketDesktopHost, WebSocketDesktopViewer,
552+
)
553+
host = WebSocketDesktopHost(token="hunter2", ssl_context=server_ctx)
554+
viewer = WebSocketDesktopViewer(
555+
host="example.com", port=443, token="hunter2",
556+
ssl_context=client_ctx, expected_host_id="123456789",
557+
)
558+
```
559+
560+
**Persistent Host ID.** Every host owns a stable 9-digit numeric ID
561+
(persisted at `~/.je_auto_control/remote_host_id`), announced in
562+
`AUTH_OK` and verifiable via the viewer's `expected_host_id`:
563+
564+
```python
565+
print(host.host_id) # e.g. "123456789"
566+
viewer = RemoteDesktopViewer(
567+
host=..., port=..., token=...,
568+
expected_host_id="123456789", # AuthenticationError on mismatch
569+
)
570+
```
571+
572+
**Audio streaming (host → viewer).** Optional `sounddevice` dep; opt
573+
in with `enable_audio=True` on the host, attach an `AudioPlayer` (or
574+
your own callback) on the viewer:
575+
576+
```python
577+
host = RemoteDesktopHost(token="tok", enable_audio=True)
578+
579+
from je_auto_control.utils.remote_desktop import AudioPlayer
580+
player = AudioPlayer(); player.start()
581+
viewer = RemoteDesktopViewer(host=..., on_audio=player.play)
582+
```
583+
584+
**Clipboard sync (text + image, bidirectional).** Explicit per-call —
585+
no auto-poll loops. Image clipboard works on Windows (CF_DIB via
586+
ctypes) and Linux (`xclip -t image/png`); macOS get is supported via
587+
Pillow ImageGrab, set requires PyObjC.
588+
589+
```python
590+
viewer.send_clipboard_text("hello")
591+
viewer.send_clipboard_image(open("logo.png", "rb").read())
592+
host.broadcast_clipboard_text("greetings")
593+
```
594+
595+
**File transfer with progress.** Bidirectional, chunked, arbitrary
596+
destination path, no size cap; the GUI viewer also accepts drag-drop:
597+
598+
```python
599+
viewer.send_file(
600+
"local.bin", "/tmp/uploaded.bin",
601+
on_progress=lambda tid, done, total: print(done, total),
602+
)
603+
host.send_file_to_viewers("local.bin", "/tmp/from_host.bin")
604+
```
605+
606+
> ⚠️ Path is unrestricted and there is no aggregate size limit.
607+
> Anyone with the token can write any file to any location and can
608+
> fill the disk — keep "trusted token holders == trusted users" in
609+
> mind, or wrap with your own `FileReceiver` subclass that vets
610+
> destination paths.
611+
411612
### Clipboard
412613

413614
```python
@@ -494,10 +695,13 @@ je_auto_control.execute_action([
494695
| Screen | `AC_screen_size`, `AC_screenshot` |
495696
| Accessibility | `AC_a11y_list`, `AC_a11y_find`, `AC_a11y_click` |
496697
| VLM (AI Locator) | `AC_vlm_locate`, `AC_vlm_click` |
497-
| OCR | `AC_locate_text`, `AC_click_text`, `AC_wait_text` |
698+
| OCR | `AC_locate_text`, `AC_click_text`, `AC_wait_text`, `AC_read_text_in_region`, `AC_find_text_regex` |
699+
| LLM planner | `AC_llm_plan`, `AC_llm_run` |
498700
| Clipboard | `AC_clipboard_get`, `AC_clipboard_set` |
499701
| Window | `AC_list_windows`, `AC_focus_window`, `AC_wait_window`, `AC_close_window` |
500-
| Flow control | `AC_loop`, `AC_break`, `AC_continue`, `AC_if_image_found`, `AC_if_pixel`, `AC_while_image`, `AC_wait_image`, `AC_wait_pixel`, `AC_sleep`, `AC_retry` |
702+
| Flow control | `AC_loop`, `AC_break`, `AC_continue`, `AC_if_image_found`, `AC_if_pixel`, `AC_if_var`, `AC_while_image`, `AC_for_each`, `AC_wait_image`, `AC_wait_pixel`, `AC_sleep`, `AC_retry` |
703+
| Variables | `AC_set_var`, `AC_get_var`, `AC_inc_var` |
704+
| Remote desktop | `AC_start_remote_host`, `AC_stop_remote_host`, `AC_remote_host_status`, `AC_remote_connect`, `AC_remote_disconnect`, `AC_remote_viewer_status`, `AC_remote_send_input` |
501705
| Record | `AC_record`, `AC_stop_record`, `AC_set_record_enable` |
502706
| Report | `AC_generate_html`, `AC_generate_json`, `AC_generate_xml`, `AC_generate_html_report`, `AC_generate_json_report`, `AC_generate_xml_report` |
503707
| Run history | `AC_history_list`, `AC_history_clear` |

0 commit comments

Comments
 (0)