Skip to content

Commit a0f62bd

Browse files
committed
Document OCR / variables / LLM planner / remote desktop additions
Bring README.md, README_zh-TW.md, README_zh-CN.md, and the en/zh new_features doc pages in line with the recent commits: - README feature lists, ToC, Quick Start sections, and AC_* command tables now cover OCR region-dump and regex search, the runtime VariableScope and the AC_set_var / AC_inc_var / AC_if_var / AC_for_each commands, the LLM action planner, and the remote desktop host + viewer (with security warnings about token-only auth and the 127.0.0.1 default). - new_features_doc.rst gains four new sections in both English and Traditional Chinese covering the same features with code samples, GUI affordances, and configuration env vars.
1 parent b91689b commit a0f62bd

5 files changed

Lines changed: 736 additions & 9 deletions

File tree

README.md

Lines changed: 138 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@
2424
- [Accessibility Element Finder](#accessibility-element-finder)
2525
- [AI Element Locator (VLM)](#ai-element-locator-vlm)
2626
- [OCR (Text on Screen)](#ocr-text-on-screen)
27+
- [LLM Action Planner](#llm-action-planner)
28+
- [Runtime Variables & Control Flow](#runtime-variables--control-flow)
29+
- [Remote Desktop](#remote-desktop)
2730
- [Clipboard](#clipboard)
2831
- [Screenshot](#screenshot)
2932
- [Action Recording & Playback](#action-recording--playback)
@@ -57,7 +60,10 @@
5760
- **Image Recognition** — locate UI elements on screen using OpenCV template matching with configurable threshold
5861
- **Accessibility Element Finder** — query the OS accessibility tree (Windows UIA / macOS AX) to locate buttons, menus, and controls by name/role
5962
- **AI Element Locator (VLM)** — describe a UI element in plain language and let a vision-language model (Anthropic / OpenAI) find its screen coordinates
60-
- **OCR** — extract text from screen regions using Tesseract; wait for, click, or locate rendered text
63+
- **OCR** — extract text from screen regions using Tesseract; wait for, click, or locate rendered text; regex search and full-region dump
64+
- **LLM Action Planner** — translate a plain-language description into a validated `AC_*` action list using Claude
65+
- **Runtime Variables & Control Flow**`${var}` substitution at execution time, plus `AC_set_var` / `AC_inc_var` / `AC_if_var` / `AC_for_each` / `AC_loop` / `AC_retry` for data-driven scripts
66+
- **Remote Desktop** — stream this machine's screen and accept remote input over a token-authenticated TCP protocol, *or* connect to another machine and view + control it (host + viewer GUIs included)
6167
- **Clipboard** — read/write system clipboard text on Windows, macOS, and Linux
6268
- **Screenshot & Screen Recording** — capture full screen or regions as images, record screen to video (AVI/MP4)
6369
- **Action Recording & Playback** — record mouse/keyboard events and replay them
@@ -408,6 +414,132 @@ If Tesseract is not on `PATH`, point at it explicitly:
408414
ac.set_tesseract_cmd(r"C:\Program Files\Tesseract-OCR\tesseract.exe")
409415
```
410416

417+
Dump every recognised text record in a region (or full screen), or
418+
search by regex when the text varies:
419+
420+
```python
421+
import je_auto_control as ac
422+
423+
# Every hit in a region as TextMatch records (text, bounding box, confidence)
424+
for match in ac.read_text_in_region(region=[0, 0, 800, 600]):
425+
print(match.text, match.center, match.confidence)
426+
427+
# Regex — accepts a pattern string or a compiled re.Pattern
428+
for match in ac.find_text_regex(r"Order#\d+"):
429+
print(match.text, match.center)
430+
```
431+
432+
GUI: **OCR Reader** tab.
433+
434+
### LLM Action Planner
435+
436+
Translate plain-language descriptions into validated `AC_*` action lists
437+
using an LLM (Anthropic Claude by default). Output is leniently parsed
438+
(strips code fences, extracts the first JSON array from prose) and then
439+
validated by the same schema the executor uses, so the result can be
440+
piped straight into `execute_action`:
441+
442+
```python
443+
import je_auto_control as ac
444+
from je_auto_control.utils.executor.action_executor import executor
445+
446+
actions = ac.plan_actions(
447+
"click the Submit button, then type 'done' and save",
448+
known_commands=executor.known_commands(),
449+
)
450+
executor.execute_action(actions)
451+
452+
# Or in a single call:
453+
ac.run_from_description("open Notepad and type hello", executor=executor)
454+
```
455+
456+
| Variable | Effect |
457+
|---|---|
458+
| `ANTHROPIC_API_KEY` | Enables the Anthropic backend |
459+
| `AUTOCONTROL_LLM_BACKEND` | `anthropic` to force a backend |
460+
| `AUTOCONTROL_LLM_MODEL` | Override the default model (e.g. `claude-opus-4-7`) |
461+
462+
GUI: **LLM Planner** tab — description box, `QThread`-backed *Plan*
463+
button, action-list preview, and a *Run plan* button.
464+
465+
### Runtime Variables & Control Flow
466+
467+
The executor resolves `${var}` placeholders **per command call** rather
468+
than pre-flattening, so nested `body` / `then` / `else` lists keep their
469+
placeholders and re-bind on every iteration. Combined with new mutation
470+
commands, scripts can drive themselves from data without Python glue:
471+
472+
```json
473+
[
474+
["AC_set_var", {"name": "items", "value": ["alpha", "beta"]}],
475+
["AC_set_var", {"name": "i", "value": 0}],
476+
["AC_for_each", {
477+
"items": "${items}", "as": "name",
478+
"body": [
479+
["AC_inc_var", {"name": "i"}],
480+
["AC_if_var", {
481+
"name": "i", "op": "ge", "value": 2,
482+
"then": [["AC_break"]], "else": []
483+
}]
484+
]
485+
}]
486+
]
487+
```
488+
489+
`AC_if_var` operators: `eq`, `ne`, `lt`, `le`, `gt`, `ge`, `contains`,
490+
`startswith`, `endswith`. GUI: **Variables** tab — live view of
491+
`executor.variables` with single-set, JSON seed, and clear-all controls.
492+
493+
### Remote Desktop
494+
495+
Stream this machine's screen and accept remote input, **or** view and
496+
control another machine. The wire format is a length-prefixed framing
497+
on raw TCP (no extra deps), starting with an HMAC-SHA256
498+
challenge / response handshake; viewers that fail auth are dropped
499+
before they can see a frame. JPEG frames are produced at the configured
500+
FPS / quality and broadcast to authenticated viewers via a shared
501+
latest-frame slot, so a slow viewer drops frames instead of blocking
502+
the rest. Viewer input is JSON, validated against an allowlist, and
503+
applied through the existing wrappers.
504+
505+
```python
506+
# Be remoted — start a host and hand the token + port to whoever views you
507+
from je_auto_control import RemoteDesktopHost
508+
host = RemoteDesktopHost(token="hunter2", bind="127.0.0.1",
509+
port=0, fps=10, quality=70)
510+
host.start()
511+
print("listening on", host.port, "viewers:", host.connected_clients)
512+
```
513+
514+
```python
515+
# Control another machine — connect a viewer and send input
516+
from je_auto_control import RemoteDesktopViewer
517+
viewer = RemoteDesktopViewer(host="10.0.0.5", port=51234, token="hunter2",
518+
on_frame=lambda jpeg: ...)
519+
viewer.connect()
520+
viewer.send_input({"action": "mouse_move", "x": 100, "y": 200})
521+
viewer.send_input({"action": "type", "text": "hello"})
522+
viewer.disconnect()
523+
```
524+
525+
GUI: **Remote Desktop** tab with two sub-tabs.
526+
527+
- **Host** — token field with a *Generate* button, security warning
528+
about the bind address, start / stop controls, refreshing port +
529+
viewer-count status, and a 4 fps preview pane below the controls so
530+
the user being remoted sees what viewers see.
531+
- **Viewer** — address / port / token form, *Connect* / *Disconnect*,
532+
and a custom frame-display widget that paints incoming JPEG frames
533+
scaled with `KeepAspectRatio`. Mouse / wheel / key events on the
534+
display are remapped from widget coordinates back to the remote
535+
screen's pixel space using the latest frame's dimensions, then
536+
forwarded as `INPUT` messages.
537+
538+
> ⚠️ Anyone with the host:port and token gets full mouse / keyboard
539+
> control of the host machine. Default bind is `127.0.0.1`; expose
540+
> externally only via SSH tunnel or TLS front-end. The token is the
541+
> only line of defence — treat it like a password.
542+
411543
### Clipboard
412544

413545
```python
@@ -494,10 +626,13 @@ je_auto_control.execute_action([
494626
| Screen | `AC_screen_size`, `AC_screenshot` |
495627
| Accessibility | `AC_a11y_list`, `AC_a11y_find`, `AC_a11y_click` |
496628
| VLM (AI Locator) | `AC_vlm_locate`, `AC_vlm_click` |
497-
| OCR | `AC_locate_text`, `AC_click_text`, `AC_wait_text` |
629+
| OCR | `AC_locate_text`, `AC_click_text`, `AC_wait_text`, `AC_read_text_in_region`, `AC_find_text_regex` |
630+
| LLM planner | `AC_llm_plan`, `AC_llm_run` |
498631
| Clipboard | `AC_clipboard_get`, `AC_clipboard_set` |
499632
| Window | `AC_list_windows`, `AC_focus_window`, `AC_wait_window`, `AC_close_window` |
500-
| Flow control | `AC_loop`, `AC_break`, `AC_continue`, `AC_if_image_found`, `AC_if_pixel`, `AC_while_image`, `AC_wait_image`, `AC_wait_pixel`, `AC_sleep`, `AC_retry` |
633+
| Flow control | `AC_loop`, `AC_break`, `AC_continue`, `AC_if_image_found`, `AC_if_pixel`, `AC_if_var`, `AC_while_image`, `AC_for_each`, `AC_wait_image`, `AC_wait_pixel`, `AC_sleep`, `AC_retry` |
634+
| Variables | `AC_set_var`, `AC_get_var`, `AC_inc_var` |
635+
| Remote desktop | `AC_start_remote_host`, `AC_stop_remote_host`, `AC_remote_host_status`, `AC_remote_connect`, `AC_remote_disconnect`, `AC_remote_viewer_status`, `AC_remote_send_input` |
501636
| Record | `AC_record`, `AC_stop_record`, `AC_set_record_enable` |
502637
| Report | `AC_generate_html`, `AC_generate_json`, `AC_generate_xml`, `AC_generate_html_report`, `AC_generate_json_report`, `AC_generate_xml_report` |
503638
| Run history | `AC_history_list`, `AC_history_clear` |

README/README_zh-CN.md

Lines changed: 108 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@
2323
- [Accessibility 元件搜索](#accessibility-元件搜索)
2424
- [AI 元件定位(VLM)](#ai-元件定位vlm)
2525
- [OCR 屏幕文字识别](#ocr-屏幕文字识别)
26+
- [LLM 动作规划器](#llm-动作规划器)
27+
- [运行期变量与流程控制](#运行期变量与流程控制)
28+
- [远程桌面](#远程桌面)
2629
- [剪贴板](#剪贴板)
2730
- [截图](#截图)
2831
- [动作录制与回放](#动作录制与回放)
@@ -56,7 +59,10 @@
5659
- **图像识别** — 使用 OpenCV 模板匹配在屏幕上定位 UI 元素,支持可配置的检测阈值
5760
- **Accessibility 元件搜索** — 通过操作系统无障碍树(Windows UIA / macOS AX)按名称/角色定位按钮、菜单、控件
5861
- **AI 元件定位(VLM)** — 用自然语言描述 UI 元素,由视觉语言模型(Anthropic / OpenAI)返回屏幕坐标
59-
- **OCR** — 使用 Tesseract 从屏幕提取文字,可搜索、点击或等待文字出现
62+
- **OCR** — 使用 Tesseract 从屏幕提取文字,可搜索、点击或等待文字出现;支持 regex 搜索与整块区域 dump
63+
- **LLM 动作规划器** — 用 Claude 把自然语言描述翻译成验证过的 `AC_*` 动作清单
64+
- **运行期变量与流程控制** — 执行时 `${var}` 替换,加上 `AC_set_var` / `AC_inc_var` / `AC_if_var` / `AC_for_each` / `AC_loop` / `AC_retry` 让脚本数据驱动
65+
- **远程桌面** — 用 token 认证的 TCP 协议串流本机画面并接收输入,**** 连接到他机观看与控制(host + viewer GUI 内置)
6066
- **剪贴板** — 于 Windows / macOS / Linux 读写系统剪贴板文本
6167
- **截图与屏幕录制** — 捕获全屏或指定区域为图片,录制屏幕为视频(AVI/MP4)
6268
- **动作录制与回放** — 录制鼠标/键盘事件并重新播放
@@ -402,6 +408,102 @@ ac.wait_for_text("加载完成", timeout=15.0)
402408
ac.set_tesseract_cmd(r"C:\Program Files\Tesseract-OCR\tesseract.exe")
403409
```
404410

411+
把区域(或整屏)内所有识别到的文字 dump 出来,或用 regex 搜索变动内容:
412+
413+
```python
414+
import je_auto_control as ac
415+
416+
# TextMatch 列表,含文字、边界框、置信度
417+
for match in ac.read_text_in_region(region=[0, 0, 800, 600]):
418+
print(match.text, match.center, match.confidence)
419+
420+
# Regex(接受字符串或 compiled re.Pattern)
421+
for match in ac.find_text_regex(r"Order#\d+"):
422+
print(match.text, match.center)
423+
```
424+
425+
GUI:**OCR Reader** 分页。
426+
427+
### LLM 动作规划器
428+
429+
把自然语言描述交给 LLM(默认 Anthropic Claude),翻译成验证过的 `AC_*` 动作清单。输出采用宽松解析(剥 code fence、从散文中抽出第一个 JSON array),再用 executor 同样的 schema 验证,所以结果可以直接喂给 `execute_action`
430+
431+
```python
432+
import je_auto_control as ac
433+
from je_auto_control.utils.executor.action_executor import executor
434+
435+
actions = ac.plan_actions(
436+
"点击 Submit 按钮,然后输入 'done' 并保存",
437+
known_commands=executor.known_commands(),
438+
)
439+
executor.execute_action(actions)
440+
441+
# 或者一行做完:
442+
ac.run_from_description("打开记事本并输入 hello", executor=executor)
443+
```
444+
445+
| 变量 | 效果 |
446+
|---|---|
447+
| `ANTHROPIC_API_KEY` | 启用 Anthropic 后端 |
448+
| `AUTOCONTROL_LLM_BACKEND` | 强制指定 `anthropic` |
449+
| `AUTOCONTROL_LLM_MODEL` | 覆盖默认模型(如 `claude-opus-4-7`|
450+
451+
GUI:**LLM Planner** 分页 — 描述输入框、`QThread` 后台执行的 *Plan* 按钮、预览指令清单,以及 *Run plan* 按钮。
452+
453+
### 运行期变量与流程控制
454+
455+
executor 改成「每次调用」才解析 `${var}` placeholder(不会事先展平),所以嵌套的 `body` / `then` / `else` 列表会保留 placeholder,每次重复执行时重新绑定。配合新的变量修改命令,脚本可以数据驱动而不需要 Python 黏合:
456+
457+
```json
458+
[
459+
["AC_set_var", {"name": "items", "value": ["alpha", "beta"]}],
460+
["AC_set_var", {"name": "i", "value": 0}],
461+
["AC_for_each", {
462+
"items": "${items}", "as": "name",
463+
"body": [
464+
["AC_inc_var", {"name": "i"}],
465+
["AC_if_var", {
466+
"name": "i", "op": "ge", "value": 2,
467+
"then": [["AC_break"]], "else": []
468+
}]
469+
]
470+
}]
471+
]
472+
```
473+
474+
`AC_if_var` 比较运算符:`eq``ne``lt``le``gt``ge``contains``startswith``endswith`。GUI:**Variables** 分页 — 实时查看 `executor.variables`,支持单条设置、JSON 批量 seed、清空。
475+
476+
### 远程桌面
477+
478+
把本机画面串流给别人看 / 控制,**** 观看并控制别人的机器。协议是 raw TCP 上的长度前缀框架(不引入额外依赖),先做一轮 HMAC-SHA256 challenge / response 认证;认证失败的 viewer 在看到任何画面前就被踢掉。JPEG frame 按照配置的 FPS / 质量产生,通过共享 latest-frame slot 广播给通过认证的 viewers,慢的 viewer 只会丢 frame 而不会卡其他人。Viewer 输入消息是 JSON,host 端用允许列表验证后才通过既有 wrapper 派发。
479+
480+
```python
481+
# 被远程 — 启动 host 把 token + port 给对方
482+
from je_auto_control import RemoteDesktopHost
483+
host = RemoteDesktopHost(token="hunter2", bind="127.0.0.1",
484+
port=0, fps=10, quality=70)
485+
host.start()
486+
print("listening on", host.port, "viewers:", host.connected_clients)
487+
```
488+
489+
```python
490+
# 控制他机 — 连接 viewer 并发送输入
491+
from je_auto_control import RemoteDesktopViewer
492+
viewer = RemoteDesktopViewer(host="10.0.0.5", port=51234, token="hunter2",
493+
on_frame=lambda jpeg: ...)
494+
viewer.connect()
495+
viewer.send_input({"action": "mouse_move", "x": 100, "y": 200})
496+
viewer.send_input({"action": "type", "text": "hello"})
497+
viewer.disconnect()
498+
```
499+
500+
GUI:**Remote Desktop** 分页,内含两个子分页。
501+
502+
- **Host**(被远程的本机)— Token 字段附 *生成* 按钮、bind 地址安全提示、启动 / 停止控制、实时刷新的 port + viewer 数量状态栏,以及 4fps 预览面板让被远程的人看到 viewer 看到的画面。
503+
- **Viewer**(控制他机)— 地址 / port / token 表单、*连接* / *断开*、自绘 frame display widget,会把 JPEG 等比缩放绘入。display 上的鼠标 / 滚轮 / 键盘事件会用最新 frame 的尺寸映射回原始远程屏幕的像素坐标,再用 `INPUT` 消息发回。
504+
505+
> ⚠️ 取得 host:port 与 token 的人,等同拥有本机完整鼠标 / 键盘控制权。默认仅绑 `127.0.0.1`;要对外暴露请务必搭配 SSH tunnel 或 TLS 前端。Token 是唯一防线 — 请当作密码保管。
506+
405507
### 剪贴板
406508

407509
```python
@@ -488,10 +590,13 @@ je_auto_control.execute_action([
488590
| 屏幕 | `AC_screen_size`, `AC_screenshot` |
489591
| Accessibility | `AC_a11y_list`, `AC_a11y_find`, `AC_a11y_click` |
490592
| VLM(AI 定位) | `AC_vlm_locate`, `AC_vlm_click` |
491-
| OCR | `AC_locate_text`, `AC_click_text`, `AC_wait_text` |
593+
| OCR | `AC_locate_text`, `AC_click_text`, `AC_wait_text`, `AC_read_text_in_region`, `AC_find_text_regex` |
594+
| LLM 规划器 | `AC_llm_plan`, `AC_llm_run` |
492595
| 剪贴板 | `AC_clipboard_get`, `AC_clipboard_set` |
493596
| 窗口 | `AC_list_windows`, `AC_focus_window`, `AC_wait_window`, `AC_close_window` |
494-
| 流程控制 | `AC_loop`, `AC_break`, `AC_continue`, `AC_if_image_found`, `AC_if_pixel`, `AC_while_image`, `AC_wait_image`, `AC_wait_pixel`, `AC_sleep`, `AC_retry` |
597+
| 流程控制 | `AC_loop`, `AC_break`, `AC_continue`, `AC_if_image_found`, `AC_if_pixel`, `AC_if_var`, `AC_while_image`, `AC_for_each`, `AC_wait_image`, `AC_wait_pixel`, `AC_sleep`, `AC_retry` |
598+
| 变量 | `AC_set_var`, `AC_get_var`, `AC_inc_var` |
599+
| 远程桌面 | `AC_start_remote_host`, `AC_stop_remote_host`, `AC_remote_host_status`, `AC_remote_connect`, `AC_remote_disconnect`, `AC_remote_viewer_status`, `AC_remote_send_input` |
495600
| 录制 | `AC_record`, `AC_stop_record`, `AC_set_record_enable` |
496601
| 报告 | `AC_generate_html`, `AC_generate_json`, `AC_generate_xml`, `AC_generate_html_report`, `AC_generate_json_report`, `AC_generate_xml_report` |
497602
| 执行记录 | `AC_history_list`, `AC_history_clear` |

0 commit comments

Comments
 (0)