Skip to content

fix: add SQLite busy_timeout and WAL autocheckpoint to prevent CPU 100% epoll busy wait (Closes #8056)#8111

Open
Pleiades1726 wants to merge 1 commit into
AstrBotDevs:masterfrom
Pleiades1726:fix/sqlite-wal-busy-cpu
Open

fix: add SQLite busy_timeout and WAL autocheckpoint to prevent CPU 100% epoll busy wait (Closes #8056)#8111
Pleiades1726 wants to merge 1 commit into
AstrBotDevs:masterfrom
Pleiades1726:fix/sqlite-wal-busy-cpu

Conversation

@Pleiades1726
Copy link
Copy Markdown
Contributor

@Pleiades1726 Pleiades1726 commented May 9, 2026

问题

详见 Issue #8056。AstrBot 运行一段时间后,单核 CPU 被占满(100%),strace 显示 epoll_wait 非阻塞忙等。原因是 SQLite WAL 文件增长过大(如 6MB)导致锁竞争,aiosqlite 进入 busy retry loop。

根因

三个 SQLite 数据库均未设置 busy_timeout PRAGMA。当 WAL 文件需要 checkpoint 时,锁竞争导致 SQLite 立即返回 SQLITE_BUSY,aiosqlite 反复重试造成 epoll 忙等。

修改

1. 主数据库 (astrbot/core/db/sqlite.py)

  • 新增 PRAGMA busy_timeout=5000 — 锁等待 5 秒而非立即失败
  • 新增 PRAGMA wal_autocheckpoint=500 — WAL 达到约 2MB 时自动 checkpoint

2. 知识库数据库 (astrbot/core/knowledge_base/kb_db_sqlite.py)

  • 引擎加 connect_args={'timeout': 30}
  • 初始化时设置完整 PRAGMA(WAL、synchronous、busy_timeout、wal_autocheckpoint 等)
  • 清理重复的旧 PRAGMA 块

3. 文档存储 (astrbot/core/db/vec_db/faiss_impl/document_storage.py)

  • 引擎加 connect_args={'timeout': 30}
  • 新增 _apply_pragma() 方法,初始化时调用

busy_timeout 说明

PRAGMA busy_timeout=5000 告诉 SQLite 在遇到锁时最多等待 5000 毫秒(5 秒),期间以指数退避方式循环等待,而不是立即返回 SQLITE_BUSY 让上层陷入忙等。这与 wal_autocheckpoint=500 配合,确保 WAL 大小可控、锁竞争可预测。

Summary by Sourcery

Configure SQLite databases to better handle concurrency and prevent busy-wait CPU spikes by tuning PRAGMA settings and connection timeouts.

Bug Fixes:

  • Prevent high CPU usage caused by SQLite busy-wait loops under WAL checkpoint lock contention by adding busy timeouts and WAL autocheckpoint settings across SQLite databases.

Enhancements:

  • Add standardized SQLite PRAGMAs (WAL mode, synchronous level, busy timeout, WAL autocheckpoint) to the document storage and knowledge base databases to improve performance and concurrency behavior.
  • Set explicit SQLite connection timeouts for the knowledge base and document storage engines to avoid immediate failures under contention.

…0% epoll busy wait

When SQLite WAL grows large (e.g., 6MB), checkpoint operations cause
lock contention. Without busy_timeout and wal_autocheckpoint PRAGMAs,
aiosqlite enters a busy retry loop that manifests as epoll_wait 100% CPU.

Changes across all three SQLite databases in AstrBot:

1. Main DB (astrbot/core/db/sqlite.py):
   - Add PRAGMA busy_timeout=5000 (5s lock wait)
   - Add PRAGMA wal_autocheckpoint=500 (checkpoint every ~2MB)

2. KB DB (astrbot/core/knowledge_base/kb_db_sqlite.py):
   - Add connect_args={'timeout': 30} for engine
   - Add all PRAGMA settings (WAL, synchronous, busy_timeout,
     wal_autocheckpoint, cache_size, temp_store)
   - Remove duplicate/redundant old PRAGMA block

3. Document storage (astrbot/core/db/vec_db/faiss_impl/document_storage.py):
   - Add connect_args={'timeout': 30} for engine
   - Add _apply_pragma() method with PRAGMA settings
   - Call _apply_pragma() during initialize()

Closes AstrBotDevs#8056
@auto-assign auto-assign Bot requested review from LIghtJUNction and Soulter May 9, 2026 13:59
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend labels May 9, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In document_storage._apply_pragma, you are calling await conn.commit() inside an engine.begin() context; since begin() manages the transaction lifecycle itself, this explicit commit is unnecessary and may not behave as expected on an AsyncConnection—it would be safer to rely on the context manager’s implied commit/rollback.
  • The SQLite PRAGMA configuration (WAL, synchronous, busy_timeout, wal_autocheckpoint, etc.) is now duplicated across the main DB, knowledge base DB, and document storage; consider factoring this into a shared helper to keep the settings consistent and easier to adjust in one place.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `document_storage._apply_pragma`, you are calling `await conn.commit()` inside an `engine.begin()` context; since `begin()` manages the transaction lifecycle itself, this explicit commit is unnecessary and may not behave as expected on an `AsyncConnection`—it would be safer to rely on the context manager’s implied commit/rollback.
- The SQLite PRAGMA configuration (WAL, synchronous, busy_timeout, wal_autocheckpoint, etc.) is now duplicated across the main DB, knowledge base DB, and document storage; consider factoring this into a shared helper to keep the settings consistent and easier to adjust in one place.

## Individual Comments

### Comment 1
<location path="astrbot/core/db/vec_db/faiss_impl/document_storage.py" line_range="201" />
<code_context>
                 self.DATABASE_URL,
                 echo=False,
                 future=True,
+                connect_args={"timeout": 30},
             )
             self.async_session_maker = sessionmaker(
</code_context>
<issue_to_address>
**suggestion:** Clarify whether both `connect_args["timeout"]` and `PRAGMA busy_timeout` are needed.

You’re now setting the lock wait in two places: `connect_args={"timeout": 30}` (seconds) and `PRAGMA busy_timeout=5000` (milliseconds). Consider choosing one mechanism or clearly aligning/documenting the values so the effective timeout is unambiguous.

Suggested implementation:

```python
                self.DATABASE_URL,
                echo=False,
                future=True,
                # timeout is in seconds; aligned with PRAGMA busy_timeout (30_000 ms)
                connect_args={"timeout": 30},
            )

```

```python
            await conn.execute(text("PRAGMA journal_mode=WAL"))
            await conn.execute(text("PRAGMA synchronous=NORMAL"))
            # 30_000 ms = 30 seconds; aligned with connect_args["timeout"]
            await conn.execute(text("PRAGMA busy_timeout=30000"))

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

self.DATABASE_URL,
echo=False,
future=True,
connect_args={"timeout": 30},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Clarify whether both connect_args["timeout"] and PRAGMA busy_timeout are needed.

You’re now setting the lock wait in two places: connect_args={"timeout": 30} (seconds) and PRAGMA busy_timeout=5000 (milliseconds). Consider choosing one mechanism or clearly aligning/documenting the values so the effective timeout is unambiguous.

Suggested implementation:

                self.DATABASE_URL,
                echo=False,
                future=True,
                # timeout is in seconds; aligned with PRAGMA busy_timeout (30_000 ms)
                connect_args={"timeout": 30},
            )
            await conn.execute(text("PRAGMA journal_mode=WAL"))
            await conn.execute(text("PRAGMA synchronous=NORMAL"))
            # 30_000 ms = 30 seconds; aligned with connect_args["timeout"]
            await conn.execute(text("PRAGMA busy_timeout=30000"))

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SQLite performance and concurrency optimizations by configuring PRAGMAs like WAL mode, busy timeouts, and autocheckpoints across several database modules. The review feedback points out inconsistencies between connection-level timeouts and PRAGMA settings, potential transaction conflicts when enabling WAL mode within an active transaction, and the unintended removal of performance-enhancing PRAGMAs in the knowledge base. It is also suggested to refactor the duplicated PRAGMA logic into a shared helper function and consolidate connection calls to reduce overhead.

Comment on lines +209 to +216
async def _apply_pragma(self) -> None:
"""Apply SQLite PRAGMAs for performance and concurrency."""
async with self.engine.begin() as conn:
await conn.execute(text("PRAGMA journal_mode=WAL"))
await conn.execute(text("PRAGMA synchronous=NORMAL"))
await conn.execute(text("PRAGMA busy_timeout=5000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
await conn.commit()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

此新方法存在以下问题:

  1. 事务冲突PRAGMA journal_mode=WAL 无法在事务内修改。self.engine.begin() 会自动开启事务,这可能导致该命令失效或被忽略。建议改用 self.engine.connect() 或在非事务环境下执行。
  2. 不一致性busy_timeout=5000 与第 201 行设置的 timeout=30 冲突。建议统一使用 30000。
  3. 冗余代码:在 engine.begin() 块内调用 await conn.commit() 是多余的,因为该上下文管理器在退出时会自动提交。
Suggested change
async def _apply_pragma(self) -> None:
"""Apply SQLite PRAGMAs for performance and concurrency."""
async with self.engine.begin() as conn:
await conn.execute(text("PRAGMA journal_mode=WAL"))
await conn.execute(text("PRAGMA synchronous=NORMAL"))
await conn.execute(text("PRAGMA busy_timeout=5000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
await conn.commit()
async def _apply_pragma(self) -> None:
"""Apply SQLite PRAGMAs for performance and concurrency."""
async with self.engine.connect() as conn:
await conn.execute(text("PRAGMA journal_mode=WAL"))
await conn.execute(text("PRAGMA synchronous=NORMAL"))
await conn.execute(text("PRAGMA busy_timeout=30000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))

Comment thread astrbot/core/db/sqlite.py
Comment on lines +59 to +60
await conn.execute(text("PRAGMA busy_timeout=5000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

PRAGMA busy_timeout=5000 (5秒) 与 BaseDatabase 构造函数中设置的 timeout=30 (30秒) 不一致。执行此 PRAGMA 会覆盖连接级别的设置。建议统一使用 30000ms 以保持一致性,确保在高负载下有足够的等待时间避免 SQLITE_BUSY 错误。

此外,这套 SQLite 优化参数在多个文件(sqlite.py, document_storage.py, kb_db_sqlite.py)中重复出现。根据通用规则,建议将其重构为共享的助手函数以提高代码复用性和可维护性。

Suggested change
await conn.execute(text("PRAGMA busy_timeout=5000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
await conn.execute(text("PRAGMA busy_timeout=30000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
References
  1. When implementing similar functionality for different cases, refactor the logic into a shared helper function to avoid code duplication.

Comment on lines +62 to 63
await self._apply_pragma()
async with self.engine.begin() as conn: # type: ignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在此处单独调用 _apply_pragma() 会导致在主初始化块之前额外开启和关闭一次数据库连接。建议将 PRAGMA 设置逻辑合并到下方的 engine.begin() 块中,以减少不必要的连接开销。

Comment on lines +76 to 78
await conn.execute(text("PRAGMA busy_timeout=5000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
await conn.execute(text("PRAGMA temp_store=MEMORY"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

busy_timeout 的值 (5000ms) 与第 45 行设置的 timeout=30 不一致。此外,本次修改删除了原有的 PRAGMA mmap_sizePRAGMA optimize;这些设置对 SQLite 性能(尤其是知识库这类大数据量场景)有显著帮助,建议保留。

Suggested change
await conn.execute(text("PRAGMA busy_timeout=5000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
await conn.execute(text("PRAGMA temp_store=MEMORY"))
await conn.execute(text("PRAGMA busy_timeout=30000"))
await conn.execute(text("PRAGMA wal_autocheckpoint=500"))
await conn.execute(text("PRAGMA temp_store=MEMORY"))
await conn.execute(text("PRAGMA mmap_size=134217728"))
await conn.execute(text("PRAGMA optimize"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant