fix: add SQLite busy_timeout and WAL autocheckpoint to prevent CPU 100% epoll busy wait (Closes #8056)#8111
Conversation
…0% epoll busy wait
When SQLite WAL grows large (e.g., 6MB), checkpoint operations cause
lock contention. Without busy_timeout and wal_autocheckpoint PRAGMAs,
aiosqlite enters a busy retry loop that manifests as epoll_wait 100% CPU.
Changes across all three SQLite databases in AstrBot:
1. Main DB (astrbot/core/db/sqlite.py):
- Add PRAGMA busy_timeout=5000 (5s lock wait)
- Add PRAGMA wal_autocheckpoint=500 (checkpoint every ~2MB)
2. KB DB (astrbot/core/knowledge_base/kb_db_sqlite.py):
- Add connect_args={'timeout': 30} for engine
- Add all PRAGMA settings (WAL, synchronous, busy_timeout,
wal_autocheckpoint, cache_size, temp_store)
- Remove duplicate/redundant old PRAGMA block
3. Document storage (astrbot/core/db/vec_db/faiss_impl/document_storage.py):
- Add connect_args={'timeout': 30} for engine
- Add _apply_pragma() method with PRAGMA settings
- Call _apply_pragma() during initialize()
Closes AstrBotDevs#8056
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
document_storage._apply_pragma, you are callingawait conn.commit()inside anengine.begin()context; sincebegin()manages the transaction lifecycle itself, this explicit commit is unnecessary and may not behave as expected on anAsyncConnection—it would be safer to rely on the context manager’s implied commit/rollback. - The SQLite PRAGMA configuration (WAL, synchronous, busy_timeout, wal_autocheckpoint, etc.) is now duplicated across the main DB, knowledge base DB, and document storage; consider factoring this into a shared helper to keep the settings consistent and easier to adjust in one place.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `document_storage._apply_pragma`, you are calling `await conn.commit()` inside an `engine.begin()` context; since `begin()` manages the transaction lifecycle itself, this explicit commit is unnecessary and may not behave as expected on an `AsyncConnection`—it would be safer to rely on the context manager’s implied commit/rollback.
- The SQLite PRAGMA configuration (WAL, synchronous, busy_timeout, wal_autocheckpoint, etc.) is now duplicated across the main DB, knowledge base DB, and document storage; consider factoring this into a shared helper to keep the settings consistent and easier to adjust in one place.
## Individual Comments
### Comment 1
<location path="astrbot/core/db/vec_db/faiss_impl/document_storage.py" line_range="201" />
<code_context>
self.DATABASE_URL,
echo=False,
future=True,
+ connect_args={"timeout": 30},
)
self.async_session_maker = sessionmaker(
</code_context>
<issue_to_address>
**suggestion:** Clarify whether both `connect_args["timeout"]` and `PRAGMA busy_timeout` are needed.
You’re now setting the lock wait in two places: `connect_args={"timeout": 30}` (seconds) and `PRAGMA busy_timeout=5000` (milliseconds). Consider choosing one mechanism or clearly aligning/documenting the values so the effective timeout is unambiguous.
Suggested implementation:
```python
self.DATABASE_URL,
echo=False,
future=True,
# timeout is in seconds; aligned with PRAGMA busy_timeout (30_000 ms)
connect_args={"timeout": 30},
)
```
```python
await conn.execute(text("PRAGMA journal_mode=WAL"))
await conn.execute(text("PRAGMA synchronous=NORMAL"))
# 30_000 ms = 30 seconds; aligned with connect_args["timeout"]
await conn.execute(text("PRAGMA busy_timeout=30000"))
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| self.DATABASE_URL, | ||
| echo=False, | ||
| future=True, | ||
| connect_args={"timeout": 30}, |
There was a problem hiding this comment.
suggestion: Clarify whether both connect_args["timeout"] and PRAGMA busy_timeout are needed.
You’re now setting the lock wait in two places: connect_args={"timeout": 30} (seconds) and PRAGMA busy_timeout=5000 (milliseconds). Consider choosing one mechanism or clearly aligning/documenting the values so the effective timeout is unambiguous.
Suggested implementation:
self.DATABASE_URL,
echo=False,
future=True,
# timeout is in seconds; aligned with PRAGMA busy_timeout (30_000 ms)
connect_args={"timeout": 30},
) await conn.execute(text("PRAGMA journal_mode=WAL"))
await conn.execute(text("PRAGMA synchronous=NORMAL"))
# 30_000 ms = 30 seconds; aligned with connect_args["timeout"]
await conn.execute(text("PRAGMA busy_timeout=30000"))There was a problem hiding this comment.
Code Review
This pull request introduces SQLite performance and concurrency optimizations by configuring PRAGMAs like WAL mode, busy timeouts, and autocheckpoints across several database modules. The review feedback points out inconsistencies between connection-level timeouts and PRAGMA settings, potential transaction conflicts when enabling WAL mode within an active transaction, and the unintended removal of performance-enhancing PRAGMAs in the knowledge base. It is also suggested to refactor the duplicated PRAGMA logic into a shared helper function and consolidate connection calls to reduce overhead.
| async def _apply_pragma(self) -> None: | ||
| """Apply SQLite PRAGMAs for performance and concurrency.""" | ||
| async with self.engine.begin() as conn: | ||
| await conn.execute(text("PRAGMA journal_mode=WAL")) | ||
| await conn.execute(text("PRAGMA synchronous=NORMAL")) | ||
| await conn.execute(text("PRAGMA busy_timeout=5000")) | ||
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) | ||
| await conn.commit() |
There was a problem hiding this comment.
此新方法存在以下问题:
- 事务冲突:
PRAGMA journal_mode=WAL无法在事务内修改。self.engine.begin()会自动开启事务,这可能导致该命令失效或被忽略。建议改用self.engine.connect()或在非事务环境下执行。 - 不一致性:
busy_timeout=5000与第 201 行设置的timeout=30冲突。建议统一使用 30000。 - 冗余代码:在
engine.begin()块内调用await conn.commit()是多余的,因为该上下文管理器在退出时会自动提交。
| async def _apply_pragma(self) -> None: | |
| """Apply SQLite PRAGMAs for performance and concurrency.""" | |
| async with self.engine.begin() as conn: | |
| await conn.execute(text("PRAGMA journal_mode=WAL")) | |
| await conn.execute(text("PRAGMA synchronous=NORMAL")) | |
| await conn.execute(text("PRAGMA busy_timeout=5000")) | |
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) | |
| await conn.commit() | |
| async def _apply_pragma(self) -> None: | |
| """Apply SQLite PRAGMAs for performance and concurrency.""" | |
| async with self.engine.connect() as conn: | |
| await conn.execute(text("PRAGMA journal_mode=WAL")) | |
| await conn.execute(text("PRAGMA synchronous=NORMAL")) | |
| await conn.execute(text("PRAGMA busy_timeout=30000")) | |
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) |
| await conn.execute(text("PRAGMA busy_timeout=5000")) | ||
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) |
There was a problem hiding this comment.
PRAGMA busy_timeout=5000 (5秒) 与 BaseDatabase 构造函数中设置的 timeout=30 (30秒) 不一致。执行此 PRAGMA 会覆盖连接级别的设置。建议统一使用 30000ms 以保持一致性,确保在高负载下有足够的等待时间避免 SQLITE_BUSY 错误。
此外,这套 SQLite 优化参数在多个文件(sqlite.py, document_storage.py, kb_db_sqlite.py)中重复出现。根据通用规则,建议将其重构为共享的助手函数以提高代码复用性和可维护性。
| await conn.execute(text("PRAGMA busy_timeout=5000")) | |
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) | |
| await conn.execute(text("PRAGMA busy_timeout=30000")) | |
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) |
References
- When implementing similar functionality for different cases, refactor the logic into a shared helper function to avoid code duplication.
| await self._apply_pragma() | ||
| async with self.engine.begin() as conn: # type: ignore |
| await conn.execute(text("PRAGMA busy_timeout=5000")) | ||
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) | ||
| await conn.execute(text("PRAGMA temp_store=MEMORY")) |
There was a problem hiding this comment.
busy_timeout 的值 (5000ms) 与第 45 行设置的 timeout=30 不一致。此外,本次修改删除了原有的 PRAGMA mmap_size 和 PRAGMA optimize;这些设置对 SQLite 性能(尤其是知识库这类大数据量场景)有显著帮助,建议保留。
| await conn.execute(text("PRAGMA busy_timeout=5000")) | |
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) | |
| await conn.execute(text("PRAGMA temp_store=MEMORY")) | |
| await conn.execute(text("PRAGMA busy_timeout=30000")) | |
| await conn.execute(text("PRAGMA wal_autocheckpoint=500")) | |
| await conn.execute(text("PRAGMA temp_store=MEMORY")) | |
| await conn.execute(text("PRAGMA mmap_size=134217728")) | |
| await conn.execute(text("PRAGMA optimize")) |
问题
详见 Issue #8056。AstrBot 运行一段时间后,单核 CPU 被占满(100%),strace 显示 epoll_wait 非阻塞忙等。原因是 SQLite WAL 文件增长过大(如 6MB)导致锁竞争,aiosqlite 进入 busy retry loop。
根因
三个 SQLite 数据库均未设置
busy_timeoutPRAGMA。当 WAL 文件需要 checkpoint 时,锁竞争导致 SQLite 立即返回SQLITE_BUSY,aiosqlite 反复重试造成 epoll 忙等。修改
1. 主数据库 (
astrbot/core/db/sqlite.py)PRAGMA busy_timeout=5000— 锁等待 5 秒而非立即失败PRAGMA wal_autocheckpoint=500— WAL 达到约 2MB 时自动 checkpoint2. 知识库数据库 (
astrbot/core/knowledge_base/kb_db_sqlite.py)connect_args={'timeout': 30}3. 文档存储 (
astrbot/core/db/vec_db/faiss_impl/document_storage.py)connect_args={'timeout': 30}_apply_pragma()方法,初始化时调用busy_timeout 说明
PRAGMA busy_timeout=5000告诉 SQLite 在遇到锁时最多等待 5000 毫秒(5 秒),期间以指数退避方式循环等待,而不是立即返回SQLITE_BUSY让上层陷入忙等。这与wal_autocheckpoint=500配合,确保 WAL 大小可控、锁竞争可预测。Summary by Sourcery
Configure SQLite databases to better handle concurrency and prevent busy-wait CPU spikes by tuning PRAGMA settings and connection timeouts.
Bug Fixes:
Enhancements: