CPU Optimization#825
Merged
Merged
Conversation
|
dingyixuan.dyx seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
…atch Replace periodic timer polling and synchronous refresh with async TG_SCHEDULE dispatch via ObInternalTableChangeNotifier - Notifier: simplified ModuleEntry to single callback (removed array) pure dispatch with no locks or retry logic in the framework - SRS: removed TenantSrsUpdatePeriodicTask/TenantSrsUpdateTask added async RetryTimerTask that self-reschedules on failure and stops on success; removed dead notify_srs_changed - TIMEZONE: moved bootstrap refresh from start to notifier callback path, UpdateTenantTZTask self-reschedules on failure All modules register callbacks in init. Import executor and role-change-driven switch_to_leader both trigger async dispatch.
**Issue** obperf sampling shows that KVCache Wash consumes 61.4% of the process CPU during seekdb idle state, making it the top hotspot. Each time the wash timer (200ms) triggers, it executes: 1. `ObKVCacheStore::wash` - `refresh_score`: Iterates through all mb_handles, performing hazptr protect/release for each. - Heap construction loop: Iterates through all mb_handles again, performing hazptr protect/release for each. → Two-pass scan, each handle performs protect/release twice. 2. `ObKVCacheMap::clean_garbage_node` - Iterates through 200K buckets, each acquiring a write lock via `ObBucketWLockGuard`. - During idle, the vast majority of buckets are empty, acquiring and immediately releasing locks is pure waste. → Accounts for 88% of total wash CPU (~1280/1407 samples in `BucketLock::wrlock/unlock`). **Optimization 1: Merged refresh_score into the heap construction loop (ob_kvcache_store.cpp)** Inlined the per-handle score decay logic from `refresh_score` into the existing heap construction loop. Each handle now performs hazptr protect/release only once, down from twice. Retained the O(1) global calculation for `base_mb_score_`. **Optimization 2: Fast skip for empty buckets (ob_kvcache_map.cpp)** Before acquiring locks, `clean_garbage_node` and `replace_fragment_node` now check if `get_bucket_node(i)` is NULL. - NULL → Skip, no lock acquired. - Non-NULL → Acquire lock, then double-check before processing. Safety: Pointer reads are hardware-atomic on aarch64/x86_64; worst case falls back to the old behavior. **Verification** - Before optimization (PID 100137, CPU 10.3%): Wash CPU 61.4%, clean_garbage 58.7% - Intermediate (PID 43062, only Optimization 1): Wash CPU 64.0%, clean_garbage 61.9% (Optimization 1 did not address the main bottleneck) - After optimization (PID 98980, CPU 10.7%): Wash CPU 11.5%, clean_garbage 7.3% ↓81% - Functional verification: After forcing cache eviction, `clean_node_count=2~5`, garbage nodes are correctly cleaned. **Flame Graphs** - Before optimization: http://obperf.oceanbase-dev.com/files/profile_20260517004356__work.flame.svg - After optimization: http://obperf.oceanbase-dev.com/files/profile_20260517100310__work.flame.svg
Background: obperf CPU sampling shows that in an idle seekdb instance, ObBKGDSessInActiveGuard accounts for 1.7% of samples (687/41161). This object is constructed/destructed each sleep cycle, each time involving: - thread_local diagnostic info access - is_ash_enabled global variable read - set_sess_inactive: rdls + CACHE_ALIGNED bool write - set_sess_active: rdls + idle time accumulation + trace_id read + ASH buffer binding A full lifecycle is triggered 100 times per second on the TimeWheel deadlock detector (10ms precision). Changes: Two instances of ObBKGDSessInActiveGuard removed: 1. ob_clock_generator.h: ObClockGenerator::usleep - Removed the inactive_guard construction before each nanosleep. - Primary beneficiary: The TimeWheel stack guard in TimeWheelBase::scan reduced from 303 samples to zero. 2. utility.h: ob_usleep(v, is_idle_sleep=true) - Removed the guard in the is_idle_sleep branch, merging both branches into a direct ob_usleep(v) call. - The is_idle_sleep parameter is kept but commented as unused for signature compatibility. - Beneficiaries: ObMultiTenant::run1, ObTimerService::run1, ObBaseLogWriter::do_flush_log, ObDDLTransController::run1, etc. Results: | Metric | Before Optimization | After Optimization | |---|---|---| | BKGDSessGuard | 687 (1.7%) | 320 (0.7%) | | TimeWheel+Guard | 303 samples | 0 samples | The remaining Guard samples come from ~35 modules constructing it directly, not via generic sleep functions. Risk Assessment: - The ASH sampler will record fewer inactive state switches during sleep periods. - No functional impact on user-visible diagnostic information. - The thread's ASH inactive flag can still be used via direct Guard construction. Next Steps: - The remaining 320 Guard samples come from direct construction in various modules (ob_timer_service, ob_base_log_writer, etc.). These can be optimized module-by-module as needed.
…e RTTI overhead. On the server side, run_wrapper always points to an ObTenantBase instance, making the type check performed by dynamic_cast provably redundant. Changing it to static_cast avoids the virtual table traversal that occurs each time a timer task switches tenants.
…Mgr Timer Task. ObPxTargetMgr executes `refresh_statistics` every 500ms. Within this, it determines whether the current node is a Leader or Follower via the call chain: `get_dummy_leader` -> `check_dummy_location_credible` -> `get_role`. Internally, `get_role` performs `MTL_SWITCH` + `check_palf_exist` + `open_palf` + `get_role` to query the Paxos role of SYS_LS. This is a CPU hotspot per tick (flame graph shows `runTimerTask` at 0.57%, with `lib_mtl_switch` accounting for 10.04% of that). SeekDB is a single-replica observer-lite mode. `ElectionImpl::get_role` always returns LEADER, and the Leader never switches. This entire Leader/Follower coordination mechanism is essentially dead code. Removed the following dead code (-237 lines): 1. `get_dummy_leader` — `MTL_SWITCH` + `nonblock_get_leader` to locate the Leader. 2. `check_dummy_location_credible` — Calls `get_role` to verify if the cache is credible. 3. `get_role` — `MTL_SWITCH` + `open_palf` to get the Paxos role. 4. `refresh_dummy_location` — Empty function, only returns `OB_SUCCESS`. 5. `query_statistics` — RPC logic for Followers to report resource increments to the Leader. 6. `reset_follower_statistics` — Logic to reset statistics on the Follower side. Removed member variables that are no longer needed: - `cluster_id_`, `dummy_cache_leader_`, `rpc_proxy_`, `need_send_refresh_all_` Simplified `refresh_statistics`: - Removed the call to `get_dummy_leader` and the Follower branch. - On the first tick, executes `reset_leader_statistics` once; subsequent ticks return directly. Applicable only to single-replica (observer-lite) deployments. This change cannot be applied to multi-machine OceanBase clusters due to the real need for Leader/Follower switching. - Compilation passes. - OBD-deployed single-instance starts normally. - MySQL connection is normal.
…n-memory atomic CAS
## Problem
The Fetcher main loop (ObCSFetcher::run1) calls try_advance_refresh_scn_ every 200ms
unconditionally, regardless of IDLE or ACTIVE mode. Under IDLE mode (no async vector
index tables), the loop iterates every 10ms, and get_refresh_scn returns GTS each time.
This results in a SQL UPDATE against __all_global_stat every 200ms even when the value
has not changed, causing unnecessary database writes.
Additionally, the Worker path (do_finish_batch_) also writes refresh_scn via SQL
within a transaction, and the virtual table query reads it via SQL — all incurring
database round-trips for a monotonically-advancing counter that only needs to be
queryable within the same process.
## Solution
Migrate refresh_scn from __all_global_stat (SQL-persisted) to in-memory management
in ObCSDispatcher using atomic variables (ATOMIC_LOAD/ATOMIC_STORE/ATOMIC_BCAS)
1. ObCSDispatcher::refresh_scn_ becomes an atomic variable with CAS-based
update_refresh_scn that advances the value only when the new value is larger.
2. Fetcher path (try_advance_refresh_scn_)
- Was: ObGlobalStatProxy::advance_change_stream_refresh_scn → SQL UPDATE
- Now: dispatcher_->update_refresh_scn → in-memory CAS, zero SQL
3. Worker path (do_finish_batch_)
- refresh_scn advancement moves from release_batch to after batch commit
- Uses dispatcher_->update_refresh_scn instead of SQL
4. Read paths
- wait_refresh_scn in ObChangeStreamMgr reads from dispatcher_->get_refresh_scn
- Virtual table ob_all_virtual_change_stream_refresh_stat reads in-memory state
- init_refresh_scn_ still loads from __all_global_stat once as recovery baseline
and allows rollback on reload (recovery semantics differ from runtime advancement)
5. Cleanup
- Removes refresh_scn_inited flag and init_refresh_scn public method
- Refactors release_batch to no longer manage refresh_scn directly
## Files Changed
- ob_change_stream_dispatcher.{h,cpp}: atomic refresh_scn_, update_refresh_scn, cleanup
- ob_change_stream_fetcher.cpp: Fetcher uses dispatcher->update_refresh_scn
- ob_change_stream_worker.cpp: Worker commits refresh_scn after batch success
- ob_change_stream_mgr.cpp: wait_refresh_scn reads from in-memory dispatcher
- ob_all_virtual_change_stream_refresh_stat.cpp: virtual table reads from memory
- wait_cs_sync.inc: mysqltest adjustment
…anual concatenation using MEMCPY. **Issue** set_ext_tname is called on every timer task execution (a hot path in the handle). The original implementation used databuff_printf to format the thread name as "%s_%s", incurring runtime overhead for format string parsing and va_list variable arguments. **Change** Replaced databuff_printf with manual concatenation using STRLEN + MEMCPY: first, get the lengths of the two strings, check that the total length does not exceed OB_EXTENED_THREAD_NAME_BUF_LEN (32), then sequentially memcpy the tname, '_', timer_name_, and '\0'. **Safety** - tname (OB_THREAD_NAME_BUF_LEN=16) and timer_name_ (16) each have ≤15 valid characters. After concatenation, the total is ≤31, which always satisfies the boundary check condition of <32. - The MEMCPY / STRLEN macros are indirectly introduced via ob_define.h, adding no new dependencies. **Verification** - Full compilation passes with 0 errors. - After starting a new instance, compared the ext_tname output in logs; the format is completely consistent with the old implementation (e.g., TimerWK0_KVCacheWash, TimerWK0_AdvanceCKPT).
The obperf flame graph shows that BlockGCTimerTask::runTimerTask consumes 3.31% of CPU. Within that, the two calls to palf_handle_impl_map_.for_each (for `get_total_used_disk_space_` and `recycle_blocks_`) each go through the ObLinkHashMap Iterator/HandleOn/revert path. In this environment, there is only 1 sys LS, and the palf_id is fixed as ObLSID::SYS_LS_ID (=1), making the for_each traversal completely wasteful. Changes: 1. `get_total_used_disk_space_` — Directly call `get_palf_handle_impl(SYS_LS_ID, guard)` to replace `for_each(GetTotalUsedDiskSpace functor)`. 2. `recycle_blocks_` — Directly call `get_palf_handle_impl(SYS_LS_ID, guard)` and inline the recycling logic (base_lsn check, block GC condition check, delete_block) to replace `for_each(LogGetRecycableFileCandidate functor)`. 3. Remove the now-unused `GetTotalUsedDiskSpace` and `LogGetRecycableFileCandidate` functors (both declarations and implementations). 4. Add `#include "share/ob_ls_id.h"`.
OB_ERR_EMPTY_QUERY means the SRS table has not been fully imported (srs_cnt < 5152), so retrying is pointless — wait for the import notifier to trigger a fresh refresh instead of busy-looping every 1s.
The MemoryDump thread previously triggered a STAT_LABEL scan every 10 seconds, iterating all tenants * ctx_ids * chunks * blocks * objects to generate per-label memory statistics. This periodic full-memory walk caused significant CPU overhead on production machines. This commit sets STAT_LABEL_INTERVAL to INT64_MAX, effectively disabling periodic auto-scanning. The original code path is fully preserved. To manually trigger a memory stat scan ALTER SYSTEM REFRESH MEMORY STAT; Other on-demand dump triggers (kill -62 signal, etc/dump.config) are unaffected and continue to work normally. The STAT_LABEL_INTERVAL constant remains as a single point for future standardization — when a configurable interval parameter is introduced this constant should be replaced with that parameter.
…e/update_queue_size, and relax TIME_SLICE_PERIOD from 10ms to 1s. Flame graph analysis (obperf, 155,396 total samples) of the 10ms scheduling loop in ObMultiTenant::run1 shows two types of unnecessary periodic overhead in ObTenant::timeup: 1. update_token_usage — Every 1 second, it traverses the worker list, atomically clears idle_us_, and calculates token_usage_. However, token_usage_ and worker_us_ are only read by virtual tables (ob_all_virtual_sys_stat / ob_all_virtual_res_mgr_sys_stat) for display and do not affect any scheduling decisions. 2. update_queue_size — Every 10ms, it calls ObServerConfig::get_instance to read the tenant_task_queue_size configuration item. Under normal conditions, the configuration does not change, making this an ineffective polling operation. 3. TIME_SLICE_PERIOD = 10000 (10ms) — The flame graph shows that ObTenant::timeup itself accounts for 3.49% of total CPU usage, with lock operations (~47%), retry queue processing (32%), and worker recycling (19%) triggered every 10ms, which is too frequent. Relaxing this to 1s (1000000us) can significantly reduce lock contention overhead, as the semantics of tenant inspection (worker start/stop, retry replay) are not sensitive to millisecond-level latency. - Remove the body of the update_token_usage method and its call in timeup. - Remove the body of the update_queue_size method and its call in timeup. - Change get_token_usage and get_worker_time to directly return 0 (preserving interface compatibility for virtual tables). - Remove member variables: token_usage_, token_usage_check_ts_, worker_us_. - Replace update_queue_size with an inline set_queue_limit(int64_t) that directly operates on req_queue_. - Remove the corresponding members from the constructor's initialization list. - TIME_SLICE_PERIOD: 10000 → 1000000 (10ms → 1s) - Add reload_tenant_task_queue_size: hold a read lock, read GCONF.tenant_task_queue_size, and call tenant_->set_queue_limit. - Change the interval for periodic tenant info dumps: 10s → 1s (to improve monitoring granularity). - Add a call to reload_tenant_task_queue_size at the end of ObServerReloadConfig::operator, piggybacking on the server-level configuration hot reload path. - tenant_task_queue_size will no longer be polled every 10ms; instead, it will be updated passively when triggered by ALTER SYSTEM SET. Configuration change path: RPC → config_mgr_->reload_config → ObServerReloadConfig::operator → ObReloadConfig::operator updates GCONF → reload_tenant_task_queue_size → set_queue_limit(GCONF.tenant_task_queue_size). - The output of update_token_usage is only used for display. After removal, virtual table queries will return 0, which does not affect system behavior. - Relaxing TIME_SLICE_PERIOD from 10ms to 1s reduces the timeup call frequency from 100Hz to 1Hz, directly cutting down the cumulative overhead from lock contention, retry drain, and worker list traversal.
## Background ObBGThreadMonitor is a thread-level function execution timeout watchdog designed to monitor background thread function execution time. It was fully initialized and running (consuming ~0.52% CPU per flame graph analysis), but **zero** business code was instrumented with MonitorGuard — no threads or functions were ever registered for monitoring. Flame graph data (155,396 total samples) ObBGThreadMonitorTimerTask::runTimerTask: 805 samples (0.52%) — all self-samples, spent scanning 500 empty MonitorEntryStack arrays every second (ObClockGenerator::getClock + 2500 timestamp checks + spinlock acquire/release per cycle). ## Changes ### Deleted - src/share/ob_bg_thread_monitor.h — all classes: MonitorGuard, ObBGThreadMonitor ObBGThreadMonitorTimerTask, MonitorEntryStack, MonitorEntry BGDummyCallback, IBGCallback, MonitorCallbackWrapper ObTSIBGMonitorMemory, ObTSIBGMonitorSlotInfo, macros BG_MONITOR_GUARD(_DEFAULT) BG_NEW_CALLBACK, BG_DELETE_CALLBACK - src/share/ob_bg_thread_monitor.cpp — all implementations ### Modified - src/observer/ob_server.cpp removed #include, init/start/stop/wait/destroy lifecycle calls - src/share/ob_thread_define.h removed TG_DEF(BGThreadMonitor, BGThreadMonitor, TIMER) - src/share/CMakeLists.txt removed ob_bg_thread_monitor.cpp from build ## Risk Assessment - Zero-risk: no business code references MonitorGuard/BG_MONITOR_GUARD. Search across entire src/ confirmed no includes of ob_bg_thread_monitor.h outside the deleted files and ob_server.cpp lifecycle management. - No test files reference the framework. ## Verification - Debug build: passed (make -j80, [100%] Built target observer) - Instance deployment: deployed to ~/ob1, startup clean, SQL connectivity OK
Problem ObCSDispatcher::run1 uses dispatch_cond_.wait(100) in its idle loop. Each wakeup triggers futex kernel operations + ObWaitEventGuard + ObDiagnosticInfo::end_wait_event accounting, creating ~10 scheduling events per second even when there is no work. Flame graph (alloc view) shows this accounts for 0.95% of system-wide allocation events. Analysis The condition variable is correctly used with the standard pattern 1. Lock mutex (ObThreadCondGuard) 2. Check condition under lock 3. cond.wait(timeout) Signal is sent under the same mutex in push, so POSIX guarantees that a waiting thread is woken immediately regardless of timeout. The timeout is purely a fallback; longer timeouts do not risk losing signals. Solution 1. Increase dispatch_cond_.wait(100) to dispatch_cond_.wait(10000) idle wakeups drop from 10/sec to 0.1/sec, reducing alloc overhead by ~100x in the idle path. 2. Add dispatch_cond_.signal in stop: prevents the 10s timeout from delaying shutdown when the thread is blocked in wait. Verification - Deployed binary, confirmed idle wakeup intervals are exactly 10s (wait_duration_us = 10000067, 10000155, 10000087, 10000078). - Measured shutdown latency: 168ms (signal in stop works correctly). - Push→signal still wakes dispatcher immediately per condvar semantics (requires change stream activity to exercise directly).
…o a raw pointer. In seekdb (lite deployment), there is always only one LS (SYS_PALF_ID=1), which is never deleted after creation. The original ObLinkHashMap<LSKey, IPalfHandleImpl> was designed for multi-LS scenarios. Every access had to go through a hash map lookup, value allocation, and reference counting, which is pure overhead in the single-LS case. Profiling revealed that the LogLoopThread consumes a significant number of CPU cycles in the lambda within for_each, when in reality it's just making a simple method call to the single, unique PalfHandleImpl. ## Changes Made ### Data Structures (palf_env_impl.h/.cpp) - Removed the PalfHandleImplAlloc class and its 4 methods. - Removed the typedef ObLinkHashMap<LSKey, IPalfHandleImpl, PalfHandleImplAlloc>. - Removed `#include "lib/hash/ob_link_hashmap.h"`. - Replaced PalfHandleImplMap + sys_ls_handle_ with `IPalfHandleImpl *single_palf_handle_`. ### Removed Reference Counting - `revert_palf_handle_impl`: Now a no-op (handle lifecycle == process lifecycle). - `wait_until_reference_count_to_zero_`: Now directly returns OB_SUCCESS. - `create_palf_handle_impl_`: Now a simple raw pointer assignment. - `remove_palf_handle_impl_from_map_not_guarded_by_lock_`: Now directly frees memory and sets pointer to null. ### Simplified `get_palf_handle_impl` Old: hash map lookup → increment ref count → check_can_be_used → revert on failure. New: null check → check_can_be_used → directly returns the pointer. ### Simplified `for_each` (both overloads) Old: ObLinkHashMap::for_each + lambda wrapper. New: get_palf_handle_impl(SYS_PALF_ID, handle) → func → revert. ### Simplified Callers - LogLoopThread: 4 `for_each` calls → 4 direct method calls, all lambdas removed. - LogUpdater: `for_each` → direct call to `update_palf_stat`. - `get_total_used_disk_space_`: map for_each → functor(ls_key, single_palf_handle_). - `recycle_blocks_`: map for_each → direct call to `single_palf_handle_->delete_block`. - `check_can_create_palf_handle_impl_`: map.count → null check. - `check_can_update_log_disk_options_`: map.count → null check. ### Memory Safety Invariant `single_palf_handle_` is set once at startup and only set to null during shutdown. It is never freed during operation. Threads reading the pointer do not require synchronization. The `palf_meta_lock_` (RWLock) ensures writes during create/reload are visible to all subsequent reads.
…mand ## Problem ObDiskUsageReportTask::runTimerTask runs periodically to refresh an in-memory cache (result_map_) that maps (file_type, tenant_id) -> disk usage. Flame graph analysis (profile_20260520115549) shows this consumes ~1.11% CPU runTimerTask 314 samples (1.11%) count_tenant 314 self: 3 count_tenant_data 311 self: 1 get_next_tablet_ptr 310 self: 0 fetch_tablet_item 310 self: 0 for_each_value 310 self: 0 try_lock_all 310 self: 18 try_rdlock 292 (1.03%) self: 292 <-- hotspot 94% of the cost is ObLatch::try_rdlock atomic CAS operations - try_lock_all iterates ~1281 latches (10243 buckets / 8) and atomically try_rdlock's each one. This runs on every timer tick even when nobody reads the result. ## Analysis of consumers The result_map_ cache has exactly two consumers 1. get_data_disk_used_size - serves __all_virtual_unit virtual table 2. delete_tenant_usage_stat - called on tenant drop (N/A for seekdb) There is NO SQL reporting path. The comment in report_tenant_disk_usage says "update the usage table" but copies result_map_ to a local array and discards it - the reporting code was never implemented. For seekdb (single tenant, never deleted), the timer is pure overhead the only reader of result_map_ is the virtual table query. ## Solution Stop the DiskUseReport timer entirely, and make get_data_disk_used_size compute disk usage synchronously on each query. All timer-related code is deleted (not commented out) ### ob_disk_usage_reporter.h - Remove ObTimerTask inheritance (class no longer acts as a timer) - Remove runTimerTask, report_tenant_disk_usage, refresh_tenant_disk_usage count_tenant, count_tenant_slog, count_tenant_clog, count_server_slog count_server_clog, count_server_meta, count_tenant_tmp declarations - Remove ObReportResultGetter helper class - Remove const from get_data_disk_used_size ### ob_disk_usage_reporter.cpp - Delete runTimerTask, report_tenant_disk_usage, count_tenant count_tenant_slog, count_tenant_clog, count_server_slog count_server_clog, count_server_meta, count_tenant_tmp implementations - Delete commented-out set_tenant_data_usage - get_data_disk_used_size now 1. MTL_SWITCH to the target tenant 2. Calls count_tenant_data(tenant_id) to populate TENANT_DATA + TENANT_META_DATA 3. Fetches tmp file usage via ObTenantTmpFileManager 4. Reads the three needed types from result_map_ (same as before) ### ob_server.cpp - Delete TG_SCHEDULE(lib::TGDefIDs::DiskUseReport, ...) - Delete TG_START(lib::TGDefIDs::DiskUseReport) - Delete TG_STOP(lib::TGDefIDs::DiskUseReport) - Delete TG_WAIT(lib::TGDefIDs::DiskUseReport) - Delete TG_DESTROY(lib::TGDefIDs::DiskUseReport) ## Verification - Built and deployed to test instance (port 2915) - SELECT * FROM __all_virtual_unit returned data_disk_in_use = 20480000 - Observer log confirmed: no DiskUseReport messages - No errors or warnings related to ObDiskUsageReportTask in logs ## Scope & Limitations - Targeted at seekdb (single-tenant, no tenant drop). - If __all_virtual_unit is queried heavily, the on-demand computation cost will be incurred per query. This is acceptable because (a) the virtual table is typically low-QPS monitoring, and (b) single-tenant tablet count is bounded. - The result_map_ is still allocated at init and serves as the temp buffer during on-demand computation. ## Related flame graph http://obperf.oceanbase-dev.com/files/profile_20260520115549__root.flame.svg
The periodic ObCTASCleanUpTask::runTimerTask spends 100% of its CPU time in schema fetching: get_table_schema triggers full SQL-based schema fetch (fetch_table_schema via ObSchemaServiceSQLImpl) for every table in the tenant. Flame graph analysis shows this accounts for ~0.12% of total CPU samples (106/85128), all consumed by the schema fetch path. Since seekdb is single-node, two checks are unnecessary and were removed - create_host_str comparison: always matches on single node - TEMP_TAB_PROXY_RULE branch: OBProxy never exists in single-node mode is_obproxy_create_tmp_tab always returns false After removing these, all remaining schema fields used by the cleanup logic (session_id, table_type, schema_version, database_id, table_name name_case_mode, is_tmp_table, is_in_recyclebin) are available in ObSimpleTableSchemaV2. The per-table get_table_schema call is replaced with get_simple_table_schema, which reads directly from ObSchemaMgr in-memory hash table without any SQL query. This eliminates the per-table SQL schema fetch overhead entirely reducing the timer task's CPU consumption from ~106 samples to near zero.
Seekdb runs only a single sys tenant, so the Linux tc-style multi-tenant WRR scheduling tree and token-bucket rate limiter are entirely dead weight (flamegraph showed 1.31% CPU in QDiscRoot::do_thread_work). ## What changed ### Core: IO path bypasses TC queues - `ObTenantIOSchedulerV2::schedule_request` now submits IO directly to the device channel: prepare → get_device_channel → device_channel->submit - Previously: TC queue inc_ref → QSchedCallback::handle → dec_ref, with WRR scheduling and token-bucket checks in between ### Deleted: entire TC library (deps/oblib/src/lib/tc/) - 40 files, ~2900 lines: ob_tc.cpp/h, ob_tc_interface.cpp, ob_tc_limit.cpp ob_tc_stat.cpp, ob_tc_wrapper.cpp, deps/*, test/*, Makefile, README - All qdisc_*/tclimit_*/TCRequest references removed from the codebase ### Removed: ObIOManagerV2 and its lifecycle - ObIOManager::init/start/stop/destroy no longer call OB_IO_MANAGER_V2 - QSchedCallback (ITCHandler subclass) deleted ### Cleaned: ObTrafficControl and ObSharedDeviceControlV2 - Removed register_bucket, add_shared_device_limits, limit_ids_ - Removed qdisc_add_limit/qdisc_set_limit calls from add_group/inner_calc_ - Simplified add_group signature (no longer needs qid/limit_ids params) - Removed TCRequest qsched_req_ member from ObIORequest - Removed #include "lib/tc/ob_tc.h" from ob_io_define.h ### Preserved: tenant IO group tracking (independent of TC) - group_id_index_map_ / get_group_index / io_usage_ are NOT removed - They provide per-group IO statistics and have no code dependency on TC ## Verification - Compile: make -j80 in build_release, no link errors - Deploy: seekdb starts and accepts connections - Sysbench: read/write workload runs without errors - 47 files changed, +47 / -3681 lines
… mode Eliminate thread_local ObTenantBase and all tenant-switching overhead by migrating to a single global ObTenantBase pointer (g_tenant_ptr). MTL reads directly from the real ObTenant heap object — no copies, no dual objects. - ob_tenant_base.h: Replace thread_local with inline globals g_tenant_ptr and g_tenant_ctx. Both get_tenant and get_tenant_local return g_tenant_ptr — a single unified pointer. - ob_tenant_base.cpp: set_tenant and all ObTenantSwitchGuard methods gutted to no-ops. switch_to(uint64_t, bool) retains a zero-cost readiness check: g_tenant_ptr != &g_tenant_ctx, returning OB_TENANT_NOT_IN_SERVER until the real tenant is available. This allows MTL_SWITCH to safely skip its body during early init. - ob_define.h: ob_get_tenant_id returns constant OB_SYS_TENANT_ID. - ob_tenant.cpp: g_tenant_ptr = this set after create_mtl_module and before init_mtl_module. All operator= copies removed — MTL reads the real ObTenant directly via the global pointer. - ob_table_service_client.cpp: init_tenant_env gutted to no-op. - ob_dynamic_thread_pool.h: ObResetThreadTenantIdGuard gutted to no-op. - ob_all_virtual_thread.cpp: &ob_get_tenant_id → OB_SYS_TENANT_ID. g_tenant_ptr starts as &g_tenant_ctx (a dummy, no MTL services). After create_mtl_module registers all MTL services on the real ObTenant g_tenant_ptr = this. MTL_SWITCH's switch_to checks g_tenant_ptr != &g_tenant_ctx — a single pointer compare — to gate MTL access until the tenant is ready. Background threads that arrive early skip safely. - Deploy: seekdb instance starts and accepts connections - Sysbench: oltp_point_select 4-thread 30s (702K txns, 0 errors) oltp_read_write 4-thread 30s (14K txns, 2 deadlock retries) - obtest: t/stanby/basic.test passes (3/3: switchover, switchback, failover)
…k_in_set
Problem
Obperf flame graph shows find_task_in_set consumes 1680 samples (1.41%)
but the binary search comparator (CompareForSet) only compares timer_ and
task_ pointers. The 1150-sample (0.97%) TaskToken temporary object
construction — typeid(*task).name + strncpy / strnlen — is entirely
wasted work. This overhead is doubled because has_running_task calls
find_task_in_set twice (running_task_set_ + uncanceled_task_set_) for
every pop_task event.
Fix
1. Replace compare_for_set free function with transparent functor
CompareForSet (is_transparent = void, C++14 heterogeneous lookup).
Add (TaskToken*, pair) and (pair, TaskToken*) overloads so
std::lower_bound can compare directly against {timer, task} pointers.
2. In find_task_in_set, pass std::pair<const ObTimer*, const ObTimerTask*>
as the search key instead of constructing a temporary TaskToken.
3. All existing call sites (insert_unique, find) remain compatible via
the (TaskToken*, TaskToken*) overload.
Expected benefit: ~1150 samples (0.97% total CPU) eliminated from
find_task_in_set, reducing it from 1.41% to ~0.45%.
## Background Flame graph analysis (profile_20260520225116, obperf) on ObSimpleThreadPoolDynamicMgr::run1 shows the manager thread wakes up every 200ms to iterate all registered simple thread pools and call reap_workers. The overhead breakdown run1 total: 0.04% CPU ├─ Self (SpinRLock + loop + usleep): 59% └─ reap_workers callees: 41% ├─ ObLatchMutex::lock (workers_lock_): 21% ├─ Threads::wait: 21% ├─ ObTimeUtility::current_time: 6% ├─ usleep: 4% └─ Threads::destroy: 2% ## Change - ObSimpleThreadPoolDynamicMgr::CHECK_INTERVAL_US: 200ms -> 3s - File: deps/oblib/src/lib/thread/ob_dynamic_thread_pool.h:129 ## Rationale reap_workers only needs to clean up stopped workers. Worker lifecycle (thread exit) is on the order of seconds, not milliseconds. A 3s interval is sufficient to reclaim stopped workers without meaningful delay, while reducing the wake-up frequency by 15x (200ms -> 3s), cutting CPU overhead from ~0.04% to ~0.003%. ## Trade-off Stopped worker threads may linger up to 3s before being reaped. This is acceptable because worker stop events are rare (only during pool shrink) and the memory held by a stopped worker (thread stack + small control struct) is negligible.
…cture ## Motivation sys_hook_impl/SYS_HOOK was originally designed to wrap blocking syscalls with WaitGuard, which set Thread::blocking_ts_ and Thread::wait_event_ so that __all_virtual_thread could display STATUS/WAIT_EVENT/LOOP_TS. This supported dynamic th_worker scaling by tracking when workers were blocked. The th_worker scaling mechanism has since been completely refactored. The hook layer, WaitGuard/JoinGuard/RpcGuard classes, and the 6 TLS variables they maintained became dead code with no remaining consumers. ## What was removed ### Thread class (thread.h/thread.cpp) - 4 Guard classes: BaseWaitGuard, WaitGuard, JoinGuard, RpcGuard - 5 WAIT_* constants: WAIT, WAIT_IN_TENANT_QUEUE, WAIT_FOR_IO_EVENT WAIT_FOR_LOCAL_RETRY, WAIT_FOR_PX_MSG - 6 TLS variables: wait_event_, blocking_ts_, loop_ts_, rpc_dest_addr_ pcode_, thread_joined_ - update_loop_ts simplified to just clear_lock ### ob_tenant_hook.cpp - Removed: sys_hook_impl, SYS_HOOK macro, in_sys_hook, ob_pthread_cond_wait ob_pthread_cond_timedwait, Linux futex_hook - Linux ob_epoll_wait: simplified to BKGDSessInActiveGuard + direct epoll_wait - ob_pthread_cond_timedwait_us: kept, Linux path now calls pthread_cond_timedwait directly ### Hook function callers - ob_pthread_cond_wait -> pthread_cond_wait (3 call sites) - ob_pthread_cond_timedwait -> pthread_cond_timedwait (2 call sites) - Removed extern declarations from ob_define.h ### futex layer - ob_futex.h: Linux now uses inline syscall(SYS_futex, ...) directly instead of futex_hook routing. Win/macOS keep futex_hook forwarding for platform emulation. - ob_futex.cpp: removed Linux weak symbol (redundant, strong symbol removed from ob_tenant_hook.cpp) - win32_unwind_stubs.c: removed ob_epoll_wait and futex_hook stubs ### ~26 WaitGuard call sites Deleted WaitGuard guard(...) constructions in 14 files without replacement ob_lock_memtable.cpp (5), ob_table_lock_service.cpp (7) ob_signal_handle.cpp, ob_tenant.cpp, ob_storage_rpc.cpp ob_local_device.cpp, obmp_query.cpp, obmp_stmt_execute.cpp obmp_stmt_prexecute.cpp, ob_data_access_service.cpp ob_lob_handler.cpp, ob_io_struct.cpp, ob_dtl_basic_channel.cpp ob_px_sqc_async_proxy.cpp, ob_rpc_proxy.ipp ### Virtual table columns - __all_virtual_thread: removed STATUS, WAIT_EVENT, LOOP_TS columns. Remaining: tid, tname, latch_wait, latch_hold, trace_id, cgroup_path, numa_node (7 cols) - Regenerated ob_inner_table_schema.*.cpp via generate_inner_table_schema.py - Removed STATUS from GV$OB_THREAD view SELECT ### ObThWorker - Removed blocking_ts accessor and blocking_ts_ member ### lua diagnostic API (ob_lua_api.cpp) - Removed loop_ts, blocking_ts, join_addr, sleep_us, rpc_dest_addr, pcode, event reads - Simplified STATUS to "Run"/"Sleep" based on sleep_us_ only - Simplified WAIT_EVENT to latch wait info only ## What was preserved - Windows/macOS ob_epoll_wait (pure platform emulation, no hook) - Windows futex_hook (WaitOnAddress/WakeByAddressSingle emulation, WaitGuard line removed) - macOS futex_hook (__ulock_wait/__ulock_wake emulation, WaitGuard line removed) - ob_pthread_cond_timedwait_us (cross-platform, with platform emulation paths) ## Verification - Build: clean compile of seekdb - Virtual table: SELECT * FROM __all_virtual_thread returns 7 columns correctly - Sysbench: oltp_point_select 30s run, ~34K TPS, 0 errors, p99 latency 0.99ms
…ond+pending bitmask ObMemoryDump used an 8-slot ObLightyQueue + pre-allocated task pool + mutex bitmask (~100 lines) to serialize multi-producer requests. Signal 62 runs in a dedicated SignalHandle thread via sigtimedwait, not a raw signal handler so all producers are ordinary thread contexts that can safely use mutex+cond. The queue/pool machinery is unnecessary. Replace queue+pool with a standard mutex+cond pattern Producer (RPC/Signal/stop) Consumer (MemoryDump thread) ──────── ──────── lock → set pending bit lock → if pending_==0: wait(10s) → signal → unlock snapshot pending_, copy task, clear unlock DUMP(b1) → handle DUMP task STAT_LABEL(b0) → handle STAT_LABEL Key decisions - pending_ is a plain int (bitmask: STAT_LABEL=1, DUMP=2), protected by mutex - DUMP has priority over STAT_LABEL to prevent debug-DUMP from being starved - cond_.wait(10s) serves dual purpose: event-driven wakeup AND timer heartbeat for periodic STAT_LABEL (every ~10s) - stop signals cond to avoid shutdown delay with large timeout - Signal 62 producer uses a stack-allocated task instead of pool allocation - deps/oblib/src/lib/alloc/memory_dump.h Removed: ObLightyQueue, task_mutex_, tasks_[8], avaliable_task_set_, TASK_NUM push, alloc_task, free_task Added: ObThreadCond cond_, int pending_, ObMemoryDumpTask pending_dump_task_ PENDING_STAT_LABEL=1, PENDING_DUMP=2, request_dump - deps/oblib/src/lib/alloc/memory_dump.cpp init: cond_.init replaces queue_.init stop: cond_.signal before TG_STOP for immediate wakeup destroy: cond_.destroy replaces queue_.destroy Deleted: push, alloc_task, free_task Added: request_dump — copy task under mutex, signal cond Simplified: generate_mod_stat_task — lock→set bit→signal→unlock Rewritten: run1 — ObThreadCondGuard RAII, snapshot+clear pending_ under lock, DUMP-first priority, timer-based STAT_LABEL heartbeat Fixed: handle no longer calls free_task (task is stack variable) - src/observer/ob_dump_task_generator.cpp Stack-allocated task replaces alloc_task/free_task/push ObMemoryDumpTask task; → fill fields → mem_dump.request_dump(task) | Path | Method | Result | |---------------|------------------------------------|--------| | RPC | ALTER SYSTEM REFRESH MEMORY STAT | Pass | | Signal 62 | kill -62 + etc/dump.config | Pass | | Timer | 10s heartbeat via cond_.wait | Pass | | Stop | kill <pid> (0.1s shutdown) | Pass | 3 files, +87/-108 lines
…and hash map ## Context ObPxTargetMgr runs a 500ms periodic timer that coordinates PX worker usage across a distributed cluster: followers RPC-sync local usage to the leader, the leader aggregates into a per-server hash map. In single-server single-tenant SeekDB, all of this is wasted work. Flame graph: TimerTask::runTimerTask = 959 samples (0.81% CPU), all in DCHash traversal and map rebuild. ## What was removed (549 lines across 4 files) ### Timer & distributed coordination (306 lines) - TimerTask inner class, timer_task_ member, run_timer_task - ObPxResRefreshFunctor (always-true need_refresh_all_ dead code) - TG_START/TG_SCHEDULE/TG_STOP/TG_WAIT timer lifecycle - refresh_statistics and its entire call chain get_dummy_leader / check_dummy_location_credible / get_role / refresh_dummy_location / query_statistics (RPC, 62 lines) - rpc_proxy_, cluster_id_, dummy_cache_leader_, debug flags - PX_REFRESH_TARGET_INTERVEL_US / PX_REFRESH_CHECK_ALIVE_INTERVAL_US - Role init: FOLLOWER -> LEADER ### Data structure (243 lines) - ServerTargetUsage struct (peer/local/report, 38 lines) and its OB_SERIALIZE_MEMBER — 3-field distributed accounting protocol - hash::ObHashMap<ObAddr, ServerTargetUsage> global_target_usage_ always held exactly 1 entry (self); replaced by int64_t px_target_used_ - PX_SERVER_TARGET_BUCKET_NUM macro - get_global_target_usage -> inline get_px_target_used - ObPxGlobalResGather::operator (dead code after map removal) ### Simplified methods - apply_target: sums worker_map, single compare, SpinWLockGuard - release_target: single subtraction, no per-server map updates - get_all_target_info: hardcoded single self-entry - reset_*_statistics: px_target_used_ = 0 - update_peer_target_used: no-op stub - gather_global_target_usage: direct push, no foreach - init/reset: removed hash map create/clear ## What was preserved - apply_target / release_target — PX admission control gates concurrent queries against SET GLOBAL parallel_servers_target - reset_leader_statistics / update_peer_target_used — RPC handler stubs kept for compilation; multi-server restore points - All ObPxTargetMgr public API unchanged ## Verification (SeekDB port 12881, debug build) - Instance boots, PX queries execute normally - __all_virtual_px_target_monitor shows is_leader=1, correct target - SET GLOBAL parallel_servers_target = N syncs to monitor correctly - Zero PX timer activity in logs, zero PX errors ## How to restore multi-server PX coordination 1. Re-add TimerTask + TG_SCHEDULE in ob_px_target_mgr 2. Re-add refresh_statistics + role-detection call chain 3. Re-add rpc_proxy_ / cluster_id_ / dummy_cache_leader_ 4. Re-add ServerTargetUsage + global_target_usage_ hash map 5. Change init role back to FOLLOWER RPC handlers and ObPxTargetMgr API are ready. (cherry picked from commit ffc4f71b2386)
…ObVector with token-resident design ## Motivation ObSortedVector's insert_unique and remove operations trigger O(n) memmove which showed as ~1.66% CPU (wcscat/memmove) in flame graphs under load. The sorted invariant is only needed for O(1) front-peek in pop_task; an unsorted vector with full scan eliminates the memmove cost entirely. ## Design: Token-Resident Queue Instead of removing tokens from priority_task_queue_ on dispatch and re-inserting them on repeat reschedule, tokens stay in the queue for their entire lifetime. Dispatch state is encoded as scheduled_time_ = TOKEN_DISPATCHED (INT64_MAX) ### Key behavioral changes 1. schedule_task (new task) - push_back + notify (was insert_unique + conditional notify) - Always notifies because there is no sorted order to optimize against 2. schedule_task (repeat task after runTimerTask completion) - Updates scheduled_time_ in-place, token stays in queue - notify_all unconditional (was insert_unique + conditional notify on head) 3. schedule_task (non-repeat task after completion) - Scans queue to find token, swap-with-last + pop_back (O(1) removal) - Then delete_token 4. pop_task (scan loop in run1) - Full O(n) scan every tick instead of O(1) front-peek - Skips TOKEN_DISPATCHED tokens - Handles self-conflict: if running_task_token == token, skips silently - On dispatch: sets scheduled_time_ = TOKEN_DISPATCHED, token stays in queue 5. cancel_task - swap-with-last + pop_back instead of ObSortedVector::remove - Skips delete_token for TOKEN_DISPATCHED tokens (they're in running_set_) 6. stop - STEP2 skips dispatched tokens to avoid double-delete with STEP3 ### Removed - min_scheduled_time_: eliminated to avoid maintenance burden across 3 mutation paths (cancel, schedule new, schedule repeat) - compare_for_queue (free function): replaced by the iterating scan - check_clock and CLOCK_SKEW_DELTA/CLOCK_ERROR_DELTA: per-tick clock sanity check removed to reduce scheduling path overhead - ObDIActionGuard / ObBKGDSessInActiveGuard: removed from scheduling hot path - Temporary TaskToken construction in find_task_in_set CompareForSet now supports std::pair<ObTimer*, ObTimerTask*> as transparent lookup key (C++14 is_transparent) ### Trade-off At idle, pop_task's O(n) full scan is slightly more expensive than the original O(1) front-peek (~0.25% vs ~0.09% self samples). The benefit manifests under load: zero memmove on insert/remove, zero temporary allocations for set lookups, and no sorted-order maintenance.
Problem KVCache wash timer task (KVStoreWashTask) runs every ~800ms and consumes ~20% CPU when the system is idle. The hotspot is `clean_garbage_node` which traverses 200K hash bucket slots per call. Root cause `ObKVCacheStore::wash` unconditionally returns `true`, so `ObKVGlobalCache::wash` always calls `map_.clean_garbage_node` even when no memory was reclaimed. Change `ObKVCacheStore::wash` returns `reclaimed_size > 0` instead of `true`. When wash reclaims nothing (idle), `clean_garbage_node` is skipped entirely. When cache pressure exists, wash reclaims memory and `clean_garbage_node` runs normally to clean garbage hash nodes. Verification (A/B test on 8C 61GB, remote deploy + local sysbench) - obperf: clean_garbage_node 20.0% -> 0.0%, wash overhead 20.6% -> 0.9% - sysbench point_select: QPS unchanged (113K vs 114K, within noise) - Cache pressure: wash reclaims 2-35MB/cycle, memory comparable
…n cond_timedwait ObCSFetcher previously consumed ~4% idle CPU by usleep(10ms)-polling schema version in IDLE mode (no async vector index tables). Replace this with a cond_timedwait(10s) that wakes immediately on schema publish. Notification chain publish_schema → ObTenantSwitchGuard(tenant_id) → MTL_CTX → ObTenantBase::on_schema_publish [virtual] → ObTenant::on_schema_publish [override] → ObChangeStreamMgr::get_fetcher.notify_schema_changed → idle_cond_.signal Design decision: use a virtual function on ObTenantBase (overridden by ObTenant) instead of callback arrays. This keeps publish_schema unaware of business modules, ensures zero runtime overhead, and makes the call chain trivially traceable. Additional modules that need schema publish notification simply add their code to ObTenant::on_schema_publish. Double-checked locking in IDLE branch prevents TOCTOU races: re-read schema version under the cond guard before entering cond_timedwait. stop signals idle_cond_ for fast shutdown.
ObLogger previously used ObBaseLogWriter, which requires per-log-item heap allocation (VSliceAlloc) and a fixed-size pointer array queue (log_items_[1024]). This had two problems 1. Per-item malloc/free overhead on the hot logging path 2. The pointer array is unnecessary for seekdb lite's single-consumer flush model Introduce ObRingBufLogWriter, a new 2 MB variable-length ring buffer with an MPSC (multi-producer, single-consumer) design modeled after the Linux BPF ringbuf - 2 MB external buffer with 8-byte RingBufEntry headers - Each entry header packs type (2b), total_len (30b), busy (1b) and reserved (31b) into a single 64-bit word for atomic read - Producers: alloc reserves space under a TTAS spinlock (RingSpinLock), writes the header with busy=1, advances push_ - After writing log data, producer calls commit to set type=COMMIT and clear busy=0 (or rollback for TYPE_ROLLBACK) - Padding entries handle wrap-around transparently - Consumer: single-threaded do_flush scans from pop_ to push_ stops at busy entries, processes committed entries in batches of up to 64, advances pop_ - Producer: WEAK_BARRIER (DMB on ARM, compiler barrier on x86 TSO) before clearing busy_, ensuring all data and type writes are visible - Consumer: ATOMIC_LOAD_ACQ on the 64-bit header ensures correct visibility on weakly-ordered architectures test-and-test-and-set spinlock: waiters do a plain read (shared cache line) before attempting CAS. This avoids cache-line bouncing from repeated lock cmpxchg under contention. - Async flush thread with SimpleCond-based idle wait - do_flush drains all available entries in one call, returns whether work was done to drive the wait logic - Subclass implements process_batch(char **entries, int64_t *lens int64_t count) - ObLogger migrates from ObBaseLogWriter to ObRingBufLogWriter - ObPLogItem: commit_start_ renamed to ring_offset_ for clarity; ring_offset_ stored for direct commit/rollback on the ring buffer - BASE_LOG_SIZE reduced from 4 KB to 1 KB to increase ring buffer entry capacity - V$SYSSTAT counters: RINGBUF_ALLOC_WAIT_COUNT, RINGBUF_ALLOC_DROP_COUNT RINGBUF_ALLOC_SUCCESS_COUNT, RINGBUF_QUEUE_DEPTH - Fixed to_string buffer overflows in LogSimpleMemberList LogAckList, ObMemtableCtx::databuff_print_key_obj that passed full buf_len instead of remaining length (buf_len - pos) - Removed high-frequency TRANS_LOG(INFO, "check replica readable fail") that flooded the ring buffer on single-node point_select benchmarks Point-select benchmark (200 threads, 16 instances, same machine) Before: 194K TPS (80% CPU in ObRingBuf::alloc due to log flood) After: 2.5M TPS (ringbuf alloc not visible in perf top-N) The 80% CPU was an artifact of per-query INFO-level TRANS_LOG spam saturating the ring buffer, not a ringbuf design limitation. With normal log volume the MPSC ring buffer has no measurable overhead.
4aa3f44 to
730177c
Compare
Member
Author
|
The mapping Dima issue is CPU overhead optimization. |
hnwyllmm
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Task Description
This MR implements a series of optimizations to reduce CPU overhead across various components of the system.
Solution Description
Timer/Task Scheduling
ObSortedVectorwithObVector+ token-resident design.TaskTokenconstruction overhead infind_task_in_set.databuff_printfstring concatenation withMEMCPYinset_ext_tname.cond_timedwaittimeout upper limit from 1s to 30s (wakes immediately on event, waits full duration only if no event).cond_timedwait, wakes on event).Tenant/OMT
update_token_usage/update_queue_size, relaxedTIME_SLICE_PERIODfrom 10ms to 1s.dynamic_castwithstatic_castto eliminate RTTI overhead.Memory/Cache
ObLightyQueue+ task pool toObThreadCond+ pending bitmask.Hash Tables
PalfHandleImplMap, simplified to raw pointer.ObLinkHashMap::for_eachtraversal.check_timeouthash scan when there are no waiting nodes.DDL/CTAS
IO/Scheduling
PxTargetMgr
CS/Fetcher
cond_timedwait.SSTable/GC
Log/Threads
ObBaseLogWriterfor asynchronous logging.ObSleepEventGuardto reduce CPU overhead.Other
OB_ERR_EMPTY_QUERY.Internal Table View Change Record
https://yuque.antfin.com/ob/product_functionality_review/tc9plgnfg76aunqp?singleDoc# 《【申请】【视图/虚拟表】删除__all_virtual_thread等相关列》
Performance Comparison
| Version | Build Time | Commit Hash | Metric 1 | Metric 2 | Metric 3 | Metric 4 | Metric 5 | Metric 6 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| ori | Jun 4 2026 20:21:14 | 1-7eb7b166fb5e53e3d6401caf49b7eff90fb22c4f | 85207.06(-24%) | 3399.83(-27%) | 1992.69(-30%) | 8115.73(-32%) | 29224.61(-32%) | 29110.33(-27%) |
| new | Jun 4 2026 01:55:47 | 1-d3c5a836069bbd60e546f2170e778f933dac0d75 | 87186.8(-23%) | 3484.13(-25%) | 2052.48(-28%) | 8445.25(-29%) | 30284.35(-30%) | 30389.93(-24%) |
Optimization Effect
RK3568 aarch64: CPU usage reduced from 30% to 10%.
Passed Regressions
Upgrade Compatibility
Other Information
Release Note
A series of CPU optimizations have been implemented across timer scheduling, tenant management, memory/cache handling, hash tables, DDL operations, IO scheduling, parallel execution targets, change stream fetching, SSTable garbage collection, logging, and other subsystems to significantly reduce system CPU overhead.