GFPGAN Speed UP PR by wangzijian1010 · Pull Request #473 · xlite-dev/lite.ai.toolkit

wangzijian1010 · 2026-05-30T14:55:04Z

This pull request introduces a comprehensive GPU benchmarking framework and significant optimizations for the face restoration pipeline (GFPGAN) in the lite.ai.toolkit project. It adds a unified, backend-agnostic benchmarking utility, a dedicated face restoration benchmark executable, and refactors the core restoration code to enable detailed, per-stage timing and substantial speedups via CUDA kernel fusion and buffer reuse.

Benchmarking and Profiling Infrastructure:

Added a header-only, backend-agnostic benchmarking utility (profiler.h) that supports CPU and GPU timing, aggregates statistics (mean, percentiles, min/max), and can export results as aligned tables or CSV files. It provides convenient macros for instrumenting code and is designed for minimal overhead when disabled.
Introduced a new benchmark executable (lite_face_restoration_bench) for end-to-end profiling of the GFPGAN face restoration stage. This tool checks CPU vs GPU correctness for the paste-back operation, runs warmup and timed iterations, and reports per-stage and total timings, including CSV export. [1] [2]

Face Restoration Pipeline Optimizations:

Refactored the core TRTFaceFusionFaceRestoration class to support detailed per-stage profiling and major performance improvements:
- Fused CPU pre/post-processing steps (affine warp, color conversion, normalization, tensor layout, paste-back) into CUDA kernels with buffer reuse and asynchronous memory copies.
- Cached the static box mask to avoid redundant computation.
- Exposed a new restore method that returns the restored frame (without disk I/O) and accepts an optional profiler for instrumentation. [1] [2]

Documentation and Results:

Added a detailed "Benchmark" section to the README.md, summarizing the optimization strategies, measured speedups (up to 4.4× end-to-end), and per-stage breakdowns, with a reproducible methodology and environment.

Summary of Most Important Changes:

Benchmarking Infrastructure:

Added a unified, header-only benchmarking utility (lite/bench/profiler.h) for CPU and GPU timing, stats aggregation, and CSV export.
Introduced a dedicated face restoration benchmark executable (lite_face_restoration_bench) and its CMake target. [1] [2]

Face Restoration Pipeline Optimization:

Refactored TRTFaceFusionFaceRestoration to support detailed per-stage profiling, GPU kernel fusion, buffer reuse, and a new restore method for compute-only evaluation. [1] [2]

Documentation:

Added a comprehensive "Benchmark" section to the README.md, including performance results and optimization details.

…mark harness Phase 1 first cut on the TensorRT face-restoration (GFPGAN) stage. Profiling showed paste_back was ~50% of the pipeline: two full-frame cv::warpAffine on the CPU plus per-call cudaMalloc/free and synchronous copies. Rewrote it as a single inverse-mapping CUDA kernel with reused device buffers and pinned/async copies. Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512: paste_back 39.07ms -> 2.37ms (16.5x); end-to-end 78.2ms -> 30.4ms; 12.8 -> 32.9 FPS. CPU vs GPU output is numerically equivalent (max|diff| = 2/255). Per-file changes: - lite/bench/profiler.h: new header-only, backend-agnostic profiler. CPU chrono + CUDA-event timers, mean/p50/p90/p99 + FPS aggregation, CSV export, scoped-timer macros. - lite/trt/kernel/paste_back.cu, paste_back.cuh: add paste_back_fused_kernel — per-output-pixel inverse mapping (uses the original->crop affine directly, no inversion), bilinear crop/mask sampling with border-0, blend, writes uint8. Old paste_back_kernel kept. - lite/trt/kernel/paste_back_manager.cpp, paste_back_manager.h: add PasteBackGPU, which owns reusable device + pinned buffers (ensure_capacity) and drives the fused kernel. Old CPU launch_paste_back kept for A/B reference. Comments translated to English. - lite/trt/cv/trt_face_restoration.cpp, trt_face_restoration.h: factor detect() into restore() (returns the frame, no disk write, optional per-stage Profiler); detect() now calls restore() + imwrite; restore() uses the GPU PasteBackGPU member instead of CPU paste_back. - examples/lite/cv/test_lite_face_restoration_bench.cpp: new benchmark — CPU-vs-GPU paste_back equivalence check + compute-only per-stage latency/FPS over N iterations. - examples/lite/CMakeLists.txt: register the lite_face_restoration_bench executable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…20.9ms->17.7ms) Phase 1 continued on the face-restoration stage. Per-stage profiling of preprocess showed two wins: (1) create_static_box_mask was rebuilt every frame (a large-kernel GaussianBlur, ~10ms) although it only depends on the fixed 512 crop size; (2) bgr2rgb + normalize + HWC->CHW ran on the CPU (~3.2ms) and then a separate H2D copied the tensor to the device. Mask is now built once and cached. bgr2rgb/normalize/CHW are fused into a single CUDA kernel that writes the normalized RGB CHW tensor straight into the inference input buffer, removing the per-frame mask rebuild, the CPU tensor build, and the separate H2D. Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512 (cumulative from the 78.2ms baseline): preprocess 14.4ms -> 1.3ms; end-to-end 30.8ms -> 17.7ms; 32.4 -> 56.6 FPS. Output unchanged (PSNR 59.5 dB vs the pre-change result). Per-file changes: - lite/trt/kernel/face_restoration_preprocess.cu, face_restoration_preprocess.cuh: new face_restoration_preprocess_kernel — one thread per crop pixel, reads interleaved BGR uint8, writes planar RGB float (CHW) normalized to [-1,1] (v/127.5 - 1). - lite/trt/kernel/face_restoration_preprocess_manager.cpp, face_restoration_preprocess_manager.h: new FaceRestorePreprocessGPU, owns reusable device + pinned staging buffers and launches the kernel writing directly into the inference input buffer. - lite/trt/cv/trt_face_restoration.h: add FaceRestorePreprocessGPU member, a cached box mask member, and the new preprocess-manager include. - lite/trt/cv/trt_face_restoration.cpp: cache the static box mask; replace CPU bgr2rgb/normalize/create_tensor with the fused GPU preprocess into buffers[0]; drop the now redundant H2D in the inference step; add per-substage profiler scopes under preprocess. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ation stage Document the FaceFusion face-restoration (GFPGAN 1.4) TensorRT optimization with a stage-by-stage baseline vs optimized table (RTX 4090, TRT 10.1, FP32, 512x512): end-to-end 78.2ms -> 17.7ms (4.4x), 12.8 -> 56.6 FPS. Includes setup/methodology (warmup, iterations, p50, compute-only) and how to reproduce via lite_face_restoration_bench. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ffort Position the work as a repeatable playbook (built-in profiler + moving CPU pre/post into fused CUDA kernels) applied across the whole FaceFusion pipeline, with a per-stage optimization-status table (detect/landmarks/recognize/swap = WIP, restoration done, FP16 next). Keep the GFPGAN face-restoration result as a worked deep-dive rather than the whole story. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

One row per algorithm (before / after / speedup / what changed) so the log reads as a growing list of optimized algorithms rather than a single case study; FaceFusion face restoration is the first entry, further algorithms are placeholders. Move the GFPGAN per-stage breakdown into a collapsible details block. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a unified benchmarking/profiling utility and a dedicated GFPGAN face-restoration benchmark binary, and refactors the TensorRT GFPGAN restoration path to enable per-stage timing and accelerate pre/post-processing via new CUDA kernels (preprocess fusion + fused paste-back with buffer reuse).

Changes:

Introduces a header-only lite/bench/profiler.h to collect CPU/GPU timing samples, aggregate stats, and export CSV.
Adds lite_face_restoration_bench example executable for end-to-end GFPGAN benchmarking and CPU-vs-GPU paste-back equivalence checking.
Refactors TRTFaceFusionFaceRestoration to expose a compute-only restore() API and integrate fused CUDA preprocess + fused paste-back with buffer reuse and mask caching.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
README.md	Documents benchmarking methodology and reported speedups/breakdowns.
lite/trt/kernel/paste_back.cuh	Declares fused paste-back kernel API.
lite/trt/kernel/paste_back.cu	Implements fused inverse-mapping paste-back CUDA kernel.
lite/trt/kernel/paste_back_manager.h	Adds `PasteBackGPU` reusable GPU paste-back wrapper.
lite/trt/kernel/paste_back_manager.cpp	Implements `PasteBackGPU` buffer management and kernel launch.
lite/trt/kernel/face_restoration_preprocess.cuh	Declares fused preprocess CUDA kernel (BGR->RGB, normalize, HWC->CHW).
lite/trt/kernel/face_restoration_preprocess.cu	Implements fused preprocess CUDA kernel.
lite/trt/kernel/face_restoration_preprocess_manager.h	Adds reusable GPU preprocess wrapper writing directly to TRT input buffer.
lite/trt/kernel/face_restoration_preprocess_manager.cpp	Implements preprocess buffer reuse + kernel launch.
lite/trt/cv/trt_face_restoration.h	Adds `restore()` API and profiling hook; wires in new GPU helpers + mask cache.
lite/trt/cv/trt_face_restoration.cpp	Refactors restore pipeline into timed stages and uses new GPU preprocess/paste-back.
lite/bench/profiler.h	Adds benchmarking/profiling utility and instrumentation macros.
examples/lite/cv/test_lite_face_restoration_bench.cpp	Adds GFPGAN restoration benchmark executable and equivalence check.
examples/lite/CMakeLists.txt	Registers `lite_face_restoration_bench` build target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    cv::Mat crop = crop_vision_frame.isContinuous() ? crop_vision_frame : crop_vision_frame.clone();
+    cv::Mat mask = crop_mask.isContinuous() ? crop_mask : crop_mask.clone();
+


+        launch_face_restoration_postprocess(
+                static_cast<float *>(buffers[1]), transposed_data.data(), 3, 512, 512);
+        std::vector<float> transposed_data_float(transposed_data.begin(), transposed_data.end());
+        cudaStreamSynchronize(stream);
+


+// GPU fused version: inverse-mapping sampling + blend entirely in the kernel, reused device
+// buffers, pinned + async copies. Numerically equivalent to launch_paste_back; returns full-frame BGR uint8.
+class PasteBackGPU {


DefTruth · 2026-05-31T06:51:55Z

+## ⚡ Benchmark 🔥
+<div id="benchmark"></div>
+
+GPU-inference optimization log. For each algorithm we profile it with a built-in, backend-agnostic harness ([`lite/bench/profiler.h`](https://github.com/xlite-dev/lite.ai.toolkit/blob/main/lite/bench/profiler.h)), then move the CPU pre/post-processing (affine warp, color convert, normalize, tensor layout, paste-back, NMS …) into **fused CUDA kernels** with reused device buffers and pinned + async copies, so the algorithm spends its time on real inference instead of host glue and `cudaMalloc`/sync round-trips. All numbers are **RTX 4090 · TensorRT 10.1 · CUDA 12.4**, median (p50), compute-only, reproducible via the `lite_*_bench` binaries.


benchmark部分建议使用最新的tensorrt，以及cuda 13+

wangzijian1010 and others added 5 commits May 30, 2026 22:37

wangzijian1010 requested a review from Copilot May 31, 2026 01:39

Copilot started reviewing on behalf of wangzijian1010 May 31, 2026 01:40 View session

wangzijian1010 requested a review from DefTruth May 31, 2026 01:40

Copilot AI reviewed May 31, 2026

View reviewed changes

Update README.md

68c2cd5

DefTruth reviewed May 31, 2026

View reviewed changes

DefTruth approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GFPGAN Speed UP PR#473

GFPGAN Speed UP PR#473
wangzijian1010 wants to merge 6 commits into
xlite-dev:mainfrom
wangzijian1010:start_0601

wangzijian1010 commented May 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

DefTruth May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		cv::Mat crop = crop_vision_frame.isContinuous() ? crop_vision_frame : crop_vision_frame.clone();
		cv::Mat mask = crop_mask.isContinuous() ? crop_mask : crop_mask.clone();

Conversation

wangzijian1010 commented May 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

DefTruth May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants