Skip to content

GFPGAN Speed UP PR#473

Open
wangzijian1010 wants to merge 6 commits into
xlite-dev:mainfrom
wangzijian1010:start_0601
Open

GFPGAN Speed UP PR#473
wangzijian1010 wants to merge 6 commits into
xlite-dev:mainfrom
wangzijian1010:start_0601

Conversation

@wangzijian1010
Copy link
Copy Markdown
Member

This pull request introduces a comprehensive GPU benchmarking framework and significant optimizations for the face restoration pipeline (GFPGAN) in the lite.ai.toolkit project. It adds a unified, backend-agnostic benchmarking utility, a dedicated face restoration benchmark executable, and refactors the core restoration code to enable detailed, per-stage timing and substantial speedups via CUDA kernel fusion and buffer reuse.

Benchmarking and Profiling Infrastructure:

  • Added a header-only, backend-agnostic benchmarking utility (profiler.h) that supports CPU and GPU timing, aggregates statistics (mean, percentiles, min/max), and can export results as aligned tables or CSV files. It provides convenient macros for instrumenting code and is designed for minimal overhead when disabled.

  • Introduced a new benchmark executable (lite_face_restoration_bench) for end-to-end profiling of the GFPGAN face restoration stage. This tool checks CPU vs GPU correctness for the paste-back operation, runs warmup and timed iterations, and reports per-stage and total timings, including CSV export. [1] [2]

Face Restoration Pipeline Optimizations:

  • Refactored the core TRTFaceFusionFaceRestoration class to support detailed per-stage profiling and major performance improvements:
    • Fused CPU pre/post-processing steps (affine warp, color conversion, normalization, tensor layout, paste-back) into CUDA kernels with buffer reuse and asynchronous memory copies.
    • Cached the static box mask to avoid redundant computation.
    • Exposed a new restore method that returns the restored frame (without disk I/O) and accepts an optional profiler for instrumentation. [1] [2]

Documentation and Results:

  • Added a detailed "Benchmark" section to the README.md, summarizing the optimization strategies, measured speedups (up to 4.4× end-to-end), and per-stage breakdowns, with a reproducible methodology and environment.

Summary of Most Important Changes:

Benchmarking Infrastructure:

  • Added a unified, header-only benchmarking utility (lite/bench/profiler.h) for CPU and GPU timing, stats aggregation, and CSV export.
  • Introduced a dedicated face restoration benchmark executable (lite_face_restoration_bench) and its CMake target. [1] [2]

Face Restoration Pipeline Optimization:

  • Refactored TRTFaceFusionFaceRestoration to support detailed per-stage profiling, GPU kernel fusion, buffer reuse, and a new restore method for compute-only evaluation. [1] [2]

Documentation:

  • Added a comprehensive "Benchmark" section to the README.md, including performance results and optimization details.

wangzijian1010 and others added 5 commits May 30, 2026 22:37
…mark harness

Phase 1 first cut on the TensorRT face-restoration (GFPGAN) stage. Profiling showed
paste_back was ~50% of the pipeline: two full-frame cv::warpAffine on the CPU plus
per-call cudaMalloc/free and synchronous copies. Rewrote it as a single inverse-mapping
CUDA kernel with reused device buffers and pinned/async copies.

Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512:
  paste_back 39.07ms -> 2.37ms (16.5x); end-to-end 78.2ms -> 30.4ms; 12.8 -> 32.9 FPS.
CPU vs GPU output is numerically equivalent (max|diff| = 2/255).

Per-file changes:
- lite/bench/profiler.h: new header-only, backend-agnostic profiler. CPU chrono +
  CUDA-event timers, mean/p50/p90/p99 + FPS aggregation, CSV export, scoped-timer macros.
- lite/trt/kernel/paste_back.cu, paste_back.cuh: add paste_back_fused_kernel —
  per-output-pixel inverse mapping (uses the original->crop affine directly, no inversion),
  bilinear crop/mask sampling with border-0, blend, writes uint8. Old paste_back_kernel kept.
- lite/trt/kernel/paste_back_manager.cpp, paste_back_manager.h: add PasteBackGPU, which owns
  reusable device + pinned buffers (ensure_capacity) and drives the fused kernel. Old CPU
  launch_paste_back kept for A/B reference. Comments translated to English.
- lite/trt/cv/trt_face_restoration.cpp, trt_face_restoration.h: factor detect() into restore()
  (returns the frame, no disk write, optional per-stage Profiler); detect() now calls
  restore() + imwrite; restore() uses the GPU PasteBackGPU member instead of CPU paste_back.
- examples/lite/cv/test_lite_face_restoration_bench.cpp: new benchmark — CPU-vs-GPU
  paste_back equivalence check + compute-only per-stage latency/FPS over N iterations.
- examples/lite/CMakeLists.txt: register the lite_face_restoration_bench executable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…20.9ms->17.7ms)

Phase 1 continued on the face-restoration stage. Per-stage profiling of preprocess showed
two wins: (1) create_static_box_mask was rebuilt every frame (a large-kernel GaussianBlur,
~10ms) although it only depends on the fixed 512 crop size; (2) bgr2rgb + normalize +
HWC->CHW ran on the CPU (~3.2ms) and then a separate H2D copied the tensor to the device.

Mask is now built once and cached. bgr2rgb/normalize/CHW are fused into a single CUDA kernel
that writes the normalized RGB CHW tensor straight into the inference input buffer, removing
the per-frame mask rebuild, the CPU tensor build, and the separate H2D.

Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512 (cumulative from the 78.2ms baseline):
  preprocess 14.4ms -> 1.3ms; end-to-end 30.8ms -> 17.7ms; 32.4 -> 56.6 FPS.
Output unchanged (PSNR 59.5 dB vs the pre-change result).

Per-file changes:
- lite/trt/kernel/face_restoration_preprocess.cu, face_restoration_preprocess.cuh: new
  face_restoration_preprocess_kernel — one thread per crop pixel, reads interleaved BGR uint8,
  writes planar RGB float (CHW) normalized to [-1,1] (v/127.5 - 1).
- lite/trt/kernel/face_restoration_preprocess_manager.cpp, face_restoration_preprocess_manager.h:
  new FaceRestorePreprocessGPU, owns reusable device + pinned staging buffers and launches the
  kernel writing directly into the inference input buffer.
- lite/trt/cv/trt_face_restoration.h: add FaceRestorePreprocessGPU member, a cached box mask
  member, and the new preprocess-manager include.
- lite/trt/cv/trt_face_restoration.cpp: cache the static box mask; replace CPU
  bgr2rgb/normalize/create_tensor with the fused GPU preprocess into buffers[0]; drop the now
  redundant H2D in the inference step; add per-substage profiler scopes under preprocess.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ation stage

Document the FaceFusion face-restoration (GFPGAN 1.4) TensorRT optimization with a
stage-by-stage baseline vs optimized table (RTX 4090, TRT 10.1, FP32, 512x512):
end-to-end 78.2ms -> 17.7ms (4.4x), 12.8 -> 56.6 FPS. Includes setup/methodology
(warmup, iterations, p50, compute-only) and how to reproduce via lite_face_restoration_bench.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ffort

Position the work as a repeatable playbook (built-in profiler + moving CPU pre/post
into fused CUDA kernels) applied across the whole FaceFusion pipeline, with a per-stage
optimization-status table (detect/landmarks/recognize/swap = WIP, restoration done,
FP16 next). Keep the GFPGAN face-restoration result as a worked deep-dive rather than
the whole story.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One row per algorithm (before / after / speedup / what changed) so the log reads as a
growing list of optimized algorithms rather than a single case study; FaceFusion face
restoration is the first entry, further algorithms are placeholders. Move the GFPGAN
per-stage breakdown into a collapsible details block.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a unified benchmarking/profiling utility and a dedicated GFPGAN face-restoration benchmark binary, and refactors the TensorRT GFPGAN restoration path to enable per-stage timing and accelerate pre/post-processing via new CUDA kernels (preprocess fusion + fused paste-back with buffer reuse).

Changes:

  • Introduces a header-only lite/bench/profiler.h to collect CPU/GPU timing samples, aggregate stats, and export CSV.
  • Adds lite_face_restoration_bench example executable for end-to-end GFPGAN benchmarking and CPU-vs-GPU paste-back equivalence checking.
  • Refactors TRTFaceFusionFaceRestoration to expose a compute-only restore() API and integrate fused CUDA preprocess + fused paste-back with buffer reuse and mask caching.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
README.md Documents benchmarking methodology and reported speedups/breakdowns.
lite/trt/kernel/paste_back.cuh Declares fused paste-back kernel API.
lite/trt/kernel/paste_back.cu Implements fused inverse-mapping paste-back CUDA kernel.
lite/trt/kernel/paste_back_manager.h Adds PasteBackGPU reusable GPU paste-back wrapper.
lite/trt/kernel/paste_back_manager.cpp Implements PasteBackGPU buffer management and kernel launch.
lite/trt/kernel/face_restoration_preprocess.cuh Declares fused preprocess CUDA kernel (BGR->RGB, normalize, HWC->CHW).
lite/trt/kernel/face_restoration_preprocess.cu Implements fused preprocess CUDA kernel.
lite/trt/kernel/face_restoration_preprocess_manager.h Adds reusable GPU preprocess wrapper writing directly to TRT input buffer.
lite/trt/kernel/face_restoration_preprocess_manager.cpp Implements preprocess buffer reuse + kernel launch.
lite/trt/cv/trt_face_restoration.h Adds restore() API and profiling hook; wires in new GPU helpers + mask cache.
lite/trt/cv/trt_face_restoration.cpp Refactors restore pipeline into timed stages and uses new GPU preprocess/paste-back.
lite/bench/profiler.h Adds benchmarking/profiling utility and instrumentation macros.
examples/lite/cv/test_lite_face_restoration_bench.cpp Adds GFPGAN restoration benchmark executable and equivalence check.
examples/lite/CMakeLists.txt Registers lite_face_restoration_bench build target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +133 to +135
cv::Mat crop = crop_vision_frame.isContinuous() ? crop_vision_frame : crop_vision_frame.clone();
cv::Mat mask = crop_mask.isContinuous() ? crop_mask : crop_mask.clone();

Comment on lines +62 to +66
launch_face_restoration_postprocess(
static_cast<float *>(buffers[1]), transposed_data.data(), 3, 512, 512);
std::vector<float> transposed_data_float(transposed_data.begin(), transposed_data.end());
cudaStreamSynchronize(stream);

Comment on lines +13 to +15
// GPU fused version: inverse-mapping sampling + blend entirely in the kernel, reused device
// buffers, pinned + async copies. Numerically equivalent to launch_paste_back; returns full-frame BGR uint8.
class PasteBackGPU {
Comment thread lite/bench/profiler.h
Comment thread README.md
## ⚡ Benchmark 🔥
<div id="benchmark"></div>

GPU-inference optimization log. For each algorithm we profile it with a built-in, backend-agnostic harness ([`lite/bench/profiler.h`](https://github.com/xlite-dev/lite.ai.toolkit/blob/main/lite/bench/profiler.h)), then move the CPU pre/post-processing (affine warp, color convert, normalize, tensor layout, paste-back, NMS …) into **fused CUDA kernels** with reused device buffers and pinned + async copies, so the algorithm spends its time on real inference instead of host glue and `cudaMalloc`/sync round-trips. All numbers are **RTX 4090 · TensorRT 10.1 · CUDA 12.4**, median (p50), compute-only, reproducible via the `lite_*_bench` binaries.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark部分建议使用最新的tensorrt,以及cuda 13+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants