GFPGAN Speed UP PR#473
Open
wangzijian1010 wants to merge 6 commits into
Open
Conversation
…mark harness Phase 1 first cut on the TensorRT face-restoration (GFPGAN) stage. Profiling showed paste_back was ~50% of the pipeline: two full-frame cv::warpAffine on the CPU plus per-call cudaMalloc/free and synchronous copies. Rewrote it as a single inverse-mapping CUDA kernel with reused device buffers and pinned/async copies. Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512: paste_back 39.07ms -> 2.37ms (16.5x); end-to-end 78.2ms -> 30.4ms; 12.8 -> 32.9 FPS. CPU vs GPU output is numerically equivalent (max|diff| = 2/255). Per-file changes: - lite/bench/profiler.h: new header-only, backend-agnostic profiler. CPU chrono + CUDA-event timers, mean/p50/p90/p99 + FPS aggregation, CSV export, scoped-timer macros. - lite/trt/kernel/paste_back.cu, paste_back.cuh: add paste_back_fused_kernel — per-output-pixel inverse mapping (uses the original->crop affine directly, no inversion), bilinear crop/mask sampling with border-0, blend, writes uint8. Old paste_back_kernel kept. - lite/trt/kernel/paste_back_manager.cpp, paste_back_manager.h: add PasteBackGPU, which owns reusable device + pinned buffers (ensure_capacity) and drives the fused kernel. Old CPU launch_paste_back kept for A/B reference. Comments translated to English. - lite/trt/cv/trt_face_restoration.cpp, trt_face_restoration.h: factor detect() into restore() (returns the frame, no disk write, optional per-stage Profiler); detect() now calls restore() + imwrite; restore() uses the GPU PasteBackGPU member instead of CPU paste_back. - examples/lite/cv/test_lite_face_restoration_bench.cpp: new benchmark — CPU-vs-GPU paste_back equivalence check + compute-only per-stage latency/FPS over N iterations. - examples/lite/CMakeLists.txt: register the lite_face_restoration_bench executable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…20.9ms->17.7ms) Phase 1 continued on the face-restoration stage. Per-stage profiling of preprocess showed two wins: (1) create_static_box_mask was rebuilt every frame (a large-kernel GaussianBlur, ~10ms) although it only depends on the fixed 512 crop size; (2) bgr2rgb + normalize + HWC->CHW ran on the CPU (~3.2ms) and then a separate H2D copied the tensor to the device. Mask is now built once and cached. bgr2rgb/normalize/CHW are fused into a single CUDA kernel that writes the normalized RGB CHW tensor straight into the inference input buffer, removing the per-frame mask rebuild, the CPU tensor build, and the separate H2D. Measured on RTX 4090 / TRT 10.1 fp32, gfpgan 512 (cumulative from the 78.2ms baseline): preprocess 14.4ms -> 1.3ms; end-to-end 30.8ms -> 17.7ms; 32.4 -> 56.6 FPS. Output unchanged (PSNR 59.5 dB vs the pre-change result). Per-file changes: - lite/trt/kernel/face_restoration_preprocess.cu, face_restoration_preprocess.cuh: new face_restoration_preprocess_kernel — one thread per crop pixel, reads interleaved BGR uint8, writes planar RGB float (CHW) normalized to [-1,1] (v/127.5 - 1). - lite/trt/kernel/face_restoration_preprocess_manager.cpp, face_restoration_preprocess_manager.h: new FaceRestorePreprocessGPU, owns reusable device + pinned staging buffers and launches the kernel writing directly into the inference input buffer. - lite/trt/cv/trt_face_restoration.h: add FaceRestorePreprocessGPU member, a cached box mask member, and the new preprocess-manager include. - lite/trt/cv/trt_face_restoration.cpp: cache the static box mask; replace CPU bgr2rgb/normalize/create_tensor with the fused GPU preprocess into buffers[0]; drop the now redundant H2D in the inference step; add per-substage profiler scopes under preprocess. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ation stage Document the FaceFusion face-restoration (GFPGAN 1.4) TensorRT optimization with a stage-by-stage baseline vs optimized table (RTX 4090, TRT 10.1, FP32, 512x512): end-to-end 78.2ms -> 17.7ms (4.4x), 12.8 -> 56.6 FPS. Includes setup/methodology (warmup, iterations, p50, compute-only) and how to reproduce via lite_face_restoration_bench. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ffort Position the work as a repeatable playbook (built-in profiler + moving CPU pre/post into fused CUDA kernels) applied across the whole FaceFusion pipeline, with a per-stage optimization-status table (detect/landmarks/recognize/swap = WIP, restoration done, FP16 next). Keep the GFPGAN face-restoration result as a worked deep-dive rather than the whole story. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One row per algorithm (before / after / speedup / what changed) so the log reads as a growing list of optimized algorithms rather than a single case study; FaceFusion face restoration is the first entry, further algorithms are placeholders. Move the GFPGAN per-stage breakdown into a collapsible details block. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a unified benchmarking/profiling utility and a dedicated GFPGAN face-restoration benchmark binary, and refactors the TensorRT GFPGAN restoration path to enable per-stage timing and accelerate pre/post-processing via new CUDA kernels (preprocess fusion + fused paste-back with buffer reuse).
Changes:
- Introduces a header-only
lite/bench/profiler.hto collect CPU/GPU timing samples, aggregate stats, and export CSV. - Adds
lite_face_restoration_benchexample executable for end-to-end GFPGAN benchmarking and CPU-vs-GPU paste-back equivalence checking. - Refactors
TRTFaceFusionFaceRestorationto expose a compute-onlyrestore()API and integrate fused CUDA preprocess + fused paste-back with buffer reuse and mask caching.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents benchmarking methodology and reported speedups/breakdowns. |
| lite/trt/kernel/paste_back.cuh | Declares fused paste-back kernel API. |
| lite/trt/kernel/paste_back.cu | Implements fused inverse-mapping paste-back CUDA kernel. |
| lite/trt/kernel/paste_back_manager.h | Adds PasteBackGPU reusable GPU paste-back wrapper. |
| lite/trt/kernel/paste_back_manager.cpp | Implements PasteBackGPU buffer management and kernel launch. |
| lite/trt/kernel/face_restoration_preprocess.cuh | Declares fused preprocess CUDA kernel (BGR->RGB, normalize, HWC->CHW). |
| lite/trt/kernel/face_restoration_preprocess.cu | Implements fused preprocess CUDA kernel. |
| lite/trt/kernel/face_restoration_preprocess_manager.h | Adds reusable GPU preprocess wrapper writing directly to TRT input buffer. |
| lite/trt/kernel/face_restoration_preprocess_manager.cpp | Implements preprocess buffer reuse + kernel launch. |
| lite/trt/cv/trt_face_restoration.h | Adds restore() API and profiling hook; wires in new GPU helpers + mask cache. |
| lite/trt/cv/trt_face_restoration.cpp | Refactors restore pipeline into timed stages and uses new GPU preprocess/paste-back. |
| lite/bench/profiler.h | Adds benchmarking/profiling utility and instrumentation macros. |
| examples/lite/cv/test_lite_face_restoration_bench.cpp | Adds GFPGAN restoration benchmark executable and equivalence check. |
| examples/lite/CMakeLists.txt | Registers lite_face_restoration_bench build target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+133
to
+135
| cv::Mat crop = crop_vision_frame.isContinuous() ? crop_vision_frame : crop_vision_frame.clone(); | ||
| cv::Mat mask = crop_mask.isContinuous() ? crop_mask : crop_mask.clone(); | ||
|
|
Comment on lines
+62
to
+66
| launch_face_restoration_postprocess( | ||
| static_cast<float *>(buffers[1]), transposed_data.data(), 3, 512, 512); | ||
| std::vector<float> transposed_data_float(transposed_data.begin(), transposed_data.end()); | ||
| cudaStreamSynchronize(stream); | ||
|
|
Comment on lines
+13
to
+15
| // GPU fused version: inverse-mapping sampling + blend entirely in the kernel, reused device | ||
| // buffers, pinned + async copies. Numerically equivalent to launch_paste_back; returns full-frame BGR uint8. | ||
| class PasteBackGPU { |
DefTruth
reviewed
May 31, 2026
| ## ⚡ Benchmark 🔥 | ||
| <div id="benchmark"></div> | ||
|
|
||
| GPU-inference optimization log. For each algorithm we profile it with a built-in, backend-agnostic harness ([`lite/bench/profiler.h`](https://github.com/xlite-dev/lite.ai.toolkit/blob/main/lite/bench/profiler.h)), then move the CPU pre/post-processing (affine warp, color convert, normalize, tensor layout, paste-back, NMS …) into **fused CUDA kernels** with reused device buffers and pinned + async copies, so the algorithm spends its time on real inference instead of host glue and `cudaMalloc`/sync round-trips. All numbers are **RTX 4090 · TensorRT 10.1 · CUDA 12.4**, median (p50), compute-only, reproducible via the `lite_*_bench` binaries. |
Member
There was a problem hiding this comment.
benchmark部分建议使用最新的tensorrt,以及cuda 13+
DefTruth
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a comprehensive GPU benchmarking framework and significant optimizations for the face restoration pipeline (GFPGAN) in the lite.ai.toolkit project. It adds a unified, backend-agnostic benchmarking utility, a dedicated face restoration benchmark executable, and refactors the core restoration code to enable detailed, per-stage timing and substantial speedups via CUDA kernel fusion and buffer reuse.
Benchmarking and Profiling Infrastructure:
Added a header-only, backend-agnostic benchmarking utility (
profiler.h) that supports CPU and GPU timing, aggregates statistics (mean, percentiles, min/max), and can export results as aligned tables or CSV files. It provides convenient macros for instrumenting code and is designed for minimal overhead when disabled.Introduced a new benchmark executable (
lite_face_restoration_bench) for end-to-end profiling of the GFPGAN face restoration stage. This tool checks CPU vs GPU correctness for the paste-back operation, runs warmup and timed iterations, and reports per-stage and total timings, including CSV export. [1] [2]Face Restoration Pipeline Optimizations:
TRTFaceFusionFaceRestorationclass to support detailed per-stage profiling and major performance improvements:restoremethod that returns the restored frame (without disk I/O) and accepts an optional profiler for instrumentation. [1] [2]Documentation and Results:
README.md, summarizing the optimization strategies, measured speedups (up to 4.4× end-to-end), and per-stage breakdowns, with a reproducible methodology and environment.Summary of Most Important Changes:
Benchmarking Infrastructure:
lite/bench/profiler.h) for CPU and GPU timing, stats aggregation, and CSV export.lite_face_restoration_bench) and its CMake target. [1] [2]Face Restoration Pipeline Optimization:
TRTFaceFusionFaceRestorationto support detailed per-stage profiling, GPU kernel fusion, buffer reuse, and a newrestoremethod for compute-only evaluation. [1] [2]Documentation:
README.md, including performance results and optimization details.