[feature] Non-zero gamma support in ChoppedTransferCompound#764
[feature] Non-zero gamma support in ChoppedTransferCompound#764Zhaoxian-Wu wants to merge 5 commits into
Conversation
4f9519d to
29c7e11
Compare
|
Hey @Zhaoxian-Wu, can you please address the lint errors and update the pr with a new commit, so we can check everything is right? Thanks! Checkout errors here: https://github.com/IBM/aihwkit/actions/runs/23523015289/job/69153947890?pr=764 |
|
Hello @maljoras @maljoras-sony can you take a look and help us here? |
29c7e11 to
b64a075
Compare
Thanks for your reminder! I have already fixed the lint and style errors of two PRs. Feel free to let me know of any other improvements. |
|
@Zhaoxian-Wu looks like a great addition, many thanks. @PabloCarmona, Let me find some time over the weekend to take a closer look. |
|
Hi @PabloCarmona , I noticed that the CI lint check is failing with the following mypy errors in src/aihwkit/simulator/tiles/periphery.py: src/aihwkit/simulator/tiles/periphery.py:983: error: Expected iterable as variadic argument [misc]
src/aihwkit/simulator/tiles/periphery.py:1009: error: Expected iterable as variadic argument [misc]
src/aihwkit/simulator/tiles/periphery.py:1011: error: Expected iterable as variadic argument [misc]Weirdly, this error seems to exist on master as well — they are not introduced by the PR. Could you run the following on your end to confirm? mypy --show-error-codes src/
|
|
Thanks @Zhaoxian-Wu, I will take a closer look and fix it in master. I'll let you know when I finish. In the meantime, let's also give time to @maljoras-sony to look at the PR and review. Thanks again to both! |
|
Hello @Zhaoxian-Wu! Please update this branch with the latest commits on master so we can check everything runs ok on the CICD side since I fixed the problem with the linting. Thanks! |
Hey @maljoras-sony did you have a chance to look at it? Thanks again! |
maljoras
left a comment
There was a problem hiding this comment.
Hi @Zhaoxian-Wu ,
many thanks. Looks like a great addition.
|
@Zhaoxian-Wu can you sync up this with the master branch so we can retrigger the actions and check if everything passes. The lint errors are already addressed on the master branch and that's why we need this one to be sync up. |
b64a075 to
d16dcdb
Compare
|
Sorry for that @Zhaoxian-Wu, but since I saw this errors on linting coming up, I address them and merge the fix on master. Could you sync with master one more time? Thanks and sorry for the inconvenience. |
d16dcdb to
448ed51
Compare
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
Bug fixes:
- CPU TransferRPUDevice::getPulseCountLearningRate now honours
scale_fast_lr (was always returning raw fast_lr)
- CPU and CUDA ChoppedTransferRPUDevice::getPulseCountLearningRate
now applies scale_fast_lr in the auto_scale branch (was missing)
- Remove duplicate auto_momentum line in printToStream
Feature: reduceToWeights for non-zero gamma
- Add ChoppedTransferRPUDevice::reduceToWeights (CPU) and
ChoppedTransferRPUDeviceCuda::reduceToWeights (CUDA) that apply
per-element chopper correction when gamma != 0:
W[i,j] += gamma * (c_d[i]*c_x[j] - 1) * A_stored[i,j]
Enables residual-learning configurations with ChoppedTransfer.
- New CUDA kernel: kernelApplyChopperCorrectionToWeights
Cleanup:
- Remove partial buffer_as_momentum field and its CUDA kernel
- Expose scale_fast_lr to Python TransferCompound (default True;
ChoppedTransferCompound keeps its existing default False)
Docs:
- Rewrite ChoppedTransferCompound docstring: corrected
base_buffer_granularity / final_fast_lr / final_transfer_lr
formulas and full numbered recursion pseudocode
- analog_update.rst: add TTv2 / TTv3 / TTv4 / RL-v2 sections,
residual-learning and bit-slicing discussion
- using_simulator.rst: add per-algorithm subsections (TTv2, TTv3,
TTv4) and gamma residual-learning explanation with code examples
- paper_references.rst: add references [15]-[19]
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
448ed51 to
ad7a737
Compare
Non-zero gamma support,
scale_fast_lr, and documentationOverview
This PR makes two improvements to
ChoppedTransferRPUDevice/ChoppedTransferCompound:gammasupport —ChoppedTransferCompoundcan now be used as a residual-learning device where the fast array A contributes directly to the effective weight, with correct chopper de-correlation applied during weight reduction.scale_fast_lrparameter — a new parameter, analogous to the existingscale_transfer_lr, that controls whether the fast-device LR tracks the current optimizer LR.gammais attached to the documents. To provide sufficient context, this PR also expands the algorithm documentation to cover TTv1 through TTv4.1. Non-zero
gammasupport inChoppedTransferRPUDevicePreviously,
checkSupported()enforcedfullyHidden(), which hard-blocked any configuration where the fast array A contributes to the visible weight (gamma != 0). This restriction is lifted, and correct behaviour is implemented via areduceToWeightsoverride (CPU + CUDA).Background: A is updated with per-element chopper sign flips and is therefore stored in "chopped" form:
A_stored[i,j] ≈ c_d[i]·c_x[j]·A_true[i,j]. The base-class weight-reduction GEMV computesW = gamma·A_stored + C, which is incorrect whengamma != 0because the chopper factors are not cancelled. The new override applies a correction after the GEMV:The
- 1term accounts for the fact that the base GEMV already contributedgamma * A_stored; the correction adds only the remaininggamma * (c_d[i]*c_x[j] - 1) * A_storedto reach the correctgamma * (c_d[i]*c_x[j]) * A_stored. On CUDA this is implemented as the newkernelApplyChopperCorrectionToWeightskernel; on CPU it is a simple loop. Both paths are no-ops whengamma == 0(the default).Motivation: Non-zero
gammaimplements the residual learning mechanism described in Wu et al. (2025) [15] and Li et al. [19]: A acts as a residual correction on top of C, compensating for C's quantisation errors and device non-idealities cycle-by-cycle, while C accumulates the long-term gradient signal via discrete transfer pulses. The two-array decompositionW = gamma·A + Calso enables bit-slicing (precision enhancement): A can represent finer-grained updates than C's native conductance step, reducing the effective weight granularity without modifying the underlying analog device.Files:
src/rpucuda/rpu_chopped_transfer_device.{cpp,h},src/rpucuda/cuda/rpucuda_chopped_transfer_device.{cu,h}2.
scale_fast_lrparameterscale_fast_lris introduced as the analogue of the existingscale_transfer_lr: just asscale_transfer_lrcontrols whethertransfer_lris multiplied by the current optimizer LR,scale_fast_lrcontrols the same behaviour forfast_lr.The parameter is added to
TransferRPUDeviceMetaParameter(C++ base struct, defaultTrue) and exposed to the Python bindings and to theTransferCompounddataclass.ChoppedTransferCompoundoverrides the default toFalse, consistent with the existing convention for that device class.The corresponding logic is implemented in:
TransferRPUDevice<T>::getPulseCountLearningRate(CPU)TransferRPUDeviceCuda<T>::getPulseCountLearningRate(CUDA)ChoppedTransferRPUDevice[Cuda]<T>::getPulseCountLearningRate,auto_scalebranch (CPU + CUDA)Files:
src/rpucuda/rpu_transfer_device.{cpp,h},src/rpucuda/cuda/rpucuda_transfer_device.cu,src/rpucuda/rpu_chopped_transfer_device.cpp,src/rpucuda/cuda/rpucuda_chopped_transfer_device.cu,src/aihwkit/simulator/rpu_base_src/rpu_base_devices.cpp,src/aihwkit/simulator/configs/compounds.py3. Documentation
compounds.py—ChoppedTransferCompounddocstringA detailed pseudocode block is added to the
ChoppedTransferCompounddocstring to improve readability and serve as the authoritative reference for the internal LR-scaling logic. The block covers:base_buffer_granularity— threshold calculation frombuffer_granularity,dw_min_A, and the optionalauto_granularityperiod scalingfinal_fast_lr— derivation fromfast_lr,scale_fast_lr, thefast_lr=0fallback, and theauto_scaleformula(
base_fast_lr * desired_BL * dw_min_A / (x_max * d_max))final_transfer_lr— both the default andcorrect_gradient_magnitudesbranchesW = gamma·A + C), chopper application, H accumulation, threshold test, pulse dispatch (C += n_steps·dw_min_C), and theforget_buffer/momentuminteractiondocs/source/analog_update.rstThe algorithm overview is extended from three methods (Plain SGD, Mixed Precision, Tiki-taka) to the full TTv1-TTv4 family. New sections:
gamma, Wu et al. [15], Li et al. [18])docs/source/using_simulator.rstBufferedTransferCompound,ChoppedTransferCompound, andDynamicTransferCompoundentriesgammadocs/source/paper_references.rstFive new references added:
Testing
gamma: setgamma=0.1in aChoppedTransferCompoundconfig; verify that the tile's visible weight equalsgamma·chop_corrected_A + Crather thangamma·A_stored + C.scale_fast_lr: train withfast_lr > 0,scale_fast_lr=True, and a LR scheduler; verify the effective pulse-count LR tracks the optimizer LR on both CPU and CUDA, including withauto_scale=True.