Merge branch core of git@code.alipay.com:pia/linghe.git into main
https://code.alipay.com/pia/linghe/pull_requests/5
* add mxfp8 quant
* add mxfp8 quant
* add mxfp8 quant
* add mxfp8 quant
* add mxfp8 quant
* add mxfp8 quant
* add mxfp8 quant
* revise for colwise
* opt mxfp8 quant
* compatible with triton 3.2.0
* add arg silu for rope
* fix norm
* add fused block permute & unpermute
* add fused permute/unpermute mxfp8 quantize
* refine testcase
* add batch scale & batch clip kernels
* finetune params
* add mla_rope kernel
* add rope inferface & use float32 update in gemm
* add quantization interface
* add varlen rope
* add cu_seqlens_kv for mla rope
* add is_contiguous assert for rope
* refine gemm testcase
* use async h2d
* add rope with cp
* add rope with cp
* add rope with cp
* refine testcases
* add support for mini v3
* add cp rope in facade
* remove megatron code in test_rope
* use two kernels for block_rms_norm
* use two kernels for block_rms_norm
* refine save_for_backward
* use raw input instead of view in saved_tensor
* fix bench norm bug
* remove ctx.rms in norm
* use parameter instead of parameter.data in op
* add silu arg in rope
* opt inplace impl in ce loss
* refine arg names
* add more barrier in ce loss
* add assert in ce loss
* use meaningful shape for batch quantization
* support transpose in mla
* support transpose in mla
* fix arg num bug in mla rope
* linghe 0.1.0
* add approvers in aci
* support more num_experts in fp32_gemm
* update aci
* update
* refine testcases
* remove is_contiguous assertion in batch norm
* use fp32 dw in multiple kernels
* use 0.1.2 version
* recover test image
* use int64 in clip and scale and fp32 in rope
* add is_contiguous assertion and use rsqrt instead of 1/sqrt
* mxfp8 deepep permute
* mxfp8 unpermute func
* support v3 dim
* assert
* scatter add for v3 funny hidden
* use numerical stable impl for ce
* use float64 in ce
* refine testcases
* refine testcases
* refine testcases
* use fp32 in silu and split batch silu backward
* fix bug in topk
* revise&refine test
* tune params
* add arg reuse in rope
* adapt for triton 3.2.0
* support tp in rope
* support tp in rope
* use int64 in batch ops
* use int64 in batch ops
* use mask instead of max to accelerate clip
* mv mxfp8 silu to linghe
* mv mxfp8 rms to linghe
* gradient fix
* refine args in mxfp8
* use int64 in smooth/mxfp8 batch kernels
* fix transpose in mla
* fix transpose in mla
* fix transpose in mla
* fix transpose in mla
* fix transpose in mla
* fix transpose in mla
* fix transpose in mla
* fix mla bug with bs=1 and strided k_pos_emb
* use padding batch instead of 128 in varlen rope
* do not transpose mla rope output with thd layout
* support more shapes
* refine test.sh
* remove rms none assertion in rms norm
* add embedding and mla
* refine mla impl
* add for hybridep
* add varlen mla
* use 0.2.5
* use 0.2.5
* return zero when grad tensors is empty
* define inf/inf=1
* add numel>0 assert in batch ops
* add experiment op&reduce confusion
* WIP dist ce loss
* refine testcases
* add debug log
* add debug log
* support ignore_index in ce
* add parallel ce
* add parallel ce
* version 0.2.7
* remove inf_or_nan
* add log for labels
* add barrier in ce backward
* refine testcase
* add tp_group in ce
* add tp_group in ce
* use rsqrt instead of 1/sqrt
* use 0.2.8
* refine condition of ce parallel
* support stride for grad of embedding
* use 0.2.9
* return grad for dummy tensor in embedding
* support bf16 in batch ops
* use fast impl for embedding
* add ptx util & revise count zeros
* fix typo in triton_batch_count_zero
* use fast and accurate impl for embedding
* refactor embedding ops
* refactor embedding ops
* revise for liuyu request
* PullRequest: 2 合并计算通信融合算子
* PullRequest: 3 reformat
* refine for public