Several kernel optimization from aiter team #5074

Bernard-Liu · 2025-10-31T03:37:57Z

Bwd performance optimization for ROCm.
Caution:
please know that, this PR depend on this: #4925 One should merge 4925 first, after that one can merge this PR

warp per row wg change

…lity

Summary: Pull Request resolved: pytorch#5072 X-link: https://github.com/facebookresearch/FBGEMM/pull/2078 Changing the QtileSize to 64. I see good improvement > 20 %.. For correctness this includes changing the TMEM atoms and introducing warp sync for row stats. Perf: ``` (Batch, SeqLenQ, SeqLenKV, MaxLenKV, HeadQ, HeadKV, HeadD) cutlass_blackwell_fmha_decode-gbps Improvment with Qtile = 64 (16, 1, 256, 256, 8, 1, 128) 238.2206209 1.31463193 (16, 1, 512, 512, 8, 1, 128) 410.8838061 1.315872068 (16, 1, 1024, 1024, 8, 1, 128) 660.5696208 1.335567769 (16, 1, 2048, 2048, 8, 1, 128) 916.5460174 1.310093116 (16, 1, 4096, 4096, 8, 1, 128) 1133.690174 1.258896694 (16, 1, 8192, 8192, 8, 1, 128) 1271.341515 1.229311967 (32, 1, 256, 256, 8, 1, 128) 468.9034945 1.295635241 (32, 1, 512, 512, 8, 1, 128) 799.2689835 1.280831124 (32, 1, 1024, 1024, 8, 1, 128) 1285.452285 1.293538886 (32, 1, 2048, 2048, 8, 1, 128) 1797.074701 1.269787171 (32, 1, 4096, 4096, 8, 1, 128) 2210.946865 1.229703361 (32, 1, 8192, 8192, 8, 1, 128) 2498.665399 1.212166122 (64, 1, 256, 256, 8, 1, 128) 893.9747894 1.302172409 (64, 1, 512, 512, 8, 1, 128) 1493.150844 1.274679551 (64, 1, 1024, 1024, 8, 1, 128) 2309.825211 1.220419935 (64, 1, 2048, 2048, 8, 1, 128) 3012.271892 1.159444905 (64, 1, 4096, 4096, 8, 1, 128) 3552.001019 1.089389445 (64, 1, 8192, 8192, 8, 1, 128) 4348.016208 1.131298153 (128, 1, 256, 256, 8, 1, 128) 1549.388365 1.233405251 (128, 1, 512, 512, 8, 1, 128) 2480.52007 1.210676964 (128, 1, 1024, 1024, 8, 1, 128) 3360.125922 1.145674899 (128, 1, 2048, 2048, 8, 1, 128) 4103.461192 1.093136854 (128, 1, 4096, 4096, 8, 1, 128) 4783.429328 1.095583284 ``` Reviewed By: jianyuh, v0i0 Differential Revision: D85155388 fbshipit-source-id: ec3e43e2c7b0ce68c8eebc3fac74db6c9b66de07

Summary: Pull Request resolved: pytorch#5073 X-link: https://github.com/facebookresearch/FBGEMM/pull/2079 Compile-time static/const mapping utilities for: 1. constexpr value -> constexpr value 2. constexpr value -> type Useful when developing template-heavy cutlass code. Reviewed By: jianyuh Differential Revision: D85893168 fbshipit-source-id: 691dbb90e17c88dfc384432908e8ffdb8c0b2a04

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2081 Pull Request resolved: pytorch#5076 D85603930 removed AVX from aarch64 compilation, and broke Sigrid build. Proposed changes fix the build break. There is an fbgemm routine without ref implementation, so we need to implement a NEON port at a later diff. For now, four AVX2 files are compiled with the NEON package Reviewed By: YifanYuan3 Differential Revision: D85918535 fbshipit-source-id: 3a9892535046a0edc05c8c1fccbb0dac8ca8de35

Summary: Pull Request resolved: pytorch#5062 X-link: meta-pytorch/torchrec#3490 X-link: https://github.com/facebookresearch/FBGEMM/pull/2070 Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks. Reviewed By: emlin Differential Revision: D85604160 fbshipit-source-id: 177ec779960a4ac9bfc3d41f38beeb7e56665db8

Summary: Pull Request resolved: pytorch#5075 X-link: https://github.com/facebookresearch/FBGEMM/pull/2080 This diff generalizes the work in (D85155388) based on Gefei's diff D85631781 . Compared to D85631781, we avoid registers warp shuffling by using 32b TMEM atoms. This diff supports: 1. Different dtypes (fp8, bf16) 2. Different mtiles (128, 64) Reviewed By: v0i0 Differential Revision: D85893883 fbshipit-source-id: 25e93e627c573a120ab46336d3f234064c5ae066

…ch#5077) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2082 Pull Request resolved: pytorch#5077 This change selects the `hash_zch_identities` that corresponds with unique indices during TBE prefetch. This is specifically required for MPZCH tables, which need both the slot index and the corresponding identities for correct lookup behavior. Without the identities, the inference side cannot correctly verify if it's using the correct slot, leading to potential lookup errors. Reviewed By: chouxi Differential Revision: D85999577 fbshipit-source-id: 3c8a4add1dd112e9a746b334e7046bb442ea977b

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2086 While not is a valid C++ keyword, MSVC issues the following warnings ``` C:\actions-runner\_work\pytorch\pytorch\third_party\fbgemm\include\fbgemm\./FloatConversion.h(292): warning C4067: unexpected tokens following preprocessor directive - expected a newline ``` Pull Request resolved: pytorch#5025 Reviewed By: spcyppt Differential Revision: D86135907 Pulled By: q10 fbshipit-source-id: 3d55410aa1f6f4f1a4511d2881d1b0ba05ea5c5a

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2088 - Remove Python 3.9 support, following PyTorch nightlies Pull Request resolved: pytorch#5081 Reviewed By: spcyppt Differential Revision: D86168579 Pulled By: q10 fbshipit-source-id: f15a5107ab9f86c7c07e704f510faab312bac858

…h#5080) Summary: Pull Request resolved: pytorch#5080 X-link: https://github.com/facebookresearch/FBGEMM/pull/2087 This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented: 1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically) 2) Switch to 32 threads logical wave sizes to reduce granularity losses. Pull Request resolved: pytorch#5078 Reviewed By: spcyppt, haoyuz Differential Revision: D86135611 Pulled By: q10 fbshipit-source-id: f4fb9966f5f5180c4dde2aed92ca726c260b7743

…ytorch#5083) Summary: Pull Request resolved: pytorch#5083 X-link: https://github.com/facebookresearch/FBGEMM/pull/2089 When running benchmarks with a large number of copies, the process may raise: OSError: [Errno 24] Too many open files. Example command: (fbgemm_gpu_env)$ ulimit -n 1048576 (fbgemm_gpu_env)$ python ./bench/tbe/tbe_inference_benchmark.py nbit-cpu \ --num-embeddings=40000000 --bag-size=2 --embedding-dim=96 \ --batch-size=162 --num-tables=8 --weights-precision=int4 \ --output-dtype=fp32 --copies=96 --iters=30000 PyTorch multiprocessing provides two shared-memory strategies: 1.file_descriptor (default) 2.file_system The default file_descriptor strategy uses file descriptors as shared memory handles, which can result in a large number of open FDs when many tensors are shared. If the total number of open FDs exceeds the system limit and cannot be raised, the file_system strategy should be used instead. This patch allows switching to the file_system strategy by setting: export PYTORCH_SHARE_STRATEGY='file_system' Reference: https://pytorch.org/docs/stable/multiprocessing.html#sharing-strategies Pull Request resolved: pytorch#5037 Reviewed By: spcyppt Differential Revision: D86135817 Pulled By: q10 fbshipit-source-id: 15f6fe7e1de5e9fef828f5a1496dc1cf9b41c293

Summary: Pull Request resolved: pytorch#5085 X-link: https://github.com/facebookresearch/FBGEMM/pull/2093 As title, in silvertorch bulk eval, they will not call eval() for the module but using torch.no_grad() to run. https://www.internalfb.com/code/fbsource/[324dbccd0ab0]/fbcode/dper_lib/silvertorch/core/publish/data_processing/bulk_eval_dmp_gpu.py?lines=1057 So set a eval mode to turn the self.training to False in tbe for bulk eval. Reviewed By: emlin Differential Revision: D86220286 fbshipit-source-id: 9a48c7b4dc09767c99a545d1f25e53bf4265079f

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2097 - Fix test reliability with table order Pull Request resolved: pytorch#5087 Reviewed By: spcyppt Differential Revision: D86242426 Pulled By: q10 fbshipit-source-id: 4ec307ff8fd9151bddb6bf7354bfe06f67a1fa0b

…#5089) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2098 Pull Request resolved: pytorch#5089 Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads Performance improves by an order of magnitude: Before: bit_rate, rows, cols, elems_per_usec, GB/Sec 8, 100, 16, 378.68, 1.51 8, 100, 64, 286.91, 1.15 8, 100, 128, 262.06, 1.05 8, 100, 256, 251.34, 1.01 8, 100, 512, 244.92, 0.98 8, 100, 1024, 237.35, 0.95 8, 100, 2048, 230.83, 0.92 8, 120, 16, 378.70, 1.51 8, 120, 64, 286.72, 1.15 8, 120, 128, 263.40, 1.05 8, 120, 256, 251.58, 1.01 8, 120, 512, 245.30, 0.98 8, 120, 1024, 238.17, 0.95 8, 120, 2048, 230.69, 0.92 8, 1000, 16, 392.85, 1.57 8, 1000, 64, 294.35, 1.18 8, 1000, 128, 264.35, 1.06 8, 1000, 256, 252.13, 1.01 8, 1000, 512, 245.50, 0.98 8, 1000, 1024, 241.61, 0.97 8, 1000, 2048, 231.39, 0.93 After: bit_rate, rows, cols, elems_per_usec, GB/Sec 8, 100, 16, 1855.59, 7.42 8, 100, 64, 2615.43, 10.46 8, 100, 128, 3134.34, 12.54 8, 100, 256, 2610.72, 10.44 8, 100, 512, 3065.20, 12.26 8, 100, 1024, 3535.29, 14.14 8, 100, 2048, 3757.66, 15.03 8, 120, 16, 1991.94, 7.97 8, 120, 64, 2971.25, 11.89 8, 120, 128, 3403.37, 13.61 8, 120, 256, 2750.87, 11.00 8, 120, 512, 3272.63, 13.09 8, 120, 1024, 3618.98, 14.48 8, 120, 2048, 3848.59, 15.39 8, 1000, 16, 2329.11, 9.32 8, 1000, 64, 3068.76, 12.28 8, 1000, 128, 3678.86, 14.72 8, 1000, 256, 4440.37, 17.76 8, 1000, 512, 4558.70, 18.23 8, 1000, 1024, 4620.94, 18.48 8, 1000, 2048, 3898.84, 15.60 Reviewed By: mcfi Differential Revision: D86236406 fbshipit-source-id: 12c20cbdbbc9b0674ccca8e1aa598b7de144dea9

Summary: Pull Request resolved: pytorch#5091 X-link: https://github.com/facebookresearch/FBGEMM/pull/2099 In this test, we run following step 1. Create a DramKVInferenceEmbedding with TTL eviction for 1 min 2. Insert 1 embedding with current Unixtime - 2 mins (it is already expired) as timestamp 3. Read from it and check correctness 4. Read for multiple times 5. Evict it 6. Read it --- this time should be inconsistent Reviewed By: emlin Differential Revision: D86268606 fbshipit-source-id: edc2dc24e5327399421d20229a0b1af2ca29ea7a

Summary: Pull Request resolved: pytorch#5093 X-link: https://github.com/facebookresearch/FBGEMM/pull/2100 ---- # Context on the changes: Currently, Torchrec merges the outputs of individual VBE TBE ops to be ordered by ranks using [_merge_variable_batch_embeddings](https://www.internalfb.com/code/fbsource/[3bd69d7fa3534144dcb0162ca59803a6c3ff6e70]/fbcode/torchrec/distributed/embedding_lookup.py?lines=593-604). This function seems to cause ~30% QPS regression compared to baseline (HBM+UVM) for Jupiter V1 model with VBE enabled. To get rid of _merge_variable_batch_embeddings() function, we pre-allocate the `vbe_output` tensor which holds outputs from all VBE ops and calculate `vbe_output_offsets` to allow each individual VBE ops to write to the correct location in the `vbe_output` tensor. By default, `vbe_output` and `vbe_output_offsets` are `None`, which means VBE ops will return individual tensor the way it currently does. The feature is enabled when `vbe_output` and `vbe_output_offsets` are not `None`. --- **NOTE** 1. This feature is currently supported for Sparse TBE. 2. The support is limited for CUDA. 3. For backward compatibility, we append the newly introduced `vbe_output` to the existing API. Hence, we need to make the `vbe_output` tensor as `optional` with default value as `None` (there's no default value for Tensor). 4. We *cannot* annotate `vbe_output` because PyTorch registration does not support annotation of optional tensor. Adding annotation will incur the following error below. This may cause some issues to support this on MTIA, if MTIA relies on tensor annotation. ``` E0903 09:50:32.966235 2850885 ExceptionTracer.cpp:227] exception stack complete terminate called after throwing an instance of 'std::runtime_error' what(): expected ident but found '(' here: split_embedding_codegen_lookup_adagrad_function_pt2( Tensor placeholder_autograd_tensor, Tensor[](a!) weights, Tensor D_offsets, SymInt total_D, SymInt max_D, Tensor hash_size_cumsum, int total_hash_size_bits, Tensor indices, Tensor offsets, int pooling_mode, Tensor? indice_weights, Tensor? feature_requires_grad, int output_dtype, Tensor?[](e!) aux_tensor, int[] aux_int, float[] aux_float, bool[] aux_bool, Tensor[](g!) momentum1, Tensor learning_rate_tensor, float[] optim_float, SymInt max_B=-1, SymInt max_B_feature_rank=-1, SymInt vbe_output_size=-1, Tensor?(t!) vbe_output=None ) -> Tensor ~ <--- HERE ``` See https://docs.google.com/document/d/1h5YyeCjYmmN-CIFB98CrBf1uMksidPbNvM1rl8yZeds/edit?tab=t.0#heading=h.tdfkkc6ujdyl ---- This diff is a reland of D79704318 which all issues have been addressed. ## 1) pyper validation test D79704318 was reverted as it broke pyper validation test (frontend/backend package compatibility issue), which blocks pyper releases. The issue is addressed in this diff. Context: In pyper, changes in python would be included in frontend package (e.g., ads_dper3) and C++ in backend package (e.g., training_platform). If the diff contains both python and C++, there's a chance that some model will use mismatching packages. In other words, frontend package does not include the diff but backend does, and vice versa. D83881544 is only enabling backend support (i.e., no one can actually use this feature, so TBE VBE will work as usual). Due to new Unified API changes, we need to pipeline optional tensor from frontend and requires python change. Denote - #0 as no D83881544 included - #1 as D83881544 included There are 4 scenarios: (1) frontend #0 + old backend #0 - no issue (2) frontend #1 + backend #1 - no issue (3) frontend #0 + backend #1 - handled; TBE VBE will work normally. (4) frontend #1 + backend #0 - no issue; the diff added warning that backend is old There's another diff D79869613 in the stack that will enable frontend support (i.e., allow users to use this feature), which will go into __frontend package only__. Now, the 1)-4) scenarios would remain the same, but new scenarios occur. Denote - #2 as D79869613 included (5) frontend #2 + backend #1 - no issue, same as (2). (6) frontend #2 (no feature enabled) + backend #0 - same as (4). (7) frontend #2 (feature enabled) + backend #0 - **assertion error due to no backend support**, to prevent silent wrong behavior. **To use the feature, this diff stack (D83881544 and D79869613) need to be included in both frontend and backend package.** ## 2) SEV D79704318 caused SEV due to TBE v1 and v2 interfacing compatibility issue on lex_ig_o3_package. Unit tests to ensure v1 compatibility was added D83020965. D83881544 passes the v1 compatibility test. Detail on the root cause and fix: https://docs.google.com/document/d/1XcYNfyiAn4aRFvjV0QG5aLiWKuuOWtJdLOMKNszZRpI/edit?tab=t.0#heading=h.psr4a2qn0mdk ------ Reviewed By: q10, renganxu Differential Revision: D83881544 fbshipit-source-id: 5d63841bbf79a72219903e9d0f77ee3b998bc105

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2095 optimization on embedding forward for MI350: 1. apply vec4 on embedding vbe forward kernel instead of vec2 2. As there are 64 threads in rocm, optimize subwarp in embedding forward v2 kernel when embedding dim is from 32 to 64. Pull Request resolved: pytorch#5064 Reviewed By: q10 Differential Revision: D85701691 Pulled By: spcyppt fbshipit-source-id: 72f491414f50e53038a4b02f3d555967d34740a7

Summary: Pull Request resolved: pytorch#5086 X-link: https://github.com/facebookresearch/FBGEMM/pull/2094 For lengths per shard exceeding 2^31, we avoid overflow resulting in undefined behavior. Reviewed By: spcyppt Differential Revision: D86209662 fbshipit-source-id: 6d51290f3436629571677091c42b76b6f98e5790

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2102 Pull Request resolved: pytorch#5094 see D86119952 Reviewed By: htyu Differential Revision: D86319606 fbshipit-source-id: bdf841f0936f1be53b7a07e66b6a64e9e2aaef12

Summary: Pull Request resolved: pytorch#5031 X-link: meta-pytorch/torchrec#3475 X-link: https://github.com/facebookresearch/FBGEMM/pull/2044 Enable feature score auto collection for EBC in the similar way of EC. The configuration has no difference in embedding table config: virtual_table_eviction_policy=FeatureScoreBasedEvictionPolicy( training_id_eviction_trigger_count=260_000_000, # 260M training_id_keep_count=160_000_000, # 160M enable_auto_feature_score_collection=True, feature_score_mapping={ "sparse_public_original_content_creator": 1.0, }, feature_score_default_value=0.5, ), Reviewed By: EddyLXJ Differential Revision: D85017179 fbshipit-source-id: 3d62f8adbe201d6e30c445aaed88710bbbcd6557

avbokovoy and others added 30 commits October 27, 2025 19:06

Add gfx950 build support + fp16 fix + index type fix

cd7dfea

Change int64_t to index_t as template parameters in load_raw_per_warp

602b7bf

Implement llvm fp16 buffer load for gfx950

a587e06

Fix c-style half to float cast

48a10bf

Patch 256 half stores

d4acaba

cta_per_row workgroup optim

a6636f0

Added mi350 guards

a15fb09

Fix index overflow in row load

6af95e0

cta_per_row workgroup reduce by 4 optim

be5f1b8

Fix mixed_D frontend to backend connection

acef908

changed max_segment_length_per_cta to 4096

33f4ad9

added rocm guards and removed comment

aaf1966

clean debug statements in Hip.cmake

48e7f97

Merge pull request #121

750bee4

warp per row wg change

Guard f16 llvm intrinsics with ROCm >=7.0

f0acbc3

fix the bug in dimention 160 in ROCm optimization

0ee2366

Cleanup optimized warp_per_raw kernel

e33120d

Add 320 embedding dim support for optimized warp_per_row kernel

3447ef0

changed the max length per warp and cta per row WG size

a1361ab

added DPP and changed max length per warp to 16k

9c2fd1d

guard max segment warp based on emb dim

54690c9

added guarding opt of max segment for the case batch size list=1

d666611

opt for grad_indice_weights kernel

df863d0

added store row per warp on emb 192 and added accuracy test functiona…

e0bee9f

…lity

workgroup tuning and loop unrolled

ca82950

specialize

7ad444b

explicitly link to tbb

970229b

added warpReduceAllSum with rocm guards

539985c

revert unroll and wg tuning

e3d4773

Minor update embedding_forward_split_kernel_template.cu

9505ffe

Aya-ZIbra and others added 29 commits November 11, 2025 06:45

Deprecate tl.async_task from fbgemm (pytorch#5094)

924082f

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2102 Pull Request resolved: pytorch#5094 see D86119952 Reviewed By: htyu Differential Revision: D86319606 fbshipit-source-id: bdf841f0936f1be53b7a07e66b6a64e9e2aaef12

workgroup tuning and loop unrolled

8bf19e4

revert unroll and wg tuning

90c029a

removed jinj is_rocm on total_L as USE_ROCM is already applied

ae17791

Change mixed_D default value to false

bcc4116

Make const work_group_size for CUDA

f624941

Add jinja comments to grad_indice_weights kernel

14cdfdb

Remove redundand comment

4973c86

Unify cuda and rocm loops

68e45ff

workgroup tuning and loop unrolled

c9aceb3

revert unroll and wg tuning

1f82f3b

Bernard-Liu force-pushed the aiter/mi350_kernel_opt branch from bca8c27 to 1f82f3b Compare November 11, 2025 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Several kernel optimization from aiter team #5074

Several kernel optimization from aiter team #5074

Uh oh!

Bernard-Liu commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants

Several kernel optimization from aiter team #5074

Are you sure you want to change the base?

Several kernel optimization from aiter team #5074

Uh oh!

Conversation

Bernard-Liu commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants