Skip to content

Conversation

@Bernard-Liu
Copy link

Bwd performance optimization for ROCm.
Caution:
please know that, this PR depend on this: #4925 One should merge 4925 first, after that one can merge this PR

avbokovoy and others added 30 commits October 27, 2025 19:06
warp per row wg change
Aya-ZIbra and others added 29 commits November 11, 2025 06:45
Summary:
Pull Request resolved: pytorch#5072

X-link: https://github.com/facebookresearch/FBGEMM/pull/2078

Changing the QtileSize to 64. I see good improvement  > 20 %..
For correctness this includes changing the TMEM atoms and introducing warp sync for row stats.

Perf:
```
(Batch, SeqLenQ, SeqLenKV, MaxLenKV, HeadQ, HeadKV, HeadD)	cutlass_blackwell_fmha_decode-gbps			Improvment with Qtile = 64
(16, 1, 256, 256, 8, 1, 128)	238.2206209			1.31463193
(16, 1, 512, 512, 8, 1, 128)	410.8838061			1.315872068
(16, 1, 1024, 1024, 8, 1, 128)	660.5696208			1.335567769
(16, 1, 2048, 2048, 8, 1, 128)	916.5460174			1.310093116
(16, 1, 4096, 4096, 8, 1, 128)	1133.690174			1.258896694
(16, 1, 8192, 8192, 8, 1, 128)	1271.341515			1.229311967
(32, 1, 256, 256, 8, 1, 128)	468.9034945			1.295635241
(32, 1, 512, 512, 8, 1, 128)	799.2689835			1.280831124
(32, 1, 1024, 1024, 8, 1, 128)	1285.452285			1.293538886
(32, 1, 2048, 2048, 8, 1, 128)	1797.074701			1.269787171
(32, 1, 4096, 4096, 8, 1, 128)	2210.946865			1.229703361
(32, 1, 8192, 8192, 8, 1, 128)	2498.665399			1.212166122
(64, 1, 256, 256, 8, 1, 128)	893.9747894			1.302172409
(64, 1, 512, 512, 8, 1, 128)	1493.150844			1.274679551
(64, 1, 1024, 1024, 8, 1, 128)	2309.825211			1.220419935
(64, 1, 2048, 2048, 8, 1, 128)	3012.271892			1.159444905
(64, 1, 4096, 4096, 8, 1, 128)	3552.001019			1.089389445
(64, 1, 8192, 8192, 8, 1, 128)	4348.016208			1.131298153
(128, 1, 256, 256, 8, 1, 128)	1549.388365			1.233405251
(128, 1, 512, 512, 8, 1, 128)	2480.52007			1.210676964
(128, 1, 1024, 1024, 8, 1, 128)	3360.125922			1.145674899
(128, 1, 2048, 2048, 8, 1, 128)	4103.461192			1.093136854
(128, 1, 4096, 4096, 8, 1, 128)	4783.429328			1.095583284

```

Reviewed By: jianyuh, v0i0

Differential Revision: D85155388

fbshipit-source-id: ec3e43e2c7b0ce68c8eebc3fac74db6c9b66de07
Summary:
Pull Request resolved: pytorch#5073

X-link: https://github.com/facebookresearch/FBGEMM/pull/2079

Compile-time static/const mapping utilities for:
1. constexpr value -> constexpr value
2. constexpr value -> type

Useful when developing template-heavy cutlass code.

Reviewed By: jianyuh

Differential Revision: D85893168

fbshipit-source-id: 691dbb90e17c88dfc384432908e8ffdb8c0b2a04
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2081

Pull Request resolved: pytorch#5076

D85603930 removed AVX from aarch64 compilation, and broke Sigrid build.

Proposed changes fix the build break.

There is an fbgemm routine without ref implementation, so we need to implement a NEON port at a later diff.

For now, four AVX2 files are compiled with the NEON package

Reviewed By: YifanYuan3

Differential Revision: D85918535

fbshipit-source-id: 3a9892535046a0edc05c8c1fccbb0dac8ca8de35
Summary:
Pull Request resolved: pytorch#5062

X-link: meta-pytorch/torchrec#3490

X-link: https://github.com/facebookresearch/FBGEMM/pull/2070

Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction.

This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks.

Reviewed By: emlin

Differential Revision: D85604160

fbshipit-source-id: 177ec779960a4ac9bfc3d41f38beeb7e56665db8
Summary:
Pull Request resolved: pytorch#5075

X-link: https://github.com/facebookresearch/FBGEMM/pull/2080

This diff generalizes the work in (D85155388) based on Gefei's diff D85631781 .

Compared to D85631781, we avoid registers warp shuffling by using 32b TMEM atoms.

This diff supports:
1. Different dtypes (fp8, bf16)
2. Different mtiles (128, 64)

Reviewed By: v0i0

Differential Revision: D85893883

fbshipit-source-id: 25e93e627c573a120ab46336d3f234064c5ae066
…ch#5077)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2082

Pull Request resolved: pytorch#5077

This change selects the `hash_zch_identities` that corresponds with unique indices during TBE prefetch. This is specifically required for MPZCH tables, which need both the slot index and the corresponding identities for correct lookup behavior. Without the identities, the inference side cannot correctly verify if it's using the correct slot, leading to potential lookup errors.

Reviewed By: chouxi

Differential Revision: D85999577

fbshipit-source-id: 3c8a4add1dd112e9a746b334e7046bb442ea977b
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2086

While not is a valid C++ keyword, MSVC issues the following warnings
```
C:\actions-runner\_work\pytorch\pytorch\third_party\fbgemm\include\fbgemm\./FloatConversion.h(292): warning C4067: unexpected tokens following preprocessor directive - expected a newline
```

Pull Request resolved: pytorch#5025

Reviewed By: spcyppt

Differential Revision: D86135907

Pulled By: q10

fbshipit-source-id: 3d55410aa1f6f4f1a4511d2881d1b0ba05ea5c5a
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2088

- Remove Python 3.9 support, following PyTorch nightlies

Pull Request resolved: pytorch#5081

Reviewed By: spcyppt

Differential Revision: D86168579

Pulled By: q10

fbshipit-source-id: f15a5107ab9f86c7c07e704f510faab312bac858
…h#5080)

Summary:
Pull Request resolved: pytorch#5080

X-link: https://github.com/facebookresearch/FBGEMM/pull/2087

This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented:
1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically)
2) Switch to 32 threads logical wave sizes to reduce granularity losses.

Pull Request resolved: pytorch#5078

Reviewed By: spcyppt, haoyuz

Differential Revision: D86135611

Pulled By: q10

fbshipit-source-id: f4fb9966f5f5180c4dde2aed92ca726c260b7743
…ytorch#5083)

Summary:
Pull Request resolved: pytorch#5083

X-link: https://github.com/facebookresearch/FBGEMM/pull/2089

When running benchmarks with a large number of copies, the process may raise:
 OSError: [Errno 24] Too many open files.

Example command:
(fbgemm_gpu_env)$ ulimit -n 1048576
(fbgemm_gpu_env)$ python ./bench/tbe/tbe_inference_benchmark.py nbit-cpu \
    --num-embeddings=40000000 --bag-size=2 --embedding-dim=96 \
    --batch-size=162 --num-tables=8 --weights-precision=int4 \
    --output-dtype=fp32 --copies=96 --iters=30000

PyTorch multiprocessing provides two shared-memory strategies: 1.file_descriptor (default)
2.file_system

The default file_descriptor strategy uses file descriptors as shared memory handles, which can result in a large number of open FDs when many tensors are shared.
If the total number of open FDs exceeds the system limit and cannot be raised, the file_system strategy should be used instead.

This patch allows switching to the file_system strategy by setting:
  export PYTORCH_SHARE_STRATEGY='file_system'

Reference:
https://pytorch.org/docs/stable/multiprocessing.html#sharing-strategies

Pull Request resolved: pytorch#5037

Reviewed By: spcyppt

Differential Revision: D86135817

Pulled By: q10

fbshipit-source-id: 15f6fe7e1de5e9fef828f5a1496dc1cf9b41c293
Summary:
Pull Request resolved: pytorch#5085

X-link: https://github.com/facebookresearch/FBGEMM/pull/2093

As title, in silvertorch bulk eval, they will not call eval() for the module but using torch.no_grad() to run. https://www.internalfb.com/code/fbsource/[324dbccd0ab0]/fbcode/dper_lib/silvertorch/core/publish/data_processing/bulk_eval_dmp_gpu.py?lines=1057 So set a eval mode to turn the self.training to False in tbe for bulk eval.

Reviewed By: emlin

Differential Revision: D86220286

fbshipit-source-id: 9a48c7b4dc09767c99a545d1f25e53bf4265079f
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2097

- Fix test reliability with table order

Pull Request resolved: pytorch#5087

Reviewed By: spcyppt

Differential Revision: D86242426

Pulled By: q10

fbshipit-source-id: 4ec307ff8fd9151bddb6bf7354bfe06f67a1fa0b
…#5089)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2098

Pull Request resolved: pytorch#5089

Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads

Performance improves by an order of magnitude:

Before:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,          378.68,       1.51
       8,   100,     64,          286.91,       1.15
       8,   100,    128,          262.06,       1.05
       8,   100,    256,          251.34,       1.01
       8,   100,    512,          244.92,       0.98
       8,   100,   1024,          237.35,       0.95
       8,   100,   2048,          230.83,       0.92
       8,   120,     16,          378.70,       1.51
       8,   120,     64,          286.72,       1.15
       8,   120,    128,          263.40,       1.05
       8,   120,    256,          251.58,       1.01
       8,   120,    512,          245.30,       0.98
       8,   120,   1024,          238.17,       0.95
       8,   120,   2048,          230.69,       0.92
       8,  1000,     16,          392.85,       1.57
       8,  1000,     64,          294.35,       1.18
       8,  1000,    128,          264.35,       1.06
       8,  1000,    256,          252.13,       1.01
       8,  1000,    512,          245.50,       0.98
       8,  1000,   1024,          241.61,       0.97
       8,  1000,   2048,          231.39,       0.93

After:

    bit_rate,   rows,  cols,  elems_per_usec,  GB/Sec
       8,   100,     16,         1855.59,       7.42
       8,   100,     64,         2615.43,      10.46
       8,   100,    128,         3134.34,      12.54
       8,   100,    256,         2610.72,      10.44
       8,   100,    512,         3065.20,      12.26
       8,   100,   1024,         3535.29,      14.14
       8,   100,   2048,         3757.66,      15.03
       8,   120,     16,         1991.94,       7.97
       8,   120,     64,         2971.25,      11.89
       8,   120,    128,         3403.37,      13.61
       8,   120,    256,         2750.87,      11.00
       8,   120,    512,         3272.63,      13.09
       8,   120,   1024,         3618.98,      14.48
       8,   120,   2048,         3848.59,      15.39
       8,  1000,     16,         2329.11,       9.32
       8,  1000,     64,         3068.76,      12.28
       8,  1000,    128,         3678.86,      14.72
       8,  1000,    256,         4440.37,      17.76
       8,  1000,    512,         4558.70,      18.23
       8,  1000,   1024,         4620.94,      18.48
       8,  1000,   2048,         3898.84,      15.60

Reviewed By: mcfi

Differential Revision: D86236406

fbshipit-source-id: 12c20cbdbbc9b0674ccca8e1aa598b7de144dea9
Summary:
Pull Request resolved: pytorch#5091

X-link: https://github.com/facebookresearch/FBGEMM/pull/2099

In this test, we run following step
1. Create a DramKVInferenceEmbedding with TTL eviction for 1 min
2. Insert 1 embedding with current Unixtime - 2 mins (it is already expired) as timestamp
3. Read from it and check correctness
4. Read for multiple times
5. Evict it
6. Read it --- this time should be inconsistent

Reviewed By: emlin

Differential Revision: D86268606

fbshipit-source-id: edc2dc24e5327399421d20229a0b1af2ca29ea7a
Summary:
Pull Request resolved: pytorch#5093

X-link: https://github.com/facebookresearch/FBGEMM/pull/2100

----

# Context on the changes:

Currently, Torchrec merges the outputs of individual VBE TBE ops to be ordered by ranks using [_merge_variable_batch_embeddings](https://www.internalfb.com/code/fbsource/[3bd69d7fa3534144dcb0162ca59803a6c3ff6e70]/fbcode/torchrec/distributed/embedding_lookup.py?lines=593-604). This function seems to cause ~30% QPS regression compared to baseline (HBM+UVM) for Jupiter V1 model with VBE enabled.

To get rid of _merge_variable_batch_embeddings() function, we pre-allocate the `vbe_output` tensor which holds  outputs from all VBE ops and calculate `vbe_output_offsets` to allow each individual VBE ops to write to the correct location in the `vbe_output` tensor.

By default, `vbe_output` and `vbe_output_offsets` are `None`, which means VBE ops will return individual tensor the way it currently does. The feature is enabled when `vbe_output` and `vbe_output_offsets` are not `None`.

 ---
**NOTE**
1. This feature is currently supported for Sparse TBE.
2. The support is limited for CUDA.
3. For backward compatibility, we append the newly introduced `vbe_output` to the existing API. Hence, we need to make the `vbe_output` tensor as `optional` with default value as `None` (there's no default value for Tensor).
4. We *cannot* annotate  `vbe_output` because PyTorch registration does not support annotation of optional tensor.  Adding annotation will incur the following error below. This may cause some issues to support this on MTIA, if MTIA relies on tensor annotation.
```
E0903 09:50:32.966235 2850885 ExceptionTracer.cpp:227] exception stack complete
terminate called after throwing an instance of 'std::runtime_error'
  what():  expected ident but found '(' here:
split_embedding_codegen_lookup_adagrad_function_pt2(    Tensor placeholder_autograd_tensor,     Tensor[](a!) weights,     Tensor D_offsets,     SymInt total_D,     SymInt max_D,     Tensor hash_size_cumsum,     int total_hash_size_bits,     Tensor indices,     Tensor offsets,     int pooling_mode,     Tensor? indice_weights,     Tensor? feature_requires_grad,     int output_dtype,     Tensor?[](e!) aux_tensor,     int[] aux_int,     float[] aux_float,     bool[] aux_bool,     Tensor[](g!) momentum1, Tensor learning_rate_tensor, float[] optim_float,     SymInt max_B=-1,     SymInt max_B_feature_rank=-1,     SymInt vbe_output_size=-1,     Tensor?(t!) vbe_output=None ) -> Tensor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ~ <--- HERE
```

See https://docs.google.com/document/d/1h5YyeCjYmmN-CIFB98CrBf1uMksidPbNvM1rl8yZeds/edit?tab=t.0#heading=h.tdfkkc6ujdyl

----

This diff is a reland of D79704318 which all issues have been addressed.

## 1) pyper validation test

D79704318 was reverted as it broke pyper validation test (frontend/backend package compatibility issue), which blocks pyper releases. The issue is addressed in this diff.

Context: In pyper, changes in python would be included in frontend package (e.g., ads_dper3) and C++ in backend package (e.g., training_platform). If the diff contains both python and C++, there's a chance that some model will use mismatching packages. In other words, frontend package does not include the diff but backend does, and vice versa.

D83881544 is only enabling backend support (i.e., no one can actually use this feature, so TBE VBE will work as usual). Due to new Unified API changes, we need to pipeline optional tensor from frontend and requires python change.

Denote
- #0 as no D83881544 included
- #1 as D83881544 included

There are 4 scenarios:

(1) frontend #0 + old backend #0 - no issue
(2) frontend #1 + backend #1 - no issue
(3) frontend #0 + backend #1 - handled; TBE VBE will work normally.
(4) frontend #1 + backend #0 - no issue; the diff added warning that backend is old

There's another diff D79869613 in the stack that will enable frontend support (i.e., allow users to use this feature), which will go into __frontend package only__.
Now, the 1)-4) scenarios would remain the same, but new scenarios occur.

Denote
- #2 as D79869613 included

(5) frontend #2 + backend #1 - no issue, same as (2).
(6) frontend #2 (no feature enabled) + backend #0 - same as (4).
(7) frontend #2 (feature enabled) + backend #0 - **assertion error due to no backend support**, to prevent silent wrong behavior.

**To use the feature, this diff stack (D83881544 and D79869613) need to be included in both frontend and backend package.**

## 2) SEV

D79704318 caused SEV due to TBE v1 and v2 interfacing compatibility issue on lex_ig_o3_package. Unit tests to ensure v1 compatibility was added D83020965.

D83881544 passes the v1 compatibility test.

Detail on the root cause and fix:
https://docs.google.com/document/d/1XcYNfyiAn4aRFvjV0QG5aLiWKuuOWtJdLOMKNszZRpI/edit?tab=t.0#heading=h.psr4a2qn0mdk

------

Reviewed By: q10, renganxu

Differential Revision: D83881544

fbshipit-source-id: 5d63841bbf79a72219903e9d0f77ee3b998bc105
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2095

optimization on embedding forward for MI350:
1. apply vec4 on embedding vbe forward kernel instead of vec2
2. As there are 64 threads in rocm, optimize subwarp in embedding forward v2 kernel when embedding dim is from 32 to 64.

Pull Request resolved: pytorch#5064

Reviewed By: q10

Differential Revision: D85701691

Pulled By: spcyppt

fbshipit-source-id: 72f491414f50e53038a4b02f3d555967d34740a7
Summary:
Pull Request resolved: pytorch#5086

X-link: https://github.com/facebookresearch/FBGEMM/pull/2094

For lengths per shard exceeding 2^31, we avoid overflow resulting in undefined behavior.

Reviewed By: spcyppt

Differential Revision: D86209662

fbshipit-source-id: 6d51290f3436629571677091c42b76b6f98e5790
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2102

Pull Request resolved: pytorch#5094

see D86119952

Reviewed By: htyu

Differential Revision: D86319606

fbshipit-source-id: bdf841f0936f1be53b7a07e66b6a64e9e2aaef12
Summary:
Pull Request resolved: pytorch#5031

X-link: meta-pytorch/torchrec#3475

X-link: https://github.com/facebookresearch/FBGEMM/pull/2044

Enable feature score auto collection for EBC in the similar way of EC. The configuration has no difference in embedding table config:

              virtual_table_eviction_policy=FeatureScoreBasedEvictionPolicy(
                  training_id_eviction_trigger_count=260_000_000,  # 260M
                  training_id_keep_count=160_000_000,  # 160M
                  enable_auto_feature_score_collection=True,
                  feature_score_mapping={
                      "sparse_public_original_content_creator": 1.0,
                  },
                  feature_score_default_value=0.5,
              ),

Reviewed By: EddyLXJ

Differential Revision: D85017179

fbshipit-source-id: 3d62f8adbe201d6e30c445aaed88710bbbcd6557
@Bernard-Liu Bernard-Liu force-pushed the aiter/mi350_kernel_opt branch from bca8c27 to 1f82f3b Compare November 11, 2025 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.