-
Notifications
You must be signed in to change notification settings - Fork 679
Several kernel optimization from aiter team #5074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Bernard-Liu
wants to merge
91
commits into
pytorch:main
Choose a base branch
from
ROCm:aiter/mi350_kernel_opt
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
warp per row wg change
Summary: Pull Request resolved: pytorch#5072 X-link: https://github.com/facebookresearch/FBGEMM/pull/2078 Changing the QtileSize to 64. I see good improvement > 20 %.. For correctness this includes changing the TMEM atoms and introducing warp sync for row stats. Perf: ``` (Batch, SeqLenQ, SeqLenKV, MaxLenKV, HeadQ, HeadKV, HeadD) cutlass_blackwell_fmha_decode-gbps Improvment with Qtile = 64 (16, 1, 256, 256, 8, 1, 128) 238.2206209 1.31463193 (16, 1, 512, 512, 8, 1, 128) 410.8838061 1.315872068 (16, 1, 1024, 1024, 8, 1, 128) 660.5696208 1.335567769 (16, 1, 2048, 2048, 8, 1, 128) 916.5460174 1.310093116 (16, 1, 4096, 4096, 8, 1, 128) 1133.690174 1.258896694 (16, 1, 8192, 8192, 8, 1, 128) 1271.341515 1.229311967 (32, 1, 256, 256, 8, 1, 128) 468.9034945 1.295635241 (32, 1, 512, 512, 8, 1, 128) 799.2689835 1.280831124 (32, 1, 1024, 1024, 8, 1, 128) 1285.452285 1.293538886 (32, 1, 2048, 2048, 8, 1, 128) 1797.074701 1.269787171 (32, 1, 4096, 4096, 8, 1, 128) 2210.946865 1.229703361 (32, 1, 8192, 8192, 8, 1, 128) 2498.665399 1.212166122 (64, 1, 256, 256, 8, 1, 128) 893.9747894 1.302172409 (64, 1, 512, 512, 8, 1, 128) 1493.150844 1.274679551 (64, 1, 1024, 1024, 8, 1, 128) 2309.825211 1.220419935 (64, 1, 2048, 2048, 8, 1, 128) 3012.271892 1.159444905 (64, 1, 4096, 4096, 8, 1, 128) 3552.001019 1.089389445 (64, 1, 8192, 8192, 8, 1, 128) 4348.016208 1.131298153 (128, 1, 256, 256, 8, 1, 128) 1549.388365 1.233405251 (128, 1, 512, 512, 8, 1, 128) 2480.52007 1.210676964 (128, 1, 1024, 1024, 8, 1, 128) 3360.125922 1.145674899 (128, 1, 2048, 2048, 8, 1, 128) 4103.461192 1.093136854 (128, 1, 4096, 4096, 8, 1, 128) 4783.429328 1.095583284 ``` Reviewed By: jianyuh, v0i0 Differential Revision: D85155388 fbshipit-source-id: ec3e43e2c7b0ce68c8eebc3fac74db6c9b66de07
Summary: Pull Request resolved: pytorch#5073 X-link: https://github.com/facebookresearch/FBGEMM/pull/2079 Compile-time static/const mapping utilities for: 1. constexpr value -> constexpr value 2. constexpr value -> type Useful when developing template-heavy cutlass code. Reviewed By: jianyuh Differential Revision: D85893168 fbshipit-source-id: 691dbb90e17c88dfc384432908e8ffdb8c0b2a04
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2081 Pull Request resolved: pytorch#5076 D85603930 removed AVX from aarch64 compilation, and broke Sigrid build. Proposed changes fix the build break. There is an fbgemm routine without ref implementation, so we need to implement a NEON port at a later diff. For now, four AVX2 files are compiled with the NEON package Reviewed By: YifanYuan3 Differential Revision: D85918535 fbshipit-source-id: 3a9892535046a0edc05c8c1fccbb0dac8ca8de35
Summary: Pull Request resolved: pytorch#5062 X-link: meta-pytorch/torchrec#3490 X-link: https://github.com/facebookresearch/FBGEMM/pull/2070 Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks. Reviewed By: emlin Differential Revision: D85604160 fbshipit-source-id: 177ec779960a4ac9bfc3d41f38beeb7e56665db8
Summary: Pull Request resolved: pytorch#5075 X-link: https://github.com/facebookresearch/FBGEMM/pull/2080 This diff generalizes the work in (D85155388) based on Gefei's diff D85631781 . Compared to D85631781, we avoid registers warp shuffling by using 32b TMEM atoms. This diff supports: 1. Different dtypes (fp8, bf16) 2. Different mtiles (128, 64) Reviewed By: v0i0 Differential Revision: D85893883 fbshipit-source-id: 25e93e627c573a120ab46336d3f234064c5ae066
…ch#5077) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2082 Pull Request resolved: pytorch#5077 This change selects the `hash_zch_identities` that corresponds with unique indices during TBE prefetch. This is specifically required for MPZCH tables, which need both the slot index and the corresponding identities for correct lookup behavior. Without the identities, the inference side cannot correctly verify if it's using the correct slot, leading to potential lookup errors. Reviewed By: chouxi Differential Revision: D85999577 fbshipit-source-id: 3c8a4add1dd112e9a746b334e7046bb442ea977b
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2086 While not is a valid C++ keyword, MSVC issues the following warnings ``` C:\actions-runner\_work\pytorch\pytorch\third_party\fbgemm\include\fbgemm\./FloatConversion.h(292): warning C4067: unexpected tokens following preprocessor directive - expected a newline ``` Pull Request resolved: pytorch#5025 Reviewed By: spcyppt Differential Revision: D86135907 Pulled By: q10 fbshipit-source-id: 3d55410aa1f6f4f1a4511d2881d1b0ba05ea5c5a
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2088 - Remove Python 3.9 support, following PyTorch nightlies Pull Request resolved: pytorch#5081 Reviewed By: spcyppt Differential Revision: D86168579 Pulled By: q10 fbshipit-source-id: f15a5107ab9f86c7c07e704f510faab312bac858
…h#5080) Summary: Pull Request resolved: pytorch#5080 X-link: https://github.com/facebookresearch/FBGEMM/pull/2087 This PR introduces optimization for `group_index_select_or_add_2d_kernel` (`USE_INDEX_SELECT==true`) kernel with primary focus on `float` type and relatively small embedding dimensions. 2 things are implemented: 1) Extracted the common variables out of the loop to omit unnecessary synchronizations on memory load (compiler won't do that automatically) 2) Switch to 32 threads logical wave sizes to reduce granularity losses. Pull Request resolved: pytorch#5078 Reviewed By: spcyppt, haoyuz Differential Revision: D86135611 Pulled By: q10 fbshipit-source-id: f4fb9966f5f5180c4dde2aed92ca726c260b7743
…ytorch#5083) Summary: Pull Request resolved: pytorch#5083 X-link: https://github.com/facebookresearch/FBGEMM/pull/2089 When running benchmarks with a large number of copies, the process may raise: OSError: [Errno 24] Too many open files. Example command: (fbgemm_gpu_env)$ ulimit -n 1048576 (fbgemm_gpu_env)$ python ./bench/tbe/tbe_inference_benchmark.py nbit-cpu \ --num-embeddings=40000000 --bag-size=2 --embedding-dim=96 \ --batch-size=162 --num-tables=8 --weights-precision=int4 \ --output-dtype=fp32 --copies=96 --iters=30000 PyTorch multiprocessing provides two shared-memory strategies: 1.file_descriptor (default) 2.file_system The default file_descriptor strategy uses file descriptors as shared memory handles, which can result in a large number of open FDs when many tensors are shared. If the total number of open FDs exceeds the system limit and cannot be raised, the file_system strategy should be used instead. This patch allows switching to the file_system strategy by setting: export PYTORCH_SHARE_STRATEGY='file_system' Reference: https://pytorch.org/docs/stable/multiprocessing.html#sharing-strategies Pull Request resolved: pytorch#5037 Reviewed By: spcyppt Differential Revision: D86135817 Pulled By: q10 fbshipit-source-id: 15f6fe7e1de5e9fef828f5a1496dc1cf9b41c293
Summary: Pull Request resolved: pytorch#5085 X-link: https://github.com/facebookresearch/FBGEMM/pull/2093 As title, in silvertorch bulk eval, they will not call eval() for the module but using torch.no_grad() to run. https://www.internalfb.com/code/fbsource/[324dbccd0ab0]/fbcode/dper_lib/silvertorch/core/publish/data_processing/bulk_eval_dmp_gpu.py?lines=1057 So set a eval mode to turn the self.training to False in tbe for bulk eval. Reviewed By: emlin Differential Revision: D86220286 fbshipit-source-id: 9a48c7b4dc09767c99a545d1f25e53bf4265079f
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2097 - Fix test reliability with table order Pull Request resolved: pytorch#5087 Reviewed By: spcyppt Differential Revision: D86242426 Pulled By: q10 fbshipit-source-id: 4ec307ff8fd9151bddb6bf7354bfe06f67a1fa0b
…#5089) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2098 Pull Request resolved: pytorch#5089 Adding NEON translation of FloatOrHalfToFused8BitRowwiseQuantizedSBFloat, used by Ads Performance improves by an order of magnitude: Before: bit_rate, rows, cols, elems_per_usec, GB/Sec 8, 100, 16, 378.68, 1.51 8, 100, 64, 286.91, 1.15 8, 100, 128, 262.06, 1.05 8, 100, 256, 251.34, 1.01 8, 100, 512, 244.92, 0.98 8, 100, 1024, 237.35, 0.95 8, 100, 2048, 230.83, 0.92 8, 120, 16, 378.70, 1.51 8, 120, 64, 286.72, 1.15 8, 120, 128, 263.40, 1.05 8, 120, 256, 251.58, 1.01 8, 120, 512, 245.30, 0.98 8, 120, 1024, 238.17, 0.95 8, 120, 2048, 230.69, 0.92 8, 1000, 16, 392.85, 1.57 8, 1000, 64, 294.35, 1.18 8, 1000, 128, 264.35, 1.06 8, 1000, 256, 252.13, 1.01 8, 1000, 512, 245.50, 0.98 8, 1000, 1024, 241.61, 0.97 8, 1000, 2048, 231.39, 0.93 After: bit_rate, rows, cols, elems_per_usec, GB/Sec 8, 100, 16, 1855.59, 7.42 8, 100, 64, 2615.43, 10.46 8, 100, 128, 3134.34, 12.54 8, 100, 256, 2610.72, 10.44 8, 100, 512, 3065.20, 12.26 8, 100, 1024, 3535.29, 14.14 8, 100, 2048, 3757.66, 15.03 8, 120, 16, 1991.94, 7.97 8, 120, 64, 2971.25, 11.89 8, 120, 128, 3403.37, 13.61 8, 120, 256, 2750.87, 11.00 8, 120, 512, 3272.63, 13.09 8, 120, 1024, 3618.98, 14.48 8, 120, 2048, 3848.59, 15.39 8, 1000, 16, 2329.11, 9.32 8, 1000, 64, 3068.76, 12.28 8, 1000, 128, 3678.86, 14.72 8, 1000, 256, 4440.37, 17.76 8, 1000, 512, 4558.70, 18.23 8, 1000, 1024, 4620.94, 18.48 8, 1000, 2048, 3898.84, 15.60 Reviewed By: mcfi Differential Revision: D86236406 fbshipit-source-id: 12c20cbdbbc9b0674ccca8e1aa598b7de144dea9
Summary: Pull Request resolved: pytorch#5091 X-link: https://github.com/facebookresearch/FBGEMM/pull/2099 In this test, we run following step 1. Create a DramKVInferenceEmbedding with TTL eviction for 1 min 2. Insert 1 embedding with current Unixtime - 2 mins (it is already expired) as timestamp 3. Read from it and check correctness 4. Read for multiple times 5. Evict it 6. Read it --- this time should be inconsistent Reviewed By: emlin Differential Revision: D86268606 fbshipit-source-id: edc2dc24e5327399421d20229a0b1af2ca29ea7a
Summary: Pull Request resolved: pytorch#5093 X-link: https://github.com/facebookresearch/FBGEMM/pull/2100 ---- # Context on the changes: Currently, Torchrec merges the outputs of individual VBE TBE ops to be ordered by ranks using [_merge_variable_batch_embeddings](https://www.internalfb.com/code/fbsource/[3bd69d7fa3534144dcb0162ca59803a6c3ff6e70]/fbcode/torchrec/distributed/embedding_lookup.py?lines=593-604). This function seems to cause ~30% QPS regression compared to baseline (HBM+UVM) for Jupiter V1 model with VBE enabled. To get rid of _merge_variable_batch_embeddings() function, we pre-allocate the `vbe_output` tensor which holds outputs from all VBE ops and calculate `vbe_output_offsets` to allow each individual VBE ops to write to the correct location in the `vbe_output` tensor. By default, `vbe_output` and `vbe_output_offsets` are `None`, which means VBE ops will return individual tensor the way it currently does. The feature is enabled when `vbe_output` and `vbe_output_offsets` are not `None`. --- **NOTE** 1. This feature is currently supported for Sparse TBE. 2. The support is limited for CUDA. 3. For backward compatibility, we append the newly introduced `vbe_output` to the existing API. Hence, we need to make the `vbe_output` tensor as `optional` with default value as `None` (there's no default value for Tensor). 4. We *cannot* annotate `vbe_output` because PyTorch registration does not support annotation of optional tensor. Adding annotation will incur the following error below. This may cause some issues to support this on MTIA, if MTIA relies on tensor annotation. ``` E0903 09:50:32.966235 2850885 ExceptionTracer.cpp:227] exception stack complete terminate called after throwing an instance of 'std::runtime_error' what(): expected ident but found '(' here: split_embedding_codegen_lookup_adagrad_function_pt2( Tensor placeholder_autograd_tensor, Tensor[](a!) weights, Tensor D_offsets, SymInt total_D, SymInt max_D, Tensor hash_size_cumsum, int total_hash_size_bits, Tensor indices, Tensor offsets, int pooling_mode, Tensor? indice_weights, Tensor? feature_requires_grad, int output_dtype, Tensor?[](e!) aux_tensor, int[] aux_int, float[] aux_float, bool[] aux_bool, Tensor[](g!) momentum1, Tensor learning_rate_tensor, float[] optim_float, SymInt max_B=-1, SymInt max_B_feature_rank=-1, SymInt vbe_output_size=-1, Tensor?(t!) vbe_output=None ) -> Tensor ~ <--- HERE ``` See https://docs.google.com/document/d/1h5YyeCjYmmN-CIFB98CrBf1uMksidPbNvM1rl8yZeds/edit?tab=t.0#heading=h.tdfkkc6ujdyl ---- This diff is a reland of D79704318 which all issues have been addressed. ## 1) pyper validation test D79704318 was reverted as it broke pyper validation test (frontend/backend package compatibility issue), which blocks pyper releases. The issue is addressed in this diff. Context: In pyper, changes in python would be included in frontend package (e.g., ads_dper3) and C++ in backend package (e.g., training_platform). If the diff contains both python and C++, there's a chance that some model will use mismatching packages. In other words, frontend package does not include the diff but backend does, and vice versa. D83881544 is only enabling backend support (i.e., no one can actually use this feature, so TBE VBE will work as usual). Due to new Unified API changes, we need to pipeline optional tensor from frontend and requires python change. Denote - #0 as no D83881544 included - #1 as D83881544 included There are 4 scenarios: (1) frontend #0 + old backend #0 - no issue (2) frontend #1 + backend #1 - no issue (3) frontend #0 + backend #1 - handled; TBE VBE will work normally. (4) frontend #1 + backend #0 - no issue; the diff added warning that backend is old There's another diff D79869613 in the stack that will enable frontend support (i.e., allow users to use this feature), which will go into __frontend package only__. Now, the 1)-4) scenarios would remain the same, but new scenarios occur. Denote - #2 as D79869613 included (5) frontend #2 + backend #1 - no issue, same as (2). (6) frontend #2 (no feature enabled) + backend #0 - same as (4). (7) frontend #2 (feature enabled) + backend #0 - **assertion error due to no backend support**, to prevent silent wrong behavior. **To use the feature, this diff stack (D83881544 and D79869613) need to be included in both frontend and backend package.** ## 2) SEV D79704318 caused SEV due to TBE v1 and v2 interfacing compatibility issue on lex_ig_o3_package. Unit tests to ensure v1 compatibility was added D83020965. D83881544 passes the v1 compatibility test. Detail on the root cause and fix: https://docs.google.com/document/d/1XcYNfyiAn4aRFvjV0QG5aLiWKuuOWtJdLOMKNszZRpI/edit?tab=t.0#heading=h.psr4a2qn0mdk ------ Reviewed By: q10, renganxu Differential Revision: D83881544 fbshipit-source-id: 5d63841bbf79a72219903e9d0f77ee3b998bc105
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2095 optimization on embedding forward for MI350: 1. apply vec4 on embedding vbe forward kernel instead of vec2 2. As there are 64 threads in rocm, optimize subwarp in embedding forward v2 kernel when embedding dim is from 32 to 64. Pull Request resolved: pytorch#5064 Reviewed By: q10 Differential Revision: D85701691 Pulled By: spcyppt fbshipit-source-id: 72f491414f50e53038a4b02f3d555967d34740a7
Summary: Pull Request resolved: pytorch#5086 X-link: https://github.com/facebookresearch/FBGEMM/pull/2094 For lengths per shard exceeding 2^31, we avoid overflow resulting in undefined behavior. Reviewed By: spcyppt Differential Revision: D86209662 fbshipit-source-id: 6d51290f3436629571677091c42b76b6f98e5790
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2102 Pull Request resolved: pytorch#5094 see D86119952 Reviewed By: htyu Differential Revision: D86319606 fbshipit-source-id: bdf841f0936f1be53b7a07e66b6a64e9e2aaef12
Summary: Pull Request resolved: pytorch#5031 X-link: meta-pytorch/torchrec#3475 X-link: https://github.com/facebookresearch/FBGEMM/pull/2044 Enable feature score auto collection for EBC in the similar way of EC. The configuration has no difference in embedding table config: virtual_table_eviction_policy=FeatureScoreBasedEvictionPolicy( training_id_eviction_trigger_count=260_000_000, # 260M training_id_keep_count=160_000_000, # 160M enable_auto_feature_score_collection=True, feature_score_mapping={ "sparse_public_original_content_creator": 1.0, }, feature_score_default_value=0.5, ), Reviewed By: EddyLXJ Differential Revision: D85017179 fbshipit-source-id: 3d62f8adbe201d6e30c445aaed88710bbbcd6557
bca8c27 to
1f82f3b
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bwd performance optimization for ROCm.
Caution:
please know that, this PR depend on this: #4925 One should merge 4925 first, after that one can merge this PR