Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion VERSION_NUMBER
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.23.1
1.23.2
2 changes: 1 addition & 1 deletion cmake/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ option(onnxruntime_USE_VSINPU "Build with VSINPU support" OFF)
cmake_dependent_option(onnxruntime_USE_FLASH_ATTENTION "Build flash attention kernel for scaled dot product attention" ON "onnxruntime_USE_CUDA" OFF)
option(onnxruntime_USE_LEAN_ATTENTION "Build lean attention kernel for scaled dot product attention" OFF)
cmake_dependent_option(onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION "Build memory efficient attention kernel for scaled dot product attention" ON "onnxruntime_USE_CUDA" OFF)
cmake_dependent_option(onnxruntime_USE_FPA_INTB_GEMM "Build FpA IntB gemm cuda kernels" ON "onnxruntime_USE_CUDA" OFF)
option(onnxruntime_USE_FPA_INTB_GEMM "Build FpA IntB gemm cuda kernels" OFF)

option(onnxruntime_BUILD_FOR_NATIVE_MACHINE "Enable this option for turning on optimization specific to this machine" OFF)
option(onnxruntime_USE_AVX "Use AVX instructions" OFF)
Expand Down
54 changes: 36 additions & 18 deletions docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -3121,13 +3121,13 @@ This version of the operator has been available since version 1 of the 'com.micr

<dl>
<dt><tt>input</tt> : T</dt>
<dd>2D input tensor with shape (num_rows, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
<dd>2D input tensor with shape (num_tokens, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
<dt><tt>router_probs</tt> : T</dt>
<dd>2D input tensor with shape (num_rows, num_experts)</dd>
<dd>2D input tensor with shape (num_tokens, num_experts)</dd>
<dt><tt>fc1_experts_weights</tt> : T</dt>
<dd>3D input tensor with shape (num_experts, inter_size, hidden_size), or (num_experts, 2 * inter_size, hidden_size) for swiglu</dd>
<dd>3D input tensor with shape (num_experts, fusion_size * inter_size, hidden_size), where fusion_size is 2 for fused swiglu, and 1 otherwise</dd>
<dt><tt>fc1_experts_bias</tt> (optional) : T</dt>
<dd>2D optional input tensor with shape (num_experts, inter_size), or (num_experts, 2 * inter_size) for swiglu</dd>
<dd>2D optional input tensor with shape (num_experts, fusion_size * inter_size)</dd>
<dt><tt>fc2_experts_weights</tt> : T</dt>
<dd>3D input tensor with shape (num_experts, hidden_size, inter_size)</dd>
<dt><tt>fc2_experts_bias</tt> (optional) : T</dt>
Expand All @@ -3142,7 +3142,7 @@ This version of the operator has been available since version 1 of the 'com.micr

<dl>
<dt><tt>output</tt> : T</dt>
<dd>2D input tensor with shape (num_rows, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
<dd>2D input tensor with shape (num_tokens, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
</dl>

#### Type Constraints
Expand Down Expand Up @@ -4532,7 +4532,23 @@ This version of the operator has been available since version 1 of the 'com.micr

### <a name="com.microsoft.QMoE"></a><a name="com.microsoft.qmoe">**com.microsoft.QMoE**</a>

Quantized MoE
Quantized mixture of experts (MoE).

Only weights are quantized with symmetric quantization.
The quantized weights are stored in column major order per expert.
The quantization block size can be specified. If not provided, column wise quantization is used.

The SwiGLU (Swish-Gated Linear Unit) activation function is like:
g = xW + b
l = xV + c
G = clamp(g, max=limit)
L = clamp(l, min=-limit, max=limit)
swiglu = G * sigmoid(alpha * G) * (L + beta)
where x is the input, W and V are weight matrices, b and c are bias vectors, and alpha, beta and limit are constant float parameters.
When swiglu_fusion=0, two GEMMs are not fused, and they are FC1 and FC3 in the inputs.
When swiglu_fusion=1, two GEMMs are fused so that g and l are computed in a single GEMM (FC1), and g and l are interleaved on each row of size 2 * inter_size.
When swiglu_fusion=2, two GEMMs are fused, and g and l are concatenated on each row.


#### Version

Expand All @@ -4547,6 +4563,8 @@ This version of the operator has been available since version 1 of the 'com.micr
<dd>Beta parameter used in activation function.</dd>
<dt><tt>activation_type</tt> : string</dt>
<dd>Activation function to use. Choose from relu, gelu, silu, swiglu and identity. Default is relu</dd>
<dt><tt>block_size</tt> : int</dt>
<dd>Size of each quantization block along the K (input feature) dimension. Must be power of two and ≥ 16 (e.g., 16, 32, 64, 128). If provided, both hidden_size and inter_size must be divisible by the block size. Otherwise, there is no blocking and a whole column shares one scaling factor. </dd>
<dt><tt>expert_weight_bits</tt> : int</dt>
<dd>Number of bits used in quantized weights. Default is 4 bits</dd>
<dt><tt>k</tt> : int</dt>
Expand All @@ -4565,34 +4583,34 @@ This version of the operator has been available since version 1 of the 'com.micr

<dl>
<dt><tt>input</tt> : T</dt>
<dd>2D input tensor with shape (num_rows, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
<dd>2D tensor with shape (num_tokens, hidden_size), or 3D tensor with shape (batch_size, sequence_length, hidden_size)</dd>
<dt><tt>router_probs</tt> : T</dt>
<dd>2D input tensor with shape (num_rows, num_experts)</dd>
<dd>2D tensor with shape (num_tokens, num_experts)</dd>
<dt><tt>fc1_experts_weights</tt> : T1</dt>
<dd>3D input tensor with shape (num_experts, inter_size, hidden_size), or (num_experts, inter_size, hidden_size / 2) for 4 bits. For swiglu, shape can be (num_experts, 2 * inter_size, hidden_size), or (num_experts, 2 * inter_size, hidden_size / 2) for 4 bits.</dd>
<dd>3D tensor with shape (num_experts, fusion_size * inter_size, hidden_size / pack_size), The fusion_size is 2 for fused swiglu, or 1 otherwise. The pack_size is 8 / expert_weight_bits.</dd>
<dt><tt>fc1_scales</tt> : T2</dt>
<dd>2D input tensor with shape (num_experts, inter_size), or (num_experts, 2 * inter_size) for swiglu</dd>
<dd>2D tensor with shape (num_experts, fusion_size * inter_size), or 3D tensor with shape (num_experts, fusion_size * inter_size, hidden_size / block_size) when block_size is provided.</dd>
<dt><tt>fc1_experts_bias</tt> (optional) : T</dt>
<dd>2D optional input tensor with shape (num_experts, inter_size), or (num_experts, 2 * inter_size) for swiglu</dd>
<dd>2D optional tensor with shape (num_experts, fusion_size * inter_size)</dd>
<dt><tt>fc2_experts_weights</tt> : T1</dt>
<dd>3D input tensor with shape (num_experts, hidden_size, inter_size) or (num_experts, hidden_size, inter_size / 2) for 4 bits</dd>
<dd>3D tensor with shape (num_experts, hidden_size, inter_size / pack_size)</dd>
<dt><tt>fc2_scales</tt> : T2</dt>
<dd>2D input tensor with shape (num_experts, hidden_size)</dd>
<dd>2D tensor with shape (num_experts, hidden_size), or 3D tensor with shape (num_experts, hidden_size, inter_size / block_size) when block_size is provided.</dd>
<dt><tt>fc2_experts_bias</tt> (optional) : T</dt>
<dd>2D optional input tensor with shape (num_experts, hidden_size)</dd>
<dd>2D optional tensor with shape (num_experts, hidden_size)</dd>
<dt><tt>fc3_experts_weights</tt> (optional) : T1</dt>
<dd>3D optional input tensor with shape (num_experts, inter_size, hidden_size) or (num_experts, inter_size, hidden_size / 2)</dd>
<dd>3D optional tensor with shape (num_experts, inter_size, hidden_size / pack_size)</dd>
<dt><tt>fc3_scales</tt> (optional) : T2</dt>
<dd>2D optional input tensor with shape (num_experts, inter_size)</dd>
<dd>2D optional tensor with shape (num_experts, inter_size), or 3D optional tensor with shape (num_experts, inter_size, hidden_size / block_size) when block_size is provided.</dd>
<dt><tt>fc3_experts_bias</tt> (optional) : T</dt>
<dd>2D optional input tensor with shape (num_experts, inter_size)</dd>
<dd>2D optional tensor with shape (num_experts, inter_size)</dd>
</dl>

#### Outputs

<dl>
<dt><tt>output</tt> : T</dt>
<dd>2D input tensor with shape (num_rows, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
<dd>output tensor with same shape of input</dd>
</dl>

#### Type Constraints
Expand Down
5 changes: 5 additions & 0 deletions docs/python/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ For more information on ONNX Runtime, please see `aka.ms/onnxruntime <https://ak

Changes
-------
1.23.2
^^^^^^

Release Notes : https://github.com/Microsoft/onnxruntime/releases/tag/v1.23.2


1.23.1
^^^^^^
Expand Down
2 changes: 1 addition & 1 deletion js/common/lib/version.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
// This file is generated by /js/scripts/update-version.ts
// Do not modify file content manually.

export const version = '1.23.1';
export const version = '1.23.2';
4 changes: 2 additions & 2 deletions js/common/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion js/common/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"license": "MIT",
"type": "module",
"name": "onnxruntime-common",
"version": "1.23.1",
"version": "1.23.2",
"repository": {
"url": "https://github.com/Microsoft/onnxruntime.git",
"type": "git"
Expand Down
2 changes: 1 addition & 1 deletion js/node/lib/version.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
// This file is generated by /js/scripts/update-version.ts
// Do not modify file content manually.

export const version = '1.23.1';
export const version = '1.23.2';
6 changes: 3 additions & 3 deletions js/node/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion js/node/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
6
]
},
"version": "1.23.1",
"version": "1.23.2",
"dependencies": {
"adm-zip": "^0.5.16",
"global-agent": "^3.0.0",
Expand Down
2 changes: 1 addition & 1 deletion js/react_native/lib/version.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
// This file is generated by /js/scripts/update-version.ts
// Do not modify file content manually.

export const version = '1.23.1';
export const version = '1.23.2';
6 changes: 3 additions & 3 deletions js/react_native/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion js/react_native/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
"registry": "https://registry.npmjs.org/"
},
"source": "lib/index",
"version": "1.23.1",
"version": "1.23.2",
"main": "dist/commonjs/index",
"homepage": "https://github.com/microsoft/onnxruntime/blob/main/js/react_native/README.md",
"files": [
Expand Down
2 changes: 1 addition & 1 deletion js/web/lib/version.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
// This file is generated by /js/scripts/update-version.ts
// Do not modify file content manually.

export const version = '1.23.1';
export const version = '1.23.2';
6 changes: 3 additions & 3 deletions js/web/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion js/web/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"type": "git"
},
"author": "fs-eire",
"version": "1.23.1",
"version": "1.23.2",
"jsdelivr": "dist/ort.min.js",
"dependencies": {
"flatbuffers": "^25.1.24",
Expand Down
2 changes: 1 addition & 1 deletion onnxruntime/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
or the `Github project <https://github.com/microsoft/onnxruntime/>`_.
"""

__version__ = "1.23.1"
__version__ = "1.23.2"
__author__ = "Microsoft"

# we need to do device version validation (for example to check Cuda version for an onnxruntime-training package).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ namespace cutlass_kernels {
template <typename ActivationType, typename WeightType, typename ScaleZeroType, typename BiasType, typename OutputType,
cutlass::WeightOnlyQuantOp QuantOp, typename EpilogueTag, typename CTAShape, typename ClusterShape,
typename MainloopScheduleType, typename EpilogueScheduleType>
#ifdef COMPILE_HOPPER_TMA_GEMMS
#if defined(COMPILE_HOPPER_TMA_GEMMS) && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ == 900) && defined(__NV_SASS_VERSION__)
void sm90_generic_mixed_gemm_kernelLauncher(
ActivationType const* A, WeightType const* B,
ScaleZeroType const* weight_scales, ScaleZeroType const* weight_zero_points, BiasType const* biases,
Expand Down Expand Up @@ -269,6 +269,7 @@ void sm90_generic_mixed_gemm_kernelLauncher(
}
}
#else // COMPILE_HOPPER_TMA_GEMMS
// This stub is now used for ALL non-SASS or non-SM90A compilation passes includes the 90-virtual (PTX) pass.
void sm90_generic_mixed_gemm_kernelLauncher(ActivationType const*, WeightType const*,
ScaleZeroType const*, ScaleZeroType const*, BiasType const*,
float const, OutputType*, int, int, int, int const, tkc::CutlassGemmConfig,
Expand Down
2 changes: 2 additions & 0 deletions onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_profiler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#if USE_FPA_INTB_GEMM
#include "contrib_ops/cuda/llm/fpA_intB_gemm_profiler.h"
#include "contrib_ops/cuda/llm/common/workspace.h"

Expand Down Expand Up @@ -97,3 +98,4 @@ bool WeightOnlyGroupwiseQuantGemmPluginProfiler::checkTactic(int m, int /*n*/, i
}

} // namespace onnxruntime::llm::kernels::weight_only
#endif
Loading
Loading