Refs #237: SVE based lookup table #261

GoWind · 2025-10-19T20:27:20Z

Running SVE benchmarks initially on a Google C4A Axion processor. The C4A seems to be a 128-bit SVE implementation. Initial run seems to be a 28% improvement over the serial (and about 5% over the NEON implementation)

Will run further tests on Graviton series of processors and update benchmarks.

govind@instance-20251019-161544:~/StringZilla/build$ ./stringzilla_bench_memory_cpp20 
Welcome to StringZilla!
Building up the environment...
Environment built with the following settings:
 - Dataset path: leipzig1M.txt
 - Time limit: 10 seconds per benchmark (10 per stress-test)
 - Tokenization mode: line
 - Seed: 0 (will avoid shuffling)
 - Stress-testing: yes
 - Loaded dataset size: 67108864 bytes
 - Number of tokens: 524288
 - Mean token length: 122.25 bytes
Compile-time capabilities:
- Uses Westmere: no 
- Uses Haswell: no 
- Uses Skylake: no 
- Uses Ice Lake: no 
- Uses NEON: yes 
- Uses SVE: yes 
- Uses SVE2: no 
Starting low-level memory-operation benchmarks...

Benchmarking `sz_copy_serial(align)`:
> Throughput: 4.32 GiB/s @ 26.38 ns/call

Benchmarking `sz_copy_serial(shift)`:
> Throughput: 4.36 GiB/s @ 26.13 ns/call
> + 0.9 % against `sz_copy_serial(align)`

Benchmarking `sz_copy_neon(align)`:
> Throughput: 4.07 GiB/s @ 27.95 ns/call
> - 5.6 % against `sz_copy_serial(align)`

Benchmarking `sz_copy_neon(shift)`:
> Throughput: 4.37 GiB/s @ 26.05 ns/call
> + 1.2 % against `sz_copy_serial(align)`
> + 0.3 % against `sz_copy_serial(shift)`

Benchmarking `sz_copy_sve(align)`:
> Throughput: 5.33 GiB/s @ 21.34 ns/call
> + 23.6 % against `sz_copy_serial(align)`

Benchmarking `sz_copy_sve(shift)`:
> Throughput: 5.49 GiB/s @ 20.75 ns/call
> + 27.1 % against `sz_copy_serial(align)`
> + 26.0 % against `sz_copy_serial(shift)`

Benchmarking `std::memcpy(align)`:
> Throughput: 5.76 GiB/s @ 19.78 ns/call
> + 33.4 % against `sz_copy_serial(align)`

Benchmarking `std::memcpy(shift)`:
> Throughput: 6.06 GiB/s @ 18.78 ns/call
> + 40.4 % against `sz_copy_serial(align)`
> + 39.1 % against `sz_copy_serial(shift)`

Benchmarking `sz_move_serial(by1)`:
> Throughput: 5.86 GiB/s @ 38.83 ns/call

Benchmarking `sz_move_serial(by64)`:
> Throughput: 5.83 GiB/s @ 39.06 ns/call
> - 0.6 % against `sz_move_serial(by1)`

Benchmarking `sz_move_neon(by1)`:
> Throughput: 6.61 GiB/s @ 34.46 ns/call
> + 12.7 % against `sz_move_serial(by1)`

Benchmarking `sz_move_neon(by64)`:
> Throughput: 7.16 GiB/s @ 31.80 ns/call
> + 22.1 % against `sz_move_serial(by1)`
> + 22.8 % against `sz_move_serial(by64)`

Benchmarking `sz_move_sve(by1)`:
> Throughput: 8.36 GiB/s @ 27.24 ns/call
> + 42.6 % against `sz_move_serial(by1)`

Benchmarking `sz_move_sve(by64)`:
> Throughput: 8.45 GiB/s @ 26.95 ns/call
> + 44.1 % against `sz_move_serial(by1)`
> + 44.9 % against `sz_move_serial(by64)`

Benchmarking `std::memmove(by1)`:
> Throughput: 10.88 GiB/s @ 20.94 ns/call
> + 85.5 % against `sz_move_serial(by1)`

Benchmarking `std::memmove(by64)`:
> Throughput: 11.01 GiB/s @ 20.67 ns/call
> + 87.8 % against `sz_move_serial(by1)`
> + 88.9 % against `sz_move_serial(by64)`

Benchmarking `sz_fill_serial`:
> Throughput: 4.22 GiB/s @ 26.98 ns/call

Benchmarking `sz_fill_random_serial`:
> Throughput: 635.80 MiB/s @ 184.77 ns/call
> - 6.8 x against `sz_fill_serial`

Benchmarking `sz_fill_neon`:
> Throughput: 5.07 GiB/s @ 22.46 ns/call
> + 20.1 % against `sz_fill_serial`

Benchmarking `sz_fill_sve`:
> Throughput: 4.49 GiB/s @ 25.35 ns/call
> + 6.4 % against `sz_fill_serial`

Benchmarking `fill<std::memset>`:
> Throughput: 6.10 GiB/s @ 18.67 ns/call
> + 44.5 % against `sz_fill_serial`

Benchmarking `fill<std::random_device>`:
> Throughput: 245.41 MiB/s @ 478.95 ns/call
> - 17.6 x against `sz_fill_serial`
> - 2.6 x against `sz_fill_random_serial`

Benchmarking `sz_lookup_serial`:
> Throughput: 1.91 GiB/s @ 59.48 ns/call

Benchmarking `sz_lookup_neon`:
> Throughput: 2.36 GiB/s @ 48.22 ns/call
> + 23.4 % against `sz_lookup_serial`

Benchmarking `sz_lookup_sve`:
> Throughput: 2.45 GiB/s @ 46.49 ns/call
> + 28.0 % against `sz_lookup_serial`

Benchmarking `lookup<std::transform>`:
> Throughput: 1.93 GiB/s @ 59.02 ns/call
> + 0.8 % against `sz_lookup_serial`
All benchmarks passed.

ashvardanian · 2025-10-21T16:15:15Z

I'm afraid the benchmarks aren't very representative, because the mean token length is lower than the cutoff values for the SVE:

Mean token length: 122.25 bytes

@GoWind, any chance you can run either without this condition:

    if (length <= 128) {
        sz_lookup_serial(target, length, source, lut);
        return;
    }

Or by using a dataset with longer lines, like the xlsum.csv with 4 KB line?

Refs ashvardanian#237: SVE based lookup table

78ea845

GoWind force-pushed the sve_lut branch from 417514e to 78ea845 Compare October 19, 2025 20:28

ashvardanian changed the base branch from main to main-dev October 21, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refs #237: SVE based lookup table #261

Refs #237: SVE based lookup table #261

Uh oh!

GoWind commented Oct 19, 2025 •

edited

Loading

Uh oh!

ashvardanian commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refs #237: SVE based lookup table #261

Are you sure you want to change the base?

Refs #237: SVE based lookup table #261

Uh oh!

Conversation

GoWind commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashvardanian commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GoWind commented Oct 19, 2025 •

edited

Loading