Skip to content

Conversation

GoWind
Copy link
Contributor

@GoWind GoWind commented Oct 19, 2025

Running SVE benchmarks initially on a Google C4A Axion processor. The C4A seems to be a 128-bit SVE implementation. Initial run seems to be a 28% improvement over the serial (and about 5% over the NEON implementation)

Will run further tests on Graviton series of processors and update benchmarks.

govind@instance-20251019-161544:~/StringZilla/build$ ./stringzilla_bench_memory_cpp20 
Welcome to StringZilla!
Building up the environment...
Environment built with the following settings:
 - Dataset path: leipzig1M.txt
 - Time limit: 10 seconds per benchmark (10 per stress-test)
 - Tokenization mode: line
 - Seed: 0 (will avoid shuffling)
 - Stress-testing: yes
 - Loaded dataset size: 67108864 bytes
 - Number of tokens: 524288
 - Mean token length: 122.25 bytes
Compile-time capabilities:
- Uses Westmere: no 
- Uses Haswell: no 
- Uses Skylake: no 
- Uses Ice Lake: no 
- Uses NEON: yes 
- Uses SVE: yes 
- Uses SVE2: no 
Starting low-level memory-operation benchmarks...

Benchmarking `sz_copy_serial(align)`:
> Throughput: 4.32 GiB/s @ 26.38 ns/call

Benchmarking `sz_copy_serial(shift)`:
> Throughput: 4.36 GiB/s @ 26.13 ns/call
> + 0.9 % against `sz_copy_serial(align)`

Benchmarking `sz_copy_neon(align)`:
> Throughput: 4.07 GiB/s @ 27.95 ns/call
> - 5.6 % against `sz_copy_serial(align)`

Benchmarking `sz_copy_neon(shift)`:
> Throughput: 4.37 GiB/s @ 26.05 ns/call
> + 1.2 % against `sz_copy_serial(align)`
> + 0.3 % against `sz_copy_serial(shift)`

Benchmarking `sz_copy_sve(align)`:
> Throughput: 5.33 GiB/s @ 21.34 ns/call
> + 23.6 % against `sz_copy_serial(align)`

Benchmarking `sz_copy_sve(shift)`:
> Throughput: 5.49 GiB/s @ 20.75 ns/call
> + 27.1 % against `sz_copy_serial(align)`
> + 26.0 % against `sz_copy_serial(shift)`

Benchmarking `std::memcpy(align)`:
> Throughput: 5.76 GiB/s @ 19.78 ns/call
> + 33.4 % against `sz_copy_serial(align)`

Benchmarking `std::memcpy(shift)`:
> Throughput: 6.06 GiB/s @ 18.78 ns/call
> + 40.4 % against `sz_copy_serial(align)`
> + 39.1 % against `sz_copy_serial(shift)`

Benchmarking `sz_move_serial(by1)`:
> Throughput: 5.86 GiB/s @ 38.83 ns/call

Benchmarking `sz_move_serial(by64)`:
> Throughput: 5.83 GiB/s @ 39.06 ns/call
> - 0.6 % against `sz_move_serial(by1)`

Benchmarking `sz_move_neon(by1)`:
> Throughput: 6.61 GiB/s @ 34.46 ns/call
> + 12.7 % against `sz_move_serial(by1)`

Benchmarking `sz_move_neon(by64)`:
> Throughput: 7.16 GiB/s @ 31.80 ns/call
> + 22.1 % against `sz_move_serial(by1)`
> + 22.8 % against `sz_move_serial(by64)`

Benchmarking `sz_move_sve(by1)`:
> Throughput: 8.36 GiB/s @ 27.24 ns/call
> + 42.6 % against `sz_move_serial(by1)`

Benchmarking `sz_move_sve(by64)`:
> Throughput: 8.45 GiB/s @ 26.95 ns/call
> + 44.1 % against `sz_move_serial(by1)`
> + 44.9 % against `sz_move_serial(by64)`

Benchmarking `std::memmove(by1)`:
> Throughput: 10.88 GiB/s @ 20.94 ns/call
> + 85.5 % against `sz_move_serial(by1)`

Benchmarking `std::memmove(by64)`:
> Throughput: 11.01 GiB/s @ 20.67 ns/call
> + 87.8 % against `sz_move_serial(by1)`
> + 88.9 % against `sz_move_serial(by64)`

Benchmarking `sz_fill_serial`:
> Throughput: 4.22 GiB/s @ 26.98 ns/call

Benchmarking `sz_fill_random_serial`:
> Throughput: 635.80 MiB/s @ 184.77 ns/call
> - 6.8 x against `sz_fill_serial`

Benchmarking `sz_fill_neon`:
> Throughput: 5.07 GiB/s @ 22.46 ns/call
> + 20.1 % against `sz_fill_serial`

Benchmarking `sz_fill_sve`:
> Throughput: 4.49 GiB/s @ 25.35 ns/call
> + 6.4 % against `sz_fill_serial`

Benchmarking `fill<std::memset>`:
> Throughput: 6.10 GiB/s @ 18.67 ns/call
> + 44.5 % against `sz_fill_serial`

Benchmarking `fill<std::random_device>`:
> Throughput: 245.41 MiB/s @ 478.95 ns/call
> - 17.6 x against `sz_fill_serial`
> - 2.6 x against `sz_fill_random_serial`

Benchmarking `sz_lookup_serial`:
> Throughput: 1.91 GiB/s @ 59.48 ns/call

Benchmarking `sz_lookup_neon`:
> Throughput: 2.36 GiB/s @ 48.22 ns/call
> + 23.4 % against `sz_lookup_serial`

Benchmarking `sz_lookup_sve`:
> Throughput: 2.45 GiB/s @ 46.49 ns/call
> + 28.0 % against `sz_lookup_serial`

Benchmarking `lookup<std::transform>`:
> Throughput: 1.93 GiB/s @ 59.02 ns/call
> + 0.8 % against `sz_lookup_serial`
All benchmarks passed.

@ashvardanian ashvardanian changed the base branch from main to main-dev October 21, 2025 16:10
@ashvardanian
Copy link
Owner

I'm afraid the benchmarks aren't very representative, because the mean token length is lower than the cutoff values for the SVE:

Mean token length: 122.25 bytes

@GoWind, any chance you can run either without this condition:

    if (length <= 128) {
        sz_lookup_serial(target, length, source, lut);
        return;
    }

Or by using a dataset with longer lines, like the xlsum.csv with 4 KB line?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants