Contributing to the project

Contributing to Universal RNG Library 🎯 Welcome Performance Engineers! The Universal RNG Library thrives on contributions that push the boundaries of random number generation performance. Whether you're optimizing SIMD kernels, implementing new algorithms, or improving cross-platform compatibility, your expertise makes this library faster for everyone. 🚀 Quick Start for Contributors Prerequisites Setup bash# Install development dependencies sudo apt install build-essential cmake ninja-build perf

Clone and setup development environment

git clone [repository-url] universal-rng cd universal-rng git checkout develop # Development branch

Create development build

mkdir build-dev && cd build-dev cmake -S .. -B . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_TESTING=ON cmake --build . -j$(nproc)

Verify everything works

./rng_selftest && ./rng_benchmark --quick Your First Contribution

Fork the repository and create a feature branch Make changes with comprehensive tests Run benchmarks to measure performance impact Submit PR with performance analysis Collaborate on code review and optimization

📊 Performance Standards Benchmark Requirements Every performance-related contribution must include before/after benchmarks: bash# Generate baseline performance report ./rng_benchmark --algorithm=all --output=baseline.json

After your changes

./rng_benchmark --algorithm=all --output=optimized.json

Compare results

./benchmark_compare baseline.json optimized.json Performance Targets by Component Single-Value Generation Target: Match reference implementation speed

Xoroshiro128+: ≥800 M ops/sec (64-bit)
WyRand: ≥900 M ops/sec (64-bit)
No regression tolerance: 0% Batch Generation (AVX2) Target: Minimum speedup ratios
64-bit: ≥4.0x vs scalar
128-bit: ≥4.5x vs scalar
256-bit: ≥3.5x vs scalar
Memory bandwidth efficiency: ≥80% Cross-Platform Consistency Performance variance tolerance: ≤15% across platforms
Windows MSVC vs Linux GCC
Intel vs AMD architectures
Different AVX2 implementations 🏗️ Development Workflow Branch Strategy main ← Stable releases only develop ← Integration branch for features feature/xxx ← Individual feature development perf/xxx ← Performance optimization branches hotfix/xxx ← Critical bug fixes platform/xxx ← Platform-specific improvements Commit Message Format type(scope): brief description

Detailed explanation including:

Performance impact measurement
Platform compatibility notes
Breaking changes (if any)

Performance: [specific metrics] Platforms: [tested on which platforms] Types: feat, fix, perf, docs, style, refactor, test, build Example: perf(avx2): optimize xoroshiro128++ batch generation

Implement direct AVX2 intrinsics replacing scalar wrapper. Eliminates function call overhead and improves register usage. Uses _mm256_set_epi64x for state initialization instead of loads.

Performance: 4.2x speedup in batch mode (was 2.8x) Platforms: Linux GCC 11, Windows MSVC 2022 🎯 Code Standards C++ Style Guidelines Modern C++17 Patterns cpp// GOOD: Template metaprogramming for zero-overhead template<Algorithm A, typename T> constexpr auto create_optimized_generator() noexcept { if constexpr (A == Algorithm::Xoroshiro128Plus) { return XoroshiroGenerator{}; } else if constexpr (A == Algorithm::WyRand) { return WyRandGenerator{}; } }

// GOOD: RAII and smart pointers class SIMDGenerator { std::unique_ptr<uint64_t[], AlignedDeleter> state_; public: explicit SIMDGenerator(size_t state_size) : state_{aligned_alloc<uint64_t>(state_size, 32)} {} };

// BAD: Raw pointers and manual memory management uint64_t* state = new uint64_t[4]; // DON'T DO THIS SIMD Implementation Standards cpp// Template for new SIMD implementations class AlgorithmNameAVX2 { alignas(32) __m256i state[2]; // Properly aligned state

public: explicit AlgorithmNameAVX2(uint64_t seed) noexcept;

// Hot path: aggressive optimization attributes
[[gnu::hot, gnu::always_inline]]
__m256i generate_batch() noexcept {
    // Minimize register pressure
    const __m256i s0 = state[0];
    const __m256i s1 = state[1];
    
    // Algorithm-specific AVX2 operations
    const __m256i result = _mm256_add_epi64(s0, s1);
    
    // Update state in-place
    state[0] = /* ... */;
    state[1] = /* ... */;
    
    return result;
}

void generate_batch(uint64_t* dest, size_t count) noexcept;
std::string get_implementation_name() const noexcept;

}; Performance Coding Principles Memory Layout Optimization cpp// GOOD: Cache-line aligned, minimal padding struct alignas(64) OptimizedState { __m256i simd_state[2]; // 64 bytes, one cache line uint64_t scalar_fallback; // 8 bytes uint32_t counter; // 4 bytes // 20 bytes remaining in cache line };

// BAD: Poor alignment, cache line splits struct PoorState { uint64_t scalar_state; // Might split cache lines __m256i simd_state[2]; // Unaligned AVX2 operations }; Branch Elimination cpp// GOOD: Compile-time dispatch template void generate_impl(uint64_t* dest, size_t count) { if constexpr (UseAVX2) { generate_avx2(dest, count); } else { generate_scalar(dest, count); } }

// BAD: Runtime branching in hot loop for (size_t i = 0; i < count; ++i) { if (has_avx2) { // Branch in hot path dest[i] = generate_avx2_single(); } else { dest[i] = generate_scalar_single(); } } 🧪 Testing Requirements Test Categories and Requirements

Correctness Tests cpp// Algorithm implementation correctness TEST(XoroshiroTest, MatchesReferenceImplementation) { auto our_impl = UniversalRNG::Generator<uint64_t, Algorithm::Xoroshiro128Plus>{12345}; auto reference = ReferenceXoroshiro128Plus{12345};

for (int i = 0; i < 10000; ++i) { EXPECT_EQ(our_impl.next(), reference.next()) << "Mismatch at iteration " << i; } }

// SIMD correctness vs scalar TEST(SIMDTest, AVX2MatchesScalar) { const uint64_t seed = 42; auto avx2_gen = create_avx2_generator(seed); auto scalar_gen = create_scalar_generator(seed);

std::array<uint64_t, 1000> avx2_results, scalar_results;

avx2_gen.generate_batch(avx2_results.data(), avx2_results.size());
for (auto& val : scalar_results) {
    val = scalar_gen.next();
}

EXPECT_EQ(avx2_results, scalar_results);

} 2. Performance Tests cpp// Benchmark integration in tests BENCHMARK(SingleGeneration_Xoroshiro128Plus) { auto rng = UniversalRNG::create_generator<64>(Algorithm::Xoroshiro128Plus);

for (auto _ : state) {
    benchmark::DoNotOptimize(rng.next());
}

state.SetItemsProcessed(state.iterations());

}

BENCHMARK(BatchGeneration_AVX2_1K) { auto rng = UniversalRNG::create_batch_generator<64>(); std::array<uint64_t, 1000> buffer;

for (auto _ : state) {
    rng.generate_batch(buffer.data(), buffer.size());
    benchmark::DoNotOptimize(buffer.data());
}

state.SetItemsProcessed(state.iterations() * buffer.size());

} 3. Statistical Quality Tests cpp// Basic randomness validation TEST(StatisticalTest, UniformDistribution) { auto rng = UniversalRNG::create_generator<32>(); std::vector<uint32_t> samples(1000000);

for (auto& sample : samples) {
    sample = rng.next();
}

// Chi-square test for uniformity
EXPECT_TRUE(passes_chi_square_test(samples));

}

// Correlation testing TEST(StatisticalTest, NoSerialCorrelation) { auto rng = UniversalRNG::create_generator<64>();

// Generate pairs and test for correlation
std::vector<std::pair<uint64_t, uint64_t>> pairs(100000);
for (auto& [first, second] : pairs) {
    first = rng.next();
    second = rng.next();
}

EXPECT_LT(calculate_correlation(pairs), 0.01);

} 4. Platform Compatibility Tests cpp// Cross-platform consistency TEST(PlatformTest, ConsistentResults) { const uint64_t seed = 123456789;

// Test that same seed produces same sequence across platforms
auto rng = UniversalRNG::create_generator<64>(Algorithm::Xoroshiro128Plus, seed);

// Known good sequence from reference platform
const std::array<uint64_t, 10> expected = {
    0x1234567890ABCDEF, 0xFEDCBA0987654321, /* ... */
};

for (size_t i = 0; i < expected.size(); ++i) {
    EXPECT_EQ(rng.next(), expected[i]) << "Platform difference at index " << i;
}

} Continuous Integration Requirements Performance Regression Detection yaml# .github/workflows/performance.yml

name: Run Performance Benchmarks run: | ./build/rng_benchmark --benchmark_format=json > current_perf.json python scripts/compare_performance.py baseline_perf.json current_perf.json
name: Check Performance Regression run: |

Fail if any benchmark shows >5% regression

python scripts/performance_gate.py --threshold=0.05 🎨 Documentation Standards Code Documentation Requirements Header Comments for Algorithms cpp/**

@brief AVX2-optimized implementation of Xoroshiro128++
This implementation uses 256-bit AVX2 registers to process 4 parallel
streams of the Xoroshiro128++ algorithm. Each register lane maintains
independent state for maximum parallelism.
@performance 4.2x speedup over scalar implementation on Intel Skylake+
@memory 64-byte aligned state for optimal SIMD access
@reference Blackman & Vigna (2018), "Scrambled Linear Pseudorandom Generators"
@warning Requires AVX2 support. Use CPU::has_avx2() to verify before use. / class Xoroshiro128PlusAVX2 { // Implementation... }; Performance-Critical Function Documentation cpp/*
@brief Generate batch of random numbers using AVX2 SIMD
Generates exactly count random numbers using parallel SIMD lanes.
Output buffer must be 32-byte aligned for optimal performance.
@param dest Output buffer (must be 32-byte aligned)
@param count Number of values to generate
@performance ~1200 M numbers/sec on Intel Core i7-9700K
@complexity O(count/4) due to 4-way SIMD parallelism
@pre dest != nullptr
@pre count > 0
@pre dest has space for at least count * sizeof(uint64_t) bytes / gnu::hot void generate_batch(uint64_t dest, size_t count) noexcept; Required Documentation for New Features Algorithm Implementation Guide markdown# Algorithm: [Name]

Overview

Brief description of the algorithm, its properties, and use cases.

Performance Characteristics

Period: [full period length]
Speed: [operations per second on reference hardware]
Quality: [statistical test results, known limitations]
Memory: [state size and alignment requirements]

Implementation Notes

Platform-specific optimizations
SIMD considerations
Numerical precision requirements

References

Original paper citations
Implementation references
Performance comparison studies

Usage Example

// Complete working example
Performance Optimization Guide
markdown# Optimization: [Description]

## Problem Statement
What performance issue this addresses.

## Solution Approach
Technical approach and implementation strategy.

## Performance Impact
- **Before**: [baseline measurements]
- **After**: [optimized measurements]
- **Improvement**: [speedup ratio and absolute gains]

## Platform Impact
Results across different platforms and architectures.

## Code Changes
Key implementation changes with explanations.
🐛 Issue Reporting
Bug Report Template
markdown## Bug Description
Clear description of incorrect behavior.

## Environment
- **OS**: [Windows 10/Ubuntu 20.04/macOS 12]
- **Compiler**: [GCC 11.2/MSVC 2022/Clang 13]
- **CPU**: [Intel i7-9700K/AMD Ryzen 5900X]
- **Build Type**: [Release/Debug/RelWithDebInfo]

## Reproduction Steps
1. Minimal code to reproduce the issue
2. Expected behavior
3. Actual behavior

## Performance Impact
If applicable, performance measurements showing the issue.

## Additional Context
Stack traces, compiler warnings, or other relevant information.
Performance Issue Template
markdown## Performance Issue
Description of suboptimal performance.

## Benchmark Results
```bash
# Command used
./rng_benchmark --algorithm=Xoroshiro128Plus --bitwidth=64

# Results showing the issue
Algorithm: Xoroshiro128Plus
Expected: >800 M ops/sec
Actual: 650 M ops/sec (19% below target)
Hardware Details

CPU Model: [exact model and generation]
SIMD Support: [AVX2/AVX-512/NEON available]
Memory: [speed and configuration]
Compiler Flags: [optimization settings used]

Proposed Solution
If you have ideas for optimization approaches.

## 🏆 Areas Where Help is Needed

### High-Impact Contribution Opportunities

#### 1. SIMD Optimization
**Skills needed**: AVX2/AVX-512 intrinsics, performance analysis
Current gap: Single-mode AVX2 underperforming by 30-70%
Target: Match reference implementation speed
Impact: Core library performance foundation

#### 2. GPU Acceleration
**Skills needed**: OpenCL/CUDA, GPU architecture knowledge
Current status: Framework exists, needs implementation
Target: 10-100x speedup for large batch generation
Impact: High-throughput applications

#### 3. ARM/NEON Implementation
**Skills needed**: ARM assembly, NEON intrinsics
Current status: Placeholder implementations
Target: Competitive performance on ARM platforms
Impact: Mobile and embedded device support

#### 4. Cryptographically Secure Algorithms
**Skills needed**: Cryptography, security analysis
Current status: Planned feature
Target: ChaCha20, AES-CTR implementations
Impact: Security-sensitive applications

#### 5. Cross-Platform Testing
**Skills needed**: CI/CD, cross-compilation
Current gap: Limited macOS and ARM testing
Target: Comprehensive platform coverage
Impact: Reliability and compatibility

### Getting Started Contributions

#### Good First Issues
- **Documentation improvements**: Add usage examples
- **Test coverage**: Expand statistical quality tests
- **Build system**: Improve CMake cross-platform support
- **Benchmarking**: Add new benchmark scenarios

#### Medium Complexity
- **Algorithm implementation**: Add new PRNG algorithms
- **Performance analysis**: Identify optimization opportunities
- **API design**: Improve usability without sacrificing performance

## 🤝 Code Review Process

### Pull Request Requirements
- [ ] **All tests pass** (unit, integration, performance)
- [ ] **Performance benchmarks included** with analysis
- [ ] **Documentation updated** (API docs, usage guides)
- [ ] **Cross-platform compatibility verified**
- [ ] **Code review approval** from maintainer

### Review Criteria

#### Technical Excellence
- Correctness of algorithm implementation
- Performance impact measurement and analysis
- Code quality and maintainability
- Memory safety and resource management

#### Performance Focus
- Benchmark results demonstrate improvement or no regression
- SIMD implementations properly optimized
- Memory access patterns optimized for cache efficiency
- Compiler optimization compatibility

#### Maintainability
- Clear, self-documenting code
- Comprehensive test coverage
- Platform abstraction where appropriate
- Future extensibility considerations

## 🎯 Recognition

### Contributor Recognition
- **Performance Hall of Fame**: Major optimization achievements
- **Algorithm Contributors**: Implementation of new algorithms
- **Platform Enablers**: Cross-platform compatibility improvements
- **Release Notes**: Significant contributions highlighted

### Optimization Achievements
Contributors who achieve significant performance improvements:
- **10%+ improvement**: Mentioned in release notes
- **2x+ improvement**: Featured in performance hall of fame
- **New platform support**: Platform enabler recognition
- **New algorithm**: Algorithm contributor status

## 📞 Getting Help

### Communication Channels
- **GitHub Issues**: Technical problems and bug reports
- **GitHub Discussions**: Design decisions and optimization strategies
- **Code Review**: Collaborative improvement through PR reviews

### Mentorship Available
Experienced contributors can help with:
- SIMD optimization techniques
- Performance analysis methodology
- Cross-platform development challenges
- Algorithm implementation guidance

---

**Ready to make Universal RNG faster?** Start with our [good first issues](link) or reach out in discussions to learn about high-impact optimization opportunities!

*This guide evolves with the project - your suggestions for improvements are always welcome!*

There is currently data lost off the bottom off the page - a search party needs to be sent in to rescue!

PLEASE DO BEAR IN CONSTANT MIND ABOVE ALL ELSE: CURRENT STATE OF DEVELOPMENT THE C++ STD LIBRARY EMPLOYING MERSENNE TWISTER STILL OUTPERFORMS SINGLE CALCULATION OPERATIONS FOR NON-SIMD BOOSTED COMPUTERS. THESE LIBRARIES FULLY REQUIRE AT LEAST AVX2 MINIMUM TO BENEFIT OVER THE STD GENERATION METHODS WHEN CONSIDERING SINGLE NUMBER GENERATION TASKS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Contributing to the project

Clone and setup development environment

Create development build

Verify everything works

After your changes

Compare results

Fail if any benchmark shows >5% regression

Overview

Performance Characteristics

Implementation Notes

References

Usage Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally