Skip to content

Contributing to the project

whisprer edited this page Aug 4, 2025 · 1 revision

Contributing to Universal RNG Library 🎯 Welcome Performance Engineers! The Universal RNG Library thrives on contributions that push the boundaries of random number generation performance. Whether you're optimizing SIMD kernels, implementing new algorithms, or improving cross-platform compatibility, your expertise makes this library faster for everyone. 🚀 Quick Start for Contributors Prerequisites Setup bash# Install development dependencies sudo apt install build-essential cmake ninja-build perf

Clone and setup development environment

git clone [repository-url] universal-rng cd universal-rng git checkout develop # Development branch

Create development build

mkdir build-dev && cd build-dev cmake -S .. -B . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_TESTING=ON cmake --build . -j$(nproc)

Verify everything works

./rng_selftest && ./rng_benchmark --quick Your First Contribution

Fork the repository and create a feature branch Make changes with comprehensive tests Run benchmarks to measure performance impact Submit PR with performance analysis Collaborate on code review and optimization

📊 Performance Standards Benchmark Requirements Every performance-related contribution must include before/after benchmarks: bash# Generate baseline performance report ./rng_benchmark --algorithm=all --output=baseline.json

After your changes

./rng_benchmark --algorithm=all --output=optimized.json

Compare results

./benchmark_compare baseline.json optimized.json Performance Targets by Component Single-Value Generation Target: Match reference implementation speed

  • Xoroshiro128+: ≥800 M ops/sec (64-bit)
  • WyRand: ≥900 M ops/sec (64-bit)
  • No regression tolerance: 0% Batch Generation (AVX2) Target: Minimum speedup ratios
  • 64-bit: ≥4.0x vs scalar
  • 128-bit: ≥4.5x vs scalar
  • 256-bit: ≥3.5x vs scalar
  • Memory bandwidth efficiency: ≥80% Cross-Platform Consistency Performance variance tolerance: ≤15% across platforms
  • Windows MSVC vs Linux GCC
  • Intel vs AMD architectures
  • Different AVX2 implementations 🏗️ Development Workflow Branch Strategy main ← Stable releases only develop ← Integration branch for features feature/xxx ← Individual feature development perf/xxx ← Performance optimization branches hotfix/xxx ← Critical bug fixes platform/xxx ← Platform-specific improvements Commit Message Format type(scope): brief description

Detailed explanation including:

  • Performance impact measurement
  • Platform compatibility notes
  • Breaking changes (if any)

Performance: [specific metrics] Platforms: [tested on which platforms] Types: feat, fix, perf, docs, style, refactor, test, build Example: perf(avx2): optimize xoroshiro128++ batch generation

Implement direct AVX2 intrinsics replacing scalar wrapper. Eliminates function call overhead and improves register usage. Uses _mm256_set_epi64x for state initialization instead of loads.

Performance: 4.2x speedup in batch mode (was 2.8x) Platforms: Linux GCC 11, Windows MSVC 2022 🎯 Code Standards C++ Style Guidelines Modern C++17 Patterns cpp// GOOD: Template metaprogramming for zero-overhead template<Algorithm A, typename T> constexpr auto create_optimized_generator() noexcept { if constexpr (A == Algorithm::Xoroshiro128Plus) { return XoroshiroGenerator{}; } else if constexpr (A == Algorithm::WyRand) { return WyRandGenerator{}; } }

// GOOD: RAII and smart pointers class SIMDGenerator { std::unique_ptr<uint64_t[], AlignedDeleter> state_; public: explicit SIMDGenerator(size_t state_size) : state_{aligned_alloc<uint64_t>(state_size, 32)} {} };

// BAD: Raw pointers and manual memory management uint64_t* state = new uint64_t[4]; // DON'T DO THIS SIMD Implementation Standards cpp// Template for new SIMD implementations class AlgorithmNameAVX2 { alignas(32) __m256i state[2]; // Properly aligned state

public: explicit AlgorithmNameAVX2(uint64_t seed) noexcept;

// Hot path: aggressive optimization attributes
[[gnu::hot, gnu::always_inline]]
__m256i generate_batch() noexcept {
    // Minimize register pressure
    const __m256i s0 = state[0];
    const __m256i s1 = state[1];
    
    // Algorithm-specific AVX2 operations
    const __m256i result = _mm256_add_epi64(s0, s1);
    
    // Update state in-place
    state[0] = /* ... */;
    state[1] = /* ... */;
    
    return result;
}

void generate_batch(uint64_t* dest, size_t count) noexcept;
std::string get_implementation_name() const noexcept;

}; Performance Coding Principles Memory Layout Optimization cpp// GOOD: Cache-line aligned, minimal padding struct alignas(64) OptimizedState { __m256i simd_state[2]; // 64 bytes, one cache line uint64_t scalar_fallback; // 8 bytes uint32_t counter; // 4 bytes // 20 bytes remaining in cache line };

// BAD: Poor alignment, cache line splits struct PoorState { uint64_t scalar_state; // Might split cache lines __m256i simd_state[2]; // Unaligned AVX2 operations }; Branch Elimination cpp// GOOD: Compile-time dispatch template void generate_impl(uint64_t* dest, size_t count) { if constexpr (UseAVX2) { generate_avx2(dest, count); } else { generate_scalar(dest, count); } }

// BAD: Runtime branching in hot loop for (size_t i = 0; i < count; ++i) { if (has_avx2) { // Branch in hot path dest[i] = generate_avx2_single(); } else { dest[i] = generate_scalar_single(); } } 🧪 Testing Requirements Test Categories and Requirements

  1. Correctness Tests cpp// Algorithm implementation correctness TEST(XoroshiroTest, MatchesReferenceImplementation) { auto our_impl = UniversalRNG::Generator<uint64_t, Algorithm::Xoroshiro128Plus>{12345}; auto reference = ReferenceXoroshiro128Plus{12345};

    for (int i = 0; i < 10000; ++i) { EXPECT_EQ(our_impl.next(), reference.next()) << "Mismatch at iteration " << i; } }

// SIMD correctness vs scalar TEST(SIMDTest, AVX2MatchesScalar) { const uint64_t seed = 42; auto avx2_gen = create_avx2_generator(seed); auto scalar_gen = create_scalar_generator(seed);

std::array<uint64_t, 1000> avx2_results, scalar_results;

avx2_gen.generate_batch(avx2_results.data(), avx2_results.size());
for (auto& val : scalar_results) {
    val = scalar_gen.next();
}

EXPECT_EQ(avx2_results, scalar_results);

} 2. Performance Tests cpp// Benchmark integration in tests BENCHMARK(SingleGeneration_Xoroshiro128Plus) { auto rng = UniversalRNG::create_generator<64>(Algorithm::Xoroshiro128Plus);

for (auto _ : state) {
    benchmark::DoNotOptimize(rng.next());
}

state.SetItemsProcessed(state.iterations());

}

BENCHMARK(BatchGeneration_AVX2_1K) { auto rng = UniversalRNG::create_batch_generator<64>(); std::array<uint64_t, 1000> buffer;

for (auto _ : state) {
    rng.generate_batch(buffer.data(), buffer.size());
    benchmark::DoNotOptimize(buffer.data());
}

state.SetItemsProcessed(state.iterations() * buffer.size());

} 3. Statistical Quality Tests cpp// Basic randomness validation TEST(StatisticalTest, UniformDistribution) { auto rng = UniversalRNG::create_generator<32>(); std::vector<uint32_t> samples(1000000);

for (auto& sample : samples) {
    sample = rng.next();
}

// Chi-square test for uniformity
EXPECT_TRUE(passes_chi_square_test(samples));

}

// Correlation testing TEST(StatisticalTest, NoSerialCorrelation) { auto rng = UniversalRNG::create_generator<64>();

// Generate pairs and test for correlation
std::vector<std::pair<uint64_t, uint64_t>> pairs(100000);
for (auto& [first, second] : pairs) {
    first = rng.next();
    second = rng.next();
}

EXPECT_LT(calculate_correlation(pairs), 0.01);

} 4. Platform Compatibility Tests cpp// Cross-platform consistency TEST(PlatformTest, ConsistentResults) { const uint64_t seed = 123456789;

// Test that same seed produces same sequence across platforms
auto rng = UniversalRNG::create_generator<64>(Algorithm::Xoroshiro128Plus, seed);

// Known good sequence from reference platform
const std::array<uint64_t, 10> expected = {
    0x1234567890ABCDEF, 0xFEDCBA0987654321, /* ... */
};

for (size_t i = 0; i < expected.size(); ++i) {
    EXPECT_EQ(rng.next(), expected[i]) << "Platform difference at index " << i;
}

} Continuous Integration Requirements Performance Regression Detection yaml# .github/workflows/performance.yml

  • name: Run Performance Benchmarks run: | ./build/rng_benchmark --benchmark_format=json > current_perf.json python scripts/compare_performance.py baseline_perf.json current_perf.json

  • name: Check Performance Regression run: |

    Fail if any benchmark shows >5% regression

    python scripts/performance_gate.py --threshold=0.05 🎨 Documentation Standards Code Documentation Requirements Header Comments for Algorithms cpp/**

  • @brief AVX2-optimized implementation of Xoroshiro128++
  • This implementation uses 256-bit AVX2 registers to process 4 parallel
  • streams of the Xoroshiro128++ algorithm. Each register lane maintains
  • independent state for maximum parallelism.
  • @performance 4.2x speedup over scalar implementation on Intel Skylake+
  • @memory 64-byte aligned state for optimal SIMD access
  • @reference Blackman & Vigna (2018), "Scrambled Linear Pseudorandom Generators"
  • @warning Requires AVX2 support. Use CPU::has_avx2() to verify before use. / class Xoroshiro128PlusAVX2 { // Implementation... }; Performance-Critical Function Documentation cpp/*
  • @brief Generate batch of random numbers using AVX2 SIMD
  • Generates exactly count random numbers using parallel SIMD lanes.
  • Output buffer must be 32-byte aligned for optimal performance.
  • @param dest Output buffer (must be 32-byte aligned)
  • @param count Number of values to generate
  • @performance ~1200 M numbers/sec on Intel Core i7-9700K
  • @complexity O(count/4) due to 4-way SIMD parallelism
  • @pre dest != nullptr
  • @pre count > 0
  • @pre dest has space for at least count * sizeof(uint64_t) bytes / gnu::hot void generate_batch(uint64_t dest, size_t count) noexcept; Required Documentation for New Features Algorithm Implementation Guide markdown# Algorithm: [Name]

Overview

Brief description of the algorithm, its properties, and use cases.

Performance Characteristics

  • Period: [full period length]
  • Speed: [operations per second on reference hardware]
  • Quality: [statistical test results, known limitations]
  • Memory: [state size and alignment requirements]

Implementation Notes

  • Platform-specific optimizations
  • SIMD considerations
  • Numerical precision requirements

References

  • Original paper citations
  • Implementation references
  • Performance comparison studies

Usage Example

// Complete working example
Performance Optimization Guide
markdown# Optimization: [Description]

## Problem Statement
What performance issue this addresses.

## Solution Approach
Technical approach and implementation strategy.

## Performance Impact
- **Before**: [baseline measurements]
- **After**: [optimized measurements]
- **Improvement**: [speedup ratio and absolute gains]

## Platform Impact
Results across different platforms and architectures.

## Code Changes
Key implementation changes with explanations.
🐛 Issue Reporting
Bug Report Template
markdown## Bug Description
Clear description of incorrect behavior.

## Environment
- **OS**: [Windows 10/Ubuntu 20.04/macOS 12]
- **Compiler**: [GCC 11.2/MSVC 2022/Clang 13]
- **CPU**: [Intel i7-9700K/AMD Ryzen 5900X]
- **Build Type**: [Release/Debug/RelWithDebInfo]

## Reproduction Steps
1. Minimal code to reproduce the issue
2. Expected behavior
3. Actual behavior

## Performance Impact
If applicable, performance measurements showing the issue.

## Additional Context
Stack traces, compiler warnings, or other relevant information.
Performance Issue Template
markdown## Performance Issue
Description of suboptimal performance.

## Benchmark Results
```bash
# Command used
./rng_benchmark --algorithm=Xoroshiro128Plus --bitwidth=64

# Results showing the issue
Algorithm: Xoroshiro128Plus
Expected: >800 M ops/sec
Actual: 650 M ops/sec (19% below target)
Hardware Details

CPU Model: [exact model and generation]
SIMD Support: [AVX2/AVX-512/NEON available]
Memory: [speed and configuration]
Compiler Flags: [optimization settings used]

Proposed Solution
If you have ideas for optimization approaches.

## 🏆 Areas Where Help is Needed

### High-Impact Contribution Opportunities

#### 1. SIMD Optimization
**Skills needed**: AVX2/AVX-512 intrinsics, performance analysis
Current gap: Single-mode AVX2 underperforming by 30-70%
Target: Match reference implementation speed
Impact: Core library performance foundation

#### 2. GPU Acceleration
**Skills needed**: OpenCL/CUDA, GPU architecture knowledge
Current status: Framework exists, needs implementation
Target: 10-100x speedup for large batch generation
Impact: High-throughput applications

#### 3. ARM/NEON Implementation
**Skills needed**: ARM assembly, NEON intrinsics
Current status: Placeholder implementations
Target: Competitive performance on ARM platforms
Impact: Mobile and embedded device support

#### 4. Cryptographically Secure Algorithms
**Skills needed**: Cryptography, security analysis
Current status: Planned feature
Target: ChaCha20, AES-CTR implementations
Impact: Security-sensitive applications

#### 5. Cross-Platform Testing
**Skills needed**: CI/CD, cross-compilation
Current gap: Limited macOS and ARM testing
Target: Comprehensive platform coverage
Impact: Reliability and compatibility

### Getting Started Contributions

#### Good First Issues
- **Documentation improvements**: Add usage examples
- **Test coverage**: Expand statistical quality tests
- **Build system**: Improve CMake cross-platform support
- **Benchmarking**: Add new benchmark scenarios

#### Medium Complexity
- **Algorithm implementation**: Add new PRNG algorithms
- **Performance analysis**: Identify optimization opportunities
- **API design**: Improve usability without sacrificing performance

## 🤝 Code Review Process

### Pull Request Requirements
- [ ] **All tests pass** (unit, integration, performance)
- [ ] **Performance benchmarks included** with analysis
- [ ] **Documentation updated** (API docs, usage guides)
- [ ] **Cross-platform compatibility verified**
- [ ] **Code review approval** from maintainer

### Review Criteria

#### Technical Excellence
- Correctness of algorithm implementation
- Performance impact measurement and analysis
- Code quality and maintainability
- Memory safety and resource management

#### Performance Focus
- Benchmark results demonstrate improvement or no regression
- SIMD implementations properly optimized
- Memory access patterns optimized for cache efficiency
- Compiler optimization compatibility

#### Maintainability
- Clear, self-documenting code
- Comprehensive test coverage
- Platform abstraction where appropriate
- Future extensibility considerations

## 🎯 Recognition

### Contributor Recognition
- **Performance Hall of Fame**: Major optimization achievements
- **Algorithm Contributors**: Implementation of new algorithms
- **Platform Enablers**: Cross-platform compatibility improvements
- **Release Notes**: Significant contributions highlighted

### Optimization Achievements
Contributors who achieve significant performance improvements:
- **10%+ improvement**: Mentioned in release notes
- **2x+ improvement**: Featured in performance hall of fame
- **New platform support**: Platform enabler recognition
- **New algorithm**: Algorithm contributor status

## 📞 Getting Help

### Communication Channels
- **GitHub Issues**: Technical problems and bug reports
- **GitHub Discussions**: Design decisions and optimization strategies
- **Code Review**: Collaborative improvement through PR reviews

### Mentorship Available
Experienced contributors can help with:
- SIMD optimization techniques
- Performance analysis methodology
- Cross-platform development challenges
- Algorithm implementation guidance

---

**Ready to make Universal RNG faster?** Start with our [good first issues](link) or reach out in discussions to learn about high-impact optimization opportunities!

*This guide evolves with the project - your suggestions for improvements are always welcome!*

PLEASE DO BEAR IN CONSTANT MIND ABOVE ALL ELSE: CURRENT STATE OF DEVELOPMENT THE C++ STD LIBRARY EMPLOYING MERSENNE TWISTER STILL OUTPERFORMS SINGLE CALCULATION OPERATIONS FOR NON-SIMD BOOSTED COMPUTERS. THESE LIBRARIES FULLY REQUIRE AT LEAST AVX2 MINIMUM TO BENEFIT OVER THE STD GENERATION METHODS WHEN CONSIDERING SINGLE NUMBER GENERATION TASKS.

Clone this wiki locally