Skip to content

RISC-V RVV: Boost Chameleon to 494 MB/s, Fix Tests WIP

PR: RISC-V RVV Optimization for density-rs - Initial Results and Request for Review

Hi @g1mv and community members, thanks for the previous feedback! 🙌

I've completed the initial optimization work on density-rs, focusing on RISC-V with RVV vector extensions. Below, I'll first summarize what I've done, then explain the current issues, and invite everyone to review the results and provide feedback! 😄

What I've Done

Based on your suggestions and discussions, I prioritized optimizing the Chameleon algorithm's core loops (e.g., encode_quad and encode_batch), and extended similar improvements to Cheetah and Lion. The optimizations include:

  1. Manual RVV Vectorization:

    • In encode_batch, used RVV intrinsics (e.g., vle32_v_u32m1, vmul_vx_u32m1, vsrl_vx_u32m1, vluxei32_v_u32m1, and vmseq_vv_m_b32) for hash calculations, dictionary accesses, and conflict detection.
    • Handled hash uniqueness and sequencing: Fall back to scalar paths on conflicts to ensure correct dictionary updates (referencing your "case a and b" analysis).
    • Used conditional compilation #[cfg(all(target_arch = "riscv64", target_feature = "v"))] for compatibility, and handled VLEN variability with vsetvli.
  2. Algorithm Improvements:

    • Reduced branch overhead and memory accesses (e.g., optimized hash multiplication and shifts).
    • Attempted dynamic mode switching (enable non-updating batches when update rate < 0.1), but currently preliminary.
    • Benchmarked with dickens.txt (10.19 MB), comparing before and after performance (default vs optimized).
  3. Performance Comparison: Using median throughput (MB/s), compression ratios unchanged:

    Algorithm Operation Before (MB/s) After (MB/s) Change Ratio
    Chameleon Compress (raw) 380.2 494.0 +30% 1.749x
    Decompress (raw) 494.4 503.1 +2%
    Cheetah Compress (raw) 220.8 264.5 +20% 1.860x
    Decompress (raw) 291.4 287.2 -1%
    Lion Compress (raw) 135.3 150.7 +11% 1.966x
    Decompress (raw) 144.9 143.5 -1%
    LZ4 Compress (raw) 82.15 79.26 -3% 1.585x
    Decompress (raw) 174.2 190.5 +9%
    Snappy Compress (stream) 83.69 83.46 -0.3% 1.607x
    Decompress (stream) 141 141.7 +0.5%

    Key Achievements: Chameleon compression nearing 500 MB/s goal! 🎯 Overall compression speeds improved significantly, but decompression varied slightly (minor drops in Cheetah and Lion).

  4. Code Cleanup:

    • Stuck to stable Rust, no external crates.
    • Added runtime fallbacks for non-RVV hardware.
    • Partially fixed warnings, but unused BYTE_SIZE_U128 and std::arch::riscv64::* remain (to be fixed in PR).

These changes build on your feedback (e.g., dynamic vectorization ideas and architectural preferences) and ensure cross-platform adaptability.

合并请求报告

加载中