RISC-V RVV: Boost Chameleon to 494 MB/s, Fix Tests WIP
PR: RISC-V RVV Optimization for density-rs - Initial Results and Request for Review
Hi @g1mv and community members, thanks for the previous feedback!
I've completed the initial optimization work on density-rs
, focusing on RISC-V with RVV vector extensions. Below, I'll first summarize what I've done, then explain the current issues, and invite everyone to review the results and provide feedback!
What I've Done
Based on your suggestions and discussions, I prioritized optimizing the Chameleon algorithm's core loops (e.g., encode_quad
and encode_batch
), and extended similar improvements to Cheetah and Lion. The optimizations include:
-
Manual RVV Vectorization:
- In
encode_batch
, used RVV intrinsics (e.g.,vle32_v_u32m1
,vmul_vx_u32m1
,vsrl_vx_u32m1
,vluxei32_v_u32m1
, andvmseq_vv_m_b32
) for hash calculations, dictionary accesses, and conflict detection. - Handled hash uniqueness and sequencing: Fall back to scalar paths on conflicts to ensure correct dictionary updates (referencing your "case a and b" analysis).
- Used conditional compilation
#[cfg(all(target_arch = "riscv64", target_feature = "v"))]
for compatibility, and handled VLEN variability withvsetvli
.
- In
-
Algorithm Improvements:
- Reduced branch overhead and memory accesses (e.g., optimized hash multiplication and shifts).
- Attempted dynamic mode switching (enable non-updating batches when update rate < 0.1), but currently preliminary.
- Benchmarked with
dickens.txt
(10.19 MB), comparing before and after performance (default vs optimized).
-
Performance Comparison: Using median throughput (MB/s), compression ratios unchanged:
Algorithm Operation Before (MB/s) After (MB/s) Change Ratio Chameleon Compress (raw) 380.2 494.0 +30% 1.749x Decompress (raw) 494.4 503.1 +2% Cheetah Compress (raw) 220.8 264.5 +20% 1.860x Decompress (raw) 291.4 287.2 -1% Lion Compress (raw) 135.3 150.7 +11% 1.966x Decompress (raw) 144.9 143.5 -1% LZ4 Compress (raw) 82.15 79.26 -3% 1.585x Decompress (raw) 174.2 190.5 +9% Snappy Compress (stream) 83.69 83.46 -0.3% 1.607x Decompress (stream) 141 141.7 +0.5% Key Achievements: Chameleon compression nearing 500 MB/s goal!
🎯 Overall compression speeds improved significantly, but decompression varied slightly (minor drops in Cheetah and Lion). -
Code Cleanup:
- Stuck to stable Rust, no external crates.
- Added runtime fallbacks for non-RVV hardware.
- Partially fixed warnings, but unused
BYTE_SIZE_U128
andstd::arch::riscv64::*
remain (to be fixed in PR).
These changes build on your feedback (e.g., dynamic vectorization ideas and architectural preferences) and ensure cross-platform adaptability.