Monero logs

15:01:48 sech1: Implemented vectorized dataset init for RISC-V: before https://p2pool.io/u/ceef12e3b3b4e8ea/Screenshot%20from%202025-11-30%2015-57-46.png after https://p2pool.io/u/3ef318b4f6ea6660/Screenshot%20from%202025-11-30%2016-00-21.png

15:02:11 sech1: dataset init time reduced from 28.294 s to 21.728 s

15:02:50 sech1: 30% speedup, but it also includes the cache init part which didn't change

15:03:58 sech1: cache init is 5-6 seconds, approximately

15:04:18 sech1: so dataset init time reduced from ~22 to ~15 seconds

15:05:00 sech1: In theory it would be down from 22 to 11 seconds if RAM access wasn't a bottleneck

17:59:27 sech1: Got even faster after fixing all the bugs: https://p2pool.io/u/8c593a881a361d8b/Screenshot%20from%202025-11-30%2018-58-54.png

17:59:43 sech1: from 28.294 down to 17.018 seconds to init dataset

18:22:36 sech1: https://github.com/xmrig/xmrig/pull/3736

18:25:08 plowsof: 👏

18:25:50 sech1: Next on my plan is to write vectorized soft AES for hash/fill AES to speed up this part. After that I'll be comfortable enough with RISC-V assembly and vector instructions to add it to the actual RandomX JIT

18:26:13 sech1: (Because this CPU doesn't have hardware AES, sad)

19:17:57 sech1: `RxDataset::init` timings: 21865 ms before, 10639 ms after

19:18:06 sech1: That's more than 2x speedup, lol

19:18:15 sech1: I expected max 2x from vector code

19:21:07 DataHoarder: they might have way more vector registers than scalar ones nowadays :D

19:21:59 sech1: I guess vector instructions are overall more efficient

19:23:18 sech1: Or it's the fact that I do prefetch instructions in vector code, and scalar code doesn't

19:23:18 sech1: It's the same number of registers (32)

19:25:21 DataHoarder: physical vs virtual I mean

19:25:21 sech1: ah, maybe it's also that register width is 256 bit (4x scalar)

19:25:21 sech1: but execution units are 128 bit

19:25:21 sech1: so it is 2x faster, but executes 4x fewer instructions, and some additional speedup comes from this

19:25:21 sech1: so it saves time on instruction decoding

19:25:21 DataHoarder: oh, double pump!

19:28:22 DataHoarder: I'm so sad Intel fucked up AVX512 so bad on new cpus

19:28:39 DataHoarder: E cores didn't have 512 so they dropped it entirely

19:28:56 DataHoarder: AMD: here's double-pumped AVX512, next gen, full pump

19:29:11 sech1: yes, it's 512 physical in Zen 5

19:29:13 DataHoarder: Intel is trying to define AVX10 to take into account bitwidth

19:29:25 DataHoarder: though they could have just double pumped it :')

19:29:32 sech1: I had an idea to write AVX512 dataset init, but it makes little sense. It's already less than a second on 9950X with AVX256

19:29:41 DataHoarder: all the useful stuff in AVX512 that is not explicitly 512-bitwidth is so great

19:30:27 DataHoarder: and what was that OP that was slower than implementing it yourself, not just yourself with vector instructions but with scalar :D

19:30:38 DataHoarder: made people not use that at all due to slowdown

19:30:47 DataHoarder: AMD: 1 cycle execution time

19:31:08 DataHoarder: no one uses it, so just about a flex