15:01:48
sech1:
Implemented vectorized dataset init for RISC-V: before https://p2pool.io/u/ceef12e3b3b4e8ea/Screenshot%20from%202025-11-30%2015-57-46.png after https://p2pool.io/u/3ef318b4f6ea6660/Screenshot%20from%202025-11-30%2016-00-21.png
15:02:11
sech1:
dataset init time reduced from 28.294 s to 21.728 s
15:02:50
sech1:
30% speedup, but it also includes the cache init part which didn't change
15:03:58
sech1:
cache init is 5-6 seconds, approximately
15:04:18
sech1:
so dataset init time reduced from ~22 to ~15 seconds
15:05:00
sech1:
In theory it would be down from 22 to 11 seconds if RAM access wasn't a bottleneck
17:59:27
sech1:
Got even faster after fixing all the bugs: https://p2pool.io/u/8c593a881a361d8b/Screenshot%20from%202025-11-30%2018-58-54.png
17:59:43
sech1:
from 28.294 down to 17.018 seconds to init dataset
18:22:36
sech1:
https://github.com/xmrig/xmrig/pull/3736
18:25:08
plowsof:
👏
18:25:50
sech1:
Next on my plan is to write vectorized soft AES for hash/fill AES to speed up this part. After that I'll be comfortable enough with RISC-V assembly and vector instructions to add it to the actual RandomX JIT
18:26:13
sech1:
(Because this CPU doesn't have hardware AES, sad)
19:17:57
sech1:
`RxDataset::init` timings: 21865 ms before, 10639 ms after
19:18:06
sech1:
That's more than 2x speedup, lol
19:18:15
sech1:
I expected max 2x from vector code
19:21:07
DataHoarder:
they might have way more vector registers than scalar ones nowadays :D
19:21:59
sech1:
I guess vector instructions are overall more efficient
19:23:18
sech1:
Or it's the fact that I do prefetch instructions in vector code, and scalar code doesn't
19:23:18
sech1:
It's the same number of registers (32)
19:25:21
DataHoarder:
physical vs virtual I mean
19:25:21
sech1:
ah, maybe it's also that register width is 256 bit (4x scalar)
19:25:21
sech1:
but execution units are 128 bit
19:25:21
sech1:
so it is 2x faster, but executes 4x fewer instructions, and some additional speedup comes from this
19:25:21
sech1:
so it saves time on instruction decoding
19:25:21
DataHoarder:
oh, double pump!
19:28:22
DataHoarder:
I'm so sad Intel fucked up AVX512 so bad on new cpus
19:28:39
DataHoarder:
E cores didn't have 512 so they dropped it entirely
19:28:56
DataHoarder:
AMD: here's double-pumped AVX512, next gen, full pump
19:29:11
sech1:
yes, it's 512 physical in Zen 5
19:29:13
DataHoarder:
Intel is trying to define AVX10 to take into account bitwidth
19:29:25
DataHoarder:
though they could have just double pumped it :')
19:29:32
sech1:
I had an idea to write AVX512 dataset init, but it makes little sense. It's already less than a second on 9950X with AVX256
19:29:41
DataHoarder:
all the useful stuff in AVX512 that is not explicitly 512-bitwidth is so great
19:30:27
DataHoarder:
and what was that OP that was slower than implementing it yourself, not just yourself with vector instructions but with scalar :D
19:30:38
DataHoarder:
made people not use that at all due to slowdown
19:30:47
DataHoarder:
AMD: 1 cycle execution time
19:31:08
DataHoarder:
no one uses it, so just about a flex