Monero logs

18:53:17 DataHoarder: I have been experimenting from making the ma, mx prefetch registers go from 2 to 3/4 (effectively converting it into a ring buffer.) what this means is that we can prefetch several vm iterations ahead instead of just for the next one. for v2 (program size 384 and ring size=2->3), looking at pipeline performance counters on zen5 makes it go from

18:53:19 DataHoarder: stalled due to 27% bound by memory to 19%, while increasing hashrate from (on my test setup) from 10kH/s to 12 kH/s. With ring size 4, that is 13 kH/s. So it effectively does way more work, uses more memory bandwidth, and stalls less

18:53:49 DataHoarder: Here's a table of some of the tests and hashrates across Zen5/Zen3/ and a random i7-7700K I got around

18:53:51 DataHoarder: https://paste.debian.net/hidden/1677baa9

18:54:22 DataHoarder: talking with sech1 suggests doing program size 384 (As is currently in v2) and ringSize(n)=3

18:54:59 DataHoarder: the hashrates are a bit noisy, specifically for the i7 (no huge pages there) but the perf stats tend to be on point for each run when measured across different runs

18:56:47 DataHoarder: effectively if you have A, B, C with N=3; at the end of the loop you write the dataset prefetch at C, and then read from location at A, then the next iteration it'd be A, B; then B, C; then back to C, A (a ring)

19:21:49 DataHoarder: oh, perf counters on zen5 as well:

19:21:51 DataHoarder: size 384, n2 https://paste.debian.net/hidden/358363f8 10kH/s

19:21:53 DataHoarder: size 384, n3 https://paste.debian.net/hidden/0cd2347c 12kH/s

19:22:23 DataHoarder: this is using perf stat --metrics PipelineL1,PipelineL2

19:22:25 DataHoarder: Equivalent to counters listed on https://docs.amd.com/r/en-US/57368-uProf-user-guide/Pipeline-Utilization

19:23:23 DataHoarder: the % in parenthesis is just what time was spent measuring that metric, not the actual %

19:42:40 sech1: TLDR n=3 effectively makes v2 hashrate bigger than v1, while still doing more than 1.5x work

19:45:08 sech1: The other side of the coin that it allows to calculate two superscalar hashes at the same time, but it should be fine because they will still burn the same amount of energy per hash

19:47:39 sech1: This will speed up the light mode though...

19:48:19 sech1: Light mode is about burning more energy, not hashrate per se, right?

19:56:38 eureka: less stalling is good, keep the memory busy

19:57:08 DataHoarder: n4 was giving an extra +1KH/s to Zen5

19:57:22 DataHoarder: I tested n=8 too, but that was just another 800H/s over n=4

19:58:39 DataHoarder: (for context, 9900X3D with 2x CP64G56C46U5.M16B1 running in low power mode)

19:59:41 DataHoarder: previous benchmark with just large pages on v1 https://xmrig.com/benchmark/361m7w

20:02:24 DataHoarder: Zen3 received nothing much more going from n=2->n=3, and the i7 is mostly the same across all

20:03:49 sech1: Zen 3 is probably memory bound, but on the RAM side, not CPU side

20:04:08 sech1: They just can't give more hashrate

20:04:22 DataHoarder: yeah, it's also a cursed suboptimal setup with only 2x LRDIMM out of 8x populated

20:04:25 DataHoarder: that'd be something for any future tweaks, anyhow.

20:04:25 sech1: I mean your specific Zen 3 build

20:04:34 DataHoarder: I don't doubt so :)

20:05:42 sech1: 24 core Zen 6 will def hit 40+ kh/s with this tweak, and will get RAM limited too...

20:06:14 sech1: Or more like it will be right on the edge of being RAM bound

20:06:22 DataHoarder: I have the testing changes that support setting what I called ProgramPrefetchRingSize on my go-randomx auto-test branch https://git.gammaspectra.live/P2Pool/go-randomx/src/branch/auto-test

20:06:26 sech1: Assuming well tweaked timings of course

20:07:28 DataHoarder: one could hope less memory bound I/O chiplet on zen6 :)

20:10:10 DataHoarder: would this make it less sensitive to memory latency? if prefetch did too much ahead I guess so, n=3 seems to be on that limit for zen5

20:10:34 DataHoarder: (to the point where it becomes bandwidth limited, not latency)

20:13:41 sech1: It's less sensitive to latency, but it can easily max out the bandwidth of random 64-byte accesses, because it doubles the amount of in-flight accesses

20:14:09 sech1: So it's still sensitive to latency because latency defines this ceiling

20:14:23 sech1: Latency and memory controller

20:15:01 sech1: That said, only 24 core Zen 6 can max it out

20:15:15 sech1: All other AM5 CPUs are fine

20:22:08 DataHoarder: would be nice to see new benchmarks not done with just my library :)

20:33:19 sech1: I will try to implement it by tomorrow evening, for x64 JIT

20:46:40 DataHoarder: if you are specifically doing just n=3 you already had an optimal setup for it (which can probably reuse the existing register instead of temp. stack like I do, but I support n=2 to n=4 (or n=8 with changes)

21:15:53 sech1: Yes, I can shift 96 bits through the 64-bit register that holds mx/ma. 32 bit go out on one side, 32 bit go in on the other side

21:37:14 eureka: 24 core zen 6 ... is that dual 12 core CCD?

22:41:18 sech1: Yes, zen 6 will have 12 core CCD

22:41:31 sech1: zen 6c will have 32 core CCD, but it's only for servers