WIP: Sketch primops-based atomic counter and distribution implementation #42

TravisWhitaker · 2020-07-19T19:54:26Z

After finding #41 I was curious whether or not a primops-based implementation of cbits could achieve competitive performance. Turns out that it can, with some caveats.

I took four measurements to compare the implementation presented here with the current tip of master:

counter benchmark with -N1
counter benchmark with -N
distribution benchmark with -N1
distribution benchmark with -N

The results were recorded on my 2018 i9 MacBook Pro, which has 12 logical cores (6 physical cores, two hyperthreads each), so -N means -N12 on this machine. I added -O2 to ghc-options for the current tip of master and the code here. These are the results for the current tip of master:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 80.14 ms   (78.19 ms .. 82.57 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 79.68 ms   (78.43 ms .. 80.69 ms)
std dev              1.949 ms   (1.070 ms .. 3.085 ms)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 206.8 ms   (187.4 ms .. 235.2 ms)
                     0.992 R²   (0.988 R² .. 1.000 R²)
mean                 197.8 ms   (191.7 ms .. 204.6 ms)
std dev              9.020 ms   (5.511 ms .. 12.86 ms)
variance introduced by outliers: 14% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 117.3 ms   (113.1 ms .. 123.7 ms)
                     0.997 R²   (0.993 R² .. 1.000 R²)
mean                 115.9 ms   (113.9 ms .. 118.1 ms)
std dev              3.398 ms   (2.475 ms .. 4.696 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 242.1 ms   (233.4 ms .. 252.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 233.4 ms   (229.9 ms .. 238.0 ms)
std dev              4.882 ms   (2.161 ms .. 7.047 ms)
variance introduced by outliers: 16% (moderately inflated)

And here are the results for the code presented here:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 64.14 ms   (61.11 ms .. 66.64 ms)
                     0.995 R²   (0.985 R² .. 0.999 R²)
mean                 64.64 ms   (62.84 ms .. 66.19 ms)
std dev              3.086 ms   (2.273 ms .. 4.978 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 187.6 ms   (156.1 ms .. 218.5 ms)
                     0.989 R²   (0.982 R² .. 1.000 R²)
mean                 193.1 ms   (182.9 ms .. 202.1 ms)
std dev              13.66 ms   (8.776 ms .. 21.37 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 346.9 ms   (340.3 ms .. 351.4 ms)
                     1.000 R²   (1.000 R² .. NaN R²)
mean                 339.8 ms   (336.3 ms .. 342.5 ms)
std dev              3.661 ms   (1.435 ms .. 4.770 ms)
variance introduced by outliers: 19% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 251.8 ms   (241.7 ms .. 266.4 ms)
                     0.997 R²   (0.987 R² .. 1.000 R²)
mean                 250.8 ms   (245.9 ms .. 256.0 ms)
std dev              6.666 ms   (5.042 ms .. 8.009 ms)
variance introduced by outliers: 16% (moderately inflated)

The new implementation yields slightly faster atomic counter performance in both the single-capability and capability-per-core cases. This could be due to the fact that in the new implementation the counter is not in pinned memory, or the lack of withForeignPtr, which isn't free.

The distribution results are quite interesting. In the single-capability case, this implementation is about three times slower than the existing implementation, but the gap between them almost entirely disappears in the capability-per-core case. I'm not quite sure what's going on here. I thought that perhaps the fact that the unsafe FFI call can't be interrupted by GC could have something to do with it, but using -S shows that only a single major GC at the end of benchmark execution takes place in both the single-capability case and the capability-per-core case. I suspect the answer is somewhere in the core emitted for spinLock, but I haven't dug into it yet.

The most important difference between the two implementations is correctness. Because of #41, ekg sometimes reports stale or gibberish metrics on newer aarch64 chips, while this implementation produces correct results (at least the results look right to me, but a second pair of eyes on the distribution update code would be much appreciated).

I'm not too happy about having to unpack I# from a CAF to get the field array offsets in these functions, but since hsc2hs seems to always wrap parens around the results of its directives, I couldn't figure out how to generate an unboxed literal from them. One solution might be to export TH splices from a separate hsc file, then splice in the unboxed literals in the System.Metrics.Distribution module. That would also allow us to skip hsc2hs on that module entirely, which would roughly halve the number of # required.

Another minor difference is the change from Int64 to Int. This is necessary because GHC doesn't make available any 64 bit wide atomic primops on 32 bit chips.

TravisWhitaker · 2020-07-29T03:26:56Z

Turns out I mistakenly used an atomic write to release the distribution spinlock, when a weak write is sufficient (and much faster). Now the primops implementation is comparable to the existing implementation's performance (even a bit faster in the multicapability case).

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 120.9 ms   (117.6 ms .. 125.8 ms)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 117.8 ms   (116.1 ms .. 119.9 ms)
std dev              2.907 ms   (2.099 ms .. 4.170 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 210.1 ms   (199.2 ms .. 219.5 ms)
                     0.998 R²   (0.992 R² .. 1.000 R²)
mean                 225.5 ms   (219.2 ms .. 236.0 ms)
std dev              10.80 ms   (5.505 ms .. 15.14 ms)
variance introduced by outliers: 14% (moderately inflated)

TravisWhitaker · 2020-10-05T23:09:28Z

One way or another, #41 needs to be fixed for EKG to work well on newer ARM boards. If there's no interest in this patch, we should find another way to achieve memory safety on weakly ordered machines.

This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by replacing the C code for the distribution metric with GHC prim ops. The performance of this implementation is about half that of the existing C code in a single-threaded benchmark; without masking the performance is comparable. This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of tibbe/ekg-core.

TravisWhitaker · 2021-08-23T19:45:21Z

@23Skidoo What more can I do to help get this package fixed on aarch64?

AndreasPK

The Int overflowing on 32bit might be a concern and I wonder if read should be locking. But looks reasonable to me on the whole.

AndreasPK · 2023-06-19T14:50:53Z

System/Metrics/Distribution.hsc

-    , cSum        :: !Double
-    , cMin        :: !Double
-    , cMax        :: !Double
-    , cLock       :: !Int64  -- ^ 0 - unlocked, 1 - locked


Seems like a good idea to keep this comment somehow.

…y-safe

Bump base constraint.

0433c5b

TravisWhitaker force-pushed the memory-safe branch from 8288724 to 4273d93 Compare July 19, 2020 19:56

TravisWhitaker changed the title ~~Sketch primops-based atomic counter and distribution implementation~~ WIP: Sketch primops-based atomic counter and distribution implementation Jul 19, 2020

Prototype atomic-memory-safe implementation.

420f559

TravisWhitaker force-pushed the memory-safe branch from 4273d93 to 420f559 Compare July 29, 2020 03:20

TravisWhitaker mentioned this pull request Mar 24, 2021

Atomic C functions are not atomic. #41

Open

AndreasPK reviewed Jun 19, 2023

View reviewed changes

TravisWhitaker added 5 commits December 24, 2023 13:44

Dependency version bumps.

2ac505c

Merge branch 'bumps' of github.com:TravisWhitaker/ekg-core into bumps

16954da

Merge branch 'bumps' of github.com:TravisWhitaker/ekg-core into memor…

a52ddad

…y-safe

Update .gitignore

74f2239

Fix CPP

12ab6a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Sketch primops-based atomic counter and distribution implementation #42

WIP: Sketch primops-based atomic counter and distribution implementation #42

TravisWhitaker commented Jul 19, 2020

TravisWhitaker commented Jul 29, 2020

TravisWhitaker commented Oct 5, 2020

TravisWhitaker commented Aug 23, 2021

AndreasPK left a comment

AndreasPK Jun 19, 2023

WIP: Sketch primops-based atomic counter and distribution implementation #42

Are you sure you want to change the base?

WIP: Sketch primops-based atomic counter and distribution implementation #42

Conversation

TravisWhitaker commented Jul 19, 2020

TravisWhitaker commented Jul 29, 2020

TravisWhitaker commented Oct 5, 2020

TravisWhitaker commented Aug 23, 2021

AndreasPK left a comment

Choose a reason for hiding this comment

AndreasPK Jun 19, 2023

Choose a reason for hiding this comment