Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Sketch primops-based atomic counter and distribution implementation #42

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

TravisWhitaker
Copy link

After finding #41 I was curious whether or not a primops-based implementation of cbits could achieve competitive performance. Turns out that it can, with some caveats.

I took four measurements to compare the implementation presented here with the current tip of master:

  • counter benchmark with -N1
  • counter benchmark with -N
  • distribution benchmark with -N1
  • distribution benchmark with -N

The results were recorded on my 2018 i9 MacBook Pro, which has 12 logical cores (6 physical cores, two hyperthreads each), so -N means -N12 on this machine. I added -O2 to ghc-options for the current tip of master and the code here. These are the results for the current tip of master:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 80.14 ms   (78.19 ms .. 82.57 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 79.68 ms   (78.43 ms .. 80.69 ms)
std dev              1.949 ms   (1.070 ms .. 3.085 ms)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 206.8 ms   (187.4 ms .. 235.2 ms)
                     0.992 R²   (0.988 R² .. 1.000 R²)
mean                 197.8 ms   (191.7 ms .. 204.6 ms)
std dev              9.020 ms   (5.511 ms .. 12.86 ms)
variance introduced by outliers: 14% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 117.3 ms   (113.1 ms .. 123.7 ms)
                     0.997 R²   (0.993 R² .. 1.000 R²)
mean                 115.9 ms   (113.9 ms .. 118.1 ms)
std dev              3.398 ms   (2.475 ms .. 4.696 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 242.1 ms   (233.4 ms .. 252.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 233.4 ms   (229.9 ms .. 238.0 ms)
std dev              4.882 ms   (2.161 ms .. 7.047 ms)
variance introduced by outliers: 16% (moderately inflated)

And here are the results for the code presented here:

$ bench ./counter-N1.sh
benchmarking ./counter-N1.sh
time                 64.14 ms   (61.11 ms .. 66.64 ms)
                     0.995 R²   (0.985 R² .. 0.999 R²)
mean                 64.64 ms   (62.84 ms .. 66.19 ms)
std dev              3.086 ms   (2.273 ms .. 4.978 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./counter-N.sh
benchmarking ./counter-N.sh
time                 187.6 ms   (156.1 ms .. 218.5 ms)
                     0.989 R²   (0.982 R² .. 1.000 R²)
mean                 193.1 ms   (182.9 ms .. 202.1 ms)
std dev              13.66 ms   (8.776 ms .. 21.37 ms)
variance introduced by outliers: 15% (moderately inflated)

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 346.9 ms   (340.3 ms .. 351.4 ms)
                     1.000 R²   (1.000 R² .. NaN R²)
mean                 339.8 ms   (336.3 ms .. 342.5 ms)
std dev              3.661 ms   (1.435 ms .. 4.770 ms)
variance introduced by outliers: 19% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 251.8 ms   (241.7 ms .. 266.4 ms)
                     0.997 R²   (0.987 R² .. 1.000 R²)
mean                 250.8 ms   (245.9 ms .. 256.0 ms)
std dev              6.666 ms   (5.042 ms .. 8.009 ms)
variance introduced by outliers: 16% (moderately inflated)

The new implementation yields slightly faster atomic counter performance in both the single-capability and capability-per-core cases. This could be due to the fact that in the new implementation the counter is not in pinned memory, or the lack of withForeignPtr, which isn't free.

The distribution results are quite interesting. In the single-capability case, this implementation is about three times slower than the existing implementation, but the gap between them almost entirely disappears in the capability-per-core case. I'm not quite sure what's going on here. I thought that perhaps the fact that the unsafe FFI call can't be interrupted by GC could have something to do with it, but using -S shows that only a single major GC at the end of benchmark execution takes place in both the single-capability case and the capability-per-core case. I suspect the answer is somewhere in the core emitted for spinLock, but I haven't dug into it yet.

The most important difference between the two implementations is correctness. Because of #41, ekg sometimes reports stale or gibberish metrics on newer aarch64 chips, while this implementation produces correct results (at least the results look right to me, but a second pair of eyes on the distribution update code would be much appreciated).

I'm not too happy about having to unpack I# from a CAF to get the field array offsets in these functions, but since hsc2hs seems to always wrap parens around the results of its directives, I couldn't figure out how to generate an unboxed literal from them. One solution might be to export TH splices from a separate hsc file, then splice in the unboxed literals in the System.Metrics.Distribution module. That would also allow us to skip hsc2hs on that module entirely, which would roughly halve the number of # required.

Another minor difference is the change from Int64 to Int. This is necessary because GHC doesn't make available any 64 bit wide atomic primops on 32 bit chips.

@TravisWhitaker TravisWhitaker changed the title Sketch primops-based atomic counter and distribution implementation WIP: Sketch primops-based atomic counter and distribution implementation Jul 19, 2020
@TravisWhitaker
Copy link
Author

Turns out I mistakenly used an atomic write to release the distribution spinlock, when a weak write is sufficient (and much faster). Now the primops implementation is comparable to the existing implementation's performance (even a bit faster in the multicapability case).

$ bench ./distribution-N1.sh
benchmarking ./distribution-N1.sh
time                 120.9 ms   (117.6 ms .. 125.8 ms)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 117.8 ms   (116.1 ms .. 119.9 ms)
std dev              2.907 ms   (2.099 ms .. 4.170 ms)
variance introduced by outliers: 11% (moderately inflated)

$ bench ./distribution-N.sh
benchmarking ./distribution-N.sh
time                 210.1 ms   (199.2 ms .. 219.5 ms)
                     0.998 R²   (0.992 R² .. 1.000 R²)
mean                 225.5 ms   (219.2 ms .. 236.0 ms)
std dev              10.80 ms   (5.505 ms .. 15.14 ms)
variance introduced by outliers: 14% (moderately inflated)

@TravisWhitaker
Copy link
Author

One way or another, #41 needs to be fixed for EKG to work well on newer ARM boards. If there's no interest in this patch, we should find another way to achieve memory safety on weakly ordered machines.

awjchen added a commit to hasura/ekg-core that referenced this pull request May 29, 2021
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by
replacing the C code for the distribution metric with GHC prim ops.

The performance of this implementation is about half that of the
existing C code in a single-threaded benchmark; without masking the
performance is comparable.

This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of
tibbe/ekg-core.
awjchen added a commit to hasura/ekg-core that referenced this pull request Jun 23, 2021
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by
replacing the C code for the distribution metric with GHC prim ops.

The performance of this implementation is about half that of the
existing C code in a single-threaded benchmark; without masking the
performance is comparable.

This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of
tibbe/ekg-core.
@TravisWhitaker
Copy link
Author

@23Skidoo What more can I do to help get this package fixed on aarch64?

Copy link

@AndreasPK AndreasPK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Int overflowing on 32bit might be a concern and I wonder if read should be locking. But looks reasonable to me on the whole.

, cSum :: !Double
, cMin :: !Double
, cMax :: !Double
, cLock :: !Int64 -- ^ 0 - unlocked, 1 - locked

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to keep this comment somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants