-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Sketch primops-based atomic counter and distribution implementation #42
base: master
Are you sure you want to change the base?
WIP: Sketch primops-based atomic counter and distribution implementation #42
Conversation
8288724
to
4273d93
Compare
4273d93
to
420f559
Compare
Turns out I mistakenly used an atomic write to release the distribution spinlock, when a weak write is sufficient (and much faster). Now the primops implementation is comparable to the existing implementation's performance (even a bit faster in the multicapability case).
|
One way or another, #41 needs to be fixed for EKG to work well on newer ARM boards. If there's no interest in this patch, we should find another way to achieve memory safety on weakly ordered machines. |
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by replacing the C code for the distribution metric with GHC prim ops. The performance of this implementation is about half that of the existing C code in a single-threaded benchmark; without masking the performance is comparable. This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of tibbe/ekg-core.
This commit attempts to address issue haskell-github-trust#41 of tibbe/ekg-core by replacing the C code for the distribution metric with GHC prim ops. The performance of this implementation is about half that of the existing C code in a single-threaded benchmark; without masking the performance is comparable. This commit is based on the work of Travis Whitaker in PR haskell-github-trust#42 of tibbe/ekg-core.
@23Skidoo What more can I do to help get this package fixed on aarch64? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Int
overflowing on 32bit might be a concern and I wonder if read should be locking. But looks reasonable to me on the whole.
, cSum :: !Double | ||
, cMin :: !Double | ||
, cMax :: !Double | ||
, cLock :: !Int64 -- ^ 0 - unlocked, 1 - locked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea to keep this comment somehow.
After finding #41 I was curious whether or not a primops-based implementation of cbits could achieve competitive performance. Turns out that it can, with some caveats.
I took four measurements to compare the implementation presented here with the current tip of master:
counter
benchmark with-N1
counter
benchmark with-N
distribution
benchmark with-N1
distribution
benchmark with-N
The results were recorded on my 2018 i9 MacBook Pro, which has 12 logical cores (6 physical cores, two hyperthreads each), so
-N
means-N12
on this machine. I added-O2
toghc-options
for the current tip of master and the code here. These are the results for the current tip of master:And here are the results for the code presented here:
The new implementation yields slightly faster atomic counter performance in both the single-capability and capability-per-core cases. This could be due to the fact that in the new implementation the counter is not in pinned memory, or the lack of
withForeignPtr
, which isn't free.The distribution results are quite interesting. In the single-capability case, this implementation is about three times slower than the existing implementation, but the gap between them almost entirely disappears in the capability-per-core case. I'm not quite sure what's going on here. I thought that perhaps the fact that the unsafe FFI call can't be interrupted by GC could have something to do with it, but using
-S
shows that only a single major GC at the end of benchmark execution takes place in both the single-capability case and the capability-per-core case. I suspect the answer is somewhere in the core emitted forspinLock
, but I haven't dug into it yet.The most important difference between the two implementations is correctness. Because of #41, ekg sometimes reports stale or gibberish metrics on newer aarch64 chips, while this implementation produces correct results (at least the results look right to me, but a second pair of eyes on the distribution update code would be much appreciated).
I'm not too happy about having to unpack
I#
from a CAF to get the field array offsets in these functions, but sincehsc2hs
seems to always wrap parens around the results of its directives, I couldn't figure out how to generate an unboxed literal from them. One solution might be to export TH splices from a separate hsc file, then splice in the unboxed literals in the System.Metrics.Distribution module. That would also allow us to skip hsc2hs on that module entirely, which would roughly halve the number of#
required.Another minor difference is the change from
Int64
toInt
. This is necessary because GHC doesn't make available any 64 bit wide atomic primops on 32 bit chips.