How to select the values for leading (max) and subleading (2nd max) varlues of an ak array in a memory efficient way? #3055
-
Hello, I have a question about my specific use case of obtaining leading and subleading values from each columns of ak array. Let's assume that we have an array of jet pt values from a
We can notice that the original
Now, I need to get the top two leading jet pts seperately, and I do this by filling none the first two columns and then using ak where:
This gives me
Which is exactly what I want. However, I realize that padding ak arrays with None values can bloat the memory usage considerably, so I would like to know if there's a way to do this operation without
And this prints
I get this error message:
This is the point that I am stuck on. If anybody knows a way around this issue or a completely different way to go about this, I would be obliged! Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 10 replies
-
My understanding of what we want to do here is something like import awkward as ak
ix_jet_pt_2_largest = ak.argpartition(good_jet_pts, axis=1)
ordered_2_jets = good_jet_pts[ix_jet_pt_2_largest]
leading_jet = ak.firsts(ordered_2_jets[:, :1])
subleading_jet = ak.firsts(ordered_2_jets[:, 1:]) Although a import awkward as ak
ix_jet_pt_2_largest = ak.argpartition(good_jet_pts, axis=1)
ordered_2_jets = ak.pad_none(
good_jet_pts[ix_jet_pt_2_largest],
target=2,
clip=True
)
leading_jet = ordered_2_jets[:, 0]
subleading_jet = ordered_2_jets[:, 1] As of the time of writing, we don't have an implementation of It should go without saying, all of this depends upon how many jets you have per event. If the answer is "not many", then this is all probably noise in the wind vs a simple sort-and-slice. I started playing around with this in a notebook. I can't make an strong insights into it yet - I ran out of time. I assume @jpivarski might weigh in here too, so it might be useful to save him some time. https://gist.github.com/agoose77/cd77e3b230b2132cbedff05c162eb222 What I can say is that from the findings above, is that for small numbers of jets, just sort the array and slice it. For larger jet counts, you'll perhaps want to perform some kind of partition. |
Beta Was this translation helpful? Give feedback.
-
Sorry about bringing up this thread from the grave, but I was wondering about Even if it is used at a record array level, wouldn't it be more memory efficient to have Since I am processing my samples with millions of events (at least before any selection), even padding just one or two each row with Thanks in advance! |
Beta Was this translation helpful? Give feedback.
My understanding of what we want to do here is something like
np.argpartition
, to extract the top-N largest values (in this case, 2), followed by a simple slice. If you always want both collections, it would be easier to perform a ragged-index followed by slice:Although a
pad_none
might be faster at this point: