Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add diffing algorithm to maps #199

Merged
merged 4 commits into from
Feb 7, 2022

Conversation

mvtec-richter
Copy link
Contributor

This is the implementation of a diff algorithm for immer::map. It compares two immer::map instances and calls respective lambda functions when keys were added or removed or when the value of a key changed. When data is shared among the two maps, the algorithm will not need to compare shared subtrees of the map.
We plan to use this for an undo feature.

The diff algorithm has following signature:
̂map::diff(map const& other, AddedFn&& added, ChangedFn&& changed, RemovedFn&& Removed)
where the lambdas have the signature:

void AddedFn(std::pair<Key, Value> const& added_entry)
void ChangedFn(std::pair<Key, Value> const& old_item, std::pair<Key, Value> const& new_data)
void RemovedFn(std::pair<Key, Value> const& removed_entry)

Usage:

immer::map<int, int> old_snapshot; // initialize somehow
immer::map<int, int> new_snapshot = old_snapshot.set(1,2); // change somehow
old_snapshot.diff(
  new_snapshot,
  [](auto const& added){ std::cout << "added key " << added.first << std::endl; },
  [](auto const& old, auto const& new){
    std::cout << "changed key " << old.first << " from " << old.second << " to " << new.second << std::endl; },
  [](auto const& removed){ std::cout << "removed key " << removed.first << std::endl; }
  );

This signature works for our usecase, however we are happy for further input and would like to help to upstream this feature.

This partially addresses: #120

@arximboldi
Copy link
Owner

Oh wow, just wow! Is it Christmas?! This is the most exciting PR this repository has received from a 3rd party so far! 🥳 Danke so sehr Florian!

Before merging, I'd like to discuss a little bit the API. Note that these are not requests to change the code by itself, I'm just curious about what you think here.

1. Now I'm also thinking about the API. Should it maybe be a free standing function?

diff(a, b, on_added, on_updated, on_removed);

2. Also, alternatively, should we group the handler functions into an object?

template <class AddedFn, class ChangedFn, class RemovedFn>
struct differ {
    AddedFn added;
    ChangedFn changed;
    RemovedFn removed;
}; 

template <class AddedFn, class ChangedFn, class RemovedFn>
auto make_differ(AddedFn&& added, ChangedFn&& changed, RemovedFn&& removed) 
  -> differ<std::decay_t<AddedFn>, std::decay_t<ChangedFn>, std::decay_t<RemoveFn>> { 
  return {std::forward<AddedFn>(added), std::forward<ChangedFn>(changed std::forward<RemovedFn>(removed)};
 }

...
diff(a, b, differ);
diff_with(a, b, added, removed, changed) { diff(a, b, make_differ(...))}

3. Have you looked into diffing vectors as well? API wise, I think 3 functions like that could be provided, excepting the "value" should maybe be a pair<std::size_t, const T&> with the index of the addition/removal/update. Implementation, particularly on flex_vector, can be significantly more tricky.

4. One thing that would be amazing would be to add special integration to Lager such that when you have cursors that descend from a map or set, this difing algorithm is used to detect updates in children in sub-O(1) time. This requires adding some kind of special hook API to detect containers that support diffing and hooking via the diff when lenses traverse down from container. I have some ideas, I'll think about this and hopefully find some time to implement that once this is merged. Are you using Lager yourself?

5. Also, this is very much outside of the scope of this PR, but one could imagine in the future leveraging a similar traversal structure to produce: operator+ and operator- that do the set union and diff of the HAMTs. Which in turn could be a way to implement diffing (but more inefficient than this, because it causes allocations that may not be needed in the general case for diffing.)

Copy link
Owner

@arximboldi arximboldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the questions above, I just saw one issue with the implementation:

Inside champ, you are accessing .first and .second on the values of the map. However, champ itself is agnostic of whether the tree represents a Map or a Set. You can see how immer::detail::champ is instantiated differently in immer::set<> and immer::map<> in order to achieve the two structures. I believe that you can achieve the desired genericity by using Equal{} instead of a.first == b.first and passing an additional function type to compare values (like we do for champ::equals).

Extra points if you also add tests for immer::set 😄


But then again, thank you so much for this brilliant PR!

@mvtec-richter
Copy link
Contributor Author

Thanks for the comments!

  1. I agree a free function might be better. We had this in mind too. I'll implement this.
  2. Grouping the function in an object is very sensible. I just wasn't sure about how to correctly do this with templates. I'll try your suggestion.
      1. As we are not using vectors and lager, we haven't looked into this, but it sounds interesting.
        I was aware of the issue with the std::pair in immer::detail::champ (sorry, I forgot to mention it in the PR). I will add another template parameter for comparing values to champ and see if I can get immer::set to work.

@arximboldi
Copy link
Owner

Regarding the last point @mvtec-richter, you do not need to add another template parameter to champ, you can add it to the implementation of the diff function there. See equals as an example.

@mvtec-richter
Copy link
Contributor Author

I implemented the usage of a template variable for Equals and EqualsValue and also added sets.
Concerning changing the interface to a free function and using a struct for the function, I will have a look into it maybe tomorrow.
I noticed, that sets don't have a changed function, so these differ structs might be different for map/set/...

@arximboldi
Copy link
Owner

@mvtec-richter thank you a lot for the latest changes!

I've noticed that you have added a EqualValue parameter to the diff() method of the public interface itself (immer::map::diff). I think we could for now not parametrize it and just use the default one; this makes it consistent with immer::map::operator==, which also just uses the default one. This means that there is no way to customize equality for values for now, but I am happy to live with that rigidity for simplicity.

Wrt. the change callback making set for sets, I wonder if it may make sense, to detect "changes" in values of two sets that compare equal when using Equal, but are different to operator==. What do you think?

@mvtec-richter
Copy link
Contributor Author

Of course I can remove the Equal template parameter from the interface. That will simplify the interface.

Wrt. the change callback, I don't think it makes sense to detect such "changes" in sets. If there are multiple layers of equal, you probably wanted a map in the first place.

However, if we have a free standing diff function, we could provide two variants, one with changed callback and one without. It might also be useful for maps to have a diff function without the changed callback. Even if it's not useful, I think its not completely wrong to allow a diff function with changed callback for sets.

I have another question concerning the use of std::forward for lambdas. I wonder, if our usage in champ::diff is correct. Should the callback be passed as & instead of &&, because they are forwarded multiple times in the function?

@arximboldi
Copy link
Owner

Hi!

Of course I can remove the Equal template parameter from the interface. That will simplify the interface.

Cool 👍

Wrt. the change callback, I don't think it makes sense to detect such "changes" in sets. If there are multiple layers of equal, you probably wanted a map in the first place.

I agree 👍

However, if we have a free standing diff function, we could provide two variants, one with changed callback and one without. It might also be useful for maps to have a diff function without the changed callback. Even if it's not useful, I think its not completely wrong to allow a diff function with changed callback for sets.

Yes, I agree that the interface should be homogeneous, so one can write generic code that uses diff regardless of whether the underlying data-structure is a a map or a set (and maybe in the future, even vector). I am happy with only having the 3-lambda version (or differ version) and having the change callback simply never do anything for set. Having the 2-lambda version additionally is also fine. I am happy with whatever you prefer in this case; with maybe a slight preference for having the 3-lambda or differ version only, since it's easier to add things than to remove things.

I have another question concerning the use of std::forward for lambdas. I wonder, if our usage in champ::diff is correct. Should the callback be passed as & instead of &&, because they are forwarded multiple times in the function?

You are correct in that it's incorrect :) Taking the lambdas in as && is fine, but the std::forward call is what makes it incorrect. (Or more pedantically, it's not fully incorrect, it's only dangerous, because you are not really consuming the forwarded value by either forwading into the operator() nor forwading it into a by-value nor any r-value consuming function of any sort). SImply removing the std::forward<> calls in the tree traversal would fix the code. You don't need to remove it in the code that passes the lambdas down only once.

@codecov-commenter
Copy link

Codecov Report

Merging #199 (f91922e) into master (42e6bea) will increase coverage by 0.19%.
The diff coverage is 99.54%.

❗ Current head f91922e differs from pull request most recent head f23dc6c. Consider uploading reports for the commit f23dc6c to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #199      +/-   ##
==========================================
+ Coverage   91.27%   91.47%   +0.19%     
==========================================
  Files         104      104              
  Lines        9574     9791     +217     
==========================================
+ Hits         8739     8956     +217     
  Misses        835      835              
Impacted Files Coverage Δ
immer/set.hpp 96.42% <80.00%> (-3.58%) ⬇️
immer/detail/hamts/bits.hpp 100.00% <100.00%> (ø)
immer/detail/hamts/champ.hpp 93.96% <100.00%> (+3.40%) ⬆️
immer/detail/hamts/node.hpp 88.50% <100.00%> (+0.02%) ⬆️
immer/map.hpp 98.07% <100.00%> (+0.16%) ⬆️
test/map/generic.ipp 98.72% <100.00%> (+0.30%) ⬆️
test/set/generic.ipp 99.04% <100.00%> (+0.10%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42e6bea...f23dc6c. Read the comment docs.

@mvtec-richter
Copy link
Contributor Author

Thanks for your comments!
I created a struct for the differ as you suggested. It greatly simplifies some function signatures.
I wonder where to place these free standing functions. Should we add a header or is algorithm.hpp fine?
I'll go with the 3-lambda differ as it simplifies the code.
I also removed std::forward and used normal references as the diff function is recursive.

@arximboldi
Copy link
Owner

I'm merging this and will work on polishing the details myself. Thank you very much for this work Florian!

@arximboldi arximboldi merged commit d372683 into arximboldi:master Feb 7, 2022
@khiner
Copy link

khiner commented Feb 23, 2024

Hi @arximboldi ! I've been holding onto some pretty nasty architecture in my application, waiting and hoping for someone to extend the diff mechanism to immer::vector, and immer::flex_vector 😅

I'm hitting the point in my app where it's getting silly to work around this, so I've decided to block further development on having the ability to diff immer vectors. Before jumping into implementation to contribute these diffs (which I've put off since it seems a bit intimidating...), I wonder if you have any thoughts/tips/suggestions, or on the outside chance you or someone else has already started to look into this?

It sounds like the current map diff implementation would be a good starting point?

... the interface should be homogeneous, so one can write generic code that uses diff regardless of whether the underlying data-structure is a a map or a set (and maybe in the future, even vector).

Thank you!

@arximboldi
Copy link
Owner

Hi @khiner!

Thank you for looking into this... that would be an incredibly useful contribution. In fact... I do miss this feature all the time!

One of the reasons it hasn't been tackled though is that it's actually not easy work. It is definitely harder than HAMT diffing. In fact, I suspect that it would require some degree of original algorithmic work (potentially worth an academic paper though, in case that's an encouragement for you!).

As a starting point, I would recommend getting some familiarity with how Relaxed Radix Balanced Tries work. You can check the first parts of the Immer paper. I also find L'Orange's master's thesis one of the clearest descriptions of this data-structure.

Then we'd need to find some way to adapt some of the existing sequence diffing algorithms to leverage structural sharing, by operating somewhat recursively in the RRB-trie. You can look into this Rust library for a list of existing algorithms (and their implementation). I have this discussion in my notes also as it contains interesting pointers in that field.

Let me know if there is some way I can further help!

@khiner
Copy link

khiner commented Feb 26, 2024

Hi @arximboldi , thanks for getting back on this and for the detailed notes and pointers! I see, it sounds like this is far from trivial, but in theory possible to achieve the same type of O(|diff|) complexity currently provided by the HAMT differ, in your estimation? This does sound like a rewarding project, but I may be too busy with other commitments to contribute this level of research... but I am motivated to have this, so maybe ;)

@arximboldi
Copy link
Owner

Yes, my intuition says we we should be able to achieve that! Big big reward indeed! Game changer for lots of applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants