Skip to content

Write sparse index #563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Nov 4, 2022
Merged

Write sparse index #563

merged 31 commits into from
Nov 4, 2022

Conversation

SidneyDouw
Copy link
Contributor

@SidneyDouw SidneyDouw commented Oct 20, 2022

Tracked in #562

Tasks

  • roundtrip tests
    raw byte comparison is failing because of an order mismatch in the tree extension and that should be investigated / fixed
    for now though this can be can be worked around by
    • comparing the in memory states
    • comparing only the relevant portions of bytes instead of all all bytes are relevant
    • creating the baseline index without tree extension not sure that is a possibility with git
  • check git-config docs for options that influence sparse index writing
    • update gix progress to reflect those findings
  • sparse index can be written by providing the corresponding extension flag in the write::Options
  • index decides automatically if it needs to be sparse or not based on what kind of options are being passed in
  • writes sparse / regular index depending on if DIR entries exist in State or not
    • is_sparse flag tracks / caches this for us
      do we trust other functions to modify this flag correctly or do we need a way of verifying this, a.k.a. checking the DIR entries our selves

Notes
Just for myself to remember and to hammer the point home:
write_to just looks at the current State and writes whatever it finds to out. There are some optional extensions that can be configured directly, but everything else (like git-config stuff) it does not care about. This is handled entirely by different functions that mutate the State to the desired state before writing it out.

@SidneyDouw SidneyDouw mentioned this pull request Oct 20, 2022
17 tasks
@SidneyDouw
Copy link
Contributor Author

SidneyDouw commented Oct 27, 2022

I think I am a little stuck with this at the moment because I am not entirely sure how much functionality should belong to the write implementation. For example:
I want to setup a test that reads in a sparse index into State, which then calls write_to() with options to write out a regular / non sparse index. That means that the entries in the sparse index marked as Mode::DIR would have to be transformed / expanded into a list of Mode::File entries. Now, is that transformation part of the work that State::write_to should do, or does that belong to a different set of functions that modify the State before writing? Or in other words: does write_to just take the State and "stupidly" writes it to a file or is it also responsible for doing these kinds of transformations?

@SidneyDouw
Copy link
Contributor Author

Just pushing what I have so far for you to maybe get a better idea, but keep in mind this is still veeery much WIP, so don't expect too much…yet :D

Byron added 5 commits October 27, 2022 19:06
Previously it would have been possible to write the sparse extension
without any indication that it is actually needed.

For now we rely on a flag which is technically a cache, after all
Mode::DIR indicates whether an index entry is sparse or not, and
with it the index itself. At least so I think.
…if `State` requires it.

Speaking of, I do wonder how the `link` extension plays into all of
this. Can there be sparse indices without `link`? I think so, after
all it's just an optimization to avoid having to read from trees
all the time.
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to setup a test that reads in a sparse index into State, which then calls write_to() with options to write out a regular / non sparse index. That means that the entries in the sparse index marked as Mode::DIR would have to be transformed / expanded into a list of Mode::File entries. Now, is that transformation part of the work that State::write_to should do, or does that belong to a different set of functions that modify the State before writing? Or in other words: does write_to just take the State and "stupidly" writes it to a file or is it also responsible for doing these kinds of transformations?

I took a quick look and am happy to share my thoughts on this.
My first question was: is State::write_to() allowed to mutate? It probably shouldn't and currently isn't, which indicates that on-the-fly changes to State as it's written aren't advisable. If this is taken as indication, there would be no other way than to have the caller perform an additional step to write a non-sparse index.

From what I could tell, my expectation would be that write_to serializes exactly what's currently in memory, and it should fail if for whatever reason the task can't be performed without mutation.

I have also made a few minor modifications which help to make that approach clearer.

One thing to be careful about is the is_sparse flag. It's a cache, and I think fn is_sparse() could make clear what a sparse index is. Is it the presence of the sdir extension, or that itself always present if there are DIR mode entries? I'd expect it to be the latter, so the flag is really just a shortcut telling us these entries are present.

However, with the State being mutable, this might not be the case anymore and I suggest to not rely on this flag at all, and rather check all entries to see of the is-sparse extension is necessary. Furthermore, lower-case extensions are mandatory, hence they can't be affected by options.

On another note, is the resolution of entries, i.e. DIR -> FILE, not dependent on the presence of the link extension? I thought that this is a performance improvement to avoid having to recurse trees (which as we now know, is slower in git than it has to be). If that was the case, it's certainly something to make sure we don't forget and somehow incorporate into the resolution API. It's certainly OK to do it in steps though and focus on the non-link version first.

@Byron
Copy link
Member

Byron commented Oct 27, 2022

Just pushing what I have so far for you to maybe get a better idea, but keep in mind this is still veeery much WIP, so don't expect too much…yet :D

Thanks a lot! I think keeping this PR up-to-date generally helps as I am likely to make some changes while I am looking at it.

@SidneyDouw
Copy link
Contributor Author

Great, thank you for your response, that definitely clears things up!

One thing to be careful about is the is_sparse flag. It's a cache, and I think fn is_sparse() could make clear what a sparse index is. Is it the presence of the sdir extension, or that itself always present if there are DIR mode entries? I'd expect it to be the latter, so the flag is really just a shortcut telling us these entries are present.

I guess I was a little bit confused about this while looking into how is_sparse is being set, because it is true if either Mode::DIR entries OR the sdir extension are detected.
At first I overlooked the comment about needing a marker when no directores are present, but I tested it out and it is possible to create such an index with git (even though it doesn't make much sense I think).

@Byron
Copy link
Member

Byron commented Nov 1, 2022

At first I overlooked the comment about needing a marker when no directores are present, but I tested it out and it is possible to create such an index with git (even though it doesn't make much sense I think).

Oh, right! That's good to keep in mind and maybe adjust as we know more. After all, is_sparse doesn't have to affect anything if we don't want it to, and it's probably OK to drop 'is_sparse' extension while writing if the index isn't actually sparse (and there is no link either). It will probably clear up in time.

@SidneyDouw
Copy link
Contributor Author

SidneyDouw commented Nov 2, 2022

By not using the cached is_sparse flag and computing the value dynamically during writing, the roundtrip test fails in the "sparse index without directories" case. That's because git still writes the sdir extension in this case, while gitoxide assumes the index should be non-sparse by just looking at the entries. (I assume git is looking at the config to figure that one out)
In any case, it shouldn't matter too much because both versions are valid index files, we just have to decide how we want to handle this :)

I think this is ready to be reviewed, we can clear up the remaining questions and find out if you deem it worthy to be merged :D

Some remaining open questions / tasks:

  • Tree extension order in gitoxide is different than in git, currently prevents raw byte comparison in roundtrip tests
  • git allows configuring the index version, should gitoxide do the same?
  • gitoxide writes sdir extension only when necessary (based on DIR entries) while git does not

@SidneyDouw SidneyDouw marked this pull request as ready for review November 2, 2022 11:00
@Byron
Copy link
Member

Byron commented Nov 3, 2022

Thanks a lot for the summary! Despite open questions, I think it's OK to merge and potentially address them in a follow-up PR depending on your preference.

is_sparse is a tricky one and I am sorry if I caused additional work by what I said (e.g. breaking round-trip tests), and I don't mean to be flip-flopping. But before getting ahead of myself, let's address the questions one by one.

Tree extension order in gitoxide is different than in git, currently prevents raw byte comparison in roundtrip tests

Can we have a checkbox to revisit this in the tracking ticket? Eventually we better understand whether enforcing a sort order like gitoxide does is an improvement over git or not to see what we want to go with.

git allows configuring the index version, should gitoxide do the same?

Yes, we need to. There is an index.version config option that could be fed into a desired_version: Option<Version> field of the write options, as we still have to potentially upgrade the actually written version depending on features present in the data. This requirement can be kept in the tracking ticket for follow ups if you like.

gitoxide writes sdir extension only when necessary (based on DIR entries) while git does not

After going back and forth with this, my opinion really depends on what you find best. git seems to never remove the extension once set which I feel is probably correct - don't try to be smarter than you have to while writing. gitoxide now may do something surprising by cleaning itself up and removing the extension if it's not required anymore which also has its benefits - after all it allows for more compatible indices as it removes an unnecessary mandatory extension. This is just as desirable. What's your preference ?

I will hopefully get to the review soon. (Please note that changing a comment also doesn't trigger a new notification, so I saw this one by chance while following up on the first email I got that didn't have the review note yet)

Conflicts:
	git-index/tests/index/file/read.rs
Byron added 4 commits November 4, 2022 10:07
It's something we want to read, but not write.
Minimal support, that's all we need. Sparse indices seem to be the
future.
@Byron
Copy link
Member

Byron commented Nov 4, 2022

Thanks for the session today!

Let me sum up what we came up with.

  • we realigned work from the perspective of what would help starship most, and came up with supporting all mandatory index extensions so that such indices can be worked with in gitoxide
  • For sdir, this probably means… nothing more is needed. ensure_full_index() is required to 'unsparse' and index, but I can't imagine an operation (in starship) that actually needs it.
    - This is going against what we discussed initially though, thus far I took the need for ensure_full_index() for granted which I now revised due to better focus on what starhip would need.
  • The link extensions needs to be analysed with a focus on understanding how we can create a full index from shared indices (so a complete index is available in memory) and how such an index can be written back to disk without writing entries that are present in the shared indices. This system should be able to deal with changes, as changed entries should indeed be written back to the base index. The latter requires some API changes as some code currently accesses &mut [Entries], so it must either learn to set bitflags accordingly, or it has to go through a set_entry() API of sorts that does it automatically.
    • ensure_full_index() probably still isn't needed for the above as in theory, the shared indices could be read on the fly while the base index is read.

Once link is usable, starship shouldn't be able to encounter unreadable repositories anymore and implementations for gix status should work out of the box.

Did I miss anything major is is there anything else I could help with? Please let me know.

@Byron
Copy link
Member

Byron commented Nov 4, 2022

I actually missed something above: it's the need for supporting extensions.worktreeConfig. Correctly interpreting the configuration would probably be needed to do a gix status correctly.

Byron added 3 commits November 4, 2022 18:00
We try really hard to describe these in detail and also condense
them into an enum that represents all valid states.

At this point it's not clear where these are going to be used as these
flags as related features aren't yet implemneted, like checkout
and status.
…ies anymore.

Let's prefer to be round-trippable to avoid degenerating any
information.

At a later stage, I imagine the methods that operate on an index, for
instance by adding DIR entries or removing them, to detect this case
and remove the extension automatically.
@Byron
Copy link
Member

Byron commented Nov 4, 2022

The latter requires some API changes as some code currently accesses &mut [Entries], so it must either learn to set bitflags accordingly, or it has to go through a set_entry() API of sorts that does it automatically.

More thoughts related to the comment above. I think this can be simpler and completely contained within the write_to method by looking up every entry to be written and seeing if it's contained in a shared index. If so, if it's the same, don't write it. This slows writes of split index, but keeps the API simple, which seems like the right trade-off given that split indices are basically a thing of the past.

Byron added 4 commits November 4, 2022 18:39
Note that it's likely that we need something like `desired_version`
as option before V4 indices are supported even when we start writing
indices for commits, for example. By then we also want to support
all extensions, so a lot of work needed for proper support.
@Byron
Copy link
Member

Byron commented Nov 4, 2022

I just read in code that ensure_full_index() can be reverted at write time, which might explain why git happily keeps the sdir extension. It might rather serve as a flag to try to recreate/sparsify the index when writing. The latter I have added to the tracking issue for completeness.

@Byron Byron merged commit ba17db0 into GitoxideLabs:main Nov 4, 2022
@Byron
Copy link
Member

Byron commented Nov 4, 2022

Thanks a lot for your help!

I used this review to fully understand sparse index (at least more 'fully' then before :D) and now have a much better understand how it all comes together with upcoming features. Very useful.

I see the following next steps:

  • Add support for extensions.worktreeConfig so that sparse index configuration is properly read
  • setup a tracking issue for the link extension, with the goal to read shared indices when reading the base index and merge them right away, while also gaining the ability to detect changed entries and write only those back to the base index on write, or if they are not yet in a shared index. That would be minimal support, but an analysis could yield more insights.

Does this make sense at all, did I miss something? Please let me know.

@SidneyDouw SidneyDouw deleted the write-sparse-index branch November 16, 2022 07:56
@Byron
Copy link
Member

Byron commented Nov 23, 2022

I just noticed that there is already a 'hook' in the current git-repository code which might help to add worktree support there: https://github.com/Byron/gitoxide/blob/d6ef2ce9d0a7883d7d3b5ddb0010c8ab3d26bb78/git-repository/src/repository/permissions.rs#L32-L34

@SidneyDouw
Copy link
Contributor Author

Thanks for the tip, I will take a closer look at that!
I am currently in the process of working through the git-repository code trying to figure out where and how the config is being loaded in the hope to understand how worktree configs can best fit in.

@Byron
Copy link
Member

Byron commented Nov 25, 2022

When I fetched git I noticed that a lot of sparse index documentation was added: https://github.com/git/git/blob/c000d916380bb59db69c78546928eadd076b9c7d/Documentation/technical/sparse-checkout.txt . Will give it a read now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants