-
Notifications
You must be signed in to change notification settings - Fork 144
sparse-checkout: add 'clean' command #1941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
sparse-checkout: add 'clean' command #1941
Conversation
The logic for the 'git sparse-checkout' builtin uses the_repository all over the place, despite some use of a repository struct in different method parameters. Complete this removal of the_repository by using 'repo' when possible. In one place, there was already a local variable 'r' that was set to the_repository, so move that to a method parameter. We cannot remove the USE_THE_REPOSITORY_VARIABLE declaration as we are still using global constants for the state of the sparse-checkout. Signed-off-by: Derrick Stolee <[email protected]>
When users change their sparse-checkout definitions to add new directories and remove old ones, there may be a few reasons why directories no longer in scope remain (ignored or excluded files still exist, Windows handles are still open, etc.). When these files still exist, the sparse index feature notices that a tracked, but sparse, directory still exists on disk and thus the index expands. This causes a performance hit _and_ the advice printed isn't very helpful. Using 'git clean' isn't enough (generally '-dfx' may be needed) but also this may not be sufficient. Add a new subcommand to 'git sparse-checkout' that removes these tracked-but-sparse directories, including any excluded or ignored files underneath. This is the most extreme method for doing this, but it works when the sparse-checkout is in cone mode and is expected to rescope based on directories, not files. Be sure to add a --dry-run option so users can predict what will be deleted. In general, output the directories that are being removed so users can know what was removed. Note that untracked directories remain. Further, directories that contain staged changes are not deleted. This is a detail that is partly hidden by the implementation which relies on collapsing the index to a sparse index in-memory and only deleting directories that are listed as sparse in the index. If a staged change exists, then that entry is not stored as a sparse tree entry and thus remains on-disk until committed or reset. Signed-off-by: Derrick Stolee <[email protected]>
In my experience, the most-common reason that the sparse index must expand to a full one is because there is some leftover file in a tracked directory that is now outside of the sparse-checkout. The new 'git sparse-checkout clean' command will find and delete these directories, so point users to it when they hit the sparse index expansion advice. Signed-off-by: Derrick Stolee <[email protected]>
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Patrick Steinhardt wrote (reply to this): On Tue, Jul 08, 2025 at 11:19:50AM +0000, Derrick Stolee via GitGitGadget wrote:
> When using cone-mode sparse-checkout, users specify which tracked
> directories they want (recursively) and any directory not part of the parent
> paths for those directories are considered "out of scope". When changing
> sparse-checkouts, there are a variety of reasons why these "out of scope"
> directories could remain, including:
>
> * The user has .gitignore or .git/info/exclude files that tell Git to not
> remove files of a certain type.
> * Some filesystem blocker prevented the removal of a tracked file. This is
> usually more of an issue on Windows where a read handle will block file
> deletion.
>
> Typically, this would not mean too much for the user experience. A few extra
> filesystem checks might be required to satisfy git status commands, but the
> scope of the performance hit is relative to how many cruft files are left
> over in this situation.
>
> However, when using the sparse index, these tracked sparse directories cause
> significant performance issues. When noticing that the index contains a
> sparse directory but that directory exists on disk, Git needs to expand that
> sparse directory to determine which files are tracked or untracked. The
> current mechanism expands the entire index to a full one, an expensive
> operation that scales with the total number of paths at HEAD and not just
> the number of cruft files left over.
>
> Advice was added in 9479a31d603 (advice: warn when sparse index expands,
> 2024-07-08) to help users determine that they were in this state. However,
> the advice doesn't actually recommend helpful ways to get out of this state.
> Recommending "git clean" on its own is incomplete, as typically users
> actually need 'git clean -dfx' to clear out the ignored or excluded files.
> Even then, they may need 'git sparse-checkout reapply' afterwards to clear
> the sparse directories.
>
> The advice was successful in helping to alert users to the problem, which is
> how I got wind of many of these cases for how users get into this state.
> It's now time to give them a tool that helps them out of this state.
As usual for you, this is a nicely-written summary of how we got here
and why the current mechanisms are insufficient for mere mortals.
> This series adds a new 'git sparse-checkout clean' command that currently
> only works for cone-mode sparse-checkouts. The only thing it does is
> collapse the index to a sparse index (as much as possible) and make sure
> that any sparse directories are removed. These directories are listed to
> stdout.
>
> A --dry-run option is available to list the directories that would be
> removed without actually deleting the directories.
>
> This option would be preferred to something like 'git clean -dfx' since it
> does not clear the excluded files that are still within the sparse-checkout.
> Instead, it performs the exact filesystem operations required to refresh the
> sparse index performance back to what is expected.
>
> I spent a few weeks debating with myself about whether or not this was the
> right interface, so please suggest alternatives if you have better ideas.
> Among my rejected ideas include:
>
> * 'git sparse-checkout reapply -f -x' or similar augmentations of
> 'reapply'.
> * 'git clean --sparse' to focus the clean operation on things outside of
> the sparse-checkout.
>
> The implementation is rather simple with the current CLI. Future
> augmentations could include a --quiet option to silence the output and a
> --verbose option to list the files that exist within each directory and
> would/will be removed.
One of the benefits of your new command is that we can extend it in the
future as necessary if we ever notice that there are other things that
we need to do to bring the sparse checkout up to date again. So without
yet having had a look at the implementation I think this direction is
quite sensible.
Ideally it would of course be great if we could automatically fix the
issue for our users. But as we have to prune potentially-ignored data it
is very much a no-go to do that in the automatically.
Patrick |
User |
@@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files | |||
SYNOPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Patrick Steinhardt wrote (reply to this):
On Tue, Jul 08, 2025 at 11:19:52AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-sparse-checkout.adoc b/Documentation/git-sparse-checkout.adoc
> index 529a8edd9c1e..21ba6f759905 100644
> --- a/Documentation/git-sparse-checkout.adoc
> +++ b/Documentation/git-sparse-checkout.adoc
> @@ -111,6 +111,17 @@ flags, with the same meaning as the flags from the `set` command, in order
> to change which sparsity mode you are using without needing to also respecify
> all sparsity paths.
>
> +'clean'::
> + Remove all files in tracked directories that are outside of the
> + sparse-checkout definition. This subcommand requires cone-mode
> + sparse-checkout to be sure that we know which directories are
> + both tracked and all contained paths are not in the sparse-checkout.
> + This command can be used to be sure the sparse index works
> + efficiently.
> ++
> +The `clean` command can also take the `--dry-run` (`-n`) option to list
> +the directories it would remove without performing any filesystem changes.
> +
Hm. This is somewhat different from `git clean`, where you have to pass
`-f` to make it delete any data. I'm not particularly a fan of that
mode, but should we maybe retain it regardless to ensure that things are
at least a tiny bit more consistent?
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 8b70d0c6a441..6d2843827367 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -924,6 +924,76 @@ static int sparse_checkout_reapply(int argc, const char **argv,
> return update_working_directory(repo, NULL);
> }
>
> +static char const * const builtin_sparse_checkout_clean_usage[] = {
> + "git sparse-checkout clean [-n|--dry-run]",
> + NULL
> +};
> +
> +static struct sparse_checkout_clean_opts {
> + int dry_run;
> +} clean_opts;
> +
> +static int sparse_checkout_clean(int argc, const char **argv,
> + const char *prefix,
> + struct repository *repo)
> +{
> + struct strbuf full_path = STRBUF_INIT;
> + size_t worktree_len;
> + static struct option builtin_sparse_checkout_clean_options[] = {
> + OPT_BOOL('n', "dry-run", &clean_opts.dry_run,
> + N_("list the directories that would be removed without making filesystem changes")),
> + OPT_END(),
> + };
> +
> + setup_work_tree();
> + if (!core_apply_sparse_checkout)
> + die(_("must be in a sparse-checkout to clean directories"));
> + if (!core_sparse_checkout_cone)
> + die(_("must be in a cone-mode sparse-checkout to clean directories"));
> +
> + argc = parse_options(argc, argv, prefix,
> + builtin_sparse_checkout_clean_options,
> + builtin_sparse_checkout_clean_usage, 0);
> +
> + if (repo_read_index(repo) < 0)
> + die(_("failed to read index"));
> +
> + if (convert_to_sparse(repo->index, SPARSE_INDEX_MEMORY_ONLY))
> + die(_("failed to convert index to a sparse index"));
I noticed that there are several cases in `convert_to_sparse()` where we
simply do nothing. Should we check whether `repo->index->sparse_index`
matches `INDEX_COLLAPSED` after the operation?
> + strbuf_addstr(&full_path, repo->worktree);
> + strbuf_addch(&full_path, '/');
> + worktree_len = full_path.len;
> +
> + for (size_t i = 0; i < repo->index->cache_nr; i++) {
> + DIR* dir;
Nit: the `*` goes with the variable, not the type.
> + struct cache_entry *ce = repo->index->cache[i];
> + if (!S_ISSPARSEDIR(ce->ce_mode))
> + continue;
Okay, we only need to handle sparse directories.
> + strbuf_setlen(&full_path, worktree_len);
> + strbuf_add(&full_path, ce->name, ce->ce_namelen);
> +
> + dir = opendir(full_path.buf);
Shouldn't it be sufficient to use `is_directory()`?
> + if (!dir)
> + continue;
This is the good and expected case, right? The entry is sparse, so
ideally it doesn't exist. If it does we have to recurse into to end up
with the full index.
> + else if (ENOENT != errno) {
Nit: style. If one branches requires curly braces, all branches should
use them.
> + warning_errno(_("failed to check for existence of '%s'"), ce->name);
> + continue;
> + }
> +
> + closedir(dir);
> +
> + printf("%s\n", ce->name);
git-clean(1) says "Removing %s\n". Should we do the same here?
> + if (!clean_opts.dry_run) {
> + if (remove_dir_recursively(&full_path, 0))
> + warning_errno(_("failed to remove '%s'"), ce->name);
> + }
> + }
> +
> + strbuf_release(&full_path);
> + return 0;
> +}
> +
> static char const * const builtin_sparse_checkout_disable_usage[] = {
> "git sparse-checkout disable",
> NULL
Patrick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Patrick Steinhardt <[email protected]> writes:
> On Tue, Jul 08, 2025 at 11:19:52AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/Documentation/git-sparse-checkout.adoc b/Documentation/git-sparse-checkout.adoc
>> index 529a8edd9c1e..21ba6f759905 100644
>> --- a/Documentation/git-sparse-checkout.adoc
>> +++ b/Documentation/git-sparse-checkout.adoc
>> @@ -111,6 +111,17 @@ flags, with the same meaning as the flags from the `set` command, in order
>> to change which sparsity mode you are using without needing to also respecify
>> all sparsity paths.
>>
>> +'clean'::
>> + Remove all files in tracked directories that are outside of the
>> + sparse-checkout definition. This subcommand requires cone-mode
>> + sparse-checkout to be sure that we know which directories are
>> + both tracked and all contained paths are not in the sparse-checkout.
>> + This command can be used to be sure the sparse index works
>> + efficiently.
>> ++
>> +The `clean` command can also take the `--dry-run` (`-n`) option to list
>> +the directories it would remove without performing any filesystem changes.
>> +
>
> Hm. This is somewhat different from `git clean`, where you have to pass
> `-f` to make it delete any data. I'm not particularly a fan of that
> mode, but should we maybe retain it regardless to ensure that things are
> at least a tiny bit more consistent?
Ah, it reminds me of my favorite "regret". We may want to consider
making the --force/--dry-run used in "git clean" saner at a major
version boundary. I am not particulary a fan of that mode, and
would oppose a patch made as a part of regular "let's change this,
as I do not like it" exercise, but as a known-breaking change, I do
not mind it at all. Essentially the change to propose would be to
deprecate clean.requireForce and internally make it a constant
false.
But that is a tangent ;-)
On the Git mailing list, Elijah Newren wrote (reply to this): On Tue, Jul 8, 2025 at 4:19 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> When using cone-mode sparse-checkout, users specify which tracked
> directories they want (recursively) and any directory not part of the parent
> paths for those directories are considered "out of scope". When changing
> sparse-checkouts, there are a variety of reasons why these "out of scope"
> directories could remain, including:
>
> * The user has .gitignore or .git/info/exclude files that tell Git to not
> remove files of a certain type.
> * Some filesystem blocker prevented the removal of a tracked file. This is
> usually more of an issue on Windows where a read handle will block file
> deletion.
>
> Typically, this would not mean too much for the user experience. A few extra
> filesystem checks might be required to satisfy git status commands, but the
> scope of the performance hit is relative to how many cruft files are left
> over in this situation.
>
> However, when using the sparse index, these tracked sparse directories cause
> significant performance issues. When noticing that the index contains a
> sparse directory but that directory exists on disk, Git needs to expand that
> sparse directory to determine which files are tracked or untracked. The
> current mechanism expands the entire index to a full one, an expensive
> operation that scales with the total number of paths at HEAD and not just
> the number of cruft files left over.
>
> Advice was added in 9479a31d603 (advice: warn when sparse index expands,
> 2024-07-08) to help users determine that they were in this state. However,
> the advice doesn't actually recommend helpful ways to get out of this state.
> Recommending "git clean" on its own is incomplete, as typically users
> actually need 'git clean -dfx' to clear out the ignored or excluded files.
> Even then, they may need 'git sparse-checkout reapply' afterwards to clear
> the sparse directories.
>
> The advice was successful in helping to alert users to the problem, which is
> how I got wind of many of these cases for how users get into this state.
> It's now time to give them a tool that helps them out of this state.
>
> This series adds a new 'git sparse-checkout clean' command that currently
> only works for cone-mode sparse-checkouts. The only thing it does is
> collapse the index to a sparse index (as much as possible) and make sure
> that any sparse directories are removed. These directories are listed to
> stdout.
But what does it clean up?
- untracked files?
- ignored files?
- tracked-but-unmodified files?
- tracked-and-modified files?
- tracked-and-conflicted files? (which is probably a subset of
tracked-and-modified, but thought I'd call it out)
Note: "tracked" probably has a slightly ambiguous connotation here
since we sometimes mean "is it in the index", and there's a difference
between "would it be in the sparse index" and "would it be in the
fully expanded index". Here, by "tracked" I mean the latter -- "is it
in the fully expanded index".
> A --dry-run option is available to list the directories that would be
> removed without actually deleting the directories.
>
> This option would be preferred to something like 'git clean -dfx' since it
> does not clear the excluded files that are still within the sparse-checkout.
This seems to suggest you are only interested in untracked and ignored
files. I'm sure that's by far the most common case, but I'm curious
about the others. Are you expecting users to sometimes need to run
both 'git sparse-checkout clean' and 'git sparse-checkout reapply'?
> Instead, it performs the exact filesystem operations required to refresh the
> sparse index performance back to what is expected.
But what operations are those and what is expected?
As you mentioned above, for untracked or ignored files, the
expectation is that those would be removed.
I think if there are tracked-but-unmodified files, I'd expect those to
be removed as well.
If only the above filetypes exist, then we'd expect the directory to
be nuked and sparse index performance to be improved back to "normal".
However, if there are tracked-and-modified files, I'd expect an error
and for the sparse index performance to continue to suffer until those
paths are resolved. (Or, pie-in-sky spitballing:maybe we could
attempt to do something smarter like make sibling directories to the
tracked-and-modified path be treated as sparse directories, so that
performance only suffers a little).
> I spent a few weeks debating with myself about whether or not this was the
> right interface, so please suggest alternatives if you have better ideas.
> Among my rejected ideas include:
>
> * 'git sparse-checkout reapply -f -x' or similar augmentations of
> 'reapply'.
The connection to sparse-checkout reapply at least would make it
clearer what you are doing with tracked files, since its explanation
explicitly mentions those. However, reapply doesn't say anything
about untracked or ignored files, which we'd need to start explaining
and perhaps isn't as clean a fit, especially since your new usecase is
predominantly about untracked and ignored files. I don't have a
strong opinion here, but I think I also like your choice of a separate
'clean' subcommand better.
> * 'git clean --sparse' to focus the clean operation on things outside of
> the sparse-checkout.
Yeah, this choice would have likely prevented you from cleaning up
tracked files, and required users to run both 'clean --sparse' and
'sparse-checkout reapply'. And this command feels more tightly
connected to sparse-checkouts to me, so I wouldn't have liked this
choice either.
> The implementation is rather simple with the current CLI. Future
> augmentations could include a --quiet option to silence the output and a
> --verbose option to list the files that exist within each directory and
> would/will be removed.
I'm also curious what happens when (1) you are in cone mode and there
is no sparse index, or (2) when you are not in cone mode. I suspect
those and the questions above will be answered as I read the
individual patches, so I'll keep going...
> Thanks, -Stolee
>
> Derrick Stolee (3):
> sparse-checkout: remove use of the_repository
> sparse-checkout: add 'clean' command
> sparse-index: point users to new 'clean' action
>
> Documentation/git-sparse-checkout.adoc | 13 +-
> builtin/sparse-checkout.c | 192 +++++++++++++++++--------
> sparse-index.c | 3 +-
> t/t1091-sparse-checkout-builtin.sh | 48 +++++++
> 4 files changed, 197 insertions(+), 59 deletions(-)
>
>
> base-commit: 8b6f19ccfc3aefbd0f22f6b7d56ad6a3fc5e4f37
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1941%2Fderrickstolee%2Fgit-sparse-checkout-clean-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1941/derrickstolee/git-sparse-checkout-clean-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1941
> --
> gitgitgadget |
@@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r) | |||
ensure_full_index(r->index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> From: Derrick Stolee <[email protected]>
>
> The logic for the 'git sparse-checkout' builtin uses the_repository all
> over the place, despite some use of a repository struct in different
> method parameters. Complete this removal of the_repository by using
> 'repo' when possible.
>
> In one place, there was already a local variable 'r' that was set to
> the_repository, so move that to a method parameter.
Always nice to see these cleanups.
> We cannot remove the USE_THE_REPOSITORY_VARIABLE declaration as we are
> still using global constants for the state of the sparse-checkout.
Thanks for calling this out and explaining it.
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> builtin/sparse-checkout.c | 119 ++++++++++++++++++++------------------
> 1 file changed, 63 insertions(+), 56 deletions(-)
>
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 1bf01591b275..8b70d0c6a441 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r)
> ensure_full_index(r->index);
> }
>
> -static int update_working_directory(struct pattern_list *pl)
> +static int update_working_directory(struct repository *r,
> + struct pattern_list *pl)
> {
> enum update_sparsity_result result;
> struct unpack_trees_options o;
> struct lock_file lock_file = LOCK_INIT;
> - struct repository *r = the_repository;
> struct pattern_list *old_pl;
>
> /* If no branch has been checked out, there are no updates to make. */
> @@ -327,7 +327,8 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
> string_list_clear(&sl, 0);
> }
>
> -static int write_patterns_and_update(struct pattern_list *pl)
> +static int write_patterns_and_update(struct repository *repo,
> + struct pattern_list *pl)
> {
> char *sparse_filename;
> FILE *fp;
> @@ -336,15 +337,15 @@ static int write_patterns_and_update(struct pattern_list *pl)
>
> sparse_filename = get_sparse_checkout_filename();
>
> - if (safe_create_leading_directories(the_repository, sparse_filename))
> + if (safe_create_leading_directories(repo, sparse_filename))
> die(_("failed to create directory for sparse-checkout file"));
>
> hold_lock_file_for_update(&lk, sparse_filename, LOCK_DIE_ON_ERROR);
>
> - result = update_working_directory(pl);
> + result = update_working_directory(repo, pl);
> if (result) {
> rollback_lock_file(&lk);
> - update_working_directory(NULL);
> + update_working_directory(repo, NULL);
> goto out;
> }
>
> @@ -372,25 +373,26 @@ enum sparse_checkout_mode {
> MODE_CONE_PATTERNS = 2,
> };
>
> -static int set_config(enum sparse_checkout_mode mode)
> +static int set_config(struct repository *repo,
> + enum sparse_checkout_mode mode)
> {
> /* Update to use worktree config, if not already. */
> - if (init_worktree_config(the_repository)) {
> + if (init_worktree_config(repo)) {
> error(_("failed to initialize worktree config"));
> return 1;
> }
>
> - if (repo_config_set_worktree_gently(the_repository,
> + if (repo_config_set_worktree_gently(repo,
> "core.sparseCheckout",
> mode ? "true" : "false") ||
> - repo_config_set_worktree_gently(the_repository,
> + repo_config_set_worktree_gently(repo,
> "core.sparseCheckoutCone",
> mode == MODE_CONE_PATTERNS ?
> "true" : "false"))
> return 1;
>
> if (mode == MODE_NO_PATTERNS)
> - return set_sparse_index_config(the_repository, 0);
> + return set_sparse_index_config(repo, 0);
>
> return 0;
> }
> @@ -410,7 +412,7 @@ static enum sparse_checkout_mode update_cone_mode(int *cone_mode) {
> return MODE_ALL_PATTERNS;
> }
>
> -static int update_modes(int *cone_mode, int *sparse_index)
> +static int update_modes(struct repository *repo, int *cone_mode, int *sparse_index)
> {
> int mode, record_mode;
>
> @@ -418,20 +420,20 @@ static int update_modes(int *cone_mode, int *sparse_index)
> record_mode = (*cone_mode != -1) || !core_apply_sparse_checkout;
>
> mode = update_cone_mode(cone_mode);
> - if (record_mode && set_config(mode))
> + if (record_mode && set_config(repo, mode))
> return 1;
>
> /* Set sparse-index/non-sparse-index mode if specified */
> if (*sparse_index >= 0) {
> - if (set_sparse_index_config(the_repository, *sparse_index) < 0)
> + if (set_sparse_index_config(repo, *sparse_index) < 0)
> die(_("failed to modify sparse-index config"));
>
> /* force an index rewrite */
> - repo_read_index(the_repository);
> - the_repository->index->updated_workdir = 1;
> + repo_read_index(repo);
> + repo->index->updated_workdir = 1;
>
> if (!*sparse_index)
> - ensure_full_index(the_repository->index);
> + ensure_full_index(repo->index);
> }
>
> return 0;
> @@ -448,7 +450,7 @@ static struct sparse_checkout_init_opts {
> } init_opts;
>
> static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> struct pattern_list pl;
> char *sparse_filename;
> @@ -464,7 +466,7 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> };
>
> setup_work_tree();
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> init_opts.cone_mode = -1;
> init_opts.sparse_index = -1;
> @@ -473,7 +475,7 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> builtin_sparse_checkout_init_options,
> builtin_sparse_checkout_init_usage, 0);
>
> - if (update_modes(&init_opts.cone_mode, &init_opts.sparse_index))
> + if (update_modes(repo, &init_opts.cone_mode, &init_opts.sparse_index))
> return 1;
>
> memset(&pl, 0, sizeof(pl));
> @@ -485,14 +487,14 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> if (res >= 0) {
> free(sparse_filename);
> clear_pattern_list(&pl);
> - return update_working_directory(NULL);
> + return update_working_directory(repo, NULL);
> }
>
> - if (repo_get_oid(the_repository, "HEAD", &oid)) {
> + if (repo_get_oid(repo, "HEAD", &oid)) {
> FILE *fp;
>
> /* assume we are in a fresh repo, but update the sparse-checkout file */
> - if (safe_create_leading_directories(the_repository, sparse_filename))
> + if (safe_create_leading_directories(repo, sparse_filename))
> die(_("unable to create leading directories of %s"),
> sparse_filename);
> fp = xfopen(sparse_filename, "w");
> @@ -511,7 +513,7 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> add_pattern("!/*/", empty_base, 0, &pl, 0);
> pl.use_cone_patterns = init_opts.cone_mode;
>
> - return write_patterns_and_update(&pl);
> + return write_patterns_and_update(repo, &pl);
> }
>
> static void insert_recursive_pattern(struct pattern_list *pl, struct strbuf *path)
> @@ -674,7 +676,8 @@ static void add_patterns_literal(int argc, const char **argv,
> add_patterns_from_input(pl, argc, argv, use_stdin ? stdin : NULL);
> }
>
> -static int modify_pattern_list(struct strvec *args, int use_stdin,
> +static int modify_pattern_list(struct repository *repo,
> + struct strvec *args, int use_stdin,
> enum modify_type m)
> {
> int result;
> @@ -696,22 +699,23 @@ static int modify_pattern_list(struct strvec *args, int use_stdin,
> }
>
> if (!core_apply_sparse_checkout) {
> - set_config(MODE_ALL_PATTERNS);
> + set_config(repo, MODE_ALL_PATTERNS);
> core_apply_sparse_checkout = 1;
> changed_config = 1;
> }
>
> - result = write_patterns_and_update(pl);
> + result = write_patterns_and_update(repo, pl);
>
> if (result && changed_config)
> - set_config(MODE_NO_PATTERNS);
> + set_config(repo, MODE_NO_PATTERNS);
>
> clear_pattern_list(pl);
> free(pl);
> return result;
> }
>
> -static void sanitize_paths(struct strvec *args,
> +static void sanitize_paths(struct repository *repo,
> + struct strvec *args,
> const char *prefix, int skip_checks)
> {
> int i;
> @@ -752,7 +756,7 @@ static void sanitize_paths(struct strvec *args,
>
> for (i = 0; i < args->nr; i++) {
> struct cache_entry *ce;
> - struct index_state *index = the_repository->index;
> + struct index_state *index = repo->index;
> int pos = index_name_pos(index, args->v[i], strlen(args->v[i]));
>
> if (pos < 0)
> @@ -779,7 +783,7 @@ static struct sparse_checkout_add_opts {
> } add_opts;
>
> static int sparse_checkout_add(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_add_options[] = {
> OPT_BOOL_F(0, "skip-checks", &add_opts.skip_checks,
> @@ -796,7 +800,7 @@ static int sparse_checkout_add(int argc, const char **argv, const char *prefix,
> if (!core_apply_sparse_checkout)
> die(_("no sparse-checkout to add to"));
>
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> argc = parse_options(argc, argv, prefix,
> builtin_sparse_checkout_add_options,
> @@ -804,9 +808,9 @@ static int sparse_checkout_add(int argc, const char **argv, const char *prefix,
>
> for (int i = 0; i < argc; i++)
> strvec_push(&patterns, argv[i]);
> - sanitize_paths(&patterns, prefix, add_opts.skip_checks);
> + sanitize_paths(repo, &patterns, prefix, add_opts.skip_checks);
>
> - ret = modify_pattern_list(&patterns, add_opts.use_stdin, ADD);
> + ret = modify_pattern_list(repo, &patterns, add_opts.use_stdin, ADD);
>
> strvec_clear(&patterns);
> return ret;
> @@ -825,7 +829,7 @@ static struct sparse_checkout_set_opts {
> } set_opts;
>
> static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> int default_patterns_nr = 2;
> const char *default_patterns[] = {"/*", "!/*/", NULL};
> @@ -847,7 +851,7 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> int ret;
>
> setup_work_tree();
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> set_opts.cone_mode = -1;
> set_opts.sparse_index = -1;
> @@ -856,7 +860,7 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> builtin_sparse_checkout_set_options,
> builtin_sparse_checkout_set_usage, 0);
>
> - if (update_modes(&set_opts.cone_mode, &set_opts.sparse_index))
> + if (update_modes(repo, &set_opts.cone_mode, &set_opts.sparse_index))
> return 1;
>
> /*
> @@ -870,10 +874,10 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> } else {
> for (int i = 0; i < argc; i++)
> strvec_push(&patterns, argv[i]);
> - sanitize_paths(&patterns, prefix, set_opts.skip_checks);
> + sanitize_paths(repo, &patterns, prefix, set_opts.skip_checks);
> }
>
> - ret = modify_pattern_list(&patterns, set_opts.use_stdin, REPLACE);
> + ret = modify_pattern_list(repo, &patterns, set_opts.use_stdin, REPLACE);
>
> strvec_clear(&patterns);
> return ret;
> @@ -891,7 +895,7 @@ static struct sparse_checkout_reapply_opts {
>
> static int sparse_checkout_reapply(int argc, const char **argv,
> const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_reapply_options[] = {
> OPT_BOOL(0, "cone", &reapply_opts.cone_mode,
> @@ -912,12 +916,12 @@ static int sparse_checkout_reapply(int argc, const char **argv,
> builtin_sparse_checkout_reapply_options,
> builtin_sparse_checkout_reapply_usage, 0);
>
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> - if (update_modes(&reapply_opts.cone_mode, &reapply_opts.sparse_index))
> + if (update_modes(repo, &reapply_opts.cone_mode, &reapply_opts.sparse_index))
> return 1;
>
> - return update_working_directory(NULL);
> + return update_working_directory(repo, NULL);
> }
>
> static char const * const builtin_sparse_checkout_disable_usage[] = {
> @@ -927,7 +931,7 @@ static char const * const builtin_sparse_checkout_disable_usage[] = {
>
> static int sparse_checkout_disable(int argc, const char **argv,
> const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_disable_options[] = {
> OPT_END(),
> @@ -955,7 +959,7 @@ static int sparse_checkout_disable(int argc, const char **argv,
> * are expecting to do that when disabling sparse-checkout.
> */
> give_advice_on_expansion = 0;
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> memset(&pl, 0, sizeof(pl));
> hashmap_init(&pl.recursive_hashmap, pl_hashmap_cmp, NULL, 0);
> @@ -965,14 +969,14 @@ static int sparse_checkout_disable(int argc, const char **argv,
>
> add_pattern("/*", empty_base, 0, &pl, 0);
>
> - prepare_repo_settings(the_repository);
> - the_repository->settings.sparse_index = 0;
> + prepare_repo_settings(repo);
> + repo->settings.sparse_index = 0;
>
> - if (update_working_directory(&pl))
> + if (update_working_directory(repo, &pl))
> die(_("error while refreshing working directory"));
>
> clear_pattern_list(&pl);
> - return set_config(MODE_NO_PATTERNS);
> + return set_config(repo, MODE_NO_PATTERNS);
> }
>
> static char const * const builtin_sparse_checkout_check_rules_usage[] = {
> @@ -987,14 +991,17 @@ static struct sparse_checkout_check_rules_opts {
> char *rules_file;
> } check_rules_opts;
>
> -static int check_rules(struct pattern_list *pl, int null_terminated) {
> +static int check_rules(struct repository *repo,
> + struct pattern_list *pl,
> + int null_terminated)
> +{
> struct strbuf line = STRBUF_INIT;
> struct strbuf unquoted = STRBUF_INIT;
> char *path;
> int line_terminator = null_terminated ? 0 : '\n';
> strbuf_getline_fn getline_fn = null_terminated ? strbuf_getline_nul
> : strbuf_getline;
> - the_repository->index->sparse_checkout_patterns = pl;
> + repo->index->sparse_checkout_patterns = pl;
> while (!getline_fn(&line, stdin)) {
> path = line.buf;
> if (!null_terminated && line.buf[0] == '"') {
> @@ -1006,7 +1013,7 @@ static int check_rules(struct pattern_list *pl, int null_terminated) {
> path = unquoted.buf;
> }
>
> - if (path_in_sparse_checkout(path, the_repository->index))
> + if (path_in_sparse_checkout(path, repo->index))
> write_name_quoted(path, stdout, line_terminator);
> }
> strbuf_release(&line);
> @@ -1016,7 +1023,7 @@ static int check_rules(struct pattern_list *pl, int null_terminated) {
> }
>
> static int sparse_checkout_check_rules(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_check_rules_options[] = {
> OPT_BOOL('z', NULL, &check_rules_opts.null_termination,
> @@ -1055,7 +1062,7 @@ static int sparse_checkout_check_rules(int argc, const char **argv, const char *
> free(sparse_filename);
> }
>
> - ret = check_rules(&pl, check_rules_opts.null_termination);
> + ret = check_rules(repo, &pl, check_rules_opts.null_termination);
> clear_pattern_list(&pl);
> free(check_rules_opts.rules_file);
> return ret;
> @@ -1084,8 +1091,8 @@ int cmd_sparse_checkout(int argc,
>
> git_config(git_default_config, NULL);
>
> - prepare_repo_settings(the_repository);
> - the_repository->settings.command_requires_full_index = 0;
> + prepare_repo_settings(repo);
> + repo->settings.command_requires_full_index = 0;
>
> return fn(argc, argv, prefix, repo);
> }
> --
> gitgitgadget
Patch looks good to me.
@@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r) | |||
ensure_full_index(r->index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Derrick Stolee via GitGitGadget" <[email protected]> writes:
> From: Derrick Stolee <[email protected]>
>
> The logic for the 'git sparse-checkout' builtin uses the_repository all
> over the place, despite some use of a repository struct in different
> method parameters. Complete this removal of the_repository by using
> 'repo' when possible.
>
> In one place, there was already a local variable 'r' that was set to
> the_repository, so move that to a method parameter.
>
> We cannot remove the USE_THE_REPOSITORY_VARIABLE declaration as we are
> still using global constants for the state of the sparse-checkout.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> builtin/sparse-checkout.c | 119 ++++++++++++++++++++------------------
> 1 file changed, 63 insertions(+), 56 deletions(-)
OK. The damage is not too bad for a partial update ;-).
As the file-scope static functions in builtin/sparse-checkout.c are
not going to be called by anybody else, it does not really matter if
they internally pass an extra parameter around or use the_repository
since the end result is the same. But doing this may hopefully help
those that may want to move some of these functions to a more
library-ish part of the system outside builtin/ hierarchy.
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 1bf01591b275..8b70d0c6a441 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r)
> ensure_full_index(r->index);
> }
>
> -static int update_working_directory(struct pattern_list *pl)
> +static int update_working_directory(struct repository *r,
> + struct pattern_list *pl)
> {
> enum update_sparsity_result result;
> struct unpack_trees_options o;
> struct lock_file lock_file = LOCK_INIT;
> - struct repository *r = the_repository;
> struct pattern_list *old_pl;
As this already used short-and-sweet 'r', we just follow suit to
minimize the damage, which is fine.
> @@ -327,7 +327,8 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
> string_list_clear(&sl, 0);
> }
>
> -static int write_patterns_and_update(struct pattern_list *pl)
> +static int write_patterns_and_update(struct repository *repo,
> + struct pattern_list *pl)
> {
> char *sparse_filename;
> FILE *fp;
> @@ -336,15 +337,15 @@ static int write_patterns_and_update(struct pattern_list *pl)
>
> sparse_filename = get_sparse_checkout_filename();
>
> - if (safe_create_leading_directories(the_repository, sparse_filename))
> + if (safe_create_leading_directories(repo, sparse_filename))
> die(_("failed to create directory for sparse-checkout file"));
>
> hold_lock_file_for_update(&lk, sparse_filename, LOCK_DIE_ON_ERROR);
>
> - result = update_working_directory(pl);
> + result = update_working_directory(repo, pl);
> if (result) {
> rollback_lock_file(&lk);
> - update_working_directory(NULL);
> + update_working_directory(repo, NULL);
> goto out;
> }
But this introduces a new parameter. Both of two instances of
repository struct used in the existing code in this function, other
than references to struct repository *UNUSED, use "r", and with this
patch, the name "repo" becomes more prevanent.
We would probably want to rename "r" to "repo" for consistency in
clean_tracked_sparse_repositories() and update_working_directory(),
but that is better done later after the dust settles and the code
around here becomes quiescent again, not as part of this topic as an
extra churn.
Looking good. Thanks. Will queue.
@@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files | |||
SYNOPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Derrick Stolee via GitGitGadget" <[email protected]> writes:
> From: Derrick Stolee <[email protected]>
>
> When users change their sparse-checkout definitions to add new
> directories and remove old ones, there may be a few reasons why
> directories no longer in scope remain (ignored or excluded files still
> exist, Windows handles are still open, etc.). When these files still
> exist, the sparse index feature notices that a tracked, but sparse,
> directory still exists on disk and thus the index expands. This causes a
> performance hit _and_ the advice printed isn't very helpful. Using 'git
> clean' isn't enough (generally '-dfx' may be needed) but also this may
> not be sufficient.
>
> Add a new subcommand to 'git sparse-checkout' that removes these
> tracked-but-sparse directories, including any excluded or ignored files
Are excluded files and ignored files form two separate sets, or are
they one and the same? Do files that users forgot to add (e.g. new
source file that would not match any patterns listed in .gitignore)
and object files left over from the previous compilation (most
likely match *.o in .gitignore) treated the same way for the purpose
of determining if the directory that is no longer in the cone can be
removed?
> underneath. This is the most extreme method for doing this, but it works
> when the sparse-checkout is in cone mode and is expected to rescope
> based on directories, not files.
>
> Be sure to add a --dry-run option so users can predict what will be
> deleted. In general, output the directories that are being removed so
> users can know what was removed.
Hmph. It would be safer to show not just the directories but which
excluded files are about to be lost, wouldn't it, especially when
the user is trying to play safe and see what potential damage they
are looking at?
Also even though ignored files are "ignored and expendable", nobody
marks their temporary file as "ignored but precious" (yet), so "it
is listed in .gitignore so we can safely remove it" may not be a
safe assumption for us to be making (yet). Shouldn't we at least be
listing these ignored files in --dry-run output, next to those files
that the user may have forgotten to add?
> Note that untracked directories remain. Further, directories that
> contain staged changes are not deleted. This is a detail that is partly
> hidden by the implementation which relies on collapsing the index to a
> sparse index in-memory and only deleting directories that are listed as
> sparse in the index. If a staged change exists, then that entry is not
> stored as a sparse tree entry and thus remains on-disk until committed
> or reset.
Removing untracked directories is a job for "clean -d", so it makes
sense for this new command not to touch them. Not losing changes
that have already been added is just a bad as losing new files that
the user forgot to add, so it does make sense not to remove them.
I wonder if we need "-x" and/or "-X" options "clean" has (and
perhaps "-d" that is a no-op, as the whole point of this subcommand
is about removing directories from the working tree) to control its
operation a bit finer-grained way.
> + for (size_t i = 0; i < repo->index->cache_nr; i++) {
> + DIR* dir;
The asterisk sticks to the variable, not the type, i.e.
DIR *dir;
Thanks.
@@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files | |||
SYNOPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> From: Derrick Stolee <[email protected]>
>
> When users change their sparse-checkout definitions to add new
> directories and remove old ones, there may be a few reasons why
> directories no longer in scope remain (ignored or excluded files still
> exist, Windows handles are still open, etc.).
Good background; I am still particularly interested in the "etc." part...
> When these files still
> exist, the sparse index feature notices that a tracked, but sparse,
> directory still exists on disk and thus the index expands. This causes a
> performance hit _and_ the advice printed isn't very helpful. Using 'git
> clean' isn't enough (generally '-dfx' may be needed) but also this may
> not be sufficient.
Very well motivated.
> Add a new subcommand to 'git sparse-checkout' that removes these
> tracked-but-sparse directories, including any excluded or ignored files
> underneath.
"including"?
> This is the most extreme method for doing this, but it works
> when the sparse-checkout is in cone mode and is expected to rescope
> based on directories, not files.
So is this also meant for cone mode without sparse index turned on?
What about non-cone mode?
> Be sure to add a --dry-run option so users can predict what will be
> deleted. In general, output the directories that are being removed so
> users can know what was removed.
Is greater fidelity of interest when there are multiple different
types of files contained? For example, "git status" lists individual
files within a directory, unless it find an ignored directory and then
it simply lists the directory. That means we get more fidelity when
it's warranted, and less when it's not. I'm not sure if that's a
perfect analogy, though; it may well be that we don't need the same
kind of fidelity that `git status` provides. (And I'm kind of
guessing it isn't needed, except in error cases, but I'm just asking.)
> Note that untracked directories remain.
What does this mean? If the sparse directory had an untracked
directory within it then it'll be left on disk, you will only clean up
untracked files at a depth of 1 within the sparse directory?
Or that untracked directories not contained within a sparse directory
will be left alone?
> Further, directories that
> contain staged changes are not deleted.
Shouldn't those be safe to delete? When a sparse directory has files
underneath it with staged changes, those roll-up into a staged
sparse-directory tree value, and so we should be able to delete the
file.
In contrast, the files under the sparse directory with unstaged
changes would be problematic to simply remove.
> This is a detail that is partly
> hidden by the implementation which relies on collapsing the index to a
> sparse index in-memory and only deleting directories that are listed as
> sparse in the index. If a staged change exists, then that entry is not
> stored as a sparse tree entry and thus remains on-disk until committed
> or reset.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> Documentation/git-sparse-checkout.adoc | 13 ++++-
> builtin/sparse-checkout.c | 73 +++++++++++++++++++++++++-
> t/t1091-sparse-checkout-builtin.sh | 48 +++++++++++++++++
> 3 files changed, 132 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-sparse-checkout.adoc b/Documentation/git-sparse-checkout.adoc
> index 529a8edd9c1e..21ba6f759905 100644
> --- a/Documentation/git-sparse-checkout.adoc
> +++ b/Documentation/git-sparse-checkout.adoc
> @@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files
> SYNOPSIS
> --------
> [verse]
> -'git sparse-checkout' (init | list | set | add | reapply | disable | check-rules) [<options>]
> +'git sparse-checkout' (init | list | set | add | reapply | disable | check-rules | clean) [<options>]
>
>
> DESCRIPTION
> @@ -111,6 +111,17 @@ flags, with the same meaning as the flags from the `set` command, in order
> to change which sparsity mode you are using without needing to also respecify
> all sparsity paths.
>
> +'clean'::
> + Remove all files in tracked directories that are outside of the
> + sparse-checkout definition.
If literal, this sounds unsafe, particularly if run while resolving
merge or rebase conflicts (since those conflicts may occur in paths
outside the sparse checkout definition).
> + This subcommand requires cone-mode
> + sparse-checkout to be sure that we know which directories are
> + both tracked and all contained paths are not in the sparse-checkout.
> + This command can be used to be sure the sparse index works
> + efficiently.
So...what does it do when in cone mode and the sparse index is not enabled?
> ++
> +The `clean` command can also take the `--dry-run` (`-n`) option to list
> +the directories it would remove without performing any filesystem changes.
> +
> 'disable'::
> Disable the `core.sparseCheckout` config setting, and restore the
> working directory to include all files.
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 8b70d0c6a441..6d2843827367 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -23,7 +23,7 @@
> static const char *empty_base = "";
>
> static char const * const builtin_sparse_checkout_usage[] = {
> - N_("git sparse-checkout (init | list | set | add | reapply | disable | check-rules) [<options>]"),
> + N_("git sparse-checkout (init | list | set | add | reapply | disable | check-rules | clean) [<options>]"),
> NULL
> };
>
> @@ -924,6 +924,76 @@ static int sparse_checkout_reapply(int argc, const char **argv,
> return update_working_directory(repo, NULL);
> }
>
> +static char const * const builtin_sparse_checkout_clean_usage[] = {
> + "git sparse-checkout clean [-n|--dry-run]",
> + NULL
> +};
> +
> +static struct sparse_checkout_clean_opts {
> + int dry_run;
> +} clean_opts;
> +
> +static int sparse_checkout_clean(int argc, const char **argv,
> + const char *prefix,
> + struct repository *repo)
> +{
> + struct strbuf full_path = STRBUF_INIT;
> + size_t worktree_len;
> + static struct option builtin_sparse_checkout_clean_options[] = {
> + OPT_BOOL('n', "dry-run", &clean_opts.dry_run,
> + N_("list the directories that would be removed without making filesystem changes")),
> + OPT_END(),
> + };
> +
> + setup_work_tree();
> + if (!core_apply_sparse_checkout)
> + die(_("must be in a sparse-checkout to clean directories"));
> + if (!core_sparse_checkout_cone)
> + die(_("must be in a cone-mode sparse-checkout to clean directories"));
> +
> + argc = parse_options(argc, argv, prefix,
> + builtin_sparse_checkout_clean_options,
> + builtin_sparse_checkout_clean_usage, 0);
> +
> + if (repo_read_index(repo) < 0)
> + die(_("failed to read index"));
> +
> + if (convert_to_sparse(repo->index, SPARSE_INDEX_MEMORY_ONLY))
> + die(_("failed to convert index to a sparse index"));
So, you make the in-memory index sparse; I don't remember the details
on this function so it might invalidate some things I say below...but
after this point you then...
> +
> + strbuf_addstr(&full_path, repo->worktree);
> + strbuf_addch(&full_path, '/');
> + worktree_len = full_path.len;
> +
> + for (size_t i = 0; i < repo->index->cache_nr; i++) {
> + DIR* dir;
> + struct cache_entry *ce = repo->index->cache[i];
> + if (!S_ISSPARSEDIR(ce->ce_mode))
> + continue;
...skip the entries that aren't sparse directories.
> + strbuf_setlen(&full_path, worktree_len);
> + strbuf_add(&full_path, ce->name, ce->ce_namelen);
> +
> + dir = opendir(full_path.buf);
> + if (!dir)
> + continue;
...skip the sparse directories that, as expected, don't exist on disk.
> + else if (ENOENT != errno) {
> + warning_errno(_("failed to check for existence of '%s'"), ce->name);
> + continue;
> + }
> +
> + closedir(dir);
> +
> + printf("%s\n", ce->name);
> + if (!clean_opts.dry_run) {
> + if (remove_dir_recursively(&full_path, 0))
> + warning_errno(_("failed to remove '%s'"), ce->name);
> + }
...and then unconditionally remove the directory, as you stated in the
documentation for this clean option.
I'm worried whether this is safe; if someone does a merge or rebase,
there could be tracked-and-modified/conflicted files outside the
sparse specification in the working tree.
Even after resolving such a merge and committing, the paths may remain
around until the user does a 'git sparse-checkout reapply' (I don't
remember details here, but our documentation for reapply certainly
says so), and since the file might stick around, the user may make
further modifications to such a file.
...or will the convert_to_sparse() call above fail in all these cases?
If it does, should it give a better and more useful error message
than "failed to convert index to a sparse index" and rather e.g. "path
%s has modifications; please stage or revert first"?
> + }
> +
> + strbuf_release(&full_path);
> + return 0;
> +}
> +
> static char const * const builtin_sparse_checkout_disable_usage[] = {
> "git sparse-checkout disable",
> NULL
> @@ -1080,6 +1150,7 @@ int cmd_sparse_checkout(int argc,
> OPT_SUBCOMMAND("set", &fn, sparse_checkout_set),
> OPT_SUBCOMMAND("add", &fn, sparse_checkout_add),
> OPT_SUBCOMMAND("reapply", &fn, sparse_checkout_reapply),
> + OPT_SUBCOMMAND("clean", &fn, sparse_checkout_clean),
> OPT_SUBCOMMAND("disable", &fn, sparse_checkout_disable),
> OPT_SUBCOMMAND("check-rules", &fn, sparse_checkout_check_rules),
> OPT_END(),
> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index ab3a105ffff2..7f8a444541f7 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -1050,5 +1050,53 @@ test_expect_success 'check-rules null termination' '
> test_cmp expect actual
> '
>
> +test_expect_success 'clean' '
> + git -C repo sparse-checkout set --cone deep/deeper1 &&
> + mkdir repo/deep/deeper2 repo/folder1 &&
> + touch repo/deep/deeper2/file &&
> + touch repo/folder1/file &&
> +
> + cat >expect <<-\EOF &&
> + deep/deeper2/
> + folder1/
> + EOF
> +
> + git -C repo sparse-checkout clean --dry-run >out &&
> + test_cmp expect out &&
> +
> + test_path_exists repo/deep/deeper2 &&
> + test_path_exists repo/folder1 &&
> +
> + git -C repo sparse-checkout clean >out &&
> + test_cmp expect out &&
> +
> + ! test_path_exists repo/deep/deeper2 &&
> + ! test_path_exists repo/folder1
> +'
> +
> +test_expect_success 'clean with staged sparse change' '
> + git -C repo sparse-checkout set --cone deep/deeper1 &&
> + mkdir repo/deep/deeper2 repo/folder1 &&
> + touch repo/deep/deeper2/file &&
> + touch repo/folder1/file &&
> +
> + git -C repo add --sparse folder1/file &&
> +
> + cat >expect <<-\EOF &&
> + deep/deeper2/
> + EOF
> +
> + git -C repo sparse-checkout clean --dry-run >out &&
> + test_cmp expect out &&
> +
> + test_path_exists repo/deep/deeper2 &&
> + test_path_exists repo/folder1 &&
> +
> + git -C repo sparse-checkout clean >out &&
> + test_cmp expect out &&
> +
> + ! test_path_exists repo/deep/deeper2 &&
> + test_path_exists repo/folder1
> +'
>
> test_done
> --
> gitgitgadget
@@ -32,7 +32,8 @@ int give_advice_on_expansion = 1; | |||
"Your working directory likely has contents that are outside of\n" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> From: Derrick Stolee <[email protected]>
>
> In my experience, the most-common reason that the sparse index must
> expand to a full one is because there is some leftover file in a tracked
> directory that is now outside of the sparse-checkout. The new 'git
> sparse-checkout clean' command will find and delete these directories,
> so point users to it when they hit the sparse index expansion advice.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> sparse-index.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index 5634abafaa07..5d14795063b5 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -32,7 +32,8 @@ int give_advice_on_expansion = 1;
> "Your working directory likely has contents that are outside of\n" \
> "your sparse-checkout patterns. Use 'git sparse-checkout list' to\n" \
> "see your sparse-checkout definition and compare it to your working\n" \
> - "directory contents. Running 'git clean' may assist in this cleanup."
> + "directory contents. Running 'git sparse-checkout clean' may assist\n" \
> + "in this cleanup."
>
> struct modify_index_context {
> struct index_state *write;
> --
> gitgitgadget
Makes sense, once we work out any wrinkles with `git sparse-checkout clean`.
On the Git mailing list, Elijah Newren wrote (reply to this): On Tue, Jul 8, 2025 at 1:36 PM Elijah Newren <[email protected]> wrote:
>
> On Tue, Jul 8, 2025 at 4:19 AM Derrick Stolee via GitGitGadget
> <[email protected]> wrote:
> >
[...]
> I'm also curious what happens when (1) you are in cone mode and there
> is no sparse index, or (2) when you are not in cone mode. I suspect
> those and the questions above will be answered as I read the
> individual patches, so I'll keep going...
After reading the series, I know the answer to (2). I think the
answer to (1) is that it effectively turns into a silent (but not
instantaneous) no-op, which may be confusing for users. We might want
to provide them with an alternative implementation, or at least a
warning or error that the mode doesn't (currently?) do anything when
sparse index isn't in use.
Anyway, I think the series is a good direction and you've explained
the motivation very well, but I'm a bit worried the current
implementation might be using too coarse of a hammer. |
On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> The implementation is rather simple with the current CLI. Future
> augmentations could include a --quiet option to silence the output and a
> --verbose option to list the files that exist within each directory and
> would/will be removed.
I liked the overall idea but this has some interactions with a topic
in flight. 2c5b5565 (environment: remove the global variable
'sparse_expect_files_outside_of_patterns', 2025-07-01). I may have
botched (semantic) conflict resolution but with both merged to
'seen', a few steps in the sparse test seem to fail.
For tonight's integration, I'll leave the topic out of 'seen' so
that we can pass other new topics that we acquired through the CI.
I may re-attempt merging this topic later, or I may eject the other
topic from 'seen' and queue this one first, asking the other topic
to be redone on top. We'll see.
Thanks. |
When using cone-mode sparse-checkout, users specify which tracked directories they want (recursively) and any directory not part of the parent paths for those directories are considered "out of scope". When changing sparse-checkouts, there are a variety of reasons why these "out of scope" directories could remain, including:
.gitignore
or.git/info/exclude
files that tell Git to not remove files of a certain type.Typically, this would not mean too much for the user experience. A few extra filesystem checks might be required to satisfy
git status
commands, but the scope of the performance hit is relative to how many cruft files are left over in this situation.However, when using the sparse index, these tracked sparse directories cause significant performance issues. When noticing that the index contains a sparse directory but that directory exists on disk, Git needs to expand that sparse directory to determine which files are tracked or untracked. The current mechanism expands the entire index to a full one, an expensive operation that scales with the total number of paths at HEAD and not just the number of cruft files left over.
Advice was added in 9479a31 (advice: warn when sparse index expands, 2024-07-08) to help users determine that they were in this state. However, the advice doesn't actually recommend helpful ways to get out of this state. Recommending "git clean" on its own is incomplete, as typically users actually need 'git clean -dfx' to clear out the ignored or excluded files. Even then, they may need 'git sparse-checkout reapply' afterwards to clear the sparse directories.
The advice was successful in helping to alert users to the problem, which is how I got wind of many of these cases for how users get into this state. It's now time to give them a tool that helps them out of this state.
This series adds a new 'git sparse-checkout clean' command that currently only works for cone-mode sparse-checkouts. The only thing it does is collapse the index to a sparse index (as much as possible) and make sure that any sparse directories are removed. These directories are listed to stdout.
A --dry-run option is available to list the directories that would be removed without actually deleting the directories.
This option would be preferred to something like 'git clean -dfx' since it does not clear the excluded files that are still within the sparse-checkout. Instead, it performs the exact filesystem operations required to refresh the sparse index performance back to what is expected.
I spent a few weeks debating with myself about whether or not this was the right interface, so please suggest alternatives if you have better ideas. Among my rejected ideas include:
The implementation is rather simple with the current CLI. Future augmentations could include a
--quiet
option to silence the output and a--verbose
option to list the files that exist within each directory and would/will be removed.Thanks,
-Stolee
cc: [email protected]
cc: [email protected]
cc: Patrick Steinhardt [email protected]