fast forwarding #1

hdu-hh · 2019-11-20T21:04:28Z

No description provided.

Add a docstring for each function that manipulates json_writers. Helped-by: Junio C Hamano <[email protected]> Helped-by: Patrick Steinhardt <[email protected]> Helped-by: Karthik Nayak <[email protected]> Signed-off-by: Lucas Seiki Oshiro <[email protected]> Acked-by: Karthik Nayak <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Provide an overview of the set of functions used for manipulating `json_writer`s, by describing what functions should be used for each JSON-related task. Helped-by: Junio C Hamano <[email protected]> Helped-by: Patrick Steinhardt <[email protected]> Helped-by: Karthik Nayak <[email protected]> Signed-off-by: Lucas Seiki Oshiro <[email protected]> Acked-by: Karthik Nayak <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

This should be "compat", not "comapt". Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The cat-file command has some minor support for handling objects with "unknown" types. I.e., strings that are not "blob", "commit", "tree", or "tag". In theory this could be used for debugging or experimenting with extensions to Git. But in practice this support is not very useful: 1. You can get the type and size of such objects, but nothing else. Not even the contents! 2. Only loose objects are supported, since packfiles use numeric ids for the types, rather than strings. 3. Likewise you cannot ever transfer objects between repositories, because they cannot be represented in the packfiles used for the on-the-wire protocol. The support for these unknown types complicates the object-parsing code, and has led to bugs such as b748ddb (unpack_loose_header(): fix infinite loop on broken zlib input, 2025-02-25). So let's drop it. The first step is to remove the user-facing parts, which are accessible only via cat-file. This is technically backwards-incompatible, but given the limitations listed above, these objects couldn't possibly be useful in any workflow. However, we can't just rip out the option entirely. That would hurt a caller who ran: git cat-file -t --allow-unknown-object <oid> and fed it normal, well-formed objects. There --allow-unknown-type was doing nothing, but we wouldn't want to start bailing with an error. So to protect any such callers, we'll retain --allow-unknown-type as a noop. The code change is fairly small (but we'll able to clean up more code in follow-on patches). The test updates drop any use of the option. We still retain tests that feed the broken objects to cat-file without --allow-unknown-type, as we should continue to confirm that those objects are rejected. Note that in one spot we can drop a layer of loop, re-indenting the body; viewing the diff with "-w" helps there. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Since cat-file dropped its "--allow-unknown-type" option in the previous commit, there are no more uses of the internal flag that implemented it. Let's drop it. That in turn lets us drop the strbuf parameter of unpack_loose_header(), which now is always NULL. And without that, we can drop all of the additional code to inflate larger headers into the strbuf. Arguably we could drop ULHR_TOO_LONG, as no callers really care about the distinction from ULHR_BAD. But it's easy enough to retain, and it does let us produce a slightly more specific message in one instance. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Now that we no longer support OBJECT_INFO_ALLOW_UNKNOWN_TYPE, there is no need to pass a strbuf into oid_object_info_extended() to record the type. The regular object_type enum is sufficient to capture all of the types we will allow. This simplifies the code a bit, and will eventually let us drop object_info's type_name strbuf support. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In oid_object_info_convert(), we convert objects between their sha1 and sha256 variants. To do this, we naturally need to know the type, which we get from oid_object_info_extended() using its type_name strbuf option. But getting the value as a string (versus an object_type enum) is not helpful. Since we do not allow unknown types, the regular enum is sufficient. And the resulting code is a bit simpler, as we no longer have to manage the extra allocation nor convert the string to an enum ourselves. Note that at first glance, it might seem like we should retain the error check for "type == -1" to catch bogus types found by the underlying parser. But we don't need it, as an unknown type would have yielded an error from the call to oid_object_info_extended(), which would already have caused us to return an error. In fact, I suspect this was always impossible to trigger. Even when we were converting the string to a type enum ourselves, an invalid type would never have escaped oid_object_info_extended(), since we never passed the (now removed) OBJECT_INFO_ALLOW_UNKNOWN_TYPE option. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

When fsck-ing a loose object, we use object_info's type_name strbuf to record the parsed object type as a string. For most objects this is redundant with the object_type enum, but it does let us report the string when we encounter an object with an unknown type (for which there is no matching enum value). There are a few downsides, though: 1. The code to report these cases is not actually robust. Since we did not pass a strbuf to unpack_loose_header(), we only retrieved types from headers up to 32 bytes. In longer cases, we'd simply say "object corrupt or missing". 2. This is the last caller that uses object_info's type_name strbuf support. It would be nice to refactor it so that we can simplify that code. 3. Likewise, we'll check the hash of the object using its unknown type (again, as long as that type is short enough). That depends on the hash_object_file_literally() code, which we'd eventually like to get rid of. So we can simplify things by bailing immediately in read_loose_object() when we encounter an unknown type. This has a few user-visible effects: a. Instead of producing a single line of error output like this: error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object is of unknown type 'bogus': .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6 we'll now issue two lines (the first from read_loose_object() when we see the unparsable header, and the second from the fsck code, since we couldn't read the object): error: unable to parse type from header 'bogus 4' of .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6 error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object corrupt or missing: .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6 This is a little more verbose, but this sort of error should be rare (such objects are almost impossible to work with, and cannot be transferred between repositories as they are not representable in packfiles). And as a bonus, reporting the broken header in full could help with debugging other cases (e.g., a header like "blob xyzzy\0" would fail in parsing the size, but previously we'd not have showed the offending bytes). b. An object with an unknown type will be reported as corrupt, without actually doing a hash check. Again, I think this is unlikely to matter in practice since such objects are totally unusable. We'll update one fsck test to match the new error strings. And we can remove another test that covered the case of an object with an unknown type _and_ a hash corruption. Since we'll skip the hash check now in this case, the test is no longer interesting. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

We provide a mechanism for callers to get the object type as a raw string, rather than an object_type enum. This was in theory useful for returning types that are not representable in the enum, but we consider any such type to be an error, and there are no callers that use the strbuf anymore. Let's drop support to simplify the code a bit. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

It's occasionally useful when testing or debugging to be able to do raw zlib inflate/deflate operations (e.g., to check the bytes of a specific loose or packed object). Even though zlib's deflate algorithm is used by many other programs, this is surprisingly hard to do in a portable way. E.g., gzip can do this if you manually munge some header bytes. But the result is somewhat arcane, and we don't assume gzip is available anyway. Likewise, pigz will handle raw zlib, but we can't assume it is available. So let's introduce a short test helper for just doing zlib operations. We'll use it in subsequent patches to add some new tests, but it would also have come in handy a few times in the past: - The hard-coded pack data from 3b910d0 (add tests for indexing packs with delta cycles, 2013-08-23) could probably be generated on the fly. - Likewise we could avoid the hard-coded data from 0b1493c (git_inflate(): skip zlib_post_call() sanity check on Z_NEED_DICT, 2025-02-25). Though note this would require support for more zlib options. - It would have helped with the debugging documented in 41dfbb2 (howto: add article on recovering a corrupted object, 2013-10-25). I'll leave refactoring existing tests for another day, but I hope the examples above show the general utility. I aimed for simplicity in the code. In particular, it will read all input into a memory buffer, rather than streaming. That makes the zlib loops harder to get wrong (which has been a source of subtle bugs in the past). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

This commit adds a shell library for writing raw loose objects into the object database. Normally this is done with hash-object, but the specific intent here is to allow broken objects that hash-object may not support. We'll convert several cases that use "hash-object --literally" to write objects with invalid types. That works currently, but dropping this dependency will allow us to remove that feature and simplify the object-writing code. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

When passed the "--literally" option, hash-object will allow any arbitrary string for its "-t" type option. Such objects are only useful for testing or debugging, as they cannot be used in the normal way (e.g., you cannot fetch their contents!). Let's drop this feature, which will eventually let us simplify the object-writing code. This is technically backwards incompatible, but since such objects were never really functional, it seems unlikely that anybody will notice. We will retain the --literally flag, as it also instructs hash-object not to worry about other format issues (e.g., type-specific things that fsck would complain about). The documentation does not need to be updated, as it was always vague about which checks we're loosening (it uses only the phrase "any garbage"). The code change is a bit hard to verify from just the patch text. We can drop our local hash_literally() helper, but it was really just wrapping write_object_file_literally(). We now replace that with calling index_fd(), as we do for the non-literal code path, but dropping the INDEX_FORMAT_CHECK flag. This ends up being the same semantically as what the _literally() code path was doing (modulo handling unknown types, which is our goal). We'll be able to clean up these code paths a bit more in subsequent patches. The existing test is flipped to show that we now reject the unknown type. The additional "extra-long type" test is now redundant, as we bail early upon seeing a bogus type. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The hash-object command has its own custom flag bits that it sets based on command-line options. But since we dropped hash_literally() in the previous commit, the only thing we do with those flag bits is convert them directly into "index_flags" to pass to index_fd(). This extra layer of indirection makes the code harder to read and reason about. Let's just use the INDEX_* flags directly. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Since we recently removed the hash_literally() function, the hash-object --literally option has been simplified to just removing the INDEX_FORMAT_CHECK flag. Rather than pass it around as a separate bool, we can just have the option parser remove the bit from the set of flags directly. This simplifies the helper functions. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Since "hash-object --literally" no longer supports objects with unknown types, there are now no callers of write_object_file_literally() and its helpers. Let's drop them to simplify the code. In particular, this gets rid of some ugly copy-and-paste code from write_object_file_literally(), which is a parallel implementation of write_object_file(). When the split was originally made, the two weren't that long, but commits like 63a6745 (object-file: update the loose object map when writing loose objects, 2023-10-01) ended up having to duplicate some tricky code. This patch drops all of that duplication and should make things less error-prone going forward. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The previous function regex required explicit matching of function bodies using `{`, `(`, `((`, or `[[`, which caused several issues: - It failed to capture valid functions where `{` was on the next line due to line continuation (`\`). - It did not recognize functions with single command body, such as `x () echo hello`. Replacing the function body matching logic with `.*$`, ensures that everything on the function definition line is captured. Additionally, the word regex is refined to better recognize shell syntax, including additional parameter expansion operators and command-line options. Signed-off-by: Moumita Dhar <[email protected]> Acked-by: Johannes Sixt <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The sparse index allows storing directory entries in the index, marked with the skip-wortkree bit and pointing to a tree object. This may be an unexpected data shape for some implementation areas, so we are rolling it out incrementally on a builtin-per-builtin basis. This change enables the sparse index for 'git apply'. The main motivation for this change is that 'git apply' is used as a child process of 'git add -p' and expanding the sparse index for each of those child processes can lead to significant performance issues. The good news is that the actual index manipulation code used by 'git apply' is already integrated with the sparse index, so the only product change is to mark the builtin as allowing the sparse index so it isn't inflated on read. The more involved part of this change is around adding tests that verify how 'git apply' behaves in a sparse-checkout environment and whether or not the index expands in certain operations. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

It is slow to expand a sparse index in-memory due to parsing of trees. We aim to minimize that performance cost when possible. 'git add -p' uses 'git apply' child processes to modify the index, but still there are some expansions that occur. It turns out that control flows out of cmd_add() in the interactive cases before the lines that confirm that the builtin is integrated with the sparse index. Moving that integration point earlier in cmd_add() allows 'git add -i' and 'git add -p' to operate without expanding a sparse index to a full one. Add test cases that confirm that these interactive add options work with the sparse index. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Similar to the previous change for 'git add -p', the reset builtin checked for integration with the sparse index after possibly redirecting its logic toward the interactive logic. This means that the builtin would expand the sparse index to a full one upon read. Move this check earlier within cmd_reset() to improve performance here. Add tests to guarantee that we are not universally expanding the index. Add behavior tests to check that we are doing the same operations as a full index. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The previous three changes contributed performance improvements to 'git apply', 'git add -p', and 'git reset -p' when using a sparse index. The improvement to 'git apply' also improved 'git checkout -p'. Add performance tests to demonstrate this (and to help validate that performance remains good in the future). In the truncated test output below, we see that the full checkout performance changes within noise expectations, but the sparse index cases improve 33% and then 96% for 'git add -p' and 41% and then 95% for 'git reset -p'. 'git checkout -p' improves immediatley by 91% because it does not need any change to its builtin. Test HEAD~4 HEAD~3 HEAD~2 HEAD~1 ------------------------------------------------------------------------------------- 2000.118: ... git add -p (full-v3) 0.79 0.79 +0.0% 0.82 +3.8% 0.82 +3.8% 2000.119: ... git add -p (full-v4) 0.74 0.76 +2.7% 0.74 +0.0% 0.76 +2.7% 2000.120: ... git add -p (sparse-v3) 1.94 1.28 -34.0% 0.07 -96.4% 0.07 -96.4% 2000.121: ... git add -p (sparse-v4) 1.93 1.28 -33.7% 0.06 -96.9% 0.06 -96.9% 2000.122: ... git checkout -p (full-v3) 1.18 1.18 +0.0% 1.18 +0.0% 1.19 +0.8% 2000.123: ... git checkout -p (full-v4) 1.10 1.12 +1.8% 1.11 +0.9% 1.11 +0.9% 2000.124: ... git checkout -p (sparse-v3) 1.31 0.11 -91.6% 0.11 -91.6% 0.11 -91.6% 2000.125: ... git checkout -p (sparse-v4) 1.29 0.11 -91.5% 0.11 -91.5% 0.11 -91.5% 2000.126: ... git reset -p (full-v3) 0.81 0.80 -1.2% 0.83 +2.5% 0.83 +2.5% 2000.127: ... git reset -p (full-v4) 0.78 0.77 -1.3% 0.77 -1.3% 0.78 +0.0% 2000.128: ... git reset -p (sparse-v3) 1.58 0.92 -41.8% 0.91 -42.4% 0.07 -95.6% 2000.129: ... git reset -p (sparse-v4) 1.58 0.92 -41.8% 0.92 -41.8% 0.07 -95.6% It is worth noting that if our test was more involved and had multiple hunks to evaluate, then the time spent in 'git apply' would dominate due to multiple index loads and writes. As it stands, we need the sparse index improvement in 'git add -p' itself to confirm this performance improvement. Since the change for 'git add -i' is identical, we avoid a second test case for that similar operation. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

This will be helpful in a future change, which will reuse this logic. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. The current implementation performs delta calculations while walking objects, which is not ideal for a few reasons. First, this will cause the "Enumerating objects" phase to be much longer than usual. Second, it does not take advantage of threading during the path-scoped delta calculations. Even with this lack of threading, the path-walk option is sometimes faster than the usual approach. Future changes will refactor this code to allow for threading, but that complexity is deferred until later to keep this patch as simple as possible. This new walk is incompatible with some features and is ignored by others: * Object filters are not currently integrated with the path-walk API, such as sparse-checkout or tree depth. A blobless packfile could be integrated easily, but that is deferred for later. * Server-focused features such as delta islands, shallow packs, and using a bitmap index are incompatible with the path-walk API. * The path walk API is only compatible with the --revs option, not taking object lists or pack lists over stdin. These alternative ways to specify the objects currently ignores the --path-walk option without even a warning. Future changes will create performance tests that demonstrate the power of this approach. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The t0450 test script verifies that builtin usage matches the synopsis in the documentation. Adjust the builtin to match and then remove 'git pack-objects' from the exception list. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The previous change added a --path-walk option to 'git pack-objects'. Create a performance test that demonstrates the time and space benefits of the feature. In order to get an appropriate comparison, we need to avoid reusing deltas and recompute them from scratch. Compare the creation of a thin pack representing a small push and the creation of a relatively large non-thin pack. Running on my copy of the Git repository results in this data (removing the repack tests for --name-hash-version): Test this tree ------------------------------------------------------------------------ 5313.2: thin pack with --name-hash-version=1 0.02(0.01+0.01) 5313.3: thin pack size with --name-hash-version=1 1.6K 5313.4: big pack with --name-hash-version=1 2.55(4.20+0.26) 5313.5: big pack size with --name-hash-version=1 16.4M 5313.6: shallow fetch pack with --name-hash-version=1 1.24(2.03+0.08) 5313.7: shallow pack size with --name-hash-version=1 12.2M 5313.10: thin pack with --name-hash-version=2 0.03(0.01+0.01) 5313.11: thin pack size with --name-hash-version=2 1.6K 5313.12: big pack with --name-hash-version=2 1.91(3.23+0.20) 5313.13: big pack size with --name-hash-version=2 16.4M 5313.14: shallow fetch pack with --name-hash-version=2 1.06(1.57+0.10) 5313.15: shallow pack size with --name-hash-version=2 12.5M 5313.18: thin pack with --path-walk 0.03(0.01+0.01) 5313.19: thin pack size with --path-walk 1.6K 5313.20: big pack with --path-walk 2.05(3.24+0.27) 5313.21: big pack size with --path-walk 16.3M 5313.22: shallow fetch pack with --path-walk 1.08(1.66+0.07) 5313.23: shallow pack size with --path-walk 12.4M This can be reformatted as follows: Pack Type Hash v1 Hash v2 Path Walk --------------------------------------------------- thin pack (time) 0.02s 0.03s 0.03s (size) 1.6K 1.6K 1.6K big pack (time) 2.55s 1.91s 2.05s (size) 16.4M 16.4M 16.3M shallow pack (time) 1.24s 1.06s 1.08s (size) 12.2M 12.5M 12.4M Note that the timing is slower because there is no threading in the --path-walk case (yet). Also, the shallow pack cases are really not using the --path-walk logic right now because it is disabled until some additions are made to the path walk API. The cases where the --path-walk option really shines is when the default name-hash is overwhelmed with unhelpful collisions. An open source example can be found in the microsoft/fluentui repo [1] at a certain commit [2]. [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 Running the tests on this repo results in the following comparison table: Pack Type Hash v1 Hash v2 Path Walk --------------------------------------------------- thin pack (time) 0.36s 0.12s 0.08s (size) 1.2M 22.0K 18.4K big pack (time) 2.00s 2.90s 2.21s (size) 20.4M 25.9M 19.5M shallow pack (time) 1.41s 1.80s 1.65s (size) 34.4M 33.7M 33.6M Notice in particular that in the small thin pack, the time performance has improved from 0.36s for --name-hash-version=1 to 0.08s and this is likely due to the improved size of the resulting pack: 18.4K instead of 1.2M. The relatively new --name-hash-version=2 is competitive with --path-walk (0.12s and 22.0K) but not quite as successful. Finally, running this on a copy of the Linux kernel repository results in these data points: Pack Type Hash v1 Hash v2 Path Walk --------------------------------------------------- thin pack (time) 0.03s 0.13s 0.03s (size) 4.6K 4.6K 4.6K big pack (time) 15.29s 12.32s 13.92s (size) 201.1M 159.1M 158.5M shallow pack (time) 10.88s 22.93s 22.74s (size) 269.2M 273.8M 267.7M Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, helping to find tests that are overly specific to the default object walk. These include: - t0411-clone-from-partial.sh : One test fetches from a repo that does not have the boundary objects. This causes the path-based walk to fail. Disable the variable for this test. - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo without a boundary object. - t5310-pack-bitmaps.sh : One test compares the case when packing with bitmaps to the case when packing without them. Since we disable the test variable when writing bitmaps, this causes a difference in the object list (the --path-walk option adds an extra object). Specify --no-path-walk in both processes for the comparison. Another test checks for a specific delta base, but when computing dynamically without using bitmaps, the base object it too small to be considered in the delta calculations so no base is used. - t5316-pack-delta-depth.sh : This script cares about certain delta choices and their chain lengths. The --path-walk option changes how these chains are selected, and thus changes the results of this test. - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. - t5332-multi-pack-reuse.sh : This test verifies that the preferred pack is used for delta reuse when possible. The --path-walk option is not currently aware of the preferred pack at all, so finds a different delta base. - t7406-submodule-update.sh : When using the variable, the --depth option collides with the --path-walk feature, resulting in a warning message. Disable the variable so this warning does not appear. I want to call out one specific test change that is only temporary: - t5530-upload-pack-error.sh : One test cares specifically about an "unable to read" error message. Since the current implementation performs delta calculations within the path-walk API callback, a different "unable to get size" error message appears. When this is changed in a future refactoring, this test change can be reverted. Similar to GIT_TEST_NAME_HASH_VERSION, we do not add this option to the linux-TEST-vars CI build as that's already an overloaded build. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

It can be notoriously difficult to detect if delta bases are being computed properly during 'git push'. Construct an example where it will make a kilobyte worth of difference when a delta base is not found. We can then use the progress indicators to distinguish between bytes and KiB depending on whether the delta base is found and used. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. Add the --path-walk option to the performance tests in p5313. For the microsoft/fluentui repo [1] checked out at a specific commit [2], the --path-walk tests in p5313 look like this: Test this tree ------------------------------------------------------------------------- 5313.18: thin pack with --path-walk 0.08(0.06+0.02) 5313.19: thin pack size with --path-walk 18.4K 5313.20: big pack with --path-walk 2.10(7.80+0.26) 5313.21: big pack size with --path-walk 19.8M 5313.22: shallow fetch pack with --path-walk 1.62(3.38+0.17) 5313.23: shallow pack size with --path-walk 33.6M 5313.24: repack with --path-walk 81.29(96.08+0.71) 5313.25: repack size with --path-walk 142.5M [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 Along with the earlier tests in p5313, I'll instead reformat the comparison as follows: Repack Method Pack Size Time --------------------------------------- Hash v1 439.4M 87.24s Hash v2 161.7M 21.51s Path Walk 142.5M 81.29s There are a few things to notice here: 1. The benefits of --name-hash-version=2 over --name-hash-version=1 are significant, but --path-walk still compresses better than that option. 2. The --path-walk command is still using --name-hash-version=1 for the second pass of delta computation, using the increased name hash collisions as a potential method for opportunistic compression on top of the path-focused compression. 3. The --path-walk algorithm is currently sequential and does not use multiple threads for delta compression. Threading will be implemented in a future change so the computation time will improve to better compete in this metric. There are small benefits in size for my copy of the Git repository: Repack Method Pack Size Time --------------------------------------- Hash v1 248.8M 30.44s Hash v2 249.0M 30.15s Path Walk 213.2M 142.50s As well as in the nodejs/node repository [3]: Repack Method Pack Size Time --------------------------------------- Hash v1 739.9M 71.18s Hash v2 764.6M 67.82s Path Walk 698.1M 208.10s [3] https://github.com/nodejs/node This benefit also repeats in my copy of the Linux kernel repository: Repack Method Pack Size Time --------------------------------------- Hash v1 2.5G 554.41s Hash v2 2.5G 549.62s Path Walk 2.2G 1562.36s It is important to see that even when the repository shape does not have many name-hash collisions, there is a slight space boost to be found using this method. As this repacking strategy was released in Git for Windows 2.47.0, some users have reported cases where the --path-walk compression is slightly worse than the --name-hash-version=2 option. In those cases, it may be beneficial to combine the two options. However, there has not been a released version of Git that has both options and I don't have access to these repos for testing. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". In order to test that this config works, add a new trace2 region around the path walk code that can be checked by a 'git push' command. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Previously, the --path-walk option to 'git pack-objects' would compute deltas inline with the path-walk logic. This would make the progress indicator look like it is taking a long time to enumerate objects, and then very quickly computed deltas. Instead of computing deltas on each region of objects organized by tree, store a list of regions corresponding to these groups. These can later be pulled from the list for delta compression before doing the "global" delta search. This presents a new progress indicator that can be used in tests to verify that this stage is happening. The current implementation is not integrated with threads, but we are setting it up to arrive in the next change. Since we do not attempt to sort objects by size until after exploring all trees, we can remove the previous change to t5530 due to a different error message appearing first. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Doc mark-up fix for a topic that has graduated to 'master'. * kh/maintenance-missing-tasks-docfix: doc: maintenance: fix linkgit syntax

l10n-2.50.0-v2 * tag 'l10n-2.50.0-v2' of https://github.com/git-l10n/git-po: l10n: zh_TW: update translation for Git 2.50

Build fix. * jc/sed-build-fixes: build: sed portability fixes

Revert a botched bswap.h change that broke ntohll() functions on big-endian systems with __builtin_bswap32/64(). * ss/revert-builtin-bswap-stuff: Revert "bswap.h: add support for built-in bswap functions"

Fixes for GitHub Actions Coverity job. * js/github-ci-win-coverity-fix: ci(coverity): output the build log upon error ci(coverity): fix building on Windows

Signed-off-by: Junio C Hamano <[email protected]>

Doc update to the more recent world order. * lo/my-first-ow-doc-update: MyFirstContribution: add walken.c to meson.build MyFirstContribution: use struct repository in examples

"git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. * ds/path-walk-2: pack-objects: allow --shallow and --path-walk path-walk: add new 'edge_aggressive' option pack-objects: thread the path-based compression pack-objects: refactor path-walk delta phase scalar: enable path-walk during push via config pack-objects: enable --path-walk via config repack: add --path-walk option t5538: add tests to confirm deltas in shallow pushes pack-objects: introduce GIT_TEST_PACK_PATH_WALK p5313: add performance tests for --path-walk pack-objects: update usage to match docs pack-objects: add --path-walk option pack-objects: extract should_attempt_deltas()

Userdiff patterns for the R language. * rc/userdiff-r: userdiff: add support for R programming language

Documentation for "git send-email" has been updated with a bit more credential helper and OAuth information. * ag/send-email-docs: docs: make the purpose of using app password for Gmail more clear in send-email docs: remove credential helper links for emails from gitcredentials docs: improve formatting in git-send-email documentation docs: add credential helper for yahoo and link Google's sendgmail tool

"git cat-file --batch" learns to understand %(objectmode) atom to allow the caller to tell missing objects (due to repository corruption) and submodules (whose commit objects are OK to be missing) apart. * vd/cat-file-objectmode-update: cat-file.c: add batch handling for submodules cat-file: add %(objectmode) atom t1006: update 'run_tests' to test generic object specifiers

Code clean-up. * ly/sequencer-update-squash-is-fixup-only: sequencer: replace error() with BUG() in update_squash_messages ()

Code clean-up. * ly/do-not-localize-bug-messages: BUG(): remove leading underscore of the format string

A memory-leak in an error code path has been plugged. * ly/commit-graph-graph-write-leakfix: commit-graph: fix start_delayed_progress() leak

A memory-leak in an error code path has been plugged. * ly/fetch-pack-leakfix: builtin/fetch-pack: cleanup before return error

"git diff --no-index dirA dirB" can limit the comparison with pathspec at the end of the command line, just like normal "git diff". * jk/diff-no-index-with-pathspec: diff --no-index: support limiting by pathspec pathspec: add flag to indicate operation without repository pathspec: add match_leading_pathspec variant

Meson-based build/test framework now understands TAP output generated by our tests. * ps/meson-tap-parse: meson: parse TAP output generated by our tests meson: introduce kwargs variable for tests test-lib: fail on unexpectedly passing tests t7815: fix unexpectedly passing test on macOS t/test-lib: fix TAP format for BASH_XTRACEFD warning t/test-lib: don't print shell traces to stdout t983*: use prereq to check for Python-specific git-p4(1) support t9822: use prereq to check for ISO-8859-1 support t: silence output from `test_create_repo()` t: stop announcing prereqs

Signed-off-by: Junio C Hamano <[email protected]>

Document that related `git config` variables should be placed one-per-line instead of separated by commas. Suggested-by: Junio C Hamano <[email protected]> Signed-off-by: Collin Funk <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

CodingGuidelines update. * cf/guideline-documenting-config-vars: CodingGuidelines: document formatting of similar config variables.

Leakfix. * ly/commit-buffer-reencode-leakfix: repo_logmsg_reencode: fix memory leak when use repo_logmsg_reencode ()

Memleak fix on an error code path. * ly/pack-bitmap-root-leakfix: pack-bitmap: remove checks before bitmap_free

Doc mark-up update. * ma/doc-diff-cc-headers: diff-generate-patch.adoc: drop spurious backticks

Some leftover references to documentation source files that no longer exist, due to recent ".txt" -> ".adoc" renaming, have been corrected. * jw/doc-txt-to-adoc-refs: doc: update references to renamed AsciiDoc files

Add settings for Solaris 10 & 11. * bs/solaris-10-and-11: config.mak.uname: update settings for Solaris 10 and 11

Code clean-up. * jm/bundle-uri-debug-output-to-fp: bundle-uri: send debug output to given FILE * stream

A memory leak on an error code path has been plugged. * ly/submodule-update-failure-leakfix: builtin/submodule--helper: fix leak when remote_submodule_branch() failed

An earlier test update incorrectly lost three prerequisites on macOS, which has been corrected. * rj/meson-tap-parse-fixup: test-lib: add missing prerequisites for Darwin

Signed-off-by: Junio C Hamano <[email protected]>

lucasoshiro and others added 30 commits May 16, 2025 09:33

object-file.h: fix typo in variable declaration

53eeed0

This should be "compat", not "comapt". Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

pack-objects: extract should_attempt_deltas()

4bc0ba0

This will be helpful in a future change, which will reuse this logic. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

gitster and others added 30 commits June 12, 2025 14:19

Merge branch 'kh/maintenance-missing-tasks-docfix'

9edff09

Doc mark-up fix for a topic that has graduated to 'master'. * kh/maintenance-missing-tasks-docfix: doc: maintenance: fix linkgit syntax

Merge tag 'l10n-2.50.0-v2' of https://github.com/git-l10n/git-po

b2bfd31

l10n-2.50.0-v2 * tag 'l10n-2.50.0-v2' of https://github.com/git-l10n/git-po: l10n: zh_TW: update translation for Git 2.50

Merge branch 'jc/sed-build-fixes'

f2a6a1e

Build fix. * jc/sed-build-fixes: build: sed portability fixes

Merge branch 'ss/revert-builtin-bswap-stuff'

e1f335f

Revert a botched bswap.h change that broke ntohll() functions on big-endian systems with __builtin_bswap32/64(). * ss/revert-builtin-bswap-stuff: Revert "bswap.h: add support for built-in bswap functions"

Merge branch 'js/github-ci-win-coverity-fix'

7bd3e53

Fixes for GitHub Actions Coverity job. * js/github-ci-win-coverity-fix: ci(coverity): output the build log upon error ci(coverity): fix building on Windows

Hopefully final bits before 2.50

f1ca98f

Signed-off-by: Junio C Hamano <[email protected]>

Git 2.50

16bd9f2

Signed-off-by: Junio C Hamano <[email protected]>

Merge branch 'lo/my-first-ow-doc-update'

60f9bc3

Doc update to the more recent world order. * lo/my-first-ow-doc-update: MyFirstContribution: add walken.c to meson.build MyFirstContribution: use struct repository in examples

Merge branch 'rc/userdiff-r'

01148ca

Userdiff patterns for the R language. * rc/userdiff-r: userdiff: add support for R programming language

Merge branch 'ly/sequencer-update-squash-is-fixup-only'

870a042

Code clean-up. * ly/sequencer-update-squash-is-fixup-only: sequencer: replace error() with BUG() in update_squash_messages ()

Merge branch 'ly/do-not-localize-bug-messages'

1f622bb

Code clean-up. * ly/do-not-localize-bug-messages: BUG(): remove leading underscore of the format string

Merge branch 'ly/commit-graph-graph-write-leakfix'

b5a135b

A memory-leak in an error code path has been plugged. * ly/commit-graph-graph-write-leakfix: commit-graph: fix start_delayed_progress() leak

Merge branch 'ly/fetch-pack-leakfix'

5e22d03

A memory-leak in an error code path has been plugged. * ly/fetch-pack-leakfix: builtin/fetch-pack: cleanup before return error

Start 2.51 cycle, the first batch

f9aa0ee

Signed-off-by: Junio C Hamano <[email protected]>

Merge branch 'cf/guideline-documenting-config-vars'

f1a1d79

CodingGuidelines update. * cf/guideline-documenting-config-vars: CodingGuidelines: document formatting of similar config variables.

Merge branch 'ly/commit-buffer-reencode-leakfix'

0d0d56b

Leakfix. * ly/commit-buffer-reencode-leakfix: repo_logmsg_reencode: fix memory leak when use repo_logmsg_reencode ()

Merge branch 'ly/pack-bitmap-root-leakfix'

f1af195

Memleak fix on an error code path. * ly/pack-bitmap-root-leakfix: pack-bitmap: remove checks before bitmap_free

Merge branch 'ma/doc-diff-cc-headers'

617318c

Doc mark-up update. * ma/doc-diff-cc-headers: diff-generate-patch.adoc: drop spurious backticks

Merge branch 'jw/doc-txt-to-adoc-refs'

19612d0

Some leftover references to documentation source files that no longer exist, due to recent ".txt" -> ".adoc" renaming, have been corrected. * jw/doc-txt-to-adoc-refs: doc: update references to renamed AsciiDoc files

Merge branch 'bs/solaris-10-and-11'

aa6ab03

Add settings for Solaris 10 & 11. * bs/solaris-10-and-11: config.mak.uname: update settings for Solaris 10 and 11

Merge branch 'jm/bundle-uri-debug-output-to-fp'

a6cdbc8

Code clean-up. * jm/bundle-uri-debug-output-to-fp: bundle-uri: send debug output to given FILE * stream

Merge branch 'ly/submodule-update-failure-leakfix'

92daf08

A memory leak on an error code path has been plugged. * ly/submodule-update-failure-leakfix: builtin/submodule--helper: fix leak when remote_submodule_branch() failed

Merge branch 'rj/meson-tap-parse-fixup'

e363d5f

An earlier test update incorrectly lost three prerequisites on macOS, which has been corrected. * rj/meson-tap-parse-fixup: test-lib: add missing prerequisites for Darwin

The second batch

cb3b403

Signed-off-by: Junio C Hamano <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fast forwarding #1

fast forwarding #1

Uh oh!

hdu-hh commented Nov 20, 2019

Uh oh!

Uh oh!

fast forwarding #1

Are you sure you want to change the base?

fast forwarding #1

Uh oh!

Conversation

hdu-hh commented Nov 20, 2019

Uh oh!

Uh oh!