Skip to content

Use packfile reader when is possible #295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ajnavarro opened this issue Jun 4, 2018 · 6 comments
Closed

Use packfile reader when is possible #295

ajnavarro opened this issue Jun 4, 2018 · 6 comments
Assignees
Labels
feature proposal proposal for new additions or changes

Comments

@ajnavarro
Copy link
Contributor

ajnavarro commented Jun 4, 2018

Maybe we can improve objects read performance accessing directly to the packfile.

To be able to do that, we need the offset of the file, that will be provided by the index.

First of all, we need to check the performance impact of that change on a big repository doing a quick benchmark, using directly Repository type and packfile Decoder type, in the following cases:

When random access is needed, use a slice of hashes when using Repository and a list of offsets when using Decoder

  • Read the first blob.
  • Read the first 100 blobs.
  • Read 100 blobs, skipping 5 every step.
  • Read the last blob.

This benchmark should be done only using go-git, gitbase is not necessary.

If the results using Decoder directly are better than using Repository, we can continue and use Decoder directly in some cases.

@ajnavarro ajnavarro added feature proposal proposal for new additions or changes labels Jun 4, 2018
@erizocosmico erizocosmico self-assigned this Jun 4, 2018
@erizocosmico
Copy link
Contributor

Here's the benchmark: https://github.com/erizocosmico/packfile-read-benchmark

The results of using the decoder is orders of magnitude faster. Note that this is without using cache, I tried reusing both instances of decoder and repository instead of opening one each run and they are slightly similar (except for the last).

@erizocosmico
Copy link
Contributor

These are the result if we also open the repository in each iteration of the decoder benchmark:

go test -bench=. -benchmem
goos: darwin
goarch: amd64
pkg: github.com/erizocosmico/packfile-read-benchmark
BenchmarkDecoder/first-4         	    2000	   1216390 ns/op	   95273 B/op	     226 allocs/op
BenchmarkDecoder/first_100-4     	     100	  14238049 ns/op	 9215988 B/op	    3537 allocs/op
BenchmarkDecoder/first_100_skip_5-4         	     100	  23771055 ns/op	13205025 B/op	    8680 allocs/op
BenchmarkDecoder/last-4                     	    3000	    871919 ns/op	  286663 B/op	     273 allocs/op
BenchmarkRepository/first-4                 	      10	 145066813 ns/op	60297573 B/op	 1238044 allocs/op
BenchmarkRepository/first_100-4             	      10	 161075898 ns/op	74482648 B/op	 1246176 allocs/op
BenchmarkRepository/first_100_skip_5-4      	      10	 221480822 ns/op	78426575 B/op	 1251132 allocs/op
BenchmarkRepository/last-4                  	      10	 151260601 ns/op	60488844 B/op	 1238090 allocs/op
PASS
ok  	github.com/erizocosmico/packfile-read-benchmark	49.224s

As we can see, it's almost the same.

@erizocosmico
Copy link
Contributor

Updated on the README the results of the benchmark after decoding the objects in the decoder benchmarks

@ajnavarro
Copy link
Contributor Author

So, to be able to implement the usage of Decoder when an index is provided, we need to change the following things:

  • The index value will contain now the repository path, the packfile hash, and the object offset
  • RepositoryPool will now be able to provide a Repository object or a Decoder object. Per repository we can have more than one Decoder instance, one per packfile.
  • Indexable tables that are created from objects should be able to create iterators from Decoder, not only from Repository.
  • Squash iterator is not able to work with indexes yet, so keep as it is by now.

@erizocosmico
Copy link
Contributor

erizocosmico commented Jun 4, 2018

Files table will not be indexable using this method. For the files you need to get the full path from the root directory, which can only be achieved traversing the whole tree from a commit.

So, I see two options for this table:

  1. not make it indexable (which dooms any query to a life of heavy full scans)
  2. have a slightly different index for this table that stores also the full path of the file, the offset of the tree object, the index of the tree entry and the offset of the blob.

I can't think right now of any possible downsides of implementing the second option other than having an implementation only for this table (and the increased size of the index).

Just like with files table ref_commits, commit_trees and commit_blobs cannot fit the unique standard model we defined for the rest of the tables. In general, any table that has no direct mapping with a git object cannot be indexed that way.

These would only require a few more things:

  • ref_commits: would have hash of commit, index and ref name (WARNING: this may be huge and expensive)
  • commit_trees: would have hash of commit and hash of tree
  • commit_blobs would have hash of commit and hash of blob

As we can see, those three tables basically require dumping them into an index for indexes to work with them. As a perk of this, there would be absolutely no call to the repository required because everything is already stored in the index.

@erizocosmico
Copy link
Contributor

erizocosmico commented Jun 5, 2018

Progress:

UPDATE: This has changed a bit, since the implementation is not going to be the same as it was initially thought, so this checklist is outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature proposal proposal for new additions or changes
Projects
None yet
Development

No branches or pull requests

2 participants