Use packfile reader when is possible #295

ajnavarro · 2018-06-04T08:55:27Z

Maybe we can improve objects read performance accessing directly to the packfile.

To be able to do that, we need the offset of the file, that will be provided by the index.

First of all, we need to check the performance impact of that change on a big repository doing a quick benchmark, using directly Repository type and packfile Decoder type, in the following cases:

When random access is needed, use a slice of hashes when using Repository and a list of offsets when using Decoder

Read the first blob.
Read the first 100 blobs.
Read 100 blobs, skipping 5 every step.
Read the last blob.

This benchmark should be done only using go-git, gitbase is not necessary.

If the results using Decoder directly are better than using Repository, we can continue and use Decoder directly in some cases.

The text was updated successfully, but these errors were encountered:

erizocosmico · 2018-06-04T12:55:02Z

Here's the benchmark: https://github.com/erizocosmico/packfile-read-benchmark

The results of using the decoder is orders of magnitude faster. Note that this is without using cache, I tried reusing both instances of decoder and repository instead of opening one each run and they are slightly similar (except for the last).

erizocosmico · 2018-06-04T13:15:24Z

These are the result if we also open the repository in each iteration of the decoder benchmark:

go test -bench=. -benchmem
goos: darwin
goarch: amd64
pkg: github.com/erizocosmico/packfile-read-benchmark
BenchmarkDecoder/first-4         	    2000	   1216390 ns/op	   95273 B/op	     226 allocs/op
BenchmarkDecoder/first_100-4     	     100	  14238049 ns/op	 9215988 B/op	    3537 allocs/op
BenchmarkDecoder/first_100_skip_5-4         	     100	  23771055 ns/op	13205025 B/op	    8680 allocs/op
BenchmarkDecoder/last-4                     	    3000	    871919 ns/op	  286663 B/op	     273 allocs/op
BenchmarkRepository/first-4                 	      10	 145066813 ns/op	60297573 B/op	 1238044 allocs/op
BenchmarkRepository/first_100-4             	      10	 161075898 ns/op	74482648 B/op	 1246176 allocs/op
BenchmarkRepository/first_100_skip_5-4      	      10	 221480822 ns/op	78426575 B/op	 1251132 allocs/op
BenchmarkRepository/last-4                  	      10	 151260601 ns/op	60488844 B/op	 1238090 allocs/op
PASS
ok  	github.com/erizocosmico/packfile-read-benchmark	49.224s

As we can see, it's almost the same.

erizocosmico · 2018-06-04T13:21:53Z

Updated on the README the results of the benchmark after decoding the objects in the decoder benchmarks

ajnavarro · 2018-06-04T14:49:47Z

So, to be able to implement the usage of Decoder when an index is provided, we need to change the following things:

The index value will contain now the repository path, the packfile hash, and the object offset
RepositoryPool will now be able to provide a Repository object or a Decoder object. Per repository we can have more than one Decoder instance, one per packfile.
Indexable tables that are created from objects should be able to create iterators from Decoder, not only from Repository.
Squash iterator is not able to work with indexes yet, so keep as it is by now.

erizocosmico · 2018-06-04T15:39:20Z

Files table will not be indexable using this method. For the files you need to get the full path from the root directory, which can only be achieved traversing the whole tree from a commit.

So, I see two options for this table:

not make it indexable (which dooms any query to a life of heavy full scans)
have a slightly different index for this table that stores also the full path of the file, the offset of the tree object, the index of the tree entry and the offset of the blob.

I can't think right now of any possible downsides of implementing the second option other than having an implementation only for this table (and the increased size of the index).

Just like with files table ref_commits, commit_trees and commit_blobs cannot fit the unique standard model we defined for the rest of the tables. In general, any table that has no direct mapping with a git object cannot be indexed that way.

These would only require a few more things:

ref_commits: would have hash of commit, index and ref name (WARNING: this may be huge and expensive)
commit_trees: would have hash of commit and hash of tree
commit_blobs would have hash of commit and hash of blob

As we can see, those three tables basically require dumping them into an index for indexes to work with them. As a perk of this, there would be absolutely no call to the repository required because everything is already stored in the index.

erizocosmico · 2018-06-05T07:48:17Z

Progress:

Merge *: implement indexable interfaces on basic and relation tables #279
Update dependencies of *: implement indexable interfaces on basic and relation tables #279 (go-mysql-server)
Allow per-table implementation of the index
Save in index packfiles with offsets
Make RepositoryPool provide packfile.Decoder too
Make tables work with packfile.Decoder and Repository

UPDATE: This has changed a bit, since the implementation is not going to be the same as it was initially thought, so this checklist is outdated.

ajnavarro added feature proposal proposal for new additions or changes labels Jun 4, 2018

erizocosmico self-assigned this Jun 4, 2018

erizocosmico mentioned this issue Jun 4, 2018

*: implement indexable interfaces on basic and relation tables #279

Merged

erizocosmico mentioned this issue Jun 5, 2018

gitbase: implement sql.Indexable on all tables #298

Merged

erizocosmico closed this as completed Jun 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use packfile reader when is possible #295

Use packfile reader when is possible #295

ajnavarro commented Jun 4, 2018 •

edited

Loading

erizocosmico commented Jun 4, 2018

Uh oh!

erizocosmico commented Jun 4, 2018

Uh oh!

erizocosmico commented Jun 4, 2018

Uh oh!

ajnavarro commented Jun 4, 2018

Uh oh!

erizocosmico commented Jun 4, 2018 •

edited

Loading

Uh oh!

erizocosmico commented Jun 5, 2018 •

edited

Loading

Uh oh!

Use packfile reader when is possible #295

Use packfile reader when is possible #295

Comments

ajnavarro commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

erizocosmico commented Jun 4, 2018

Uh oh!

erizocosmico commented Jun 4, 2018

Uh oh!

erizocosmico commented Jun 4, 2018

Uh oh!

ajnavarro commented Jun 4, 2018

Uh oh!

erizocosmico commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erizocosmico commented Jun 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajnavarro commented Jun 4, 2018 •

edited

Loading

erizocosmico commented Jun 4, 2018 •

edited

Loading

erizocosmico commented Jun 5, 2018 •

edited

Loading