Skip to content

Check how is interfering git on a git repository that we are reading #634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ajnavarro opened this issue Nov 28, 2018 · 4 comments
Closed
Assignees
Labels
research Something that requires research

Comments

@ajnavarro
Copy link
Contributor

Can we still read the content of a repository when git is making changes? Is go-git prepared for that? Is go-git using the git locks?

@ajnavarro ajnavarro added the research Something that requires research label Nov 28, 2018
@kuba-- kuba-- self-assigned this Nov 29, 2018
@kuba--
Copy link
Contributor

kuba-- commented Nov 30, 2018

- Repo:  [GitHub - golang/go: The Go programming language](https://github.com/golang/go)
- gitbase/master with oniguruma

Case 1

Query:

SELECT files.repository_id, files.file_path
FROM files
NATURAL JOIN refs
NATURAL JOIN commits
WHERE SUBSTRING(ref_name,1,15) = 'refs/heads/HEAD'

Git:

$ git checkout dev.debug
$ git gc

Error:

unknown error: open /private/tmp/repos/go/.git/objects/pack/pack-565b7a9733a5fc7ff8978845f49477ee4a6252b5.pack: no such file or directory

Case 2 (not related to interop git-gitbase)

Query:

SELECT 
    file_path
    FROM (
    SELECT
        file_path,
        uast_extract(
            uast(
                blob_content,
                LANGUAGE(
                    file_path,
                    blob_content
                ),
                "//FuncLit"
            ),
            "internalRole"
        ) AS uast
    FROM files
    WHERE LANGUAGE(file_path, blob_content) = 'Go'
) AS q1
LIMIT 1000;

Git:

$ git checkout -b kuba
$ cp final-noclosure.go kuba.go # add a new .go file
$ git add .
$ git commit -a

Error:
I got an error because bblfsh started parsing lot of go files (with LIMIT 10 it worked fine). All go drivers were busy, so the next requests timeout (doesn't matter what was the file size)

bblfshd     | time="2018-11-30T13:13:35Z" level=warning msg="unable to allocate a driver instance: timeout, all drivers are busy" language=go
bblfshd     | time="2018-11-30T13:13:35Z" level=error msg="request processed content 873 bytes error: timeout, all drivers are busy" elapsed=72.816µs language=go
bblfshd     | time="2018-11-30T13:13:59Z" level=error msg="request processed content 873 bytes error: rpc error: code = Canceled desc = context canceled" elapsed=152.649µs language=go
bblfshd     | time="2018-11-30T13:14:55Z" level=error msg="request processed content 330743 bytes error: rpc error: code = DeadlineExceeded desc = context deadline exceeded" elapsed=5.000369555s language=go

At some point bblfshd threw an exception:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "site-packages/urllib3/connectionpool.py", line 384, in _make_request
  File "<string>", line 2, in raise_from
  File "site-packages/urllib3/connectionpool.py", line 380, in _make_request
  File "http/client.py", line 1331, in getresponse
  File "http/client.py", line 297, in begin
  File "http/client.py", line 258, in _read_status
  File "socket.py", line 586, in readinto
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "site-packages/requests/adapters.py", line 449, in send
  File "site-packages/urllib3/connectionpool.py", line 638, in urlopen
  File "site-packages/urllib3/util/retry.py", line 367, in increment
  File "site-packages/urllib3/packages/six.py", line 686, in reraise
  File "site-packages/urllib3/connectionpool.py", line 600, in urlopen
  File "site-packages/urllib3/connectionpool.py", line 386, in _make_request
  File "site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

Case 3

Git:

# revert changes (kuba.go file as a copy of final-noclosure.go)
$ git revert 771c2dcfd4a7daea53056af419fb83d7a52fa638
$ git commit -a
$ git gc --auto
$ git checkout master
$ git branch -D kuba
$ git reset --hard

Query:

SELECT 
    file_path, blob_hash, SUBSTRING(`file_path`, 19, 10) AS NAME
    FROM (
    SELECT
        blob_hash,
        file_path,
        uast_extract(
            uast(
                blob_content,
                LANGUAGE(
                    file_path,
                    blob_content
                ),
                "//FuncLit"
            ),
            "internalRole"
        ) AS uast
    FROM files
    WHERE LANGUAGE(file_path, blob_content) = 'Go' 
) AS q1
LIMIT 10;

I still saw kuba.go (changes were reverted, branch deleted). Then run git gc and the query didn't list kuba.go file.

Case 4

Generally git gc breaks running queries:

unknown error: open /private/tmp/repos/go/.git/objects/pack/pack-5ac0ff75115f8763016c2f775bd221c49afb67a8.pack: no such file or directory

Case 5

Created a new kuba.go file, so the query:

SELECT 
    file_path, blob_hash, SUBSTRING(`file_path`, 19, 10) AS fname
    FROM (
    SELECT
        blob_hash,
        file_path,
        uast_extract(
            uast(
                blob_content,
                LANGUAGE(
                    file_path,
                    blob_content
                ),
                "//FuncLit"
            ),
            "internalRole"
        ) AS uast
    FROM files
    WHERE LANGUAGE(file_path, blob_content) = 'Go' 
) AS q1
WHERE fname='kuba.go'
LIMIT 1;

Git:

$ rm kuba.go
$ git add .
$ git commit -a

Ran the query again and it cannot find a file (but after revert and delete a branch the file was found-able)

Btw. if we parse 100+ files bblfshd starts being unresponsive and times out.

Case 6

Query:

CREATE INDEX email_idx ON commits USING pilosa (commit_author_email);

Git:

$ git gc

Query:

SELECT * FROM commits WHERE commit_author_email!='....' LIMIT 100;

Error: unknown error: object not found

Case 7

Because we assume that gitbase is readonly we don't invalidate our cache (I assume).
So I created a new branch, did couple commits, reverted my changes, and delete the branch. I also run git prune . But both queries:

SELECT * FROM commits WHERE commit_author_name='kuba--' 
SELECT count(*) FROM commits;

Still show my commits. I assumed that query:
SELECT count(commit_author_name) FROM commits // 40229
should return the same number as:
$ git log --all --pretty=%cn | wc -l // 40224
What we see is exactly my 5 commits

@kuba--
Copy link
Contributor

kuba-- commented Nov 30, 2018

@ajnavarro - PTAL

@ajnavarro
Copy link
Contributor Author

Some annotations:

  • Even if you delete an object, go-git will still list it. The object is not deleted from the packfile, it is just not being pointed by the reference. Related issue: Proposal: Make CommitObjects(), BlobObjects() and so on methods do not return unreachable objects due to performance problems.  go-git#1023
  • I think is not necessary to invalidate Objects cache. The sources of truth for that cache are the .idx files. If a file is not on the .idx file, it won't be asked to be retrieved from the cache, So the problem is the same as before, the object is still there but unreachable by references. Also, the LRU cache eventually will discard that object when they will not be used to retrieve them or to resolve deltas.

@ajnavarro
Copy link
Contributor Author

ajnavarro commented Dec 3, 2018

The main problem that I can see here is that gitbase or go-git is saving some references to outdated packfiles, and if some new objects or package are created, we are not able to read it neither.

Also, go-git is not using git locks at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Something that requires research
Projects
None yet
Development

No branches or pull requests

3 participants