Skip to content
This repository was archived by the owner on Dec 15, 2022. It is now read-only.

Use git ls-files for loading paths when available. #301

Closed
wants to merge 3 commits into from

Conversation

eeejay
Copy link

@eeejay eeejay commented Jun 1, 2017

Description of the Change

Leverage git to index a git-controlled project quickly by using git-ls-files where appropriate

Background

Walking though a large codebase with node fs is very slow. Gecko has almost 200k files, and it takes 17 seconds on a fast machine. The project is constantly being reindexed after each editor focus. There needs to be a better way.

Solution

Allow a native binary to walk the tree. A simple find or git ls-files is 40x faster than the current nodejs task. I used dugite here to wrap the git commands instead of using child_process because it is more portable and Atom already pulls it in as a dependency elsewhere.

EDIT: This is faster because the non-git solution needs to query each file if it is ignored or not. This means reading, at least, .gitignore + .git/info/exclude on each traversed node.

Alternate Designs

  1. find on POSIX systems performs just as well, I decided to go with git-ls-files because it is more portable, and should be available on Windows, and because it has a simple flag for excluding VCS ignored files.
  2. Use a node, libgit2 based, library.git-utils doesn't expose an API that would allow this. I tried replicating git-ls-files with nodegit to see if it is possible, and it is 4x slower. So I think the git command line solution is the best.
  3. Use child_process.spawn. This is twice as fast as this patch, because it would allow streaming the stdout. dugite currently does not allow that and needs a fixed buffer. I went with dugite because it seemed the most portable, even though it is pretty certain Atom will have access to git in its PATH.

Benefits

A 40x speedup.
Time to index Gecko before patch: 16.5s.
After patch: 0.4s

Possible Drawbacks

Following symlink subdirs is broken in this solution.

Applicable Issues

What does this mean??

@eeejay
Copy link
Author

eeejay commented Jun 1, 2017

I guess this could be a fix for issue #271

@winstliu
Copy link
Contributor

winstliu commented Jun 2, 2017

This needs tests before it can be considered.

Following symlink subdirs is broken in this solution.

Do any of the alternatives handle symlinks correctly?

What does this mean??

You figured it out in your next comment 😉

even though it is pretty certain Atom will have access to git in its PATH.

At least on Windows, it is not required to have Git installed when using Atom.

@eeejay
Copy link
Author

eeejay commented Jun 2, 2017

I can't figure out what kind of test would be useful here. Ultimately, this change is great because all the regression tests pass, which proves that this change is functionally identical where it matters.

Do you have any ideas as to what kind of test would be good here?

As for the symlink issue, we can potentially use find for that. It wouldn't work on Windows, which is probably fine because symlinks don't work on Windows anyway, afaik. The strategy would look like this:

  1. Run something like find -type l -xtype d to get all the directory symbolic links.
  2. Use the output from the previous step to exclude all those matches from git ls-files (-x)
  3. Run another find to traverse the symbolic linked directories and append output to results from previous step. This step would only be needed if there are actual folders.

I don't think that is ideal, but I also see a conflict with core.followSymlinks and core.excludeVcsIgnoredPaths. Symlinks are never followed in git, it sees them as a leaf node, so their contents are implicitly ignored. So, if you want to exclude VCS ignored paths, you would never follow a version controlled symlink. The only time you would follow a symlink in a VCS folder is if it were untracked. With that in mind, I would suggest the equivalent of these steps:

  1. Capture files in git cache: git ls-files -c --exclude-standard
  2. Capture untracked (other files) git ls-files -o --exclude-standard pipe output to find -L
  3. Append both lists from previous steps as result.

@winstliu
Copy link
Contributor

winstliu commented Jun 5, 2017

Do you have any ideas as to what kind of test would be good here?

Disclaimer: I don't know what the code coverage for fuzzy-finder looks like. It could be that the test I'm suggesting already exists.

I'd like to see a test asserting that we don't attempt to use the Git method when loading paths for a non-Git project.

@eeejay
Copy link
Author

eeejay commented Jun 16, 2017

I'd like to see a test asserting that we don't attempt to use the Git method when loading paths for a non-Git project.

This is already tested for!

done()
repo = GitRepository.open(@rootPath, refreshOnWindowFocus: false)
if repo?.relativize(path.join(@rootPath, 'test')) is 'test'
args = ['ls-files', '-c', '-o', '-z']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind using the extended command args for clarity? Eg --cached, --others, etc.

if @ignoreVcsIgnores
args.push('--exclude-standard')
for ignoredName in @ignoredNames
args.push("-x")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. (Also it looks like these should be single quotes to match the other arg quotes?)

@eeejay
Copy link
Author

eeejay commented Jun 21, 2017

Updated! (don't know if you get notified when another commit is pushed)

@winstliu
Copy link
Contributor

👍 Sorry for the late reply; was on vacation. I've brought this PR up to the other team members for consideration.

@ungb
Copy link

ungb commented Aug 28, 2017

sorry for the late arrival to this PR. Here are the following test scenario I plan to run through:

  • git projects vs non-git projects
  • Symlink on windows/mac
  • Network/SSHFS mounted drive.

I'll be spending time this week to look into these, please let me know if you feel like anything is missing from this list. I'll be testing first before the changes and reporting back if anything changed or is the same.

One thing I see is:

Following symlink subdirs is broken in this solution.

@eeejay do you know if this has been fixed with a more recent commit?

@ungb
Copy link

ungb commented Sep 26, 2017

Thanks for the work you've put into this PR.

Unfortunately, we won't be able to accept this as it is currently written because it causes a change in behavior with regard to symlinks. We are also looking at making indexing better by incorporating some of the work we've invested in our file watcher infrastructure. If you can update this PR to address the symlinks issue, we'll reevaluate it after the indexing changes.

Thanks again for your help!

@eeejay
Copy link
Author

eeejay commented Mar 14, 2019

Pull request #367 does this now.

@eeejay eeejay closed this Mar 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants