-
Notifications
You must be signed in to change notification settings - Fork 37k
Description
Hello (first time contributing here)
There are few offline spell checkers among VSCode extensions, but they are based on seriously faulty JavaScript implementations of Hunspell spell checker.
Hunspell is nowadays probably the most widespread standard for spell check layer. It is used on MacOS, Linux and in some software (e.g. LibreOffice) on Windows. It is also used by both Atom and Sublime Text. There is an enormous collection of polished dictionaries for Hunspell.
There exists some JavaScript implementations that refer to Hunspell's name but in fact they do not implement critical functionality - lexical parser. I have verified these three:
hunspell-spellchecker
Typo.js
nspell
All three work more or less following a simple idea of loading the dictionary into memory (into a associative table, a.k.a. dictionary, object to be precise). They use the Hunspell's affixes (.aff file) to create ALL variants of the words found in the dictionary (.dic file) and then store them in the memory. When checking spelling dictionary is simply asked whether the word exist or not. Simple, but it has these implications:
- Loading takes a lot of time;
- It takes a lot of memory too;
- Memory consumption causes them to crash under dictionaries with more expanded affix system (two out of three mentioned, third does not consume all of the affixes).
For example when running hunspell-spellchecker (there is a SpellChecker extension based on it) with English dictionary ("en_US", 62K+ words in dictionary) memory consumption is in peaks 500 MB and constantly above 250 MB. It crashes under Polish language dictionary ("pl_PL", 300K+ words in dictionary) after reaching about 1.5 GB memory consumed (there are reports about other dictionaries doing the same) with "JavaScript heap out of memory" message hidden well under the hood. Hunspell has a lexical parser which allows it to use these two sets (dictionary and affixes) "on the fly" without the need to merge them thus exploding memory consumption and load time.
There is a good spell checker component for node.js, which is actually a bindings for native spell checkers for MacOS (NSSpellChecker), Linux (Hunspell) and Windows (Spell Check API in windows 8+, Hunspell in earlier versions):
https://github.com/atom/node-spellchecker
It is alas a native module.
I have built a spell checker using this module. I will rather not publish it because it is quite pointless:
- The extension will (silently) stop working every time the electron or node get a version bump and I cannot guarantee I will always be around to rebuild binary dependencies quickly;
- Rebuilding binary dependencies is quite a hassle;
- I am unable to reasonably maintain binary dependencies for all three platforms (MacOS, Linux & Windows) - there already is an extension which uses this module, but it provides binary dependencies for MacOS only;
- Even If I would produce node-spellchecker module using node-pre-gyp with binaries for various platforms if I understand things correctly extension cannot (easily?) install dependent modules using npm (which could also imply having a proper C++ toolchain around in case node-pre-gyp packaged binaries are not sufficient). Binaries are packaged and simply get downloaded along with the extension.
So I would like you to consider doing something about it.
There are few paths I can imagine among them two are most obvious:
- Build the node-spellchecker module along with the VSCode and make it available among "standard" modules that extension developers can count upon (this could result in more than one spell checker extension e.g. for spelling text or latex documents, comments in code etc.);
- Provide a way to use native modules among extensions' dependencies.
I am most probably no one to discuss pros or cons of these alternatives, there are maybe other alternatives that I cannot see, but I think that with the evidence provided it is clear that unless something changes the answer to the question in the title is MOST PROBABLY NOT!