Skip to content

Offline spell checker for VSCode #20266

@bartosz-antosik

Description

@bartosz-antosik

Hello (first time contributing here)

There are few offline spell checkers among VSCode extensions, but they are based on seriously faulty JavaScript implementations of Hunspell spell checker.

Hunspell is nowadays probably the most widespread standard for spell check layer. It is used on MacOS, Linux and in some software (e.g. LibreOffice) on Windows. It is also used by both Atom and Sublime Text. There is an enormous collection of polished dictionaries for Hunspell.

There exists some JavaScript implementations that refer to Hunspell's name but in fact they do not implement critical functionality - lexical parser. I have verified these three:

hunspell-spellchecker
Typo.js
nspell

All three work more or less following a simple idea of loading the dictionary into memory (into a associative table, a.k.a. dictionary, object to be precise). They use the Hunspell's affixes (.aff file) to create ALL variants of the words found in the dictionary (.dic file) and then store them in the memory. When checking spelling dictionary is simply asked whether the word exist or not. Simple, but it has these implications:

  1. Loading takes a lot of time;
  2. It takes a lot of memory too;
  3. Memory consumption causes them to crash under dictionaries with more expanded affix system (two out of three mentioned, third does not consume all of the affixes).

For example when running hunspell-spellchecker (there is a SpellChecker extension based on it) with English dictionary ("en_US", 62K+ words in dictionary) memory consumption is in peaks 500 MB and constantly above 250 MB. It crashes under Polish language dictionary ("pl_PL", 300K+ words in dictionary) after reaching about 1.5 GB memory consumed (there are reports about other dictionaries doing the same) with "JavaScript heap out of memory" message hidden well under the hood. Hunspell has a lexical parser which allows it to use these two sets (dictionary and affixes) "on the fly" without the need to merge them thus exploding memory consumption and load time.

There is a good spell checker component for node.js, which is actually a bindings for native spell checkers for MacOS (NSSpellChecker), Linux (Hunspell) and Windows (Spell Check API in windows 8+, Hunspell in earlier versions):

https://github.com/atom/node-spellchecker

It is alas a native module.

I have built a spell checker using this module. I will rather not publish it because it is quite pointless:

  • The extension will (silently) stop working every time the electron or node get a version bump and I cannot guarantee I will always be around to rebuild binary dependencies quickly;
  • Rebuilding binary dependencies is quite a hassle;
  • I am unable to reasonably maintain binary dependencies for all three platforms (MacOS, Linux & Windows) - there already is an extension which uses this module, but it provides binary dependencies for MacOS only;
  • Even If I would produce node-spellchecker module using node-pre-gyp with binaries for various platforms if I understand things correctly extension cannot (easily?) install dependent modules using npm (which could also imply having a proper C++ toolchain around in case node-pre-gyp packaged binaries are not sufficient). Binaries are packaged and simply get downloaded along with the extension.

So I would like you to consider doing something about it.

There are few paths I can imagine among them two are most obvious:

  1. Build the node-spellchecker module along with the VSCode and make it available among "standard" modules that extension developers can count upon (this could result in more than one spell checker extension e.g. for spelling text or latex documents, comments in code etc.);
  2. Provide a way to use native modules among extensions' dependencies.

I am most probably no one to discuss pros or cons of these alternatives, there are maybe other alternatives that I cannot see, but I think that with the evidence provided it is clear that unless something changes the answer to the question in the title is MOST PROBABLY NOT!

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature-requestRequest for new features or functionalitylanguages-basicBasic language support issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions