What steps will reproduce the problem?
1. Read in a large file of varied UTF characters
2. Run guessLanguage on it
3. It takes forever
What is the expected output? What do you see instead?
What version of the product are you using? On what operating system?
Please provide any additional information below.
The library is designed to deal with small chunks of data, which is fine.
However, in the case you feed it lots of data, it slows to a crawl.
This appears to be because of the nonAlphaRe call in normalize; the regex is
thousands of characters long, and applied to every character in the data.
A substantial speedup (100x or more) can be obtained by replacing the following
call in normalize():
u = nonAlphaRe.sub(' ', u)
with
u = ''.join([ c.isalpha() and c or ' ' for c in u])
which I believe has the same effect.
Original issue reported on code.google.com by
ajshan...@gmail.comon 13 Jul 2011 at 1:18