Skip to content

Extremeley slow on large files #7

@GoogleCodeExporter

Description

@GoogleCodeExporter
What steps will reproduce the problem?
1. Read in a large file of varied UTF characters
2. Run guessLanguage on it
3. It takes forever

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.
The library is designed to deal with small chunks of data, which is fine. 
However, in the case you feed it lots of data, it slows to a crawl.

This appears to be because of the nonAlphaRe call in normalize; the regex is 
thousands of characters long, and applied to every character in the data. 

A substantial speedup (100x or more) can be obtained by replacing the following 
call in normalize():
    u = nonAlphaRe.sub(' ', u)
with
    u = ''.join([ c.isalpha() and c or ' ' for c in u])
which I believe has the same effect.

Original issue reported on code.google.com by ajshan...@gmail.com on 13 Jul 2011 at 1:18

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions