This program will take two file inputs: A file to parse and a file with a set of words to exclude. It reads through the
provided input file line by line, strips out any special or numerical characters, stems words into their base form
using the Porter Stemmer Algorithm, and prints out the top n
words
found.
- The file with the set of words to exclude is expected to have one word per line.
- Python >=3.6. This was developed and tested on 3.8.5, but should work on 3.6 and later. CI job set up to test all three versions.
pythom -m venv .venv
source .venv/bin/activate
pip install dist/word_parser-0.1.0-py3-none-any.whl
Usage: parse_file [OPTIONS]
Options:
-i, --input-file TEXT The full path to the source file to parse
[required]
-r, --rules-file TEXT The full path to the file with words to
exclude [required]
-n, --number-of-results INTEGER
The number of words to return. Defaults to
20
-c, --use-collections-counter Use the Counter class from the collections
module instead of custom method
--help Show this message and exit.
# From the root of the project
parse_file -i $(pwd)/Text1.txt -r $(pwd)/stopwords.txt
- Word frequency based on root form
- Non-alphabetical means anything not covered by the regex pattern '[A-Za-z]'
- Apply stemmer algorithm using Porter Stemmer means use the code provided in the language of choice
To verify my implementation of finding the top used stem words is comparable to the method available in Collections,
I used the bash command time
.
# Using my implementation
time parse_file -i $(pwd)/Text2.txt -r $(pwd)/stopwords.txt
1.62s user 0.18s system 99% cpu 1.809 total
# Using Counter.most_common
time parse_file -i $(pwd)/Text2.txt -r $(pwd)/stopwords.txt -c
1.68s user 0.16s system 99% cpu 1.843 total
Word | Occurrences |
---|---|
he | 18 |
us | 11 |
govern | 10 |
peopl | 10 |
right | 10 |
for | 9 |
law | 9 |
state | 9 |
power | 8 |
we | 8 |
time | 6 |
among | 5 |
declar | 5 |
establish | 5 |
refus | 5 |
abolish | 4 |
assent | 4 |
coloni | 4 |
form | 4 |
free | 4 |
Word | Occurrences |
---|---|
said | 462 |
alic | 401 |
i | 400 |
it | 208 |
on | 158 |
and | 130 |
littl | 128 |
the | 124 |
you | 110 |
look | 104 |
like | 97 |
know | 90 |
that | 83 |
went | 83 |
go | 77 |
thing | 77 |
queen | 76 |
thought | 76 |
time | 74 |
sai | 70 |
# Install python dependencies
poetry install
# Building the package
poetry build
----------- coverage: platform linux, python 3.8.5-final-0 -----------
Name Stmts Miss Cover
-------------------------------------------
word_parser/parser.py 73 5 93%