Overview

This program will take two file inputs: A file to parse and a file with a set of words to exclude. It reads through the provided input file line by line, strips out any special or numerical characters, stems words into their base form using the Porter Stemmer Algorithm, and prints out the top n words found.

Requirements

The file with the set of words to exclude is expected to have one word per line.
Python >=3.6. This was developed and tested on 3.8.5, but should work on 3.6 and later. CI job set up to test all three versions.

Development Requirements

Poetry

Installation

pythom -m venv .venv
source .venv/bin/activate
pip install dist/word_parser-0.1.0-py3-none-any.whl

How to run

Usage: parse_file [OPTIONS]

Options:
  -i, --input-file TEXT           The full path to the source file to parse
                                  [required]

  -r, --rules-file TEXT           The full path to the file with words to
                                  exclude  [required]

  -n, --number-of-results INTEGER
                                  The number of words to return. Defaults to
                                  20

  -c, --use-collections-counter   Use the Counter class from the collections
                                  module instead of custom method

  --help                          Show this message and exit.

Example

# From the root of the project
parse_file -i $(pwd)/Text1.txt -r $(pwd)/stopwords.txt

Assumptions

Word frequency based on root form
Non-alphabetical means anything not covered by the regex pattern '[A-Za-z]'
Apply stemmer algorithm using Porter Stemmer means use the code provided in the language of choice

Comparing Counter most_common method vs my implementation

To verify my implementation of finding the top used stem words is comparable to the method available in Collections, I used the bash command time.

# Using my implementation
time parse_file -i $(pwd)/Text2.txt -r $(pwd)/stopwords.txt
1.62s user 0.18s system 99% cpu 1.809 total

# Using Counter.most_common
time parse_file -i $(pwd)/Text2.txt -r $(pwd)/stopwords.txt -c
1.68s user 0.16s system 99% cpu 1.843 total

Results

Text1.txt

Word	Occurrences
he	18
us	11
govern	10
peopl	10
right	10
for	9
law	9
state	9
power	8
we	8
time	6
among	5
declar	5
establish	5
refus	5
abolish	4
assent	4
coloni	4
form	4
free	4

Text2.txt

Word	Occurrences
said	462
alic	401
i	400
it	208
on	158
and	130
littl	128
the	124
you	110
look	104
like	97
know	90
that	83
went	83
go	77
thing	77
queen	76
thought	76
time	74
sai	70

Developing

# Install python dependencies
poetry install

# Building the package
poetry build

Test Coverage

----------- coverage: platform linux, python 3.8.5-final-0 -----------
Name                    Stmts   Miss  Cover
-------------------------------------------
word_parser/parser.py      73      5    93%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Requirements

Development Requirements

Installation

How to run

Example

Assumptions

Comparing Counter most_common method vs my implementation

Results

Text1.txt

Text2.txt

Developing

Test Coverage

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
dist		dist
test		test
word_parser		word_parser
.gitignore		.gitignore
README.md		README.md
Text1.txt		Text1.txt
Text2.txt		Text2.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
stopwords.txt		stopwords.txt

nrgeil/programming_assignment

Folders and files

Latest commit

History

Repository files navigation

Overview

Requirements

Development Requirements

Installation

How to run

Example

Assumptions

Comparing Counter most_common method vs my implementation

Results

Text1.txt

Text2.txt

Developing

Test Coverage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages