Develop a highly performant and resource-friendly program that identifies all unique words present across a collection of multiple text files located within a specified directory (./test_files
). The identified unique words must then be sorted alphabetically and written to a single output file named result.txt
.
- Process all text files in the
./test_cases
directory, efficiently counting unique words across all files. - Optimize execution time, leveraging strategies that allow multiple large files to be processed concurrently rather than sequentially.
- Maintain deterministic output, meaning the result should remain consistent across different runs regardless of variations in processing order.
- Submissions will be done on Google Form.
- 10 min wait time before making a new submission.
- Test files are located in the
./test_cases
directory. - Each file is a .txt file containing public domain text with varying word distributions.
- Individual input files can be very large, potentially exceeding the available physical RAM and even the combined size of RAM and swap space on the processing system.
- It is guaranteed that total sum of size of all test_cases won't cross 32 GiB.
- For this challenge, a word is any contiguous sequence of non‐whitespace Unicode characters normalized for comparison.
- The maximum length of any word can be safely assumed as 32 bytes.
The output file result.txt
should contain all unique words found across the input text files, sorted in alphabetical order. Each word should appear on a new line. The words should be in lowercase to ensure case insensitivity.
Your program will be run on a few edge cases before the official tests.
Following parameters will be used for judging:
- Execution Time – Faster solutions will receive higher scores.
- Memory Efficiency (tie breaker) – Solutions that minimize memory usage will be prioritized.
- RAM: 16 GiB (with other 16 GiB swap)
- CPU: AMD Ryzen 5 5600H
- GPU: Nvidia RTX 3050Ti
The file_generator file will create an file containing X MB worth of words. It is suggested that you keep the output file size in the range of 100-200 MB for initial tests.
For later runs, check for files greater than 2-5 GiB; in final legs, files greater than 16 GiB are recommended.
To run:
./file_generator test_cases/out.txt 2000
This command creates a file around 2 GB under the test_cases
directory. (Expect this to be buggy and slow)
Alternatively, you can use produce_x
for generating files as well!
./produce_x 200 # generates file at test_cases directory, of size 200 MiB