Skip to content

Conversation

gzm55
Copy link

@gzm55 gzm55 commented Aug 21, 2023

No description provided.

@gzm55 gzm55 changed the title fix generate_compressor_model for finding best encodings fix generate_compressor_model.py for finding best encodings Aug 21, 2023
@perronet
Copy link

perronet commented Sep 8, 2023

Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?

Unrelated: I opened a PR to fix some type errors I had when running the script (#47). Did you have any of these type errors?

@perronet
Copy link

perronet commented Sep 8, 2023

I tried running make on your branch and I'm getting the following error:

python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h
finding bigrams ... Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main
    chunks = list(chunkinator(args.file, args.split, args.strip))
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator
    for chunk in chunks:
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr>
    chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes
make: *** [Makefile:33: models/words_en.h] Error 1

And also:

Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main
    max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found

@gzm55
Copy link
Author

gzm55 commented Sep 9, 2023

I tried running make on your branch and I'm getting the following error:

python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h
finding bigrams ... Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main
    chunks = list(chunkinator(args.file, args.split, args.strip))
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator
    for chunk in chunks:
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr>
    chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes
make: *** [Makefile:33: models/words_en.h] Error 1

And also:

Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main
    max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found

@perronet I have fix the previous errors for python3, the testing command is

python3 ./generate_compressor_model.py --optimize-encoding --split whitespace -o >(cat) generate_compressor_model.py

@gzm55
Copy link
Author

gzm55 commented Sep 9, 2023

Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?

fix some logic issue when searching best encodings, such as line 159 last_char = part[0] does not update last_char. And also improve the performance of searching, move the encoding loop to outer, which can filter encodings once.

@perronet
Copy link

perronet commented Sep 9, 2023

https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119

Please change this line with

self.packed = sum(bitlist) // 8

You can see here that the generated bytes_packed is accidentally a float

image

@gzm55
Copy link
Author

gzm55 commented Sep 9, 2023

https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119

Please change this line with

self.packed = sum(bitlist) // 8

You can see here that the generated bytes_packed is accidentally a float

image

this is already done in pr #18 , can we merge this first and this pr will rebase to the latest

@perronet
Copy link

perronet commented Sep 9, 2023

Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.

In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck

@gzm55
Copy link
Author

gzm55 commented Sep 9, 2023

Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.

In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck

the float packed issue is fixed.

@perronet
Copy link

Nice! I'm just using your fork right now since it's unclear if this PR will ever get merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants