fix generate_compressor_model.py for finding best encodings #46

gzm55 · 2023-08-21T09:40:28Z

No description provided.

perronet · 2023-09-08T15:31:36Z

Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?

Unrelated: I opened a PR to fix some type errors I had when running the script (#47). Did you have any of these type errors?

perronet · 2023-09-08T17:03:52Z

I tried running make on your branch and I'm getting the following error:

python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h
finding bigrams ... Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main
    chunks = list(chunkinator(args.file, args.split, args.strip))
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator
    for chunk in chunks:
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr>
    chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes
make: *** [Makefile:33: models/words_en.h] Error 1

And also:

Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main
    max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found

gzm55 · 2023-09-09T02:31:43Z

I tried running make on your branch and I'm getting the following error:

python generate_compressor_model.py --split=whitespace --strip=punctuation training_data/dorian_gray.txt training_data/metamorphosis.txt training_data/pride_and_prejudice.txt -o models/words_en.h
finding bigrams ... Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 292, in main
    chunks = list(chunkinator(args.file, args.split, args.strip))
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 250, in chunkinator
    for chunk in chunks:
  File "/home/perro/code/test_shoco/shoco/generate_compressor_model.py", line 247, in <genexpr>
    chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes
make: *** [Makefile:33: models/words_en.h] Error 1

And also:

Traceback (most recent call last):
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 396, in <module>
    main()
  File "/home/perro/code/test_shoco/shoco/./generate_compressor_model.py", line 310, in main
    max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found

@perronet I have fix the previous errors for python3, the testing command is

python3 ./generate_compressor_model.py --optimize-encoding --split whitespace -o >(cat) generate_compressor_model.py

gzm55 · 2023-09-09T02:35:42Z

Can you describe what the fix is for exactly? Was there something fundamentally broken with the model generation or is this PR an improvement?

fix some logic issue when searching best encodings, such as line 159 last_char = part[0] does not update last_char. And also improve the performance of searching, move the encoding loop to outer, which can filter encodings once.

perronet · 2023-09-09T10:05:34Z

https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119

Please change this line with

self.packed = sum(bitlist) // 8

You can see here that the generated bytes_packed is accidentally a float

gzm55 · 2023-09-09T10:45:20Z

https://github.com/Ed-von-Schleck/shoco/pull/46/files#diff-9de89837200c6ecb97eea8c58103f21c6e5dc7f221d3c4d2fcedfd5209fbd182R119

Please change this line with
self.packed = sum(bitlist) // 8
You can see here that the generated bytes_packed is accidentally a float

this is already done in pr #18 , can we merge this first and this pr will rebase to the latest

perronet · 2023-09-09T11:08:28Z

Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.

In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck

gzm55 · 2023-09-09T15:37:23Z

Good point, however I suggest adding this fix regardless since that PR hasn't been merged since 2015.

In the meantime, let's see if we can get the attention of the maintainer @Ed-von-Schleck

the float packed issue is fixed.

perronet · 2023-09-21T18:30:59Z

Nice! I'm just using your fork right now since it's unclear if this PR will ever get merged

fix generate_compressor_model for finding best encodings

e64c426

gzm55 changed the title ~~fix generate_compressor_model for finding best encodings~~ fix generate_compressor_model.py for finding best encodings Aug 21, 2023

make sure the left length of chunk is enough for the tested encoding

a6e4b4e

fix errors on python3

3a74df7

perronet mentioned this pull request Sep 9, 2023

Fix type errors in generate_compressor_model.py #47

Closed

fix "packed", use int div

bbd651a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix generate_compressor_model.py for finding best encodings #46

fix generate_compressor_model.py for finding best encodings #46

Uh oh!

gzm55 commented Aug 21, 2023

Uh oh!

perronet commented Sep 8, 2023

Uh oh!

perronet commented Sep 8, 2023 •

edited

Loading

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

perronet commented Sep 9, 2023

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

perronet commented Sep 9, 2023

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

perronet commented Sep 21, 2023

Uh oh!

Uh oh!

fix generate_compressor_model.py for finding best encodings #46

Are you sure you want to change the base?

fix generate_compressor_model.py for finding best encodings #46

Uh oh!

Conversation

gzm55 commented Aug 21, 2023

Uh oh!

perronet commented Sep 8, 2023

Uh oh!

perronet commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

perronet commented Sep 9, 2023

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

perronet commented Sep 9, 2023

Uh oh!

gzm55 commented Sep 9, 2023

Uh oh!

perronet commented Sep 21, 2023

Uh oh!

Uh oh!

perronet commented Sep 8, 2023 •

edited

Loading