Skip to content

Conversation

Prakhar-Shankar
Copy link
Contributor

@Prakhar-Shankar Prakhar-Shankar commented Apr 14, 2025

Fixes #193

@benoit74
Copy link
Collaborator

Please:

  • fix your first comment so that PR is properly linked to issue
  • explain how you checked the fix is not introducing a regression

@Prakhar-Shankar
Copy link
Contributor Author

What I have done is -

  • I have changed the variable name from s to sibling.
  • I have removed the useless expression.

I have also used ruff to check if the export.py is working fine and this is the result -

ruff check src/gutenberg2zim/export.py
warning: The top-level linter settings are deprecated in favour of their counterparts in the `lint` section. Please update the following options in `pyproject.toml`:
  - 'ignore' -> 'lint.ignore'
  - 'select' -> 'lint.select'
  - 'unfixable' -> 'lint.unfixable'
  - 'flake8-tidy-imports' -> 'lint.flake8-tidy-imports'
  - 'isort' -> 'lint.isort'
  - 'per-file-ignores' -> 'lint.per-file-ignores'
All checks passed!

Hence I think there is no issue with the code.

@benoit74
Copy link
Collaborator

This does not validate at all that the code is doing the same thing, I'm not sure at all what you considered a useless expression is really useless. And I'm not sure at all renaming the variable comes without side-effects.

The only thing you've really validated is the ruff part, i.e. ruff issues are gone. But you could get the same results by deleting the whole codebase, there will be no more code and no more issue.

You need to find a test case (manual is fine) where the whole code block is executed, and validate that resulting content is identical before and after your modification.

@kelson42 kelson42 requested a review from benoit74 April 16, 2025 08:54
@Prakhar-Shankar
Copy link
Contributor Author

I have written this manual test with the exact same code blocks -

import os
import tempfile
from bs4 import BeautifulSoup
import shutil
import difflib

def create_sample_epub():
    
    test_dir = tempfile.mkdtemp()
    
   
    content = """<?xml version="1.0" encoding="utf-8"?>
<ncx>
    <navPoint id="chapter1">
        <navLabel><text>Chapter 1</text></navLabel>
        <content src="chapter1.xhtml"/>
    </navPoint>
    <navPoint id="navpoint-2">
        <navLabel><text>*** START: FULL LICENSE ***</text></navLabel>
        <content src="license.xhtml"/>
    </navPoint>
    <navPoint id="extra">
        <navLabel><text>Extra Info</text></navLabel>
        <content src="extra.xhtml"/>
    </navPoint>
</ncx>"""
    
  
    with open(os.path.join(test_dir, "toc.ncx"), "w", encoding="utf-8") as f:
        f.write(content)
    
    return test_dir

def process_with_old_code(epub_dir):
    original_toc = os.path.join(epub_dir, "toc.ncx.old")
    shutil.copy2(os.path.join(epub_dir, "toc.ncx"), original_toc)
    with open(original_toc, "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f, "xml")
    pattern = "*** START: FULL LICENSE ***"
    for tag in soup.findAll("text"):
        if pattern in tag.text:
            s = tag.parent.parent
            # Original code
            s.decompose()
            for s in s.next_siblings:  # noqa: B020
                s.decompose()
            s.next_sibling # noqa: B018
    

    output_file = os.path.join(epub_dir, "result_old.xml")
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(str(soup))
    
    return output_file

def process_with_new_code(epub_dir):
    new_toc = os.path.join(epub_dir, "toc.ncx.new")
    shutil.copy2(os.path.join(epub_dir, "toc.ncx"), new_toc)
    with open(new_toc, "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f, "xml")
    
   
    pattern = "*** START: FULL LICENSE ***"
    for tag in soup.findAll("text"):
        if pattern in tag.text:
            s = tag.parent.parent
            # New code
            s.decompose()
            for sibling in s.next_siblings:
                sibling.decompose()
    
   
    output_file = os.path.join(epub_dir, "result_new.xml")
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(str(soup))
    
    return output_file

def compare_results(file1, file2):
   
    with open(file1, "r", encoding="utf-8") as f1, open(file2, "r", encoding="utf-8") as f2:
        content1 = f1.read()
        content2 = f2.read()
    
    if content1 == content2:
        print("test passed")
        return True
    else:
        print("test failed")
        print("\nDifferences found:")
        diff = difflib.unified_diff(
            content1.splitlines(),
            content2.splitlines(),
            fromfile="Old version",
            tofile="New version"
        )
        for line in diff:
            print(line)
        return False

def main():
   
    test_dir = create_sample_epub()
    
    try:
        print("\nrunnong old code")
        old_result = process_with_old_code(test_dir)
        
        print("running new code")
        new_result = process_with_new_code(test_dir)
        
        print("\nresult comparison")
        compare_results(old_result, new_result)
        
    finally:
    
        shutil.rmtree(test_dir)

if __name__ == "__main__":
    main()

On running the test I am gettin this -
image

@benoit74
Copy link
Collaborator

Thank you for this test file.

I'm concerned by the fact that the following code never gets executed in your test case:

for sibling in s.next_siblings:
    sibling.decompose()

It is not executed both in previous and new code, so it either indicate that your test case is incomplete (there are other situations you did not covered) or that this code is simply useless (and should be removed).

Can you please have a look into this and advise? I feel like either there is something we still did not properly understood or there is code to clean up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix weird noqa that has been set in optimize_epub in export.py
2 participants