Skip to content

umbrella ticket to resolve iteration / read size / chunked encoding questions #844

Closed
@slingamn

Description

@slingamn

This ticket is intended to aggregate previous discussion from #539, #589, and #597 about the default value of chunk_size used by iter_content and iter_lines.

cc @mponton @gwrtheyrn @shazow

Issues:

  1. The default read size of iter_content is 1 byte; this is probably inefficient
  2. Requests does not expose the ability to read chunked encoding streams in the "correct" way, i.e., using the provided octet counts to tell how much to read.
  3. However, this would not be suitable as the default implementation of iter_content anyway; not all websites are standards-compliant and when this was tried it caused more problems than it solved.
  4. The current default read size for iter_lines is 10kB. This is high enough that iteration over lines can be perceived as unresponsive --- no lines are returned until all 10kB have been read.
  5. There is no "correct" way to implement iter_lines using blocking I/O, we just have to bite the bullet and take a guess as to how much data we should read.
  6. There's apparently some nondeterminism in iter_lines, I think because of the edge case where a read ends between a \r and a \n.
  7. iter_lines is backed by iter_content, which operates on raw byte strings and splits at byte boundaries. I think there may be edge cases where we could split the body in the middle of a multi-byte encoding of a Unicode character.

My guess at a solution:

  1. Set the default chunk_size to 1024 bytes, for both iter_content and iter_lines.
  2. Provide a separate interface (possibly iter_chunks) for iterating over chunks of pages that are known to correctly implement chunked encoding, e.g., Twitter's firehose APIs
  3. We may need our own implementation of splitlines that is deterministic with respect to our chunking boundaries, i.e., remembers if the last-read character was \r and suppresses a subsequent \n. We may also need to build in Unicode awareness at this level, i.e., decode as much of the body as is valid, then save any leftover invalid bytes to be prepended to the next chunk.

Comments and thoughts are much appreciated. Thanks for your time!

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions