Comment 4 for bug 2025748

Revision history for this message
Robie Basak (racb) wrote :

> I don't see a way of getting the requests library to handle transfer encoding but not content encoding without using undocumented API. As far as I can tell, the raw attribute would give me neither.

Digging into this much deeper, the handling of transfer encoding doesn't seem to be documented anywhere, but requests uses urllib3 which uses http.client on Python 3, and http.client's automatic handling of transfer encoding is passed all the way through. The requests library's response object's raw attribute is a urllib3 response object, and testing urllib3 against a deliberately chunked HTTP response does result in it decoding correctly. For example, the following code worked as expected:

import urllib3
http = urllib3.PoolManager()
resp = http.request("GET", "http://localhost", preload_content=False)
print(resp.read())

In that test I arranged a fake chunked response with the following:

while true; do nc -q1 -l -p 80 < result;done

With the file result having the following contents saved with ff=dos in vim:

--BEGIN--
HTTP/1.1 200
Transfer-Encoding: chunked
Content-Type: text/plain

4
Wiki
7
pedia i
B
n
chunks.
0

--END--

Similarly, with requests directly, the following also worked fine against the same fake chunked response:

import requests
with requests.get('http://localhost', stream=True) as resp:
    resp.raw.read(decode_content=False)

In conclusion I don't need to worry about transfer encoding not being handled correctly when using raw.read() with requests.

Going back to the original issue, it is fixed with a partial revert of 1e20363 to restore the use of raw.read().

The commit message against 1e20363 talks of the changes file ending up compressed on disk, and indeed I re-introduced this behaviour after my revert.

To fix that without regressing the fetch of a diff.gz by "accidentally" decompressing it, I think the correct fix would be to set "Accept-Encoding: identity". Indeed, on testing with wget (using -d), I see that wget sets this by default, so I'm confident it's correct. Setting this using headers={'accept-encoding': 'identity'} fixes that issue, so that for example "pull-pkg -D ubuntu --upload-queue --pull source r8125 jammy" downloads the changes file and writes it uncompressed again (r8125 is currently in the unapproved queue but won't work later as an example).

I think the final fix should therefore look something like:

--- a/ubuntutools/misc.py 2023-05-30 16:47:58.000000000 +0000
+++ b/ubuntutools/misc.py 2023-07-04 18:41:54.919108821 +0000
@@ -348,7 +348,7 @@
     with tempfile.TemporaryDirectory() as tmpdir:
         tmpdst = Path(tmpdir) / "dst"
         try:
- with requests.get(src, stream=True, timeout=60, auth=auth) as fsrc:
+ with requests.get(src, stream=True, timeout=60, auth=auth, headers={'accept-encoding': 'identity'}) as fsrc:
                 with tmpdst.open("wb") as fdst:
                     fsrc.raise_for_status()
                     _download(fsrc, fdst, size, blocksize=blocksize)
@@ -433,7 +433,10 @@

     downloaded = 0
     try:
- for block in fsrc.iter_content(blocksize):
+ while True:
+ block = fsrc.raw.read(blocksize)
+ if not block:
+ break
             fdst.write(block)
             downloaded += len(block)
             progress_bar.update(downloaded, size)