Pip finds hash mismatch for package during installation

Bug #1708707 reported by Clark Boylan on 2017-08-04
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack-Gate
Undecided
Unassigned

Bug Description

Occasionally when doing a pip install it will error because the has reported by pypi does not match the hash as calculated locally after download. This bug is for tracking purposes so that we can follow these problems more easily and hopefully eventually track them down.

Ian Wienand (iwienand) wrote :
Download full text (3.3 KiB)

Here's an example [1]

---
2017-08-09 03:31:00.821 | protobuf===3.3.0 from http://mirror.ord.rax.openstack.org/pypi/packages/d0/60/35ed2bb392a7845ba316a66ea1f6cc2f3acaf95e7905deca17d46430452b/protobuf-3.3.0-cp27-cp27mu-manylinux1_x86_64.whl#md5=fa09b90d290bb63aa075e4a5a7965001 (from -c /opt/stack/new/requirements/upper-constraints.txt (line 330)):
2017-08-09 03:31:00.821 | Expected md5 fa09b90d290bb63aa075e4a5a7965001
2017-08-09 03:31:00.821 | Got c7a6afd624e1c302ad0cd793447be780
---

The download happened before that

---
2017-08-09 03:30:58.365 | Collecting protobuf===3.3.0 (from -c /opt/stack/new/requirements/upper-constraints.txt (line 330))
2017-08-09 03:30:58.524 | Downloading http://mirror.ord.rax.openstack.org/pypi/packages/d0/60/35ed2bb392a7845ba316a66ea1f6cc2f3acaf95e7905deca17d46430452b/protobuf-3.3.0-cp27-cp27mu-manylinux1_x86_64.whl (5.7MB)
---

going to the mirror logs, we have the following (removed everything after the pip UA string, which is very long). this test is 2001:4801:7828:104:be76:4eff:fe10:22e3

---
2001:4801:7828:104:be76:4eff:fe10:284c - - [09/Aug/2017:03:30:58 +0000] "GET /pypi/simple/protobuf/ HTTP/1.1" 200 5872 "-" "pip/9.0.1 ...
2001:4801:7828:104:be76:4eff:fe10:22e3 - - [09/Aug/2017:03:30:58 +0000] "GET /pypi/simple/protobuf/ HTTP/1.1" 200 5872 "-" "pip/9.0.1 ...
2001:4801:7828:104:be76:4eff:fe10:284c - - [09/Aug/2017:03:30:58 +0000] "GET /wheel/ubuntu-16.04-x86_64/protobuf/ HTTP/1.1" 200 763 "-" "pip/9.0.1 ...
2001:4801:7828:104:be76:4eff:fe10:22e3 - - [09/Aug/2017:03:30:58 +0000] "GET /wheel/ubuntu-16.04-x86_64/protobuf/ HTTP/1.1" 200 763 "-" "pip/9.0.1 ...
2001:4801:7828:104:be76:4eff:fe10:284c - - [09/Aug/2017:03:30:58 +0000] "GET /pypi/packages/d0/60/35ed2bb392a7845ba316a66ea1f6cc2f3acaf95e7905deca17d46430452b/protobuf-3.3.0-cp27-cp27mu-manylinux1_x86_64.whl HTTP/1.1" 200 5724653 "-" "pip/9.0.1 ...
---

Werid ... two addresses at the same time, only one seems to actually download. here's the entire logs for that IP available

---
root@mirror:/var/log/apache2# cat mirror.ord.rax.openstack.org_access.log.1 | grep 'fe10:22e3' | grep protobuf
2001:4801:7828:104:be76:4eff:fe10:22e3 - - [09/Aug/2017:03:30:58 +0000] "GET /pypi/simple/protobuf/ HTTP/1.1" 200 5872 "-" "pip/9.0.1 ...
2001:4801:7828:104:be76:4eff:fe10:22e3 - - [09/Aug/2017:03:30:58 +0000] "GET /wheel/ubuntu-16.04-x86_64/protobuf/ HTTP/1.1" 200 763 "-" "pip/9.0.1 ...
2001:4801:7828:101:be76:4eff:fe10:22e3 - - [09/Aug/2017:04:11:34 +0000] "GET /pypi/simple/protobuf/ HTTP/1.1" 200 5872 "-" "pip/9.0.1 ...
2001:4801:7828:101:be76:4eff:fe10:22e3 - - [09/Aug/2017:04:11:35 +0000] "GET /wheel/ubuntu-16.04-x86_64/protobuf/ HTTP/1.1" 200 763 "-" "pip/9.0.1 ...
2001:4801:7828:101:be76:4eff:fe10:22e3 - - [09/Aug/2017:04:11:36 +0000] "GET /pypi/packages/d0/60/35ed2bb392a7845ba316a66ea1f6cc2f3acaf95e7905deca17d46430452b/protobuf-3.3.0-cp27-cp27mu-manylinux1_x86_64.whl HTTP/1.1" 200 5724653 "-" "pip/9.0.1 ...
2001:4801:7828:104:be76:4eff:fe10:22e3 - - [09/Aug/2017:06:28:44 +0000] "GET /pypi/simple/protobuf/ HTTP/1.1" 200 5872 "-" "pip/9.0.1 ...
2001:4801:7828:104:be76:4eff:fe10:22e3 - - [09/Aug/2017:06:28:44 +0000] "GE...

Read more...

Ian Wienand (iwienand) wrote :

Hmm, so :284c, the other job in the logs that appears to download the whl file correctly, is actually from the peer job [1], which makes sense

---
2017-08-09 03:16:24.852882 | "ip_address": "2001:4801:7828:104:be76:4eff:fe10:284c",
---

The only conclusion here seems to be that :22e3 did *not* download protobuf-3.3.0-cp27-cp27mu-manylinux1_x86_64.whl from the mirror, but actually got it from somewhere else (despite thinking it got it from the mirror ...)

[1] http://logs.openstack.org/71/491971/1/check/gate-dsvm-nodepool-redhat-src/d8bca38/console.html#_2017-08-09_03_16_24_852882

Ian Wienand (iwienand) wrote :

Looking further -- segfaults right around the time this was happening. Which would explain why the message is missing in the logs and why a transfer would have been killed half way through

---
error.log.1:[Wed Aug 09 03:07:41.315536 2017] [core:notice] [pid 25521:tid 139749546280832] AH00051: child pid 28341 exit signal Segmentation fault (11), possible coredump in /etc/apache2
error.log.1:[Wed Aug 09 03:23:39.285214 2017] [core:notice] [pid 25521:tid 139749546280832] AH00051: child pid 28592 exit signal Segmentation fault (11), possible coredump in /etc/apache2
error.log.1:[Wed Aug 09 03:28:30.599820 2017] [core:notice] [pid 25521:tid 139749546280832] AH00051: child pid 28457 exit signal Segmentation fault (11), possible coredump in /etc/apache2
error.log.1:[Wed Aug 09 03:29:47.681434 2017] [core:notice] [pid 25521:tid 139749546280832] AH00051: child pid 29573 exit signal Segmentation fault (11), possible coredump in /etc/apache2
error.log.1:[Wed Aug 09 03:31:00.755337 2017] [core:notice] [pid 25521:tid 139749546280832] AH00051: child pid 29768 exit signal Segmentation fault (11), possible coredump in /etc/apache2
error.log.1:[Wed Aug 09 03:31:54.813903 2017] [core:notice] [pid 25521:tid 139749546280832] AH00051: child pid 29843 exit signal Segmentation fault (11), possible coredump in /etc/apache2
error.log.1:[Wed Aug 09 03:39:29.263365 2017] [core:notice] [pid 25521:tid 139749546280832] AH00051: child pid 27712 exit signal Segmentation fault (11), possible coredump in /etc/apache2
---

we don't see the "scoreboard full" messages, but seems suspiciously like https://review.openstack.org/#/c/426639/

Ian Wienand (iwienand) wrote :

I think my recommendation here is, essentially, let's turn it off and on again. We could spend unlimited time investigating, but since this seems to be pretty isolated to this mirror, I think we should start with the simple approach and take it from there

Ian Wienand (iwienand) wrote :

I have restarted rax-ord mirror. Will monitor for further issues

James E. Blair (corvus) wrote :

I tracked the following error:

http://logs.openstack.org/69/690ccfd43fb141c26652e119f1d702b65414a194/post/octavia-releasenotes/238e6ab/console.html#_2017-08-11_17_50_10_439343

mirror.ord.rax.openstack.org_access.log:2001:4801:7828:101:be76:4eff:fe10:6e6a - - [11/Aug/2017:12:29:35 +0000] "GET /pypi/packages/1c/17/653639940e2a4f20e42af1738ec3350e972a11d0209ecd65c6034876cf60/blockdiag-1.5.3-py2.py3-none-any.whl HTTP/1.1" 200 2722535 "-" "pip/9.0.1 {\"cpu\":\"x86_64\",\"distro\":{\"id\":\"Xenial Xerus\",\"libc\":{\"lib\":\"glibc\",\"version\":\"2.23\"},\"name\":\"Ubuntu\",\"version\":\"16.04\"},\"implementation\":{\"name\":\"CPython\",\"version\":\"2.7.12\"},\"installer\":{\"name\":\"pip\",\"version\":\"9.0.1\"},\"openssl_version\":\"OpenSSL 1.0.2g 1 Mar 2016\",\"python\":\"2.7.12\",\"system\":{\"name\":\"Linux\",\"release\":\"4.4.0-89-generic\"}}"

There was no segfault immediately proximate (though there were some before and after).

That transfer is larger than the file on disk, as expected:

-rw-r--r-- 1 10001 root 2722270 Jul 27 2016 /afs/openstack.org/mirror/pypi/web/packages/1c/17/653639940e2a4f20e42af1738ec3350e972a11d0209ecd65c6034876cf60/blockdiag-1.5.3-py2.py3-none-any.whl

 md5sum /afs/openstack.org/mirror/pypi/web/packages/1c/17/653639940e2a4f20e42af1738ec3350e972a11d0209ecd65c6034876cf60/blockdiag-1.5.3-py2.py3-none-any.whl
e4ca7dcb239e22ea0d249c5403e02620 /afs/openstack.org/mirror/pypi/web/packages/1c/17/653639940e2a4f20e42af1738ec3350e972a11d0209ecd65c6034876cf60/blockdiag-1.5.3-py2.py3-none-any.whl

This suggests a network or client side error. When looking into this previously, we found that pip can mask network errors and simply report hash mismatches.

James E. Blair (corvus) wrote :

Our outbound traffic from that server has increased significantly since July, and Jeremy notes that it appears to be maxing out at 200 Mbps. We may be hitting the bandwidth limit for the flavor and seeing errors related to upstream traffic shaping. Jeremy will replace the server with a larger flavor.

Matt Riedemann (mriedem) wrote :

Appears to be fixed, not seeing any hits in logstash anymore.

Changed in openstack-gate:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers