Mirror snapshots are broken when `apt-get update` Astute task is running.

Bug #1533682 reported by Alexandr Kostrikov
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Sergii Golovatiuk
8.0.x
Fix Released
Critical
Sergii Golovatiuk
Mitaka
Fix Released
Critical
Sergii Golovatiuk

Bug Description

At a job https://product-ci.infra.mirantis.net/job/8.0.ubuntu.bvt_2/383/console we have an error

AssertionError: Task 'deploy' has incorrect status. error != ready, 'Deployment has failed. Method granular_deploy. Failed to execute hook 'shell' Failed to run command cd / && apt-get update

---
priority: 2200
type: shell
id:
parameters:
  retries: 3
  cmd: apt-get update
  cwd: /
  timeout: 1800
  interval: 1
uids:
- '1'
- '3'
- '2'
- '5'
- '4'
- '6'
.
Inspect Astute logs for the details'

During debugging I found that at a small amount of time there is `apt-get update` errors.
At most times apt-get has passed successfully, but there were small period of time when such errors appear:
First:
'Fetched 864 kB in 2min 0s (7,162 B/s)
W: GPG error: http://mirror.fuel-infra.org mos8.0-holdback Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: GPG error: http://mirror.fuel-infra.org mos8.0-security Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: GPG error: http://mirror.fuel-infra.org mos8.0-updates Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty-proposed/universe/i18n/Translation-en Hash Sum mismatch

E: Some index files failed to download. They have been ignored, or old ones used instead.'
Exit code: 100

Second:
'
Fetched 820 kB in 1s (651 kB/s)
W: GPG error: http://mirror.fuel-infra.org mos8.0-holdback Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: GPG error: http://mirror.fuel-infra.org mos8.0-security Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: GPG error: http://mirror.fuel-infra.org mos8.0-updates Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty/main/i18n/Translation-en_US Bad header line

W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty/main/i18n/Translation-en Bad header line

W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty/multiverse/i18n/Translation-en_US Bad header line

W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty/multiverse/i18n/Translation-en Bad header line

W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty/universe/i18n/Translation-en_US Bad header line

W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty/universe/i18n/Translation-en Bad header line

W: Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty-proposed/universe/i18n/Translation-en Hash Sum mismatch

E: Some index files failed to download. They have been ignored, or old ones used instead.
'
Exit code: 100

As the only available access method for me on the mirrors is http access, I found that
mirror snapshots have not been changed during that time(at least for web server):
http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/
Index of /pkgs/ubuntu-2016-01-12-170104/

../
dists/ 08-Mar-2015 19:35 -
indices/ 12-Jan-2016 15:29 -
pool/ 27-Feb-2010 06:30 -
project/ 28-Jun-2013 11:52 -
ubuntu/ 12-Jan-2016 17:17 -

The errors at the Job are 'Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty-updates/universe/binary-amd64/Packages 404 Not Found'
But there is no need for them to present at http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty-updates/universe/binary-amd64/ because usually apt-get fetches gz or .bz2 files. And when there is an error with them it fallbacks to
uncompressed vesion 'Packages'.

There are timings and access to mirror state at Build team, so I reassign that bug to them.

Problematic logs are at:
http://paste.openstack.org/show/483770/

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Also, during the time logging is implemented, I am going to simulate mirror files update situation with the same problem 'Bad header line'.
And after verification that that is the root problem, I will assign bug to build team to improve mirrors behaviour on updates.

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Logs from Astute are needed because of similar bugs https://bugs.launchpad.net/fuel/+bug/1530161
That bvt problem can be internet issue(which can not be cured by code) or 504 gateway time-out(which is another problem).

description: updated
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Found needed information, will reassign bug to build team

Changed in fuel:
status: New → Invalid
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

The errors are 'Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty-updates/universe/binary-amd64/Packages 404 Not Found'
But there is no need for them to present at http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty-updates/universe/binary-amd64/ because usually apt-get fetches gz or .bz2 files. And when there is an error with them it fallbacks to
uncompressed vesion 'Packages'.

There are timings and access to mirror state at Build team, so I reassing that bug to them.

Problematic logs are at:
http://paste.openstack.org/show/483770/

Changed in fuel:
status: Invalid → New
assignee: Fuel Python Team (fuel-python) → Fuel build team (fuel-build)
description: updated
tags: added: area-build
removed: area-python
Revision history for this message
Sergey Kulanov (skulanov) wrote :
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel build team (fuel-build) → Max Rasskazov (mrasskazov)
status: Confirmed → In Progress
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Increased as Critical, due to new failures on BVT: https://product-ci.infra.mirantis.net/job/8.0.ubuntu.bvt_2/391/console

Changed in fuel:
importance: High → Critical
tags: added: swarm-blocker
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Rename bug to be clear about it so name does not mislead

summary: - Output information is needed during `apt-get update` Astute task.
+ Mirror snapshots are broken when `apt-get update` Astute task is
+ running.
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Reverted 2 last fail, I with success executed apt-get updated on the slave, moreover I run re-deployment over cli fuel deploy changes and deployment is success :
[root@nailgun /]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|--------|---------------------------|---------|--------------|-------------------|-------------------|---------------|--------|---------
6 | ready | slave-01_controller | 1 | 10.109.30.8 | 64:25:13:49:37:19 | controller | | True | 1
2 | ready | slave-02_controller | 1 | 10.109.30.10 | 64:54:49:96:11:b2 | controller | | True | 1
5 | ready | slave-03_controller | 1 | 10.109.30.11 | 64:ce:fa:0c:32:82 | controller | | True | 1
3 | ready | slave-06_compute_ceph-osd | 1 | 10.109.30.14 | 64:61:88:d8:6f:ac | ceph-osd, compute | | True | 1
1 | ready | slave-04_compute_ceph-osd | 1 | 10.109.30.12 | 64:3d:13:29:8d:72 | ceph-osd, compute | | True | 1
4 | ready | slave-05_compute_ceph-osd | 1 | 10.109.30.13 | 64:6c:ed:5e:14:0b | ceph-osd, compute | | True | 1
[root@nailgun /]#

[root@nailgun /]# fuel health --env 1 --check ha
[ 1 of 7] [success] 'Check state of haproxy backends on controllers' (1.137 s)
[ 2 of 7] [success] 'Check data replication over mysql' (3.252 s)
[ 3 of 7] [success] 'Check if amount of tables in databases is the same on each node' (2.908 s)
[ 4 of 7] [success] 'Check galera environment state' (1.077 s)
[ 5 of 7] [success] 'Check pacemaker status' (1.403 s)
[ 6 of 7] [success] 'RabbitMQ availability' (16.15 s)
[ 7 of 7] [success] 'RabbitMQ replication' (22.71 s)
[root@nailgun /]#

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

"Reverted 2 last fail, I with success executed apt-get updated" - yes. I have been running 'apt-get update' on such environments and got rare random failures.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

the same with https://product-ci.infra.mirantis.net/job/8.0.ubuntu.bvt_2/393/console
after revert env - deployment starts, task does not fail, and no problems with apt-get update

Revision history for this message
Timur Sufiev (tsufiev-x) wrote :
Revision history for this message
Timur Sufiev (tsufiev-x) wrote :
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
assignee: Max Rasskazov (mrasskazov) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

Still reproducable - https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.filling_root/51 ;
Logs indicates timeouts but I've ran "apt-get update" manually:
W: GPG error: http://mirror.fuel-infra.org mos8.0-holdback Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: GPG error: http://mirror.fuel-infra.org mos8.0-security Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08
W: GPG error: http://mirror.fuel-infra.org mos8.0-updates Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY BCE5CC461FA22B08

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/269012

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

RCA:

Nailgun runs 'apt-get update' on all nodes. However, node-6 gets corrupted as it gets timed out as our mirror is created in wrong way.

We need to cherry-pick https://review.openstack.org/#/c/269012/ as it has a bit different mechanism which doesn't cause any issues on master.

However, the main issue remains - our mirrors are created in wrong way.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/269012
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=f0c286624901bb2fc2dbef23fdc236b4a82b231b
Submitter: Jenkins
Branch: stable/8.0

commit f0c286624901bb2fc2dbef23fdc236b4a82b231b
Author: Bartłomiej Piotrowski <email address hidden>
Date: Fri Nov 20 13:47:08 2015 +0100

    Replace upload_core_repos with Puppet-only task

    Currently the code responsible for configuring repositories on the
    managed nodes is kept directly it Nailgun. This commit introduces a new
    task setup_repositories that replaces upload_core_repos.

    Closes-Bug: 1508486
    Closes-Bug: 1533682

    Change-Id: I1b83e3bfaebecdb8455d5697e320f24fb4941536
    (cherry picked from commit 4a6c031e79bbb98b0dc0a51a9963dc170cbdaa34)

Revision history for this message
Pawel Brzozowski (pbrzozowski) wrote :

Few words about what I have checked after some chat with Sergii.

1. I would not blame http://mirror.seed-cz1.fuel-infra.org/pkgs/ mirror for this failure (@sergk). First of all mirror (this particular snapshot) was untouched since this happend - someone said it was working most of the time + it's working now if I set it in sources for some trusty instance I have.

2. I have checked network usage on both nodes - it was very small that time, not loaded with anything. It seems really weird that some download speeds were so horibbly slow: "Fetched 864 kB in 2min 0s (7,162 B/s)". Too bad we can not tell what was the real capacity between these two hosts at that time. Currently it's really huge.

3. Sergii asked me to check few other details about seed-cz1 host - iptables limits, Nginx configuration, host performance - all seems to be fine, no limits reached of any kind.

4. There is another thing which may be releveant to this issue. Mirrors mentioned here:

 'mos-updates': 'deb http://mirror.fuel-infra.org/mos-repos/ubuntu/{cluster.release.environment_version}/ mos8.0-updates main restricted', priority:1050
'mos-security': 'deb http://mirror.fuel-infra.org/mos-repos/ubuntu/{cluster.release.environment_version}/ mos8.0-security main restricted', priority:1050
 'mos-holdback': 'deb http://mirror.fuel-infra.org/mos-repos/ubuntu/{cluster.release.environment_version}/ mos8.0-holdback main restricted', priority:1100

can AFAIK have broken integrity quite often with big gaps. @Max could probably say something more about this.

Could this be related?

Revision history for this message
Pawel Brzozowski (pbrzozowski) wrote :

After some small talk with Max, he convinced me that we should not use any generic link to repository - instead we should use one specific snapshot for mos-repos sources.

Why it's better? For example:

1. apt-get update is run on one node
2. in the meantime /mos-repos/ is updated (every 15 minutes)
3. apt-get install will fail because some packages does not exists anymore

Obviously there are also other scenerios which may fail.

So file http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0.target.txt (version 8.0 example) should be read first and jobs should use only this snapshot.

The same thing goes for /pkgs/ mirror (update every day). The failure possibility is much, much smaller but it still exists.

/pkgs/ mirror also have snapshot file (was suggested by build-team) http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-latest.htm which is formated in a bit different way but the functionality stays the same. It would be much more reliable to also use it instead of generic path.

I believe these two things may improve stability of these failing jobs.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified on 8.0-446 iso

Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

@pbrzozowski
The errors at the Job are 'Failed to fetch http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2016-01-12-170104/dists/trusty-updates/universe/binary-amd64/Packages 404 Not Found'
So it means that mirror with snapshot has some errors too, Pawel?

Revision history for this message
Max Rasskazov (mrasskazov) wrote :

Related fix proposed to branch: master
Review: https://review.fuel-infra.org/16514

Revision history for this message
Aleksei Stepanov (penguinolog) wrote :

Not reproduced for a long time

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.