Content provider and other tripleo jobs are often timeouting on upstream vexxhost ca-ymq-1 region

Bug #1912663 reported by Sagi (Sergey) Shnaidman
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Content providers job are building containers for all other jobs. On upstream cloud from vexxhost (ca-ymq-1 region) these jobs often time out. The building is going very slow there.
Overall pass rate on this region is about 65%:

http://dashboard-ci.tripleo.org/d/3-DYSmOGk/jobs-exploration?var-influxdb_filter=job_name%7C%3D~%7C%2Fcontent-provider%2F&var-influxdb_filter=cloud%7C%3D%7Cvexxhost&var-influxdb_filter=region%7C%3D%7Cca-ymq-1

Overall rate excluding this region is about 95%:

http://dashboard-ci.tripleo.org/d/3-DYSmOGk/jobs-exploration?var-influxdb_filter=job_name%7C%3D~%7C%2Fcontent-provider%2F&var-influxdb_filter=cloud%7C!%3D%7Cvexxhost&var-influxdb_filter=region%7C!%3D%7Cca-ymq-1

We have already increased timeout for this job[1], but it's still so slow that can't finish containers build in time.

Example of timeouting job:

https://8cc2269bd81626fd401a-4760b556f625ab5e6fe00ab100fd24ac.ssl.cf5.rackcdn.com/770634/11/check/tripleo-ci-centos-8-content-provider/8e56646/logs/

It built only 20 containers in 2h 15min: https://8cc2269bd81626fd401a-4760b556f625ab5e6fe00ab100fd24ac.ssl.cf5.rackcdn.com/770634/11/check/tripleo-ci-centos-8-content-provider/8e56646/logs/report.html

Example of regular job from RAX cloud DFW region:
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_325/753407/2/check/tripleo-ci-centos-8-content-provider/3255e27/job-output.txt

built all ~120 containers in 35 mins.

[1] https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/769777

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Problem exists also in scenario jobs, for example failed jobs in gates:

https://92ed9dc19b3bf94f48d5-598e1d61c0aab85aa3b67b337ca2c556.ssl.cf5.rackcdn.com/772598/2/gate/tripleo-ci-centos-8-undercloud-containers/9d272cd/logs/undercloud/home/zuul/undercloud_install.log
 [ERROR]: Container(s) which failed to be created by podman_container module:['neutron_db_sync']

 db sync took 23 min instead of usual 1 min.
 https://92ed9dc19b3bf94f48d5-598e1d61c0aab85aa3b67b337ca2c556.ssl.cf5.rackcdn.com/772598/2/gate/tripleo-ci-centos-8-undercloud-containers/9d272cd/logs/undercloud/var/log/containers/stdouts/neutron_db_sync.log

Every operation takes a unusually long time and jobs fail on vexx upstream cloud.

tags: added: alert
summary: - Content provider jobs are often timeouting on upstream vexxhost ca-ymq-1
- region
+ Content provider and other tripleo jobs are often timeouting on upstream
+ vexxhost ca-ymq-1 region
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Revision history for this message
Alan Pevec (apevec) wrote :

List of failing hosts collected by Sagi:

ansible_hostname: centos-8-vexxhost-ca-ymq-1-0022799299 host_id: 28e469d800162f1390ebc4c96534e9bec57f25409dfcd68f14e100f9
ansible_hostname: centos-8-vexxhost-ca-ymq-1-0022788831 host_id: e46140ab5ff21639014742be93c84719b4f39f9c044498b79064dcb5
ansible_hostname: centos-8-vexxhost-ca-ymq-1-0022790877 host_id: 322bfced1bb62c11480e9c8cf97e066ab7cb1067b23b38e8004c6f69
ansible_hostname: centos-7-vexxhost-ca-ymq-1-0022792215 host_id: 08e945bd00ba53f11f723cb5fca02cc0f5edb6b8056b2b73ad0449cd
ansible_hostname: centos-8-vexxhost-ca-ymq-1-0022798829 host_id: 6d4cb5d013461b5ae3cdfbec9c124753323f2b734d533c8797a9148e
ansible_hostname: centos-8-vexxhost-ca-ymq-1-0022787431 host_id: 08e945bd00ba53f11f723cb5fca02cc0f5edb6b8056b2b73ad0449cd
ansible_hostname: centos-8-vexxhost-ca-ymq-1-0022787407 host_id: 08e945bd00ba53f11f723cb5fca02cc0f5edb6b8056b2b73ad0449cd

Revision history for this message
Alan Pevec (apevec) wrote :

new hostnames that failed today:
centos-8-vexxhost-ca-ymq-1-0022828441 host_id: 6d4cb5d013461b5ae3cdfbec9c124753323f2b734d533c8797a9148e
centos-8-vexxhost-ca-ymq-1-0022828413 host_id: 8fff2b99b52d9369c460de43685c4f39059dc99f7b3eef95ec34e8a5
centos-8-vexxhost-ca-ymq-1-0022824207 host_id: d503dcbe511a6cdfc9810642cca8c6c66efbb29e1e14fc98860e36b5
centos-8-vexxhost-ca-ymq-1-0022806170 host_id: 23320365a917d59498b30bb8308684708801795984a6ca42b1cd06b7

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

We could see one job in gate that failed: https://969a09e50621f00ba282-7a435d8b6671949964eefa8586127100.ssl.cf5.rackcdn.com/772968/6/gate/tripleo-ci-centos-8-content-provider/8e6a686/
host_id: d503dcbe511a6cdfc9810642cca8c6c66efbb29e1e14fc98860e36b5

Also timeout in check: https://a2cdacfa8cbf343b7284-460112c8cdf3a162e27bb5a0ba3266a8.ssl.cf2.rackcdn.com/773999/3/check/tripleo-ci-centos-8-content-provider-victoria/a4027c9/
host_id: 08e945bd00ba53f11f723cb5fca02cc0f5edb6b8056b2b73ad0449cd

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

https://013592400a0e3b08671a-e9b8396308aaaef05937a4d3fced2e50.ssl.cf2.rackcdn.com/774500/1/check/tripleo-ci-centos-8-scenario004-standalone/e8f9efc/

host_id: d503dcbe511a6cdfc9810642cca8c6c66efbb29e1e14fc98860e36b5

update of database took 1 hour:

2021-02-08 17:45:50.573004 | fa163e60-92fa-45c8-0b70-000000000617 | TASK | Initialize aide database
2021-02-08 18:44:17.930248 | fa163e60-92fa-45c8-0b70-000000000617 | OK | Initialize aide database | standalone

https://013592400a0e3b08671a-e9b8396308aaaef05937a4d3fced2e50.ssl.cf2.rackcdn.com/774500/1/check/tripleo-ci-centos-8-scenario004-standalone/e8f9efc/logs/undercloud/home/zuul/standalone_deploy.log

https://013592400a0e3b08671a-e9b8396308aaaef05937a4d3fced2e50.ssl.cf2.rackcdn.com/774500/1/check/tripleo-ci-centos-8-scenario004-standalone/e8f9efc/logs/undercloud/var/log/aide/aide.log

Start timestamp: 2021-02-08 17:45:50 +0000 (AIDE 0.16)
AIDE initialized database at /var/lib/aide/aide.db.new.gz

Number of entries: 569388

---------------------------------------------------
The attributes of the (uncompressed) database(s):
---------------------------------------------------

/var/lib/aide/aide.db.new.gz
  MD5 : R+QeRC/we5/ki0DvMZsdMw==
  SHA1 : UnYCxqfyWSojK9nLQFtLvpd0RYs=
  RMD160 : a401RREc0UtnB2ddvoIUgxd672E=
  TIGER : Q+iJ49PXA3gfkcaURCLwYQhazEka1E5L
  SHA256 : QzbsDa2CObDypeRApCLekrgBgLbtQOoh
             e1r2bt4Dp90=
  SHA512 : JkoSH0BokyrXqjfUfuyOJugn0xiRRhHK
             ChjjACcSrC86KC6Tpj/kg4tIjhdjul/F
             SL6hgKHaAL3YOQ4UmpASXg==

End timestamp: 2021-02-08 18:44:17 +0000 (run time: 58m 27s)

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.