Slow down in image build and in accessing/updating containers - did not get container create message from subprocess

Bug #1963702 reported by Ananya Banerjee
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

We are seeing a slow down in jobs running in vexxhost since 03/03 - image builds, containers access/modification/tempest timeouts.

For example in standalone deploy, it is failing with:

FATAL | Capture the update repos and installed rpms | localhost | error={"changed": true, "cmd": "buildah run 192.168.24.1-working-container-3 yum list installed > /var/log/container_info.log\n", "delta": "0:00:30.079656", "end": "2022-03-04 15:57:03.241852", "msg": "non-zero return code", "rc": 1, "start": "2022-03-04 15:56:33.162196", "stderr": "time=\"2022-03-04T15:56:56Z\" level=error msg=\"did not get container create message from subprocess: read |0: i/o timeout\"\nerror running container: write containercreatepipe: broken pipe\nerror while running runtime: exit status 1", "stderr_lines": ["time=\"2022-03-04T15:56:56Z\" level=error msg=\"did not get container create message from subprocess: read |0: i/o timeout\"", "error running container: write containercreatepipe: broken pipe", "error while running runtime: exit status 1"], "stdout": "", "stdout_lines": []}

logs:
https://logserver.rdoproject.org/82/40082/2/check/periodic-tripleo-ci-centos-9-standalone-full-tempest-scenario-master/79eae19/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

This shows up in some of the c9 standalone jobs, c8 standalone jobs, component. example:

periodic-tripleo-ci-centos-9-standalone-full-tempest-scenario-master
periodic-tripleo-ci-centos-9-scenario003-standalone-common-master, periodic-tripleo-ci-centos-9-standalone-tempest-master

Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :

In centos9 jobs we see a diff in podman versions, not sure if that is relevent:

Podman version in a passing log:
podman-3.4.5-0.7.el9.x86_64
podman-catatonit-3.4.5-0.7.el9.x86_64

Podman version in failing log:
podman-4.0.0-6.el9.x86_64
podman-catatonit-4.0.0-6.el9.x86_64

Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
milestone: none → yoga-2
milestone: yoga-2 → yoga-3
tags: added: promotion-blocker
Ronelle Landy (rlandy)
description: updated
Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :

But in centos8 failing jobs we have jobs

podman.x86_64 3.0.1-6.module_el8.5.0+736+58cc1a5a
podman-catatonit.x86_64 3.0.1-6.module_el8.5.0+736+58cc1a5a

So it might not be related after all

description: updated
Revision history for this message
Ronelle Landy (rlandy) wrote :

We are also seeing a slow down in image build jobs:

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-centos-8-buildimage-overcloud-full-wallaby

Jobs ow timing out at an hour and a half - before they ran at about ~50 mins.

summary: - did not get container create message from subprocess
+ Slow down in accessing/updating containers - did not get container
+ create message from subprocess
summary: - Slow down in accessing/updating containers - did not get container
- create message from subprocess
+ Slow down in image build and in accessing/updating containers - did not
+ get container create message from subprocess
Ronelle Landy (rlandy)
description: updated
description: updated
Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :

In periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-ussuri, periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-ussuri we are seeing timeout/slow down as well

SK | Check containers status
2022-03-04 09:51:28 | [ERROR]: Container(s) which failed to be created by podman_container module:
2022-03-04 09:51:28 | ['nova_db_sync']
2022-03-04 09:51:28 | [ERROR]: Container(s) which did not finish after 300 minutes: ['nova_db_sync']
2022-03-04 09:51:28 | 2022-03-04 09:51:28.999785 | fa163e1f-30ca-4089-242a-00000000a05b | FATAL | Check containers status | overcloud-controller-0 | error={"changed": false, "msg": "Failed container(s): ['nova_db_sync'], check logs in /var/log/containers/stdouts/"}
2022-03-04 09:51:28 | 2022-03-04 09:51:29.000873 | fa163e1f-30ca-4089-242a-00000000a05b | TIMING | tripleo_container_manage : Check containers status | overcloud-controller-0 | 1:09:05.953041 | 461.26s

https://logserver.rdoproject.org/70/40070/1/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-ussuri/d69b7dd/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

Looking at individual tempest bugs now - closing this general one out

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.