standalone_ffu job is failing upgrading at "podman exec ovn_cluster_north_db_server" with error: can only create exec sessions on running containers: container state improper

Bug #1997362 reported by Juan Badia Payno
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Undecided
Unassigned

Bug Description

The job tripleo-ci-centos-8-standalone-ffu-wallaby is failing when the standalone node is upgrading

The error is:

2022-11-21 21:48:52 | 2022-11-21 21:48:52.584124 | bc764e10-0e1a-faaf-8d43-00000000220c | FATAL | Set connection | standalone | error={"changed": true, "cmd": "podman exec ovn_cluster_north_db_server bash -c \"ovn-nbctl --no-leader-only --inactivity-probe=60000 set-connection ptcp:6641:0.0.0.0\"\npodman exec ovn_cluster_south_db_server bash -c \"ovn-sbctl --no-leader-only --inactivity-probe=60000 set-connection ptcp:6642:0.0.0.0\"\n", "delta": "0:00:00.330020", "end": "2022-11-21 19:25:13.053914", "msg": "non-zero return code", "rc": 255, "start": "2022-11-21 19:25:12.723894", "stderr": "Error: can only create exec sessions on running containers: container state improper\nError: can only create exec sessions on running containers: container state improper", "stderr_lines": ["Error: can only create exec sessions on running containers: container state improper", "Error: can only create exec sessions on running containers: container state improper"], "stdout": "", "stdout_lines": []}

It can be seen at:

https://zuul.opendev.org/t/openstack/build/5fbff124520c41ff9e27ae1a3756cb34/log/logs/undercloud/home/zuul/standalone_upgrade.log#4568

Other logs:

var/log/containers/openvswitch/ovsdb-server-nb.log

2022-11-21T21:31:35.017Z|00013|reconnect|INFO|tcp:192.0.2.254:6641: connecting...
2022-11-21T21:31:39.020Z|00014|reconnect|INFO|tcp:192.0.2.254:6641: connection attempt timed out
2022-11-21T21:31:39.020Z|00015|reconnect|INFO|tcp:192.0.2.254:6641: continuing to reconnect in the background but suppressing further logging
2022-11-21T21:48:10.570Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2022-11-21T21:48:10.571Z|00002|daemon_unix|WARN|/var/run/ovn/ovnnb_db.pid: stale pidfile for pid 136
 being deleted by pid 0
2022-11-21T21:48:10.571Z|00003|daemon_unix|EMER|/var/run/ovn/ovnnb_db.pid: pidfile check failed (No such process), aborting
2022-11-21T21:48:15.219Z|00016|jsonrpc|WARN|unix#107: send error: Broken pipe
2022-11-21T21:48:15.220Z|00017|reconnect|WARN|unix#107: connection dropped (Broken pipe)
2022-11-21T21:48:15.226Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2022-11-21T21:48:15.226Z|00002|daemon_unix|WARN|/var/run/ovn/ovnnb_db.pid: stale pidfile for pid 136
 being deleted by pid 0
2022-11-21T21:48:15.226Z|00003|daemon_unix|EMER|/var/run/ovn/ovnnb_db.pid: pidfile check failed (No such process), aborting
2022-11-21T21:48:16.658Z|00018|jsonrpc|WARN|unix#110: send error: Broken pipe
2022-11-21T21:48:16.658Z|00019|reconnect|WARN|unix#110: connection dropped (Broken pipe)
2022-11-21T21:48:16.663Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2022-11-21T21:48:16.663Z|00002|daemon_unix|WARN|/var/run/ovn/ovnnb_db.pid: stale pidfile for pid 136
 being deleted by pid 0

var/log/containers/openvswitch/ovn-northd.log

2022-11-21T21:31:40.739Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
2022-11-21T21:31:40.739Z|00005|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
2022-11-21T21:31:40.739Z|00006|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
2022-11-21T21:48:46.797Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-northd.log
2022-11-21T21:48:46.801Z|00002|daemon_unix|WARN|/run/openvswitch/ovn-northd.pid: stale pidfile for pid 244
 being deleted by pid 0
2022-11-21T21:48:46.801Z|00003|daemon_unix|EMER|/run/openvswitch/ovn-northd.pid: pidfile check failed (No such process), aborting
2022-11-21T21:48:51.875Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-northd.log
2022-11-21T21:48:51.875Z|00002|daemon_unix|WARN|/run/openvswitch/ovn-northd.pid: stale pidfile for pid 244
 being deleted by pid 0
2022-11-21T21:48:51.875Z|00003|daemon_unix|EMER|/run/openvswitch/ovn-northd.pid: pidfile check failed (No such process), aborting
2022-11-21T21:48:53.050Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-northd.log
2022-11-21T21:48:53.050Z|00002|daemon_unix|WARN|/run/openvswitch/ovn-northd.pid: stale pidfile for pid 244
 being deleted by pid 0

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I took a look at this quickly but for me it seems like some issue with container's configuration maybe. Here's what I see in the var/log/extra/podman/containers/ovn_cluster_north_db_server/stdout.log on undercloud:

Running command: 'bash -c $* -- eval source /etc/sysconfig/ovn_cluster; exec /usr/local/bin/start-nb-db-server ${OVN_NB_DB_OPTS}'
+ umask 0022
+ exec bash -c '$*' -- eval source '/etc/sysconfig/ovn_cluster;' exec /usr/local/bin/start-nb-db-server '${OVN_NB_DB_OPTS}'
Creating cluster database /etc/ovn/ovnnb_db.db ovsdb-tool: I/O error: /etc/ovn/ovnnb_db.db: failed to lock lockfile (Resource temporarily unavailable)
[FAILED]
ovsdb-server: /var/run/ovn/ovnnb_db.pid: pidfile check failed (No such process), aborting

Alan Pevec (apevec)
tags: added: promotion-blocker
Alan Pevec (apevec)
tags: added: ci
Revision history for this message
Alan Pevec (apevec) wrote :

@Juan please provide a reproducer and set Status = Triaged to start the CI escalation process!

Changed in tripleo:
milestone: none → antelope-1
status: New → Incomplete
Changed in tripleo:
status: Incomplete → New
status: New → Incomplete
status: Incomplete → Confirmed
Revision history for this message
Juan Badia Payno (jbadiapa) wrote :
Download full text (4.1 KiB)

- Deploy standalone/train with tripleo-quickstart
- Modified container, repos and params
- Upgrade the standalone

######################################################
The following process is not perfect, but it reproduce the issue.
######################################################
- Download tripleo-quickstart and apply this patch [1]
- Deploy the standalone job
 - Create personal configuration
cat >personal_configuration.yml <<EOF
control_memory: 2048
compute_memory: 2048
undercloud_memory: 32768

undercloud_vcpu: 8
default_vcpu: 1
undercloud_disk: 250

standalone_interface: ens4
standalone_ip: 192.168.24.1
standalone_role: Standalone.yaml
standalone_libvirt_type: "{{ standalone_virt_type|default('qemu') }}"
standalone_container_cli: podman
standalone_control_virtual_ip: 192.168.24.3
standalone_custom_env_files:
  - /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml
  - /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml
  - /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml
overcloud_templates_path: /usr/share/openstack-tripleo-heat-templates
ctlplane_dns_nameservers:
  - 192.168.24.1
modify_image_vc_root_password: temporal
enable_vbmc: false
create_instackenv_json: false
inject_instackenv: false
create_eth0: false
EOF

   - Provision the vm
./quickstart.sh -n -X -R tripleo-ci/CentOS-8/train --tags all -T none -E /root/tripleo-quickstart/personal_configuration.yml --nodes config/nodes/baremetal.yml -p quickstart.yml 127.0.0.2 2>&1 | tee 0_provisioning.log

   - Enable the ssh
cat /root/.quickstart/id_rsa_virt_host.pub >> ~/.ssh/authorized_keys

   - Deploy the standalone
./quickstart.sh -R tripleo-ci/CentOS-8/train --no-clone --tags all -I -T none -E /root/src/tripleo-quickstart/personal_configuration.yml --nodes config/nodes/baremetal.yml -p quickstart-extras-standalone.yml 127.0.0.2 2>&1 | tee 1_standalone_installation.log

- Do the upgrade [will fail]
./quickstart.sh -R tripleo-ci/CentOS-8/wallaby --no-clone --tags all -I -T none -E /root/tripleo-quickstart/personal_configuration.yml --nodes config/nodes/baremetal.yml -p multinode-standalone-upgrade.yml 127.0.0.2 2>&1 | tee 2_standalone_upgrade.log

- Modified the params
cat >standalone_parameters_upgrade.yaml <<EOF
# TODO: This is workaround for downstream osp17 and osp17.1 job,
# TripleO now defaults to libvirt's "modular daemons" but this
# required libvirt > 7.4, this is a temporary workaround to move
# back to monolithic modular libvirt daemon till RHOS17.1 on RHEL8 releases.
# Details in https://bugzilla.redhat.com/show_bug.cgi?id=2021536
resource_registry:
  OS::TripleO::Services::NovaLibvirt: /usr/share/openstack-tripleo-heat-templates/deployment/deprecated/nova/nova-libvirt-container-puppet.yaml

parameter_defaults:
  StandaloneParameters:
  CertmongerCA: local
  CloudName: 192.168.24.1
  ContainerCli: podman
  ContainerImagePrepareDebug: true
  ControlPlaneStaticRoutes: []
  Debug: true
  DeploymentUser: stack
  DnsServers: ["8.8.8.8"]
  DockerInsecureRegistryAddress:
  - 192.168.24.1:8787
  - quay.io
  MasqueradeNetworks:
    192.168.24.0/24:
    - 192.168.24.0/24
# Set machine type if defau...

Read more...

Revision history for this message
Juan Badia Payno (jbadiapa) wrote :

@Alan, sorry but I'm not able to est the Status = triage.
Not sure whether I need more permission, but the triage is on grey and I can not click on it.

Revision history for this message
Juan Badia Payno (jbadiapa) wrote :
Revision history for this message
Juan Badia Payno (jbadiapa) wrote :
Revision history for this message
Juan Badia Payno (jbadiapa) wrote :
Alan Pevec (apevec)
Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
Juan Badia Payno (jbadiapa) wrote :

The RDO failures:

https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861154
* tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039 OK
* tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001 POST_FAILURE

https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861363
* tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039
   Connection issue on ipa installation:
      https://review.rdoproject.org/zuul/build/9a01d320534b41758a48e3bfde6545ce/log/logs/supplemental/home/cloud-user/deploy_freeipa.log.txt.gz#592
* tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001
    Package download issue:
       https://review.rdoproject.org/zuul/build/20a62d6fabc3494183dd4e41e48b1d5a/log/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#1359

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/866441

Revision history for this message
Marios Andreou (marios-b) wrote :

The fix is in the gate https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861154

Test result looks OK @ https://zuul.opendev.org/t/openstack/build/673d44b1c4c54f51931882dcde458dbd

However, we should mark this job non voting for now.

We need to get it green and we also need to start running it in the periodic line asap/ideally same time as we mark it voting in uptream gate. This may take a few days so we should stop running non voting in the gate.

posted https://review.opendev.org/c/openstack/tripleo-ci/+/866441 Remove standalone-upgrade-ffu from gate until it is stable to vote

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ci (master)

Change abandoned by "Marios Andreou <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/866441
Reason: job is green this is no longer needed https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-standalone-ffu-wallaby

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.