overcloud deploy with podman fails on step4 with systemctl failed error

Bug #1814860 reported by Rabi Mishra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

In change I43c4291caf3c8ec07529ee264a5960d84854d648 we started raising errors for systemd failures. I've noticed that overcloud deploy is now failing randomly on step4 for different containers. I guess this not noticed at the gate as we're running with clean hosts (no existing running containers)

TASK [Debug output for task: Start containers for step 4] **********************
task path: /var/lib/mistral/overcloud/common_deploy_steps_tasks.yaml:494
Tuesday 05 February 2019 11:04:17 +0000 (0:00:21.796) 0:10:08.022 ******
fatal: [reprosubnode-1]: FAILED! => {
    "failed_when_result": true,
    "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
        "Error stopping container: neutron_ovs_agent",
        "Error stopping container: neutron_metadata_agent",
        "Error stopping container: neutron_dhcp",
        "systemctl failed",
        "Traceback (most recent call last):",
        " File \"/usr/lib/python2.7/site-packages/paunch/utils/systemd.py\", line 114, in service_delete",
        " subprocess.check_call(['systemctl', 'stop', sysd_f])",
        " File \"/usr/lib64/python2.7/subprocess.py\", line 542, in check_call",
        " raise CalledProcessError(retcode, cmd)",
        "CalledProcessError: Command '['systemctl', 'stop', u'tripleo_neutron_l3_agent.service']' returned non-zero exit status 1",
        "Command '['systemctl', 'stop', u'tripleo_neutron_l3_agent.service']' returned non-zero exit status 1",
        "Removed symlink /etc/systemd/system/multi-user.target.wants/tripleo_neutron_ovs_agent.service.",
        "Warning: Stopping tripleo_neutron_ovs_agent_healthcheck.service, but it can still be activated by:",
        " tripleo_neutron_ovs_agent_healthcheck.timer",
        "Removed symlink /etc/systemd/system/timers.target.wants/tripleo_neutron_ovs_agent_healthcheck.timer.",
        "Removed symlink /etc/systemd/system/multi-user.target.wants/tripleo_neutron_metadata_agent.service.",
        "Warning: Stopping tripleo_neutron_metadata_agent_healthcheck.service, but it can still be activated by:",
        " tripleo_neutron_metadata_agent_healthcheck.timer",
        "Removed symlink /etc/systemd/system/timers.target.wants/tripleo_neutron_metadata_agent_healthcheck.timer.",
        "Removed symlink /etc/systemd/system/multi-user.target.wants/tripleo_neutron_dhcp.service.",
        "Warning: Stopping tripleo_neutron_dhcp_healthcheck.service, but it can still be activated by:",
        " tripleo_neutron_dhcp_healthcheck.timer",
        "Removed symlink /etc/systemd/system/timers.target.wants/tripleo_neutron_dhcp_healthcheck.timer.",
        "Job for tripleo_neutron_l3_agent.service canceled."

So the podman stop raises an error and the systemctl stop command return an exit code 1

from the journal logs for the service:

Feb 06 07:25:36 reprosubnode-1.rdocloud podman[31432]: time="2019-02-06T07:25:36Z" level=error msg="Error forwarding signal 15 to container d1ed9ecfb65f6329e22f62ecbaed78314cf1cb1631e223ef1602e68e7dc7d8fb: can only kill running containers: container state improper"
Feb 06 07:25:36 reprosubnode-1.rdocloud podman[31432]: time="2019-02-06T07:25:36Z" level=error msg="Error forwarding signal 18 to container d1ed9ecfb65f6329e22f62ecbaed78314cf1cb1631e223ef1602e68e7dc7d8fb: can only kill running containers: container state improper"

Looking at podman issues:

https://github.com/containers/libpod/issues/2168 looks like kind of similar and has been fixed with https://github.com/containers/libpod/pull/2169/commits/33889c642deaaf3d6977ea6463f5937f549fb52b.

I'm hoping this would go away after we bump podman as podman-1.0.0-1.git82e8011.el7.x86_64 does not seem to have this fix.

Tags: containers
Changed in tripleo:
milestone: none → stein-3
status: New → Triaged
importance: Undecided → High
tags: added: containers
Revision history for this message
Rabi Mishra (rabi) wrote :

Tested with podman-1.0.0-3.git921f98f.el7 which contains the above change and the issue seems resolved now.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.