various jobs failing for overcloud deploy with missing cluster error: Could not connect to cluster (is it running?)", "

Bug #1821744 reported by Marios Andreou
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sagi (Sergey) Shnaidman

Bug Description

various jobs like [1][2][3][4] tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001 , periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset021-master , periodic-tripleo-ci-centos-7-bm_envD-3ctlr_1comp-featureset001-master , openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp_1supp-featureset039

are failing during overcloud deploy with missing pacemaker cluster - first error is like

    2019-03-25 21:06:50 | "error: Could not connect to cluster (is it running?)",

then the deployment fails (e.g. no rabbit etc )

[1] https://logs.rdoproject.org/62/644562/7/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/82b8093/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-03-25_21_06_50
[2] http://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset021-master/cb44a5b/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[3] https://sf.hosted.upshift.rdu2.redhat.com/logs/15/165815/14/check/periodic-tripleo-ci-centos-7-bm_envD-3ctlr_1comp-featureset001-master/e7111a3/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-03-26_04_28_29
[4] http://logs.rdoproject.org/03/644903/5/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp_1supp-featureset039/31d1914/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Tags: ci
Changed in tripleo:
assignee: nobody → Marios Andreou (marios-b)
tags: added: promotion-blocker
Changed in tripleo:
assignee: Marios Andreou (marios-b) → nobody
importance: Undecided → Critical
Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (14.7 KiB)

From https://logs.rdoproject.org/26/645626/13/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035/1fef793/logs/overcloud-controller-0/var/log/cluster/corosync.log.txt.gz we see the following:
Mar 27 11:41:41 [29680] overcloud-controller-0 cib: info: cib_perform_op: + /cib: @num_updates=12
Mar 27 11:41:41 [29680] overcloud-controller-0 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: <nvpair id="status-2-last-failure-rabbitmq-bundle-docker-1.start_0" name="
last-failure-rabbitmq-bundle-docker-1#start_0" value="1553686901"/>
Mar 27 11:41:41 [29680] overcloud-controller-0 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=overcloud-controller-1/attrd/5, version=0.14.12)
Mar 27 11:41:41 docker(rabbitmq-bundle-docker-0)[52773]: ERROR: Newly created docker container exited after start
Mar 27 11:41:41 [29682] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq-bundle-docker-0_start_0:52773:stderr [ Error: No such object: rabbitmq-bundle-docker-0 ]
Mar 27 11:41:41 [29682] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq-bundle-docker-0_start_0:52773:stderr [ Error: No such object: rabbitmq-bundle-docker-0 ]
Mar 27 11:41:41 [29682] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq-bundle-docker-0_start_0:52773:stderr [ ocf-exit-reason:monitor cmd failed (rc=126), output: rpc error: code = 2 desc = oci runtime error: exec failed: cannot ex
ec a container that has run and stopped ]
Mar 27 11:41:41 [29682] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq-bundle-docker-0_start_0:52773:stderr [
 ]
Mar 27 11:41:41 [29682] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq-bundle-docker-0_start_0:52773:stderr [ ocf-exit-reason:waiting on monitor_cmd to pass after start ]
Mar 27 11:41:41 [29682] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq-bundle-docker-0_start_0:52773:stderr [ ocf-exit-reason:Newly created docker container exited after start ]
Mar 27 11:41:41 [29682] overcloud-controller-0 lrmd: info: log_finished: finished - rsc:rabbitmq-bundle-docker-0 action:start call_id:18 pid:52773 exit-code:1 exec-time:1705ms queue-time:0ms
Mar 27 11:41:41 [29685] overcloud-controller-0 crmd: notice: process_lrm_event: Result of start operation for rabbitmq-bundle-docker-0 on overcloud-controller-0: 1 (unknown error) | call=18 key=rabbitmq-bundle-docker-0_start_0 confirmed=true cib-update=29
Mar 27 11:41:41 [29685] overcloud-controller-0 crmd: notice: process_lrm_event: overcloud-controller-0-rabbitmq-bundle-docker-0_start_0:18 [ Error: No such object: rabbitmq-bundle-docker-0\nError: No such object: rabbitmq-bundle-docker-0\nocf-exit-reason:mo
nitor cmd failed (rc=126), output: rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped\n\r\nocf-exit-reason:waiting on monitor_cmd to pass after start\nocf-exit-reason:Newly created docker c
Mar 27 11:41:41 [29680] overcl...

Revision history for this message
Michele Baldessari (michele) wrote :

The problem is that a selinux change must have happened recently. rlandy gave me and Damien an environment and here are our findings:
1) the HA containers fail due to:
++ cat /run_command
+ CMD=/usr/sbin/pacemaker_remoted
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/mariadb ]]
++ mkdir -p /var/log/kolla/mariadb
mkdir: cannot create directory '/var/log/kolla': Permission denied

Seems previously the containers ran with spc_t and now they run with container_t and since we bind mount the following mount point automatically:
"/var/log/pacemaker/bundles/galera-bundle-2:/var/log",

The selinux policy denies us:
type=AVC msg=audit(1553699383.582:62715): avc: denied { write } for pid=505544 comm="mkdir" name="galera-bundle-0" dev="sda2" ino=239099332 scontext=system_u:system_r:container_t:s0:c194,c
678 tcontext=system_u:object_r:cluster_var_log_t:s0 tclass=dir permissive=0

Two ways to unblock things here:
A) We actually just fixed this issue for podman/rhel8 here:
https://github.com/redhat-openstack/openstack-selinux/pull/31 (https://github.com/redhat-openstack/openstack-selinux/commit/9d5f9f02baa6c10c14301e8d55269216f4107e6a)

I.e. if we get a new openstack-selinux with the above change it should just all work without needing any other change.

B) We revert whatever change moved the HA containers from spc_t to container_t context.

I vote for A) although am not sure how to make that happen?

Revision history for this message
Alex Schultz (alex-schultz) wrote :

https://logs.rdoproject.org/62/644562/7/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/82b8093/logs/overcloud-controller-0/etc/selinux/ reports we're running enforcing. upstream is not supposed to be running enforcing in any blocking context

Revision history for this message
Marios Andreou (marios-b) wrote :

thanks for checking bandinin and dciabrin and for all the info

via mwhahaha just now in #tripleo apparently we shouldn't even be running enforcing in these jobs so this may also be bad config in the job - will have a closer look tomorrow and add info if i find it !

Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

@bandini - in order to get a newer selinux we'd need a promotion and so we have chicken/egg since this is blocking promotion.

We may need to temporarily set selinux permissive or disabled so we can promote then remove that once we get the proper fix that you merged released.

Looking at one of those jobs from the description i see openstack-selinux-0.8.18-0.20190312025835 (at [1] via [2]).

Trying to decide where to put the temp disable for now

[1] https://trunk.rdoproject.org/centos7/a4/a0/a4a0f16a376033f468236844eed756be35fd536f_9c2c4c8f
[2] https://sf.hosted.upshift.rdu2.redhat.com/logs/15/165815/14/check/periodic-tripleo-ci-centos-7-bm_envD-3ctlr_1comp-featureset001-master/e7111a3/logs/undercloud/etc/yum.repos.d/delorean.repo.txt.gz

Revision history for this message
Marios Andreou (marios-b) wrote :

just posted that https://review.openstack.org/#/c/648348/1/test-environments/worker-config.yaml not sure if its the right place discussin now with sagi on #tripleo

Revision history for this message
Marios Andreou (marios-b) wrote :

sorry, this is NOT blocking promotion jobs I thought it was.

In which case we don't need a temp fix - assuming there has already been a release with the fix from selinux

If there has then promotion should help us, otherwise we have to wait for it still or revisit landing a temp fix

tags: removed: promotion-blocker
Revision history for this message
Marios Andreou (marios-b) wrote :

11:14 < ykarel> marios_|ruck, openstack-selinux is consumed from master
11:14 < ykarel> rdopkg info openstack-selinux
11:15 < marios_|ruck> ykarel: sshnaidm|rover ok great then we should be good with promotion in this case
11:15 < ykarel> so that pull request is merged 3 days ago

Revision history for this message
Marios Andreou (marios-b) wrote :

Sagi posted this https://review.openstack.org/#/c/648353/ ... so regardless of the fix we'll want to run centos jobs with permissive

Revision history for this message
Marios Andreou (marios-b) wrote :

so this should be fixed in https://github.com/redhat-openstack/openstack-selinux/pull/31 so promotion will help but also disable selinux for ci with https://review.openstack.org/#/c/648353/

Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/648353
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=1e01fc200398f873de0b7c3c5847a6f1262ba18b
Submitter: Zuul
Branch: master

commit 1e01fc200398f873de0b7c3c5847a6f1262ba18b
Author: Sagi Shnaidman <email address hidden>
Date: Thu Mar 28 11:25:42 2019 +0200

    Add selinux configuration for OVB jobs

    Add template with selinux config and disable it for CentOS in CI.
    Co-Author: Ronelle Landy <email address hidden>
    Closes-Bug: #1821744
    Change-Id: I9b1143152e4e120c1c1aff8f4a7882a4799eb776

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Marios Andreou (marios-b) wrote :

this just merged https://review.openstack.org/#/c/648353/ couple hours ago ... but we also had a promotion yesterday so the pull request from bandini should be there.

I can see at least one green run latest right now in https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-quickstart-extras 2.1.1

This issue was fixed in the openstack/tripleo-quickstart-extras 2.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.