periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master - which includes idempotency check - is timing out during introspection

Bug #1810690 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Bogdan Dobrelya

Bug Description

periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master has been failing in master promotion runs. The full log is available at:

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/d77f650/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz#_2019-01-05_12_32_23

fs020 should run with the options as fs001 with the following differences:
 - nodeset 1ctrl_1comp
 - runs the undercloud idempotent check
 - runs full tempest

The current fs020 file has fallen behind in updates and https://review.openstack.org/#/c/628052/ was added to sync fs020 up with changes to fs001. However, even with these changes, fs020 continued to stall during introspection.

https://review.rdoproject.org/r/#/c/18071 run fs001 with nodeset of 1ctl_1compt and runs the undercloud idempotent check . In this test job, we can reproduce the fs020 failure:

https://logs.rdoproject.org/71/18071/2/check/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset001-master/3e94eba/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz#_2019-01-06_20_11_39.

fs020 passes introspection in the Rocky promotion pipeline

Ronelle Landy (rlandy)
tags: added: alert promotion-blocker
Revision history for this message
Ronelle Landy (rlandy) wrote :

The error reported is:

u'failed_introspection': [u'446a0047-6853-45c4-920b-92622c20e114', u'49affcd3-4cad-4015-89f3-325add97601c'], u'result': u'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=62909211-fd12-45f1-bf3e-269f29211f6a] -> Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 2 nodes still failing introspection\n

Looking at the ironic-inspector logs:

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/d77f650/logs/undercloud/var/log/containers/ironic-inspector/ironic-inspector.log.txt.gz?level=ERROR

2019-01-05 12:09:02.761 9 ERROR oslo_db.sqlalchemy.engines [-] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)

2019-01-05 12:09:05.771 9 ERROR ironic_inspector.conductor.manager [-] Periodic clean up of node cache failed: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)

and

2019-01-05 16:26:11.299 8 ERROR ironic.drivers.modules.ipmitool [req-9699d061-d9e8-4a48-9dfa-6f7b92406771 - - - - -] IPMI Error while attempting "ipmitool -I lanplus -H 192.168.100.216 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmpDVbPcz power status" for node 446a0047-6853-45c4-920b-92622c20e114. Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 192.168.100.216 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmpDVbPcz power status

in

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/d77f650/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz?level=ERROR

This seems similar to a lot of what was discussed in https://bugs.launchpad.net/tripleo/+bug/1784847 but this error is reproducible with:
 - master
 - 1cont_1comp
 - idempotent check

Revision history for this message
Ronelle Landy (rlandy) wrote :

Also journal shows:

Exception during message handling: RuntimeError: Can not schedule new futures after being shutdown

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/d77f650/logs/undercloud/var/log/journal.txt.gz#_Jan_05_12_32_30

https://bugzilla.redhat.com/show_bug.cgi?id=1622719 may have some relevant info. Note the piece about restarting networks.

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → stein-2
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

No - this is completely reproducible in the promotion - with the conditions listed above.
Suspicion is the order in which we re-install and restart undercloud containers.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Problems with DB start at
           2019-01-05T11:49:56 ./undercloud/var/log/containers/mysql/mariadb.log.txt.gz 213 [Warning] Aborted connection 213 to db: 'heat' user: 'heat' host: 'undercloud.localdomain' (Got an error reading communication packets)

While the 2nd deployment starts to do its idempotency check at 2019-01-05 11:58:16.952
So this does not look like the result of overlapping deployments.

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Right, as Ronelle has found out, that message appears in the passing fs01 jobs and happened to be a red herring :-(

Perhaps the *real* issue starts later and is about AMQP and not DB, which is logged firstly as

Jan 05 12:32:30 ./undercloud/var/log/journal.txt.gz undercloud.localdomain podman[78695]: 2019-01-05T12:32:30.203 9 ERROR oslo_messaging.rpc.server [-] Exception during message handling: RuntimeError: Can not schedule new futures after being shutdown

and that time does overlap with the 2nd re-deployment being executed.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

but that message is much later than the 2nd deployment ends at 2019-01-05 12:14:51.197, so nothing overlaps here again. We need to dig it reproduced live.

Revision history for this message
Ronelle Landy (rlandy) wrote :
Download full text (19.9 KiB)

Maybe this is relevant: https://bugs.launchpad.net/tripleo/+bug/1802203?

I have a reproducer environment:

(undercloud) [zuul@undercloud ironic-inspector]$ sudo podman container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
94462af66f6b 192.168.24.1:8787/tripleomaster/centos-binary-neutron-dhcp-agent:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago neutron-dnsmasq-qdhcp-6875d33b-abe0-4740-af2a-7c2173e0cbd5
86d90bad34d4 192.168.24.1:8787/tripleomaster/centos-binary-nova-api:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Exited (0) About an hour ago nova_api_discover_hosts
4161d5a237bd 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago mysql
f45da414e374 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Exited (0) About an hour ago mysql_bootstrap
5c62e46665f7 192.168.24.1:8787/tripleomaster/centos-binary-mistral-api:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Exited (0) About an hour ago mistral_db_populate
f81492036467 192.168.24.1:8787/tripleomaster/centos-binary-nova-compute-ironic:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago nova_compute
276949e6be1f 192.168.24.1:8787/tripleomaster/centos-binary-ironic-inspector:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago ironic_inspector_dnsmasq
98b950558c0e 192.168.24.1:8787/tripleomaster/centos-binary-ironic-inspector:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago ironic_inspector
4f4a0da03f68 192.168.24.1:8787/tripleomaster/centos-binary-ironic-pxe:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago ironic_pxe_http
7a0094adbd2d 192.168.24.1:8787/tripleomaster/centos-binary-ironic-pxe:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago ironic_pxe_tftp
b499be996373 192.168.24.1:8787/tripleomaster/centos-binary-ironic-conductor:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago ironic_conductor
ec9bf9f82ebf 192.168.24.1:8787/tripleomaster/centos-binary-ironic-neutron-agent:current-tripleo-updated-20190107205122 dumb-init --singl... About an hour ago Up About an hour ago ironic_neutron_agent
155dcb2df0a3 192.168.24.1:8787/tripleomaster/centos-binary-ironic-conductor:current-tripleo-updated-201901...

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

For the reproduced env, I cannot see it relates to https://bugs.launchpad.net/tripleo/+bug/1802203. There is no avc events like in that bug.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

paunch logs (not collected for CI jobs, let's fix that!) show that it recreates mysql container, which breaks idempotency. That may be affecting the future deployment and failing introspection:

mariadb log snippet:
2019-01-07 21:50:51 0 [ERROR] mysqld: Table './mysql/user' is marked as crashed and should be repaired

/var/log/paunch.log:
2019-01-07 21:50:31.365 160239 DEBUG paunch [ ] $ podman inspect --type container --format {{index .Config.Labels "config_data"}} mysql
2019-01-07 21:50:31.500 160239 DEBUG paunch [ ] {"environment": ["KOLLA_CONFIG_STRATEGY=COPY_ALWAYS", "TRIPLEO_CONFIG_HASH=3192ff4937a32c37
022f976976648db6"], "start_order": 2, "volumes": ["/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json", "/var/li
b/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro", "/etc/localtime:/etc/localtime:ro", "/etc/hosts:/etc/hosts:ro", "
/var/lib/mysql:/var/lib/mysql", "/var/log/containers/mysql:/var/log/mariadb"], "healthcheck": {"test": "/openstack/healthcheck"}, "image": "
192.168.24.1:8787/tripleomaster/centos-binary-mariadb:current-tripleo-updated-20190107205122", "net": "host", "restart": "always"}
2019-01-07 21:50:31.500 160239 DEBUG paunch [ ]
2019-01-07 21:50:31.501 160239 DEBUG paunch [ ] Deleting container (changed config_data): mysql
2019-01-07 21:50:31.501 160239 DEBUG paunch [ ] Stopping and disabling systemd service for tripleo_mysql
2019-01-07 21:50:41.982 160239 DEBUG paunch [ ] Removing systemd unit file tripleo_mysql.service
2019-01-07 21:50:42.120 160239 DEBUG paunch [ ] Stopping and disabling systemd service for tripleo_mysql
2019-01-07 21:50:42.260 160239 DEBUG paunch [ ] Removing systemd unit file tripleo_mysql_healthcheck.service
2019-01-07 21:50:42.384 160239 DEBUG paunch [ ] Stopping and disabling systemd service for tripleo_mysql
2019-01-07 21:50:42.523 160239 DEBUG paunch [ ] Removing systemd unit file tripleo_mysql_healthcheck.timer
2019-01-07 21:50:42.644 160239 DEBUG paunch [ ] $ podman rm -f mysql

...
2019-01-07 21:50:49.202 160239 DEBUG paunch [ ] Running container: mysql
2019-01-07 21:50:49.337 160239 DEBUG paunch [ ] $ podman create --name mysql --label config_id=tripleo_step2

The issues are:
1) Nothing should be recreated here
2) It takes a 7 seconds of downtime
3) Graceful stop is not used, which is not good and corrupts tables so it has to be recovered (see mariadb logs snippet)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

Test jobs with debug running in RDO Cloud: https://review.rdoproject.org/r/#/c/18058/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to paunch (master)

Fix proposed to branch: master
Review: https://review.openstack.org/629408

Changed in tripleo:
assignee: nobody → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
tags: added: pike-backport-potential queens-backport-potential rocky-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to paunch (master)

Reviewed: https://review.openstack.org/629408
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=5b925ef7bb3549a97ae8ee4ef5d66a08fced7154
Submitter: Zuul
Branch: master

commit 5b925ef7bb3549a97ae8ee4ef5d66a08fced7154
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Jan 8 17:44:10 2019 +0200

    Fix comparing dictionaries for changes detection

    Non-unicode and unicode dicts cannot be compared.
    That makes paunch failing to detect config changes and
    rebuilding containers w/o a need.

    Change-Id: Ie2db16b982bf1851ab37c3514c1bc82cfc64d0b3
    Closes-Bug: #1810690
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in tripleo:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to paunch (master)

Fix proposed to branch: master
Review: https://review.openstack.org/630165

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to paunch (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/630219

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/paunch 4.2.0

This issue was fixed in the openstack/paunch 4.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to paunch (master)

Reviewed: https://review.openstack.org/630165
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=5558e521abc48f5cf1bd3fcfdaec1149a05dcfaa
Submitter: Zuul
Branch: master

commit 5558e521abc48f5cf1bd3fcfdaec1149a05dcfaa
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Jan 11 12:35:36 2019 +0100

    Fix comparing dictionaries for changes detection

    Non-unicode and unicode dicts cannot be compared.
    That makes paunch failing to detect config changes and
    rebuilding containers w/o a need.

    Use yaml safe_load in order to produce a non-unicode dict
    for the right comparison of config data.

    Change-Id: I507f4b07af2c515fd8548af7dbe6b35ad2a269a9
    Closes-Bug: #1810690
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to paunch (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/630750

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to paunch (stable/rocky)

Reviewed: https://review.openstack.org/630750
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=e52fdc085412b7631364b1c93c98a5e5a25d4320
Submitter: Zuul
Branch: stable/rocky

commit e52fdc085412b7631364b1c93c98a5e5a25d4320
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Jan 11 12:35:36 2019 +0100

    Fix comparing dictionaries for changes detection

    Non-unicode and unicode dicts cannot be compared.
    That makes paunch failing to detect config changes and
    rebuilding containers w/o a need.

    Use yaml safe_load in order to produce a non-unicode dict
    for the right comparison of config data.

    NOTE: This change is different for <= Rocky as we refactored in Stein.
    Conflicts:
     paunch/builder/base.py
    Change-Id: I507f4b07af2c515fd8548af7dbe6b35ad2a269a9
    Closes-Bug: #1810690
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 5558e521abc48f5cf1bd3fcfdaec1149a05dcfaa)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to paunch (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/631005

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to paunch (stable/queens)

Reviewed: https://review.openstack.org/631005
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=9d38fc83443961166553d15705439d2d9066ea2c
Submitter: Zuul
Branch: stable/queens

commit 9d38fc83443961166553d15705439d2d9066ea2c
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Jan 11 12:35:36 2019 +0100

    Fix comparing dictionaries for changes detection

    Non-unicode and unicode dicts cannot be compared.
    That makes paunch failing to detect config changes and
    rebuilding containers w/o a need.

    Use yaml safe_load in order to produce a non-unicode dict
    for the right comparison of config data.

    NOTE: This change is different for <= Rocky as we refactored in Stein.
    Conflicts:
     paunch/builder/base.py
    Change-Id: I507f4b07af2c515fd8548af7dbe6b35ad2a269a9
    Closes-Bug: #1810690
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 5558e521abc48f5cf1bd3fcfdaec1149a05dcfaa)
    (cherry picked from commit e52fdc085412b7631364b1c93c98a5e5a25d4320)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to paunch (master)

Reviewed: https://review.openstack.org/630219
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=b5a14c6cd0be76156852afdc8c664e37e88bc7b3
Submitter: Zuul
Branch: master

commit b5a14c6cd0be76156852afdc8c664e37e88bc7b3
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jan 9 12:51:45 2019 +0200

    Always attempt a graceful container stop/remove

    When paunch removes a container, like applying a container's config
    changes, it does that not gracefully. That poses an issue for some
    cases, like stateful services. To mitigate that, attempt stopping the
    container prior to removing it. And use the defined stop signal and
    timeout settings for such a container, when stopping it. If there is no
    signal/timeout in the config, fallback to deaults of SIGTERM and a 10s.

    Related-bug: #1810690
    Change-Id: I7c72ea055984bb5ff3ea7cea019693d61245212f
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

https://review.openstack.org/630219 is reverted in https://review.openstack.org/635156. Let's address graceful stopping of containers via another bug and another implementation

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Basically paunch already supports --stop-signal / --stop-timeout for creating containers.
The only change would be nice to add is using --timeout arg for the action stop as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/635161

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/635161
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=6f262c8051143f22d7be77fc3a359b63caae6cfe
Submitter: Zuul
Branch: master

commit 6f262c8051143f22d7be77fc3a359b63caae6cfe
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Feb 6 15:00:33 2019 +0100

    Apply stop period for paunch-managed stateful svcs

    Apply stop_grace_period of a 60s for stateful services managed via
    paunch. That gives it time to attempt stopping it gracefully, while
    applying configuration updates.

    Also switch non-HA mysql template to run mysqld directly.
    This is because myslqd_safe does not process the graceful stop
    signals well, so we need to signal mysqld directly.
    Note: mysqld_safe helps for better coverging HA clusters so
    we leave it as is for the HA template.

    Change-Id: I329f871b6f2509c0133823ca4c22238e3e19c76d
    Related-bug: #1810690
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/paunch 4.3.0

This issue was fixed in the openstack/paunch 4.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/paunch 3.2.1

This issue was fixed in the openstack/paunch 3.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/paunch 2.5.1

This issue was fixed in the openstack/paunch 2.5.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.