Minor updates cause control-plane downtime

Bug #1664650 reported by Steven Hardy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Expired
Undecided
Unassigned

Bug Description

Originally reported at https://bugzilla.redhat.com/show_bug.cgi?id=1421883

It seems we've got service interuption/restarts impacting control-plane uptime, even with no-op updates, so we need to investigate and confirm if this is happening in upstream builds, and if so why.

Revision history for this message
Steven Hardy (shardy) wrote :

The suggested reproducer is to do a second openstack overcloud deploy without changing anything, while polling the overcloud APIs e.g openstack server list in a while true loop - this results in errors according to the report from Graeme.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → pike-1
tags: added: ocata-backport-potential
Revision history for this message
Steven Hardy (shardy) wrote :

So in terms of history, https://review.openstack.org/#/c/358511/ landed which was supposed to ensure minor updates didn't restart everything, only things which need it (which, for no-op updates should really be nothing?)

It sounds like we need to reproduce and check logs to see what actually does get restarted (e.g do pacemaker managed services actually still get restarted, causing errors in the API services which don't?)

Also we should look at adding some liveness checks to the update CI job, so we can ensure the APIs really do stay up during minor updates.

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (3.3 KiB)

So I tried this on a fairly recent ocata env.
We start the overcloud deploy against an existing overcloud at the following time:
2017-02-14 18:25:45Z [Networks]: UPDATE_IN_PROGRESS state changed

In the attached "services-being-restarted" file I grepped for any service that was restarted on overcloud-controller-0.

So looking at http I see the following in the puppet debug logs:
2017-02-14 18:32:26 +0000 Puppet (info): Computing checksum on file /etc/httpd/conf/ports.conf
2017-02-14 18:32:26 +0000 Puppet (info): FileBucket got a duplicate file {md5}8c87b9f11b8696570777c0f491819160
2017-02-14 18:32:26 +0000 /Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf] (info): Filebucketed /etc/httpd/conf/ports.conf to puppet with sum 8c87b9f11
b8696570777c0f491819160
2017-02-14 18:32:26 +0000 /Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf]/content (notice): content changed '{md5}8c87b9f11b8696570777c0f491819160' to
 '{md5}737e2fe64473f2781e3c99022a27b6ff'
2017-02-14 18:32:26 +0000 /Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf] (debug): The container Concat[/etc/httpd/conf/ports.conf] will propagate my
refresh event
2017-02-14 18:32:26 +0000 /Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf] (debug): The container /etc/httpd/conf/ports.conf will propagate my refresh
event
2017-02-14 18:32:26 +0000 /etc/httpd/conf/ports.conf (debug): The container Concat[/etc/httpd/conf/ports.conf] will propagate my refresh event
2017-02-14 18:32:26 +0000 Concat[/etc/httpd/conf/ports.conf] (debug): The container Class[Apache] will propagate my refresh event
2017-02-14 18:32:26 +0000 Concat[/etc/httpd/conf/ports.conf] (info): Scheduling refresh of Class[Apache::Service]
2017-02-14 18:32:26 +0000 Class[Apache] (debug): The container Stage[main] will propagate my refresh event
2017-02-14 18:32:26 +0000 Class[Apache::Service] (info): Scheduling refresh of Service[httpd]
2017-02-14 18:32:26 +0000 Puppet (debug): Executing: '/usr/bin/systemctl is-active httpd'
2017-02-14 18:32:26 +0000 Puppet (debug): Executing: '/usr/bin/systemctl is-enabled httpd'
2017-02-14 18:32:26 +0000 Puppet (debug): Executing: '/usr/bin/systemctl is-active httpd'
2017-02-14 18:32:26 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart httpd'
2017-02-14 18:32:46 +0000 /Stage[main]/Apache::Service/Service[httpd] (notice): Triggered 'refresh' from 1 events
2017-02-14 18:32:46 +0000 /Stage[main]/Apache::Service/Service[httpd] (debug): The container Class[Apache::Service] will propagate my refresh event
2017-02-14 18:32:46 +0000 Class[Apache::Service] (debug): The container Stage[main] will propagate my refresh event
2017-02-14 18:32:46 +0000 /Stage[main]/Keystone::Deps/Anchor[keystone::service::end] (notice): Triggered 'refresh' from 7 events
2017-02-14 18:32:46 +0000 /Stage[main]/Keystone::Deps/Anchor[keystone::service::end] (debug): The container Class[Keystone::Deps] will propagate my refresh event
2017-02-14 18:32:46 +0000 Class[Keystone::Deps] (debug): The container Stage[main] will propagate my refresh event

At least for httpd it seems w...

Read more...

Revision history for this message
Michele Baldessari (michele) wrote :

So here is the full list of restarts I have been able to observe (I grepped for restarts with timestamps after 2017-02-14 18:25:45Z which is the time I started the deploy command):
5 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart ntpd'
3 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-nova-api'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-proxy'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-object-updater'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-object-replicator'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-object-expirer'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-object-auditor'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-object'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-container-updater'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-container-replicator'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-container-auditor'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-container'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-account-replicator'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-account-reaper'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-account-auditor'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-swift-account'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-nova-scheduler'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-nova-novncproxy'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-nova-consoleauth'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart openstack-nova-conductor'
2 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart httpd'
1 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart neutron-server'
1 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart neutron-metadata-agent'
1 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart neutron-l3-agent'
1 +0000 Puppet (debug): Executing: '/usr/bin/systemctl restart neutron-dhcp-agent'

Revision history for this message
Michele Baldessari (michele) wrote :

So at http://acksyn.org/files/tripleo/update.log.gz there is the full puppet log (42M uncompressed) where we can observe the restarts being triggered by puppet

Revision history for this message
Michele Baldessari (michele) wrote :

So Alex's change here https://review.openstack.org/#/c/435011 will fix the unneeded restart for the swift services.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Just for visibility, Bug 1665443 possibly addresses the nova service restarts and Bug 1665405 possibly addresses the swift restarts.

Revision history for this message
Michele Baldessari (michele) wrote :

ntp restarts are tracked here: https://bugs.launchpad.net/bugs/1665426

Revision history for this message
Michele Baldessari (michele) wrote :

So with Alex's patches for the norpm provider and the nova filter patch we have a definite improvement in the number of restarts:
- nova-api went from 3 to 1
- nova-* went from 2 to 1
- swift has no restarts any longer
- neutron-* and httpd stayed at 1 and 2 respectively

Here are the restarts divided by steps:
* Step1
restart ntpd'
* Step2
restart ntpd'
* Step3
restart ntpd'
restart httpd'
* Step4
restart ntpd'
restart openstack-nova-conductor'
restart openstack-nova-scheduler'
restart openstack-nova-consoleauth'
restart openstack-nova-novncproxy'
restart httpd'
restart openstack-nova-api'
restart neutron-dhcp-agent'
restart neutron-server'
restart neutron-l3-agent'
restart neutron-metadata-agent'
* Step5
restart ntpd'

tags: added: idempotency
Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Changed in tripleo:
milestone: pike-3 → pike-rc1
Revision history for this message
Ben Nemec (bnemec) wrote :

What's the current situation on this? I know we merged a bunch of patches a while back to address the problem.

Revision history for this message
Steven Hardy (shardy) wrote :

I think the baremetal updates were fixed by a number of puppet related patches, but we still have work remaining to enable zero downtime minor updates in the new container architecture.

I'm not sure if it's reasonable to track the patches for the container related stuff here (which are mostly posted but not assigned to any bug AFAIK) or if we should close this and raise a new one to track the progress towards fully working container minor updates.

Changed in tripleo:
milestone: pike-rc1 → pike-rc2
Changed in tripleo:
milestone: pike-rc2 → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Revision history for this message
Emilien Macchi (emilienm) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (FUTURE, PIKE, QUEENS, ROCKY, STEIN).
  Valid example: CONFIRMED FOR: FUTURE

Changed in tripleo:
importance: High → Undecided
status: Triaged → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.