Fuel for OpenStack

Pacemaker 'vip__public' can be stopped for a while when deploy finishes.

Bug #1455910 reported by Dennis Dmitriev on 2015-05-17

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Vladimir Kuklin	Fuel for OpenStack 6.1
	6.0.x	Invalid	High	Unassigned	Fuel for OpenStack 6.0-updates

Bug Description

Reproduced on different CI jobs: [1], [2]

When deploy finished (cluster marked as 'ready'), system tests are failed on first access to the cluster because slave nodes are inaccessible:

---------
Authorization Failed: Unable to establish connection to http://10.109.6.2:5000/v2.0/tokens
---------

Also, Nailgun marks slaves as 'offline':
http://paste.openstack.org/show/225950/

But since several seconds after revert, nodes are 'online' and they weren't rebooted:
http://paste.openstack.org/show/225951/

Hosts where the tests were running were not under load.
Need to investigate this issue.

[1] smoke test 'deploy_ha_one_controller_flat', http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.centos.smoke_nova/357/

[2] system test 'deploy_ha_one_controller_zabbix', http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.centos.known_issues/123/

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-05-17:

fail_error_deploy_ha_one_controller_flat-2015_05_17__12_16_56.tar.xz Edit (14.2 MiB, application/octet-stream)

Andrey Maximov (maximov) on 2015-05-17

Changed in fuel:
importance:	Undecided → High

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-05-17:

Looks like issue was caused because pacemaker restart 'vip__public' :

=== /node-1.test.domain.local/pengine.log :
2015-05-17T12:14:30.186762+00:00 notice: notice: LogActions: Stop vip__public (node-1.test.domain.local)

=== CI 6.1/job/6.1.centos.smoke_nova/357/console :
2015-05-17 12:15:59,477 - INFO fuel_web_client.py:765 -- Get ID of a last created cluster
...
Authorization Failed: Unable to establish connection to http://10.109.6.2:5000/v2.0/tokens

=== /node-1.test.domain.local/pengine.log :
2015-05-17T12:16:16.363780+00:00 notice: notice: LogActions: Start vip__public (node-1.test.domain.local)

We should cover this in system tests.

Changed in fuel:
assignee:	nobody → MOS QA Team (mos-qa)

OpenStack Infra (hudson-openstack) on 2015-05-17

Changed in fuel:
assignee:	MOS QA Team (mos-qa) → Dennis Dmitriev (ddmitriev)
status:	New → In Progress

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-05-17:

https://review.openstack.org/#/c/178966/

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-05-17:

Online / offline is being set by Nailgun based on latest timestamp of nailgun-agent run against Nailgun REST API. Nailgun agent can use only admin network, and it doesn't need routing. So the issue should not be any related to public network or any other VIP we use.

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-05-17:

It seems to be the issue @nurla was investigating recently - you may have offline nodes after revert and so you should wait > 30s (~1min should be safe) before nailgun agent makes a call, and we assume that REST API is ready to receive calls (it will take some time for nailgun to start listening after master node revert).

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-05-17:

This is another case, because we don't do any revert after deploy.

Smoke test was failed between L44 (wait until cluster become 'ready') and L45 (first access to the cluster): https://github.com/stackforge/fuel-qa/blob/master/fuelweb_test/tests/test_ha_one_controller_base.py#L44-L47

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-05-18:

The issue is in pacemaker behaviour during executing the manifest configure_default_route.pp .

1) Manifest 'configure_default_route.pp' reconfigures default route (through br-ex). That broke connectivity to public network for a moment, but when the manifest finishes - connectivity is fine.

2) Pacemaker , which is already installed and carrying cluster resources, detect the broken connectivity to public network and tries to re-arrange the resource 'vip__public'. This is standard action for connectivity issues.

3) Nailgun marks cluster as 'ready' when all granules are completed. 'configure_default_route.pp' is one of latest manifests of deployment process, so cluster becomes 'ready' at the time when pacemaker finds the restored connectivity to public network and going to start the 'vip__public'.

4) As we check for cluster status 'ready' with a certain interval in system tests, there are the cases when system tests try to access OpenStack cluster before pacemaker finishes starting 'vip__public'.

summary:

- Slaves suddenly went 'offline' by the end of deployment
+ Pacemaker 'vip__public' can be stopped for a while when deploy finishes.

Bogdan Dobrelya (bogdando) on 2015-05-18

tags:	added: non-release system-tests
summary:	- Pacemaker 'vip__public' can be stopped for a while when deploy finishes. + [system-tests] Pacemaker 'vip__public' can be stopped for a while when + deploy finishes.

Nastya Urlapova (aurlapova) on 2015-05-18

Changed in fuel:
assignee:	Dennis Dmitriev (ddmitriev) → Fuel Library Team (fuel-library)

Vladimir Kuklin (vkuklin) on 2015-05-18

tags:

removed: non-release

Vladimir Kuklin (vkuklin) on 2015-05-19

tags:	removed: fuel-ci system-tests
summary:	- [system-tests] Pacemaker 'vip__public' can be stopped for a while when - deploy finishes. + Pacemaker 'vip__public' can be stopped for a while when deploy finishes.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-19:

Dennis, configure default route is not run on the controllers, so it cannot be the issue here

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-19:

The actual issue is that public vip ping resource started and public vip was starting too slowly. I will write a fix for it.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-19: Fix proposed to fuel-library (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/184168

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)

OpenStack Infra (hudson-openstack) on 2015-05-19

Changed in fuel:
assignee:	Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando)

OpenStack Infra (hudson-openstack) on 2015-05-19

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Vladimir Kuklin (vkuklin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-20: Fix merged to fuel-library (master)

#11

Reviewed: https://review.openstack.org/184168
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=855e1a16c95bf7ee484fdd53e504dd32da733efb
Submitter: Jenkins
Branch: master

commit 855e1a16c95bf7ee484fdd53e504dd32da733efb
Author: Vladimir Kuklin <email address hidden>
Date: Tue May 19 04:21:10 2015 +0300

Wait for virtual ip to start

    This fix makes puppet wait
    for virtual ip start after
    corresponding ping location
    is added

Change-Id: I7b6cb2ea7caad31ffaf690f4b5ffb72262a96815
Closes-bug: #1455910

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-20: Fix merged to fuel-qa (master)

#12

Reviewed: https://review.openstack.org/178966
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=aa50833aaebc598de37fcc5d617d77f894b569e7
Submitter: Jenkins
Branch: master

commit aa50833aaebc598de37fcc5d617d77f894b569e7
Author: Dennis Dmitriev <email address hidden>
Date: Thu May 14 17:02:13 2015 +0300

Add two methods to wait for cluster HA and OS services ready

    assert_ha_services_ready():
     OSTF 'HA' test group should be used to validate if a cluster
     in the operational state.
     There are rabbitmq and mysql checks, and will be added haproxy
     and pacemaker checks.

Without these services the cluster can fail requests from tests.

    assert_os_services_ready():
     OSTF 'Sanity' test group to wait until OpenStack services are
     ready.

    Change-Id: Ie1bddc965719ca59a143f8f43c53546a4553b1b9
    Closes-Bug: #1383247
    Closes-Bug: #1455910

Sergey Novikov (snovikov) on 2015-05-28

tags:

added: on-verification

Revision history for this message

Sergey Novikov (snovikov) wrote on 2015-05-29:

#13

Verified on fuel-6.1-469-2015-05-26_16-19-56.iso.

Steps to verify:
            1. Create cluster in HA mode with 1 controller
            2. Add 1 node with controller role
            3. Add 1 node with compute role
            4. Deploy the cluster
            5. Validate cluster was set up correctly, there are no dead
            services, there are no errors in logs
            6. Verify networks
            7. Verify network configuration on controller
            8. Run OSTF

tags:	removed: on-verification
Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_error_deploy_ha_one_controller_flat-2015_05_17__12_16_56.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.