Pacemaker 'vip__public' can be stopped for a while when deploy finishes.

Bug #1455910 reported by Dennis Dmitriev
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Kuklin
6.0.x
Invalid
High
Unassigned

Bug Description

Reproduced on different CI jobs: [1], [2]

When deploy finished (cluster marked as 'ready'), system tests are failed on first access to the cluster because slave nodes are inaccessible:

---------
Authorization Failed: Unable to establish connection to http://10.109.6.2:5000/v2.0/tokens
---------

Also, Nailgun marks slaves as 'offline':
http://paste.openstack.org/show/225950/

But since several seconds after revert, nodes are 'online' and they weren't rebooted:
http://paste.openstack.org/show/225951/

Hosts where the tests were running were not under load.
Need to investigate this issue.

[1] smoke test 'deploy_ha_one_controller_flat', http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.centos.smoke_nova/357/

[2] system test 'deploy_ha_one_controller_zabbix', http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.centos.known_issues/123/

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Andrey Maximov (maximov)
Changed in fuel:
importance: Undecided → High
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Looks like issue was caused because pacemaker restart 'vip__public' :

=== /node-1.test.domain.local/pengine.log :
2015-05-17T12:14:30.186762+00:00 notice: notice: LogActions: Stop vip__public (node-1.test.domain.local)

=== CI 6.1/job/6.1.centos.smoke_nova/357/console :
2015-05-17 12:15:59,477 - INFO fuel_web_client.py:765 -- Get ID of a last created cluster
...
Authorization Failed: Unable to establish connection to http://10.109.6.2:5000/v2.0/tokens

=== /node-1.test.domain.local/pengine.log :
2015-05-17T12:16:16.363780+00:00 notice: notice: LogActions: Start vip__public (node-1.test.domain.local)

We should cover this in system tests.

Changed in fuel:
assignee: nobody → MOS QA Team (mos-qa)
Changed in fuel:
assignee: MOS QA Team (mos-qa) → Dennis Dmitriev (ddmitriev)
status: New → In Progress
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Online / offline is being set by Nailgun based on latest timestamp of nailgun-agent run against Nailgun REST API. Nailgun agent can use only admin network, and it doesn't need routing. So the issue should not be any related to public network or any other VIP we use.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

It seems to be the issue @nurla was investigating recently - you may have offline nodes after revert and so you should wait > 30s (~1min should be safe) before nailgun agent makes a call, and we assume that REST API is ready to receive calls (it will take some time for nailgun to start listening after master node revert).

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

This is another case, because we don't do any revert after deploy.

Smoke test was failed between L44 (wait until cluster become 'ready') and L45 (first access to the cluster): https://github.com/stackforge/fuel-qa/blob/master/fuelweb_test/tests/test_ha_one_controller_base.py#L44-L47

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

The issue is in pacemaker behaviour during executing the manifest configure_default_route.pp .

1) Manifest 'configure_default_route.pp' reconfigures default route (through br-ex). That broke connectivity to public network for a moment, but when the manifest finishes - connectivity is fine.

2) Pacemaker , which is already installed and carrying cluster resources, detect the broken connectivity to public network and tries to re-arrange the resource 'vip__public'. This is standard action for connectivity issues.

3) Nailgun marks cluster as 'ready' when all granules are completed. 'configure_default_route.pp' is one of latest manifests of deployment process, so cluster becomes 'ready' at the time when pacemaker finds the restored connectivity to public network and going to start the 'vip__public'.

4) As we check for cluster status 'ready' with a certain interval in system tests, there are the cases when system tests try to access OpenStack cluster before pacemaker finishes starting 'vip__public'.

summary: - Slaves suddenly went 'offline' by the end of deployment
+ Pacemaker 'vip__public' can be stopped for a while when deploy finishes.
tags: added: non-release system-tests
summary: - Pacemaker 'vip__public' can be stopped for a while when deploy finishes.
+ [system-tests] Pacemaker 'vip__public' can be stopped for a while when
+ deploy finishes.
Changed in fuel:
assignee: Dennis Dmitriev (ddmitriev) → Fuel Library Team (fuel-library)
tags: removed: non-release
tags: removed: fuel-ci system-tests
summary: - [system-tests] Pacemaker 'vip__public' can be stopped for a while when
- deploy finishes.
+ Pacemaker 'vip__public' can be stopped for a while when deploy finishes.
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Dennis, configure default route is not run on the controllers, so it cannot be the issue here

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The actual issue is that public vip ping resource started and public vip was starting too slowly. I will write a fix for it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184168

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Vladimir Kuklin (vkuklin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/184168
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=855e1a16c95bf7ee484fdd53e504dd32da733efb
Submitter: Jenkins
Branch: master

commit 855e1a16c95bf7ee484fdd53e504dd32da733efb
Author: Vladimir Kuklin <email address hidden>
Date: Tue May 19 04:21:10 2015 +0300

    Wait for virtual ip to start

    This fix makes puppet wait
    for virtual ip start after
    corresponding ping location
    is added

    Change-Id: I7b6cb2ea7caad31ffaf690f4b5ffb72262a96815
    Closes-bug: #1455910

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/178966
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=aa50833aaebc598de37fcc5d617d77f894b569e7
Submitter: Jenkins
Branch: master

commit aa50833aaebc598de37fcc5d617d77f894b569e7
Author: Dennis Dmitriev <email address hidden>
Date: Thu May 14 17:02:13 2015 +0300

    Add two methods to wait for cluster HA and OS services ready

    assert_ha_services_ready():
     OSTF 'HA' test group should be used to validate if a cluster
     in the operational state.
     There are rabbitmq and mysql checks, and will be added haproxy
     and pacemaker checks.

     Without these services the cluster can fail requests from tests.

    assert_os_services_ready():
     OSTF 'Sanity' test group to wait until OpenStack services are
     ready.

    Change-Id: Ie1bddc965719ca59a143f8f43c53546a4553b1b9
    Closes-Bug: #1383247
    Closes-Bug: #1455910

tags: added: on-verification
Revision history for this message
Sergey Novikov (snovikov) wrote :

Verified on fuel-6.1-469-2015-05-26_16-19-56.iso.

Steps to verify:
            1. Create cluster in HA mode with 1 controller
            2. Add 1 node with controller role
            3. Add 1 node with compute role
            4. Deploy the cluster
            5. Validate cluster was set up correctly, there are no dead
            services, there are no errors in logs
            6. Verify networks
            7. Verify network configuration on controller
            8. Run OSTF

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.