Timeout exceeded during network verification for environment with multirole nodes

Bug #1613246 reported by Tatyana Kuterina
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Georgy Kibardin
Mitaka
Fix Released
High
Georgy Kibardin

Bug Description

Detailed bug description:
Found on CI:
    https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.plugins.thread_2_separate_services/29/consoleFull

Test: Deploy cluster with 3 nodes with db, keystone, rabbit, horizon
Test Group: separate_all_service

Steps to reproduce:
        1. Create cluster
        2. Add 3 nodes with controller role
        3. Add 3 nodes with database, keystone, rabbit,
           horizon
        4. Add 1 compute and cinder
        5. Verify networks

Expected results:
    Networks verification finished without errors
Actual result:
    Error appears: TimeoutError: Waiting task u'verify_networks' timeout 300 sec was exceeded

Description of the environment:
    9.1 snapshot #136

https://drive.google.com/a/mirantis.com/file/d/0Bz15vbpS5ZPNczRCWHhUYWxZWm8/view?usp=sharing

Tags: area-python
tags: added: area-python
tags: added: swarm-blocker
Revision history for this message
Nikita Zubkov (zubchick) wrote :

Is bug still valid and swarm-blocker?

Revision history for this message
Artem Roma (aroma-x) wrote :

I have reviewed swarm bugs reports for three day from now (25.08) and there haven't been any reproduces of this failure. Why this is a swarm blocker and has high priority?

Revision history for this message
Artem Roma (aroma-x) wrote :

Anyway, further investigation may be needed, so please, provide access to environment on which the issue will (if ever) has shoved itself.

Revision history for this message
Nikita Zubkov (zubchick) wrote :
Revision history for this message
ElenaRossokhina (esolomina) wrote :
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/newton
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
status: Incomplete → Confirmed
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Initially mcollective on node-4 successfully connected:

I, [2016-09-05T02:33:08.431555 #1376] INFO -- : rabbitmq.rb:10:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective@10.109.10.2:61613
I, [2016-09-05T02:33:08.449448 #1376] INFO -- : rabbitmq.rb:15:in `on_connected' Conncted to stomp://mcollective@10.109.10.2:61613

However, later something broke:

D, [2016-09-05T04:37:14.662179 #1687] DEBUG -- : rabbitmq.rb:66:in `on_hbfire' Publishing heartbeat to stomp://mcollective@10.109.10.2:61613: send_fire, {:curt=>1473050234.661921, :last_sleep=>30.49950408935547}
D, [2016-09-05T04:37:27.170574 #1687] DEBUG -- : rabbitmq.rb:64:in `on_hbfire' Received heartbeat from stomp://mcollective@10.109.10.2:61613: receive_fire, {:curt=>1473050247.1704204}
E, [2016-09-05T04:37:27.170770 #1687] ERROR -- : rabbitmq.rb:50:in `on_hbread_fail' Heartbeat read failed from 'stomp://mcollective@10.109.10.2:61613': {"ticker_interval"=>29.5, "read_fail_count"=>0, "lock_fail"=>true, "lock_fail_count"=>2}
E, [2016-09-05T04:37:27.171284 #1687] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.109.10.2:61613: es_oldrecv: receive failed: stream closed
I, [2016-09-05T04:37:27.171477 #1687] INFO -- : rabbitmq.rb:10:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective@10.109.10.2:61613
I, [2016-09-05T04:37:27.185050 #1687] INFO -- : rabbitmq.rb:15:in `on_connected' Conncted to stomp://mcollective@10.109.10.2:61613

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

We've got similar issue https://bugs.launchpad.net/fuel/+bug/1298262 long time ago.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/369344

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/369546

Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Maksim Malchuk (mmalchuk)
Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Georgy Kibardin (gkibardin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/370015

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-agent (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/370025

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-agent (master)

Reviewed: https://review.openstack.org/369344
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=b50241a7b243f553cc35e521ab99bb7f94d8b54a
Submitter: Jenkins
Branch: master

commit b50241a7b243f553cc35e521ab99bb7f94d8b54a
Author: Georgy Kibardin <email address hidden>
Date: Tue Sep 13 14:08:34 2016 +0300

    Ignore heartbeats lock fails

    Stomp heartbeat handling is quite poorly designed. It happens in a
    separate thread which sleeps, then tries to read a heartbeat if reading
    mutex is acquired by message receiving thread it fails and increases
    lock failure count. Upon reaching the limit (in our packets it is 2 by
    default) it forcibly closes the connetion causing reconnect. Setting the
    value to 0 turns the feature off.

    Change-Id: I2187ce69508c530073582c542c963014acc5123a
    Closes-Bug: #1613246
    Closes-Bug: #1298262

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/370015
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=8318d7056556337f17f596edad9d7eed48ec3ca5
Submitter: Jenkins
Branch: stable/mitaka

commit 8318d7056556337f17f596edad9d7eed48ec3ca5
Author: Georgy Kibardin <email address hidden>
Date: Tue Sep 13 18:43:45 2016 +0300

    Ignore heartbeats lock fails

    Stomp heartbeat handling is quite poorly designed. It happens in a
    separate thread which sleeps, then tries to read a heartbeat if reading
    mutex is acquired by message receiving thread it fails and increases
    lock failure count. Upon reaching the limit (in our packets it is 2 by
    default) it forcibly closes the connetion causing reconnect. Setting the
    value to 0 turns the feature off.

    Change-Id: Ieec889828d1dd2654ee760e7d5676efd14c7c348
    Closes-Bug: #1613246
    Closes-Bug: #1298262

Changed in fuel:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-agent (stable/mitaka)

Reviewed: https://review.openstack.org/370025
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=898bcca75224ad82fa98a85b77651faaf554e2b6
Submitter: Jenkins
Branch: stable/mitaka

commit 898bcca75224ad82fa98a85b77651faaf554e2b6
Author: Georgy Kibardin <email address hidden>
Date: Tue Sep 13 14:08:34 2016 +0300

    Ignore heartbeats lock fails

    Stomp heartbeat handling is quite poorly designed. It happens in a
    separate thread which sleeps, then tries to read a heartbeat if reading
    mutex is acquired by message receiving thread it fails and increases
    lock failure count. Upon reaching the limit (in our packets it is 2 by
    default) it forcibly closes the connetion causing reconnect. Setting the
    value to 0 turns the feature off.

    Change-Id: I2187ce69508c530073582c542c963014acc5123a
    Closes-Bug: #1613246
    Closes-Bug: #1298262
    (cherry picked from commit b50241a7b243f553cc35e521ab99bb7f94d8b54a)

tags: added: on-verification
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

BTW. We cannot verify this completely until fuel-agent package is in the repos from which bootstrap image is built. Until release the package can only be found in os-latest and proposed-latest repos.

Revision history for this message
Tatyana Kuterina (tkuterina) wrote :

doesn't reproduced during last week.
9.1 snapshot #294

tags: removed: on-verification
tags: removed: swarm-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/369546
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=22116add36b830e1418d2dd8345d633f194e2f27
Submitter: Jenkins
Branch: master

commit 22116add36b830e1418d2dd8345d633f194e2f27
Author: Georgy Kibardin <email address hidden>
Date: Tue Sep 13 18:43:45 2016 +0300

    Ignore heartbeats lock fails

    Stomp heartbeat handling is quite poorly designed. It happens in a
    separate thread which sleeps, then tries to read a heartbeat if reading
    mutex is acquired by message receiving thread it fails and increases
    lock failure count. Upon reaching the limit (in our packets it is 2 by
    default) it forcibly closes the connetion causing reconnect. Setting the
    value to 0 turns the feature off.

    Change-Id: Ieec889828d1dd2654ee760e7d5676efd14c7c348
    Closes-Bug: #1613246
    Closes-Bug: #1298262

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-agent 10.0.0rc1

This issue was fixed in the openstack/fuel-agent 10.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0rc1

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-agent 10.0.0

This issue was fixed in the openstack/fuel-agent 10.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0

This issue was fixed in the openstack/fuel-library 10.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-agent 10.0.0

This issue was fixed in the openstack/fuel-agent 10.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0

This issue was fixed in the openstack/fuel-library 10.0.0 release.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Finally, the commit has been reverted, it turned out that it was based on false assumptions and didn't fix anything. Now the problem is fixed by using puppet to execute shell tasks which is not prone to this type of problems.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.