Network verification stopped responding.

Bug #1487397 reported by Vasily Gorin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Sharshov
7.0.x
Won't Fix
High
Fuel Python (Deprecated)

Bug Description

Network verification stopped responding after deploying cluster.

Connect with me ,If you need a env to find solution.
Build # 187

Scenario:
1. Create new environment
2. Choose Neutron, VLAN
3. Choose Ceph for images
4. Choose Sahara
5. Choose Ceilometer
6. Add 1 controller+ceph
7. Add 1 compute+ceph
8. Add 1 cinder+ceph
9. Add 2 mongo
10. Change disk configuration for both Mongo nodes. Change 'MongoDB' volume for vdc
11. Deploy the environment
12. Verify networks

Expected result:
On step 12 net check is successful

Actual result:
Network verification isn't responding.
On ui we see network verification still running already more than 30 minutes.

In table of tasks we can see following:
[root@nailgun log]# fuel task
id | status | name | cluster | progress | uuid
---|---------|-------------------------|---------|----------|-------------------------------------
36 | ready | provision | 2 | 100 | 5b0c8bc1-c9ad-47e4-af1c-05769b7887a4
38 | running | verify_networks | 2 | 0 | 4f740d59-3961-411e-8adf-84cf8b90d831
37 | ready | deployment | 2 | 100 | 7ff4dc08-76ab-4a8c-bed6-e7d752c052b1
32 | ready | deploy | 2 | 100 | 84368f8d-b370-4578-8d93-c8184181d25c
39 | running | check_dhcp | 2 | 0 | 2082cd6a-2b82-4e92-9e10-04a9d19533b3
40 | running | check_repo_availability | 2 | 0 | fb173701-8384-4818-94ba-e0cd5c02cabc
41 | running | dump | None | 0 | 69b4d207-7119-4298-a547-02cd90caeeb8
[root@nailgun log]#

Revision history for this message
Vasily Gorin (vgorin) wrote :
description: updated
Evgeniy L (rustyrobot)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Evgeniy L (rustyrobot)
Revision history for this message
Evgeniy L (rustyrobot) wrote :

I have seen the environment and here what I've investigated.
Nailgun sends network check message to Astute, in Astute logs there is nothing about this message, after checking RabbitMQ naily queue the message was found, it was sent to the Astute, Astute didn't respond with acknowledgement message, so RabbitMQ kept the message without resending it to other workers.

So eventmachine received the message but stuck before trying to log it [1], or it stuck on logging attempt.
Also we probably had similar issue with logging which just stuck [2].

After worker which received the message was killed, message was rescheduled and received by another worker.

We had snapshot of the environment, after it was reverted Astute instantly reconnected and message was rescheduled.
So it adds more complexity to debug the issue.

[1] https://github.com/stackforge/fuel-astute/blob/53c86cba593ddbac776ce5a3360240274c20738c/lib/astute/server/server.rb#L62
[2] https://github.com/stackforge/fuel-astute/commit/3ce8643c2d8447256561f0eafb71a258b6f74f17#diff-e58148f7ac9ffd88d4681162773da473

Changed in fuel:
status: New → Confirmed
assignee: Evgeniy L (rustyrobot) → Fuel Python Team (fuel-python)
Revision history for this message
Evgeniy L (rustyrobot) wrote :

Removing from myself because currently I don't have enough time to work on further attempts to reproduce the issue.

tags: added: tricky
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/217810

Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Evgeniy L (rustyrobot)
Changed in fuel:
assignee: Evgeniy L (rustyrobot) → Dima Shulyak (dshulyak)
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Dima Shulyak (dshulyak) → Igor Kalnitsky (ikalnitsky)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/217810
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=e63709d16bd4c1949bef820ac336c9393c040d25
Submitter: Jenkins
Branch: master

commit e63709d16bd4c1949bef820ac336c9393c040d25
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Thu Aug 27 20:08:15 2015 +0300

    Implement asynchronous logger for event callbacks

    Excludes any lock operations in EventMachine event loop
    and its callbacks.

    Now in EventMachine callbacks all writings to log are
    performed in asynchronous fashion. This way we try to
    exclude possibility of blocking and getting deadlock
    using classic logging mechanism, which has plenty of
    locks internally.

    Also similar problem was seen in the next ticket #1453573.

    Co-Authored-By: Evgeniy L <email address hidden>

    Closes-Bug: #1487397
    Change-Id: I72d0eb01ef8c87e10003338bd3bb70e42a4b7dd6

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/220069
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=e790f6609104fffd33bca2ba20175d0d67024e51
Submitter: Jenkins
Branch: master

commit e790f6609104fffd33bca2ba20175d0d67024e51
Author: Igor Kalnitsky <email address hidden>
Date: Thu Sep 3 10:15:34 2015 +0000

    Revert "Implement asynchronous logger for event callbacks"

    This reverts commit e63709d16bd4c1949bef820ac336c9393c040d25.

    AsyncLogger's worker thread takes almost 100% CPU because
    it infinitely iterates over a logger queue. A simple
    "sleep 0.2" will solve this problem, but I have doubts
    about that AsyncLogger helps us anyway. It's better to
    revert this fix and investigate a real cause of our problems.

    Closes-Bug: 1491794
    Related-Bug: 1487397

    Change-Id: I95d57de9ef92c717b8b07f7832ab258c0b35b6a5

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
assignee: Igor Kalnitsky (ikalnitsky) → Fuel Python Team (fuel-python)
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

This bug introduced really awful uncatchable but during fixing. I propose not to backport it to 7.0.x

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

In progress. Will be closed by https://review.openstack.org/#/c/234665/

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 7.0-updates per comment #7

Dmitry Pyzhov (dpyzhov)
tags: added: team-enhancements
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/234665
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=b60624ee2c5f1d6d805619b6c27965a973508da1
Submitter: Jenkins
Branch: master

commit b60624ee2c5f1d6d805619b6c27965a973508da1
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Oct 12 19:25:00 2015 +0300

    Move from amqp-gem to bunny

    Differents:

    - separate independent chanel for outgoing report;
    - solid way to redeclare already existed queues;
    - auto recovery mode in case of network problem by default;
    - more solid, modern and simple library for AMQP.

    Also:

    - implement asynchronous logger for event callbacks.

    Short words from both gems authors:

    amqp gem brings in a fair share of EventMachine complexity which
    cannot be fully eliminated. Event loop blocking, writes that
    happen at the end of loop tick, uncaught exceptions in event
    loop silently killing it: it's not worth the pain unless
    you've already deeply invested in EventMachine and
    understand how it works.

    Closes-Bug: #1498847
    Closes-Bug: #1487397
    Closes-Bug: #1461562
    Related-Bug: #1485895
    Related-Bug: #1483182

    Change-Id: I52d005498ccb978ada158bfa64b1c7de1a24e9b0

Changed in fuel:
status: In Progress → Fix Committed
Vladimir (vushakov)
tags: added: on-verification
Revision history for this message
Vladimir (vushakov) wrote :
tags: removed: on-verification
Revision history for this message
Vladimir (vushakov) wrote :

Verified on:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "427"
  build_id: "427"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "fae42170a54b98d8e8c8db99b0fbb312633c693c"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "85de57080a18fda18e5325f06eaf654b1b931592"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "e8e36cff332644576d7853c80b8a53d5b955420a"

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.