Deployment hangs more then 2 days without timeout errors

Bug #1483182 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Fuel Python (Deprecated)
6.1.x
Invalid
High
MOS Maintenance

Bug Description

There is no 100% defined steps to reproduce, but my case was next:

1. Deploy cluster with 6.1 release (525 iso) with nova flat/ 1 controllers / 2 compute 1 cinder with Centos
2. When cluster is ready, run ostf
3. Upgrade master node to 7.0 (I use tarball 7.0-140)
4. Ad d 1 controller to 6.1 cluster and delete 1 compute
5. Run deploy changes

Actual result:
Deployment hangs on 57% and stay is this stay 2 days. I run stop deployment and seems it hangs too,
[root@nailgun rabbitmq]# fuel task
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
id | status | name | cluster | progress | uuid
---|---------|-----------------|---------|----------|-------------------------------------
28 | running | dump | None | 0 | 132236a9-8fbf-41b0-acc0-e66c7117e7bb
22 | ready | check_networks | 2 | 100 | b47ff38f-1290-47d9-8f50-2b3397a8c92e
26 | running | deployment | 2 | 0 | a9b19ddd-d293-4fed-8c99-0f34cc7aa3ca
25 | running | provision | 2 | 92 | effb8ade-a4ef-4f6a-9ad4-91ea8b133872
21 | ready | deploy | 2 | 57 | de0d8324-0661-46d4-8a83-f43af0ddf5c9
27 | running | stop_deployment | 2 | 0 | 35c116ba-71cf-4afa-bf53-5f6921da7932

Expected:
At least deployment should fail by timeout error

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

I experienced this issue in the past but thought it was a misconfiguration of my environment. Confirming.

Changed in fuel:
status: New → Confirmed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

One wrong thing I see here it is 2 controllers (we recommend use 1-3-5 ), but at the same time I would love to see some interruption by timeout for such case

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

Adding a timeout is a good way to fix symptoms of this in particular and similar problems. However, I think this issue has to be re-assigned to fuel-library for two reasons: a) Something caused the deployment to hang b) There are only two folks in fuel-astute so the chance this issue is revised is too small.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel Library Team (fuel-library)
tags: added: feature-upgrade module-astute
tags: added: tricky
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If puppet hangs, only orchestration can resolve this issue. Fuel library team cannot fix this issue

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
status: Confirmed → New
Changed in fuel:
status: New → Confirmed
Evgeniy L (rustyrobot)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Evgeniy L (rustyrobot)
Revision history for this message
Evgeniy L (rustyrobot) wrote :

I've looked at the logs, and see a really weird situation.

2015-08-07T14:04:56 - deployment is starting (after upgrade)
2015-08-07T14:56:48 - deployment is finished, message sent to nailgun
2015-08-07T14:56:48 warning: [630] Trying to reconnect to message broker. Retry 5 sec later...
2015-08-10T09:02:13 - stop comes, nothing to stop, so message is skipped by all the workers

Basically the problem is Nailgun haven't received a message, there is nothing in reciever log about that.

Revision history for this message
Evgeniy L (rustyrobot) wrote :

Found the next message in rabbitmq logs

=ERROR REPORT==== 7-Aug-2015::14:27:09 ===
closing AMQP connection <0.318.0> (10.109.0.2:40094 -> 10.109.0.2:5672):
{heartbeat_timeout,running}

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Let me guess. Memory leak in astute/ruby?

I've seen something similar only once on scale lab, ruby process started to eat Gigs of RAM inside of astute container. No responses, no new entries in astute logs.

is it reproducible?

seems like astute completely hangs. Does it have some sort of watchdog which will restart the service in such cases?

Revision history for this message
Evgeniy L (rustyrobot) wrote :

Aleksandr, I don't think that this problem is related to memory leak problem, there are new entities in the log, also there is a message about successfully deployed env, which it sent to Nailgun, but Nailgun haven't received any messages, so this problem is related to Nailgun <-> RabbitMQ <-> Astute connectivity.

Revision history for this message
Evgeniy L (rustyrobot) wrote :

We've got the env with the same problem, we were not able to debug the problem, because right after env is deployed, there is reconnect happens and everything works just fine.

Revision history for this message
Evgeniy L (rustyrobot) wrote :

Our current strategy is to get reproducibility by stopping dropping packets between asatute and rabbitmq.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/212675

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Evgeniy L (rustyrobot) wrote :

So the problem was seen since 6.1 release, there were no changes in packages between 6.0 and 6.1, what was changed is instead of bridge, to connect services (in containers) we use a single interface.

Most likely the problem is related to temporary connectivity problem, how it can happen, when all services are connected to localhost is not found.

Similar symptoms can be seen by dropping packages to RabbitMQ from Astute, there is no attempt to reconnect and Astute successfully sends the data to RabbitMQ, kernel doesn't return error on writev into socket. Actually there is an attempt to reconnect in an hour, it can be seen in attached logs, to be precisely in 50 minutes, which is standard heartbeat timeout in RabbitMQ. It was reduced, so now if there is problem in connectivity we will see the problem much earlier and try to perform reconnect.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/212675
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=e24ca066bf6160bc1e419aaa5d486cad1aaa937d
Submitter: Jenkins
Branch: master

commit e24ca066bf6160bc1e419aaa5d486cad1aaa937d
Author: Evgeniy L <email address hidden>
Date: Thu Aug 13 20:36:10 2015 +0300

    Specify heartbeat for connection to RabbitMQ

    There was default heartbeat which was set in RabbitMQ to 580 seconds.
    Change it to smaller amount of seconds in order to detect connectivity
    problems earlier.

    Change-Id: I3674829d525006c7bfad96e725b7ab4d559144e2
    Closes-bug: #1483182

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Confirmed
importance: High → Critical
assignee: Evgeniy L (rustyrobot) → Fuel Python Team (fuel-python)
status: Confirmed → Fix Committed
tags: added: verification
tags: removed: verification
Maksym Strukov (unbelll)
tags: added: on-verification
Revision history for this message
Maksym Strukov (unbelll) wrote :

Verified as fixed in 7.0-288

{"build_id": "288", "build_number": "288", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "288", "build_number": "288", "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}}, "2014.2.2-6.1": {"VERSION": {"build_id": "2015-06-19_13-02-31", "build_number": "525", "api": "1.0", "fuel-library_sha": "2e7a08ad9792c700ebf08ce87f4867df36aa9fab", "nailgun_sha": "dbd54158812033dd8cfd7e60c3f6650f18013a37", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "1ea8017fe8889413706d543a5b9f557f5414beae", "fuel-ostf_sha": "8fefcf7c4649370f00847cc309c24f0b62de718d", "release": "6.1", "fuelmain_sha": "a3998372183468f56019c8ce21aa8bb81fee0c2f"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/234665
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=b60624ee2c5f1d6d805619b6c27965a973508da1
Submitter: Jenkins
Branch: master

commit b60624ee2c5f1d6d805619b6c27965a973508da1
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Oct 12 19:25:00 2015 +0300

    Move from amqp-gem to bunny

    Differents:

    - separate independent chanel for outgoing report;
    - solid way to redeclare already existed queues;
    - auto recovery mode in case of network problem by default;
    - more solid, modern and simple library for AMQP.

    Also:

    - implement asynchronous logger for event callbacks.

    Short words from both gems authors:

    amqp gem brings in a fair share of EventMachine complexity which
    cannot be fully eliminated. Event loop blocking, writes that
    happen at the end of loop tick, uncaught exceptions in event
    loop silently killing it: it's not worth the pain unless
    you've already deeply invested in EventMachine and
    understand how it works.

    Closes-Bug: #1498847
    Closes-Bug: #1487397
    Closes-Bug: #1461562
    Related-Bug: #1485895
    Related-Bug: #1483182

    Change-Id: I52d005498ccb978ada158bfa64b1c7de1a24e9b0

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Closing as Invalid for 6.1, as per comments it is caused by connectivity failures.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.