Deployment hangs more then 2 days without timeout errors

Bug #1483182 reported by Tatyanka on 2015-08-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Fuel Python (Deprecated)
6.1.x
High
MOS Maintenance

Bug Description

There is no 100% defined steps to reproduce, but my case was next:

1. Deploy cluster with 6.1 release (525 iso) with nova flat/ 1 controllers / 2 compute 1 cinder with Centos
2. When cluster is ready, run ostf
3. Upgrade master node to 7.0 (I use tarball 7.0-140)
4. Ad d 1 controller to 6.1 cluster and delete 1 compute
5. Run deploy changes

Actual result:
Deployment hangs on 57% and stay is this stay 2 days. I run stop deployment and seems it hangs too,
[root@nailgun rabbitmq]# fuel task
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
id | status | name | cluster | progress | uuid
---|---------|-----------------|---------|----------|-------------------------------------
28 | running | dump | None | 0 | 132236a9-8fbf-41b0-acc0-e66c7117e7bb
22 | ready | check_networks | 2 | 100 | b47ff38f-1290-47d9-8f50-2b3397a8c92e
26 | running | deployment | 2 | 0 | a9b19ddd-d293-4fed-8c99-0f34cc7aa3ca
25 | running | provision | 2 | 92 | effb8ade-a4ef-4f6a-9ad4-91ea8b133872
21 | ready | deploy | 2 | 57 | de0d8324-0661-46d4-8a83-f43af0ddf5c9
27 | running | stop_deployment | 2 | 0 | 35c116ba-71cf-4afa-bf53-5f6921da7932

Expected:
At least deployment should fail by timeout error

Roman Prykhodchenko (romcheg) wrote :

I experienced this issue in the past but thought it was a misconfiguration of my environment. Confirming.

Changed in fuel:
status: New → Confirmed
Tatyanka (tatyana-leontovich) wrote :

One wrong thing I see here it is 2 controllers (we recommend use 1-3-5 ), but at the same time I would love to see some interruption by timeout for such case

Roman Prykhodchenko (romcheg) wrote :

Adding a timeout is a good way to fix symptoms of this in particular and similar problems. However, I think this issue has to be re-assigned to fuel-library for two reasons: a) Something caused the deployment to hang b) There are only two folks in fuel-astute so the chance this issue is revised is too small.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel Library Team (fuel-library)
tags: added: feature-upgrade module-astute
tags: added: tricky
Bogdan Dobrelya (bogdando) wrote :

If puppet hangs, only orchestration can resolve this issue. Fuel library team cannot fix this issue

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
status: Confirmed → New
Changed in fuel:
status: New → Confirmed
Evgeniy L (rustyrobot) on 2015-08-11
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Evgeniy L (rustyrobot)
Evgeniy L (rustyrobot) wrote :

I've looked at the logs, and see a really weird situation.

2015-08-07T14:04:56 - deployment is starting (after upgrade)
2015-08-07T14:56:48 - deployment is finished, message sent to nailgun
2015-08-07T14:56:48 warning: [630] Trying to reconnect to message broker. Retry 5 sec later...
2015-08-10T09:02:13 - stop comes, nothing to stop, so message is skipped by all the workers

Basically the problem is Nailgun haven't received a message, there is nothing in reciever log about that.

Evgeniy L (rustyrobot) wrote :

Found the next message in rabbitmq logs

=ERROR REPORT==== 7-Aug-2015::14:27:09 ===
closing AMQP connection <0.318.0> (10.109.0.2:40094 -> 10.109.0.2:5672):
{heartbeat_timeout,running}

Alexander Gordeev (a-gordeev) wrote :

Let me guess. Memory leak in astute/ruby?

I've seen something similar only once on scale lab, ruby process started to eat Gigs of RAM inside of astute container. No responses, no new entries in astute logs.

is it reproducible?

seems like astute completely hangs. Does it have some sort of watchdog which will restart the service in such cases?

Evgeniy L (rustyrobot) wrote :

Aleksandr, I don't think that this problem is related to memory leak problem, there are new entities in the log, also there is a message about successfully deployed env, which it sent to Nailgun, but Nailgun haven't received any messages, so this problem is related to Nailgun <-> RabbitMQ <-> Astute connectivity.

Evgeniy L (rustyrobot) wrote :

We've got the env with the same problem, we were not able to debug the problem, because right after env is deployed, there is reconnect happens and everything works just fine.

Evgeniy L (rustyrobot) wrote :

Our current strategy is to get reproducibility by stopping dropping packets between asatute and rabbitmq.

Fix proposed to branch: master
Review: https://review.openstack.org/212675

Changed in fuel:
status: Confirmed → In Progress
Evgeniy L (rustyrobot) wrote :

So the problem was seen since 6.1 release, there were no changes in packages between 6.0 and 6.1, what was changed is instead of bridge, to connect services (in containers) we use a single interface.

Most likely the problem is related to temporary connectivity problem, how it can happen, when all services are connected to localhost is not found.

Similar symptoms can be seen by dropping packages to RabbitMQ from Astute, there is no attempt to reconnect and Astute successfully sends the data to RabbitMQ, kernel doesn't return error on writev into socket. Actually there is an attempt to reconnect in an hour, it can be seen in attached logs, to be precisely in 50 minutes, which is standard heartbeat timeout in RabbitMQ. It was reduced, so now if there is problem in connectivity we will see the problem much earlier and try to perform reconnect.

Reviewed: https://review.openstack.org/212675
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=e24ca066bf6160bc1e419aaa5d486cad1aaa937d
Submitter: Jenkins
Branch: master

commit e24ca066bf6160bc1e419aaa5d486cad1aaa937d
Author: Evgeniy L <email address hidden>
Date: Thu Aug 13 20:36:10 2015 +0300

    Specify heartbeat for connection to RabbitMQ

    There was default heartbeat which was set in RabbitMQ to 580 seconds.
    Change it to smaller amount of seconds in order to detect connectivity
    problems earlier.

    Change-Id: I3674829d525006c7bfad96e725b7ab4d559144e2
    Closes-bug: #1483182

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Confirmed
importance: High → Critical
assignee: Evgeniy L (rustyrobot) → Fuel Python Team (fuel-python)
status: Confirmed → Fix Committed
Ivan Kliuk (ivankliuk) on 2015-09-04
tags: added: verification
Ivan Kliuk (ivankliuk) on 2015-09-07
tags: removed: verification
Maksym Strukov (unbelll) on 2015-09-09
tags: added: on-verification
Maksym Strukov (unbelll) wrote :

Verified as fixed in 7.0-288

{"build_id": "288", "build_number": "288", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "288", "build_number": "288", "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}}, "2014.2.2-6.1": {"VERSION": {"build_id": "2015-06-19_13-02-31", "build_number": "525", "api": "1.0", "fuel-library_sha": "2e7a08ad9792c700ebf08ce87f4867df36aa9fab", "nailgun_sha": "dbd54158812033dd8cfd7e60c3f6650f18013a37", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "1ea8017fe8889413706d543a5b9f557f5414beae", "fuel-ostf_sha": "8fefcf7c4649370f00847cc309c24f0b62de718d", "release": "6.1", "fuelmain_sha": "a3998372183468f56019c8ce21aa8bb81fee0c2f"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/234665
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=b60624ee2c5f1d6d805619b6c27965a973508da1
Submitter: Jenkins
Branch: master

commit b60624ee2c5f1d6d805619b6c27965a973508da1
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Oct 12 19:25:00 2015 +0300

    Move from amqp-gem to bunny

    Differents:

    - separate independent chanel for outgoing report;
    - solid way to redeclare already existed queues;
    - auto recovery mode in case of network problem by default;
    - more solid, modern and simple library for AMQP.

    Also:

    - implement asynchronous logger for event callbacks.

    Short words from both gems authors:

    amqp gem brings in a fair share of EventMachine complexity which
    cannot be fully eliminated. Event loop blocking, writes that
    happen at the end of loop tick, uncaught exceptions in event
    loop silently killing it: it's not worth the pain unless
    you've already deeply invested in EventMachine and
    understand how it works.

    Closes-Bug: #1498847
    Closes-Bug: #1487397
    Closes-Bug: #1461562
    Related-Bug: #1485895
    Related-Bug: #1483182

    Change-Id: I52d005498ccb978ada158bfa64b1c7de1a24e9b0

Denis Meltsaykin (dmeltsaykin) wrote :

Closing as Invalid for 6.1, as per comments it is caused by connectivity failures.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers