Bug #1483182 “Deployment hangs more then 2 days without timeout ...” : Bugs : Fuel for OpenStack

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2015-08-10:

#1

I experienced this issue in the past but thought it was a misconfiguration of my environment. Confirming.

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-08-10:

#2

https://drive.google.com/a/mirantis.com/file/d/0B_tSitrwrgvoRlZPbkJaaVdPNTg/view?usp=sharing

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-08-10:

#3

One wrong thing I see here it is 2 controllers (we recommend use 1-3-5 ), but at the same time I would love to see some interruption by timeout for such case

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2015-08-10:

#4

Adding a timeout is a good way to fix symptoms of this in particular and similar problems. However, I think this issue has to be re-assigned to fuel-library for two reasons: a) Something caused the deployment to hang b) There are only two folks in fuel-astute so the chance this issue is revised is too small.

Changed in fuel:
assignee:	Fuel Astute Team (fuel-astute) → Fuel Library Team (fuel-library)
tags:	added: feature-upgrade module-astute
tags:	added: tricky

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-10:

#5

If puppet hangs, only orchestration can resolve this issue. Fuel library team cannot fix this issue

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
status:	Confirmed → New

Nastya Urlapova (aurlapova) on 2015-08-10

Changed in fuel:
status:	New → Confirmed

Evgeniy L (rustyrobot) on 2015-08-11

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Evgeniy L (rustyrobot)

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2015-08-11:

#6

I've looked at the logs, and see a really weird situation.

2015-08-07T14:04:56 - deployment is starting (after upgrade)
2015-08-07T14:56:48 - deployment is finished, message sent to nailgun
2015-08-07T14:56:48 warning: [630] Trying to reconnect to message broker. Retry 5 sec later...
2015-08-10T09:02:13 - stop comes, nothing to stop, so message is skipped by all the workers

Basically the problem is Nailgun haven't received a message, there is nothing in reciever log about that.

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2015-08-11:

#7

Found the next message in rabbitmq logs

=ERROR REPORT==== 7-Aug-2015::14:27:09 ===
closing AMQP connection <0.318.0> (10.109.0.2:40094 -> 10.109.0.2:5672):
{heartbeat_timeout,running}

Revision history for this message

Alexander Gordeev (a-gordeev) wrote on 2015-08-11:

#8

Let me guess. Memory leak in astute/ruby?

I've seen something similar only once on scale lab, ruby process started to eat Gigs of RAM inside of astute container. No responses, no new entries in astute logs.

is it reproducible?

seems like astute completely hangs. Does it have some sort of watchdog which will restart the service in such cases?

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2015-08-12:

#10

Aleksandr, I don't think that this problem is related to memory leak problem, there are new entities in the log, also there is a message about successfully deployed env, which it sent to Nailgun, but Nailgun haven't received any messages, so this problem is related to Nailgun <-> RabbitMQ <-> Astute connectivity.

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2015-08-13:

#11

We've got the env with the same problem, we were not able to debug the problem, because right after env is deployed, there is reconnect happens and everything works just fine.

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2015-08-13:

#12

Our current strategy is to get reproducibility by stopping dropping packets between asatute and rabbitmq.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-13: Fix proposed to fuel-astute (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/212675

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2015-08-14:

#14

So the problem was seen since 6.1 release, there were no changes in packages between 6.0 and 6.1, what was changed is instead of bridge, to connect services (in containers) we use a single interface.

Most likely the problem is related to temporary connectivity problem, how it can happen, when all services are connected to localhost is not found.

Similar symptoms can be seen by dropping packages to RabbitMQ from Astute, there is no attempt to reconnect and Astute successfully sends the data to RabbitMQ, kernel doesn't return error on writev into socket. Actually there is an attempt to reconnect in an hour, it can be seen in attached logs, to be precisely in 50 minutes, which is standard heartbeat timeout in RabbitMQ. It was reduced, so now if there is problem in connectivity we will see the problem much earlier and try to perform reconnect.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-14: Fix merged to fuel-astute (master)

#15

Reviewed: https://review.openstack.org/212675
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=e24ca066bf6160bc1e419aaa5d486cad1aaa937d
Submitter: Jenkins
Branch: master

commit e24ca066bf6160bc1e419aaa5d486cad1aaa937d
Author: Evgeniy L <email address hidden>
Date: Thu Aug 13 20:36:10 2015 +0300

Specify heartbeat for connection to RabbitMQ

    There was default heartbeat which was set in RabbitMQ to 580 seconds.
    Change it to smaller amount of seconds in order to detect connectivity
    problems earlier.

Change-Id: I3674829d525006c7bfad96e725b7ab4d559144e2
Closes-bug: #1483182

Changed in fuel:
status:	In Progress → Fix Committed

Sergii Golovatiuk (sgolovatiuk) on 2015-09-02

Changed in fuel:
status:	Fix Committed → Confirmed
importance:	High → Critical
assignee:	Evgeniy L (rustyrobot) → Fuel Python Team (fuel-python)
status:	Confirmed → Fix Committed

deactivateduser (deactivateduser-deactivatedaccount) on 2015-09-04

tags:

added: verification

deactivateduser (deactivateduser-deactivatedaccount) on 2015-09-07

tags:

removed: verification

Maksym Strukov (unbelll) on 2015-09-09

tags:

added: on-verification

Revision history for this message

Maksym Strukov (unbelll) wrote on 2015-09-09:

#16

Verified as fixed in 7.0-288

{"build_id": "288", "build_number": "288", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "288", "build_number": "288", "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}}, "2014.2.2-6.1": {"VERSION": {"build_id": "2015-06-19_13-02-31", "build_number": "525", "api": "1.0", "fuel-library_sha": "2e7a08ad9792c700ebf08ce87f4867df36aa9fab", "nailgun_sha": "dbd54158812033dd8cfd7e60c3f6650f18013a37", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "1ea8017fe8889413706d543a5b9f557f5414beae", "fuel-ostf_sha": "8fefcf7c4649370f00847cc309c24f0b62de718d", "release": "6.1", "fuelmain_sha": "a3998372183468f56019c8ce21aa8bb81fee0c2f"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}

Verified as fixed in 7.0-288

{"build_id": "288", "build_number": "288", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "288", "build_number": "288", "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}}, "2014.2.2-6.1": {"VERSION": {"build_id": "2015-06-19_13-02-31", "build_number": "525", "api": "1.0", "fuel-library_sha": "2e7a08ad9792c700ebf08ce87f4867df36aa9fab", "nailgun_sha": "dbd54158812033dd8cfd7e60c3f6650f18013a37", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "1ea8017fe8889413706d543a5b9f557f5414beae", "fuel-ostf_sha": "8fefcf7c4649370f00847cc309c24f0b62de718d", "release": "6.1", "fuelmain_sha": "a3998372183468f56019c8ce21aa8bb81fee0c2f"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}

tags:	removed: on-verification
Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-27: Related fix merged to fuel-astute (master)

#17

Reviewed: https://review.openstack.org/234665
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=b60624ee2c5f1d6d805619b6c27965a973508da1
Submitter: Jenkins
Branch: master

commit b60624ee2c5f1d6d805619b6c27965a973508da1
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Oct 12 19:25:00 2015 +0300

Move from amqp-gem to bunny

Differents:

    - separate independent chanel for outgoing report;
    - solid way to redeclare already existed queues;
    - auto recovery mode in case of network problem by default;
    - more solid, modern and simple library for AMQP.

Also:

- implement asynchronous logger for event callbacks.

Short words from both gems authors:

    amqp gem brings in a fair share of EventMachine complexity which
    cannot be fully eliminated. Event loop blocking, writes that
    happen at the end of loop tick, uncaught exceptions in event
    loop silently killing it: it's not worth the pain unless
    you've already deeply invested in EventMachine and
    understand how it works.

    Closes-Bug: #1498847
    Closes-Bug: #1487397
    Closes-Bug: #1461562
    Related-Bug: #1485895
    Related-Bug: #1483182

Change-Id: I52d005498ccb978ada158bfa64b1c7de1a24e9b0

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-07-06:

#18

Closing as Invalid for 6.1, as per comments it is caused by connectivity failures.

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	Critical	Fuel Python (Deprecated)	Fuel for OpenStack 7.0
	6.1.x	Invalid	High	MOS Maintenance	Fuel for OpenStack 6.1-updates

Fuel for OpenStack

Deployment hangs more then 2 days without timeout errors

Bug Description

Other bug subscribers

Remote bug watches