Failed to casting message to the nailgun. RabbitMQ was dead

Bug #1461562 reported by Alexander Kurenyshev on 2015-06-03
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Medium
Vladimir Sharshov

Bug Description

Steps to reproduce:
Scenario:
1. Create cluster in Ha mode with 1 controller
2. Add 1 node with controller role
3. Add 1 node with compute role
4. Deploy cluster

Expected result:
Deploy pass

Actual result:
Deploy fails

At the Astute log:
2015-06-03T07:33:03 info: [663] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"a20646c6-40e0-4d8d-b3a5-c123ce4cda1d", "status"=>"ready", "progress"=>100}}
2015-06-03T07:33:03 warning: [663] Trying to reconnect to message broker. Retry 5 sec later...

At the rabbit log:
=ERROR REPORT==== 3-Jun-2015::07:31:33 ===
closing AMQP connection <0.350.0> (10.109.15.2:39783 -> 10.109.15.2:5672):
{heartbeat_timeout,running}

=INFO REPORT==== 3-Jun-2015::07:33:08 ===
accepting AMQP connection <0.1203.0> (10.109.15.2:40972 -> 10.109.15.2:5672)

Fuel used:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "497"
  build_id: "2015-06-02_16-28-25"
  nailgun_sha: "3830bdcb28ec050eed399fe782cc3dd5fbf31bde"
  python-fuelclient_sha: "4fc55db0265bbf39c369df398b9dc7d6469ba13b"
  astute_sha: "cbae24e9904be2ff8d1d49c0c48d1bdc33574228"
  fuel-library_sha: "d757cd41e4f8273d36ef85b8207e554e5422c5b0"
  fuel-ostf_sha: "f899e16c4ce9a60f94e7128ecde1324ea41d09d4"
  fuelmain_sha: "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Matthew Mosesohn (raytrac3r)
importance: Undecided → High
status: New → Confirmed
Matthew Mosesohn (raytrac3r) wrote :

According to the logs, astute keeps casting messages to deploy_resp topic in RabbitMQ, but they never get collected by receiverd.
One suggestion could be to kill receiverd if it can't connect (and supervisor restarts it).
See there are only a couple of messages received by provision_resp, but no deploy_resp messages.

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Fuel Python Team (fuel-python)
tags: added: scale
Dmitry Pyzhov (dpyzhov) on 2015-06-03
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Dmitry Ilyin (idv1985)
Dmitry Pyzhov (dpyzhov) wrote :

This bug is not about scale. Nastya will create a new bug request for scale story. Hiding scale comments.

Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → Fuel Python Team (fuel-python)
Dmitry Pyzhov (dpyzhov) on 2015-06-04
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Igor Kalnitsky (ikalnitsky)
Ihor Kalnytskyi (ikalnytskyi) wrote :

I have no idea what's really going on. Nailgun's receiver has re-connection code. Moreover, in case of re-connection or some error it should print error message to logs. Still, the logs are clear and it looks like receiver's waiting for messages like always,

From Astute POV, we see some AMQP connection problems, but then all goes fine.

If we take a look at Astute's log more carefully, we'll find out unusual messages. We have a huge list of consequent "casting message" calls. Normally, there should be no consequent calls. Normally, we execute some task and then cast message to nailgun with updated progress bar.

Ihor Kalnytskyi (ikalnytskyi) wrote :

Actually, I saw similar unusual Astute log in another bug.

https://bugs.launchpad.net/fuel/+bug/1459289

Mike Scherbakov (mihgen) wrote :

Do we know how to reproduce it?

Ihor Kalnytskyi (ikalnytskyi) wrote :

No, we don't. This is a floating bug, and it seems impossible to fix it without "bad" env where you can try to debug the issue.

Ihor Kalnytskyi (ikalnytskyi) wrote :

Since I can't say nothing more, move it to Incomplete.

P.S: If someone catch this issue, PLEASE do not destroy env, give it to me, so I can debug it with strace/gdb/etc.

Changed in fuel:
status: Confirmed → Incomplete
Ihor Kalnytskyi (ikalnytskyi) wrote :

Move it to 7.0, because it's something incredible rare. and we don't know repro steps.

Changed in fuel:
milestone: 6.1 → 7.0
vikas (vd44690) wrote :
Download full text (5.0 MiB)

I got the same error how can I resolve it . Attaching the astute log below

/var/log/docker-logs/astute/astute.log

# Logfile created on 2015-07-05 13:12:42 +0000 by logger.rb/44203
2015-07-05T13:12:42 info: [514] Starting with settings
--- !ruby/hash:SymbolTable
:puppet_timeout: 5400
:puppet_deploy_interval: 2
:puppet_fade_timeout: 120
:mc_retries: 10
:mc_retry_interval: 1
:puppet_fade_interval: 30
:provisioning_timeout: 5400
:reboot_timeout: 240
:dump_timeout: 3600
:keys_src_dir: "/var/lib/fuel/keys"
:puppet_ssh_keys:
- neutron
- nova
- ceph
- mysql
:puppet_keys:
- mongodb
:keys_dst_dir: "/var/lib/astute"
:max_nodes_per_call: 50
:max_nodes_to_provision: 50
:ssh_retries: 5
:ssh_retry_timeout: 30
:max_nodes_per_remove_call: 10
:nodes_remove_interval: 10
:max_nodes_net_validation: 10
:dhcp_repeat: 3
:iops: 54.25
:splay_factor: 180
:agent_nodiscover_file: "/etc/nailgun-agent/nodiscover"
:broker_host: 10.20.0.2
:broker_port: 5672
:broker_rest_api_port: 15672
:broker_username: naily
:broker_password: aBukMmzx
:broker_service_exchange: naily_service
:broker_queue: naily
:broker_publisher_queue: nailgun
:broker_exchange: nailgun

2015-07-05T13:12:42 info: [514] => Starting astute with 7 worker(s)
2015-07-05T13:12:42 info: [514] master process ready
2015-07-05T13:12:42 info: [516] => Starting worker (PID: 516, ID: 0)
2015-07-05T13:12:42 info: [516] Worker initialization
2015-07-05T13:12:42 info: [520] => Starting worker (PID: 520, ID: 1)
2015-07-05T13:12:42 info: [520] Worker initialization
2015-07-05T13:12:42 info: [523] => Starting worker (PID: 523, ID: 2)
2015-07-05T13:12:42 info: [523] Worker initialization
2015-07-05T13:12:42 info: [527] => Starting worker (PID: 527, ID: 3)
2015-07-05T13:12:42 info: [527] Worker initialization
2015-07-05T13:12:42 info: [534] => Starting worker (PID: 534, ID: 5)
2015-07-05T13:12:42 info: [534] Worker initialization
2015-07-05T13:12:42 info: [531] => Starting worker (PID: 531, ID: 4)
2015-07-05T13:12:42 info: [531] Worker initialization
2015-07-05T13:12:42 info: [538] => Starting worker (PID: 538, ID: 6)
2015-07-05T13:12:42 info: [538] Worker initialization
2015-07-05T20:59:49 info: [493] Starting with settings
--- !ruby/hash:SymbolTable
:puppet_timeout: 5400
:puppet_deploy_interval: 2
:puppet_fade_timeout: 120
:mc_retries: 10
:mc_retry_interval: 1
:puppet_fade_interval: 30
:provisioning_timeout: 5400
:reboot_timeout: 240
:dump_timeout: 3600
:keys_src_dir: "/var/lib/fuel/keys"
:puppet_ssh_keys:
- neutron
- nova
- ceph
- mysql
:puppet_keys:
- mongodb
:keys_dst_dir: "/var/lib/astute"
:max_nodes_per_call: 50
:max_nodes_to_provision: 50
:ssh_retries: 5
:ssh_retry_timeout: 30
:max_nodes_per_remove_call: 10
:nodes_remove_interval: 10
:max_nodes_net_validation: 10
:dhcp_repeat: 3
:iops: 222.28
:splay_factor: 180
:agent_nodiscover_file: "/etc/nailgun-agent/nodiscover"
:broker_host: 10.20.0.2
:broker_port: 5672
:broker_rest_api_port: 15672
:broker_username: naily
:broker_password: aBukMmzx
:broker_service_exchange: naily_service
:broker_queue: naily
:broker_publisher_queue: nailgun
:broker_exchange: nailgun

2015-07-05T20:59:49 info: [493] => Starting astute with 7 worker(s)
2015-07-05T20:59:49 info: [493] master process...

Vladimir Sharshov (vsharshov) wrote :

Looks like same problem was here: https://bugs.launchpad.net/fuel/6.1.x/+bug/1467320

I have reproduced this bug on

[root@nailgun ~]# curl -sS http://10.108.0.2:8000/api/version/ | python -m json.tool
{
    "api": "1.0",
    "astute_sha": "e63709d16bd4c1949bef820ac336c9393c040d25",
    "auth_required": true,
    "build_id": "263",
    "build_number": "263",
    "feature_groups": [
        "mirantis"
    ],
    "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
    "fuel-library_sha": "1556601b9b7503285714d7d1e02cc0807b1c68f0",
    "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
    "fuel-ostf_sha": "582a81ccaa1e439a3aec4b8b8f6994735de840f4",
    "fuelmain_sha": "4dc6799370da4cddf06c04e4ecb7646102298535",
    "nailgun_sha": "b564ae20116297750bf6402b3a017e219bf4b468",
    "openstack_version": "2015.1.0-7.0",
    "production": "docker",
    "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
    "release": "7.0",
    "release_versions": {
        "2015.1.0-7.0": {
            "VERSION": {
                "api": "1.0",
                "astute_sha": "e63709d16bd4c1949bef820ac336c9393c040d25",
                "build_id": "263",
                "build_number": "263",
                "feature_groups": [
                    "mirantis"
                ],
                "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
                "fuel-library_sha": "1556601b9b7503285714d7d1e02cc0807b1c68f0",
                "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
                "fuel-ostf_sha": "582a81ccaa1e439a3aec4b8b8f6994735de840f4",
                "fuelmain_sha": "4dc6799370da4cddf06c04e4ecb7646102298535",
                "nailgun_sha": "b564ae20116297750bf6402b3a017e219bf4b468",
                "openstack_version": "2015.1.0-7.0",
                "production": "docker",
                "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
                "release": "7.0"
            }
        }
    }
}

Changed in fuel:
status: Incomplete → Confirmed
importance: High → Critical
tags: added: module-astute rabbitmq
removed: scale
Changed in fuel:
status: Confirmed → In Progress
Dmitry Pyzhov (dpyzhov) wrote :

Also reproduced in this report: https://bugs.launchpad.net/fuel/+bug/1491375

Dennis Dmitriev (ddmitriev) wrote :

Bug https://bugs.launchpad.net/fuel/+bug/1491375 is reproduced many times on almost half of 7.0.swarm jobs.

For example, CI job:
https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.ha_destructive_ceph_neutron/45/console

##### rabbitmq log:

=ERROR REPORT==== 3-Sep-2015::04:25:24 ===
closing AMQP connection <0.345.0> (10.109.0.2:40620 -> 10.109.0.2:5672):
{heartbeat_timeout,running}

=INFO REPORT==== 3-Sep-2015::04:25:31 ===
accepting AMQP connection <0.2438.0> (10.109.0.2:44064 -> 10.109.0.2:5672)
################

This particular issue most possible was caused by huge size of a message that was casted from astute to nailgun.

Scenario of the issue:

1) Deploy cluster

2) When waiting timeout was reached in system test (cluster deploing was still in progress!) , system test started task for making diagnostic snapshot (time: 04:20)

3) Diagnostic snapshot was successfully created, and then message was casted to Nailgun (>4Mb of size, see the attach) (time: 04:25:20)

4) RabbitMQ closed the connection (time: 04:25:24)

5) Astute, while processing the deploy that was still in progress, reconnected to RabbitMQ and continue deploy:
## astute.log
2015-09-03T04:25:26 warning: [679] Trying to reconnect to message broker. Retry 5 sec later...

Dennis Dmitriev (ddmitriev) wrote :

For details, /var/log folder from the job with failed diagnostic snapshot attached here.

Ihor Kalnytskyi (ikalnytskyi) wrote :

Well, today I tried to test our Astute connection errors handling and looks good. I tried to

* close Astute->RabbitMQ socket using gdb while Astute's processing message from nailgun (before sending back result)
* drop outgoing traffic

In both cases Astute tries to reconnect to message broker almost immediately.

So, I've prepared a patch that introduces more verbose logging https://review.openstack.org/#/c/220146/1 and I'm going to build ISO and ask QA to help me to reproduce this issue.

I want also to notice that perhaps we started facing this issue because Astute started to use almost 100% CPU. For instance, the Dennis case could be:

* Astute sends message to RabbitMQ (4Mb, huge enough)
* CPU is busy by other workers
* It requires some time to send it
* There's no heartbeats from Astute to RabbitMQ (because it's busy with sending message)
* RabbitMQ closes connection
* Astute reconnected, but after it's failed to send the original message. And there's no retries, apparently.

Related fix proposed to branch: master
Review: https://review.openstack.org/220196

Ihor Kalnytskyi (ikalnytskyi) wrote :

ok, Dennis' cases is the following -

* Astute takes 100% CPU usage (due to bug #1491794)
* Astute's sending a huge message and it did quite a time due to heavy loading.

As the result, Astute's unable to respond to RabbitMQ's heartbeat so the connection was dropped and message was lost.

That means, we need to remove heartbeats and implement resending message in such cases.

Reviewed: https://review.openstack.org/220196
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=c9b54daf1e60fdf528dc02818bc113f1cf78c11a
Submitter: Jenkins
Branch: master

commit c9b54daf1e60fdf528dc02818bc113f1cf78c11a
Author: Igor Kalnitsky <email address hidden>
Date: Thu Sep 3 15:22:03 2015 +0000

    Revert "Specify heartbeat for connection to RabbitMQ"

    Unfortunately, such a small timeout for hearbeats cause an incredible
    regression for Astute <-> RabbitMQ connectivity.

    Basically, if Astute's sending a huge message to RabbitMQ and by some
    reason we aren't able to do it within 2 seconds (heartbeat timeout),
    the connection will be dropped by RabbitMQ server.

    In fact, that's happened on SWARM while

    * Astute takes almost 100% CPU usage (bug #1491794)
    * Astute's sending diagnostic snapshot result message to Nailgun
      (more than 4MB of data).

    Since heartbeats aren't solve the problem we have tried to fix, let's
    revert the original commit e24ca066bf6160bc1e419aaa5d486cad1aaa937d.

    Change-Id: I8734541c026b86506a6f735b1eb55371b393921c
    Related-Bug: #1461562

Dennis Dmitriev (ddmitriev) wrote :

Using patch https://review.openstack.org/#/c/220146/ on ISO#265, I tried to make a diagnistic snapshot from CLI: 'fuel snapshot'.

There is a result (lines with lot of data are not in full length to more suitable view): http://paste.openstack.org/show/444537/

Ihor Kalnytskyi (ikalnytskyi) wrote :

Ok, according to Sergii G. and Dennis D. , two reverts helped us with this issue. If there'll be no reproduces anymore, I'd close this issue or move it to 8.0 (because we faced it before two reverted patches).

Dmitry Pyzhov (dpyzhov) on 2015-09-04
Changed in fuel:
importance: Critical → Medium
status: In Progress → Confirmed
assignee: Igor Kalnitsky (ikalnitsky) → Fuel Python Team (fuel-python)
milestone: 7.0 → 8.0
Dmitry Pyzhov (dpyzhov) on 2015-10-22
tags: added: area-python
Changed in fuel:
importance: Medium → High
Mike Scherbakov (mihgen) wrote :

If issue occurs once in a thousand of installations (and we should do no less than that in a month timeframe), why do we think it's high? Arthur - can you please justify importance of this bug?

Vladimir Sharshov (vsharshov) wrote :

This bug connected with this fix: https://review.openstack.org/#/c/234665/

Dmitry Pyzhov (dpyzhov) on 2015-11-20
tags: added: team-enh
tags: added: team-enhancements
removed: team-enh

Seems it was incorrect statistics from me about bug frequency. I move it back to medium due to rare reproduction

Changed in fuel:
importance: High → Medium
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/234665
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=b60624ee2c5f1d6d805619b6c27965a973508da1
Submitter: Jenkins
Branch: master

commit b60624ee2c5f1d6d805619b6c27965a973508da1
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Mon Oct 12 19:25:00 2015 +0300

    Move from amqp-gem to bunny

    Differents:

    - separate independent chanel for outgoing report;
    - solid way to redeclare already existed queues;
    - auto recovery mode in case of network problem by default;
    - more solid, modern and simple library for AMQP.

    Also:

    - implement asynchronous logger for event callbacks.

    Short words from both gems authors:

    amqp gem brings in a fair share of EventMachine complexity which
    cannot be fully eliminated. Event loop blocking, writes that
    happen at the end of loop tick, uncaught exceptions in event
    loop silently killing it: it's not worth the pain unless
    you've already deeply invested in EventMachine and
    understand how it works.

    Closes-Bug: #1498847
    Closes-Bug: #1487397
    Closes-Bug: #1461562
    Related-Bug: #1485895
    Related-Bug: #1483182

    Change-Id: I52d005498ccb978ada158bfa64b1c7de1a24e9b0

Changed in fuel:
status: In Progress → Fix Committed

Change abandoned by Vladimir Sharshov (<email address hidden>) on branch: master
Review: https://review.openstack.org/220146
Reason: We moved from ruby-amqp to bunny, so we need to refactor this change. I've created new bug for such propose https://bugs.launchpad.net/fuel/+bug/1521816

tags: added: on-verification
Sergey Novikov (snovikov) wrote :

Verified on MOS 8.0 RC#1

I executed steps to reproduce which described above several times and bug was not reproduced. Also I collected information about CPU workload and there are not any issues

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers