ceilometer notification agent doesnt publish samples when workload_partitioning is enabled

Bug #1512001 reported by Rohit Jaiswal
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceilometer
Undecided
Unassigned
tooz
Status tracked in Kilo
Kilo
Medium
Rohit Jaiswal
Trunk
Medium
Joshua Harlow

Bug Description

This affects stable/kilo.

ceilometer notification agent is running on 3 nodes with workload_partitioning as enabled, this is the configuration used:

[oslo_messaging_rabbit]
rabbit_hosts = ccp-c1-m1-mgmt:5672,ccp-c1-m2-mgmt:5672,ccp-c1-m3-mgmt:5672
rabbit_userid = ceilometer
rabbit_password = password
rabbit_use_ssl = False

[coordination]
backend_url = kazoo://ccp-c1-m1-mgmt:2181
check_watchers = 10.0
heartbeat = 1.0

[notification]
workload_partitioning=True
messaging_urls=rabbit://guest:password@ccp-c1-m1-mgmt:5672,ccp-c1-m2-mgmt:5672,ccp-c1-m3-mgmt:5672

notification agent log files from 3 nodes attached

generated notifiications from cinder service using cinder create command,
and verified rabbit connectivity as ok.

stack@ccp-c1-m2-mgmt:~$ sudo rabbitmqctl list_queues | grep notification
notifications.error 0
notifications.info 0

stack@ccp-c1-m2-mgmt:~$ sudo rabbitmqctl list_queues | grep ceilometer-pipe*
ceilometer-pipe-image_source:meter_sink.sample 6
ceilometer-pipe-meter_source:meter_sink.sample 6
ceilometer-pipe-volume_source:meter_sink.sample 6

stack@ccp-c1-m2-mgmt:~$ ceilometer meter-list
+------+------+------+-------------+---------+------------+
| Name | Type | Unit | Resource ID | User ID | Project ID |
+------+------+------+-------------+---------+------------+
+------+------+------+-------------+---------+------------+

Pipeline configuration for noti agents:

---
sources:
    - name: meter_source
      interval: 30
      meters:
          - "instance"
          - "ip.floating"
          - "network"
          - "network.create"
          - "network.update"
      sinks:
          - meter_sink
    - name: image_source
      interval: 30
      meters:
          - "image"
          - "image.size"
          - "image.upload"
          - "image.delete"
      sinks:
          - meter_sink
    - name: volume_source
      interval: 30
      meters:
          - "volume"
          - "volume.size"
          - "snapshot"
          - "snapshot.size"
      sinks:
          - meter_sink

sinks:
    - name: meter_sink
      transformers:

      publishers:
         - notifier://

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :
description: updated
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Also observed that when ceilometer notification agent on node 1 is restarted, data is consumed from the pipeline queues and available to query from ceilometer api

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Disabling workload_partitioning works, despite the rabbit connection errors in the logs.

Revision history for this message
gordon chung (chungg) wrote :

it doesn't seem like it's registering to tooz correctly.

what happens when you run 'sudo rabbitmqctl list_consumers' i assume nothing is consuming those queues?

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

My logs show that the kazoo connections get dropped but connect shortly afterwards,

Not sure why i see this: IOError: Socket closed
(oslo_messaging._drivers.impl_rabbit): 2015-10-31 18:33:19,285 ERROR impl_rabbit on_error AMQP server padawan-ccp-c1-m2-mgmt:5672 closed the connection. Check login credentials: Socket closed

i already have this in rabbitmq.conf: {loopback_users, []}

sudo rabbitmqctl list_consumers | grep ceilometer-pipe* returns nothing..

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Is this a potential problem - https://github.com/openstack/ceilometer/blob/stable/kilo/ceilometer/notification.py#L211

I think pipeline_listeners will use the last cached transport and if there are connection issues with that rabbit node, the consumers will not start correctly?

Changed in ceilometer:
assignee: nobody → Rohit Jaiswal (rohit-jaiswal-3)
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

On restarting the notification agents, i do see tooz behave better:

(kazoo.client): 2015-11-03 02:28:23,333 DEBUG connection _submit Sending request(xid=4): GetChildren(path=u'/tooz/ceilometer.notification', watcher=None)
(kazoo.client): 2015-11-03 02:28:23,334 DEBUG connection _read_response Received response(xid=4): [u'b78a1db9-217d-42cd-94b2-e6884a353ece', u'5ff2fe90-c251-4378-94ac-055c3410db42', u'0e679310-39fa-4bce-bee6-12681581c217']
(kazoo.client): 2015-11-03 02:28:23,334 DEBUG connection _submit Sending request(xid=5): GetChildren(path=u'/tooz/ceilometer.notification', watcher=<bound method ChildrenWatch._watcher of <kazoo.recipe.watchers.ChildrenWatch object at 0x7f6418675210>>)
(kazoo.client): 2015-11-03 02:28:23,336 DEBUG connection _read_response Received response(xid=5): [u'b78a1db9-217d-42cd-94b2-e6884a353ece', u'5ff2fe90-c251-4378-94ac-055c3410db42', u'0e679310-39fa-4bce-bee6-12681581c217']
(ceilometer.notification): 2015-11-03 02:28:23,336 WARNING notification start Non-metric meters may be collected. It is highly advisable to disable these meters using ceilometer.conf or the pipeline.yaml

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

, and i also see the pipeline queue consumers:

stack@padawan-ccp-c1-m1-mgmt:/var/log/ceilometer$ sudo rabbitmqctl list_consumers | grep ceilometer-pipe*
ceilometer-pipe-image_source:meter_sink.sample <email address hidden> 1 true 0 []
ceilometer-pipe-meter_source:meter_sink.sample <email address hidden> 1 true 0 []
ceilometer-pipe-volume_source:meter_sink.sample <email address hidden> 1 true 0 []

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :
Download full text (4.0 KiB)

Not sure why kazoo fails to report participating members: (From the logs snippet below, connection using kazoo is evantually established, but fails to get information on members..

(kazoo.client): 2015-10-31 18:37:42,754 DEBUG connection _read_response Received response(xid=1): [u'076cb3de-cf34-4c46-857c-4eeebea04a8b', u'6287b8d0-5769-4f3c-8061-a8e8da4d476e', u'7dc079a8-afb4-4da1-9205-481dd65e1f71']
(kazoo.client): 2015-10-31 18:38:12,585 WARNING connection _connect_attempt Connection dropped: outstanding heartbeat ping not received
(kazoo.client): 2015-10-31 18:38:12,585 WARNING connection _connect_attempt Transition to CONNECTING
(kazoo.client): 2015-10-31 18:38:12,585 INFO client _session_callback Zookeeper connection lost
(kazoo.client): 2015-10-31 18:38:12,776 INFO connection _connect Connecting to padawan-ccp-c1-m1-mgmt:2181
(kazoo.client): 2015-10-31 18:38:12,777 DEBUG connection _submit Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=4294978717, time_out=10000, session_id=22728108451692562, passwd='8a\x84\xbc\x9e!:_\xb0\x8c\xc6\xdc\r\x90\xc5L', read_only=None)
(kazoo.client): 2015-10-31 18:38:22,777 WARNING connection _connect_attempt Connection time-out: socket time-out during read
(kazoo.client): 2015-10-31 18:38:24,402 INFO connection _connect Connecting to padawan-ccp-c1-m1-mgmt:2181
(kazoo.client): 2015-10-31 18:38:24,403 DEBUG connection _submit Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=4294978717, time_out=10000, session_id=22728108451692562, passwd='8a\x84\xbc\x9e!:_\xb0\x8c\xc6\xdc\r\x90\xc5L', read_only=None)
(kazoo.client): 2015-10-31 18:38:24,760 WARNING connection _connect_attempt Connection dropped: socket connection broken
(kazoo.client): 2015-10-31 18:38:25,811 INFO connection _connect Connecting to padawan-ccp-c1-m1-mgmt:2181
(kazoo.client): 2015-10-31 18:38:25,812 DEBUG connection _submit Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=4294978717, time_out=10000, session_id=22728108451692562, passwd='8a\x84\xbc\x9e!:_\xb0\x8c\xc6\xdc\r\x90\xc5L', read_only=None)
(kazoo.client): 2015-10-31 18:38:25,820 WARNING connection _connect_attempt Connection dropped: socket connection broken
(kazoo.client): 2015-10-31 18:38:26,763 INFO connection _connect Connecting to padawan-ccp-c1-m1-mgmt:2181
(kazoo.client): 2015-10-31 18:38:26,765 DEBUG connection _submit Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=4294978717, time_out=10000, session_id=22728108451692562, passwd='8a\x84\xbc\x9e!:_\xb0\x8c\xc6\xdc\r\x90\xc5L', read_only=None)
(kazoo.client): 2015-10-31 18:38:26,766 WARNING connection _connect_attempt Connection dropped: socket connection broken
(kazoo.client): 2015-10-31 18:38:29,132 INFO connection _connect Connecting to padawan-ccp-c1-m1-mgmt:2181
(kazoo.client): 2015-10-31 18:38:29,133 DEBUG connection _submit Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=4294978717, time_out=10000, session_id=22728108451692562, passwd='8a\x84\xbc\x9e!:_\xb0\x8c\xc6\xdc\r\x90\xc5L', read_only=None)
(kazoo.client): 2015-10-31 18:38:29,138 WARNING connection _connect_attempt Session has expired
(kazoo.client): 2015-10-3...

Read more...

Changed in ceilometer:
assignee: Rohit Jaiswal (rohit-jaiswal-3) → nobody
Revision history for this message
gordon chung (chungg) wrote :

should we add tooz as affected?

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

See a lot of these in the zookeeper logs:

2015-11-03 18:37:37,304 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x50ce640c7f006e, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
2015-11-03 18:37:37,380 - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 3633ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2015-11-03 18:37:37,392 - ERROR [CommitProcessor:0:NIOServerCnxn@180] - Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
        at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)

Changed in ceilometer:
assignee: nobody → Rohit Jaiswal (rohit-jaiswal-3)
Revision history for this message
gordon chung (chungg) wrote :
Changed in python-tooz:
assignee: nobody → Joshua Harlow (harlowja)
status: New → In Progress
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

gordc: I dont think we need an additional fix in ceilometer for bug/1495663 (beyond https://review.openstack.org/#/c/224919/). I think harlowja's patch (https://review.openstack.org/#/c/241335/) fixes the issues described above, although retries should be a nice to have thing, but since we already have periodic heartbeats/watchers in ceilometer, it should be ok.

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

For the ceilometer fix for this bug, we may want to update docs to include a handler argument to the backend_url param value with value as eventlet

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (master)

Reviewed: https://review.openstack.org/241335
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=9c5cb6fd0ba6b08fd163130a8862137277e48020
Submitter: Jenkins
Branch: master

commit 9c5cb6fd0ba6b08fd163130a8862137277e48020
Author: Joshua Harlow <email address hidden>
Date: Tue Nov 3 11:08:54 2015 -0800

    Allow specifying a kazoo async handler 'kind'

    In situations where the built-in (and default) kazoo
    threading async handler does not work (which sometimes
    appears to happen under eventlet) allow for specifying
    a different handler (ie the 'eventlet' one) that should
    work better under those scenarios.

    Closes-bug: #1512001

    Change-Id: Iec5e39928b223a3ffca0b9b5b4d0fd61abaa0f2b

Changed in python-tooz:
status: In Progress → Fix Committed
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

harlowja: Can this tooz fix be backported to stable/kilo and stable/liberty?

Revision history for this message
Joshua Harlow (harlowja) wrote :

Likely that should be possible, although please ask jd (julien danjou) on #openstack-oslo

Julien Danjou (jdanjou)
Changed in ceilometer:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/242619

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (stable/kilo)

Reviewed: https://review.openstack.org/242619
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=4251e1ea72bf4574629aaae701166da5ea2c6413
Submitter: Jenkins
Branch: stable/kilo

commit 4251e1ea72bf4574629aaae701166da5ea2c6413
Author: Joshua Harlow <email address hidden>
Date: Tue Nov 3 11:08:54 2015 -0800

    Allow specifying a kazoo async handler 'kind'

    In situations where the built-in (and default) kazoo
    threading async handler does not work (which sometimes
    appears to happen under eventlet) allow for specifying
    a different handler (ie the 'eventlet' one) that should
    work better under those scenarios.

    Closes-bug: #1512001
    Change-Id: Iec5e39928b223a3ffca0b9b5b4d0fd61abaa0f2b
    (cherry-picked from commit 9c5cb6fd0ba6b08fd163130a8862137277e48020)

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Even with the tooz fix, i see the hearbeat messages in the logs for a while after which they die out, logs at http://paste.openstack.org/show/478674/.

From the logs, it looks like its off to a good start as it reports the node information but then the heartbeat pings are not received at all, so eventually it dies out, this could be a problem with pings not being received correctly..

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :
Changed in python-tooz:
milestone: none → 1.27.0
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tooz (master)

Reviewed: https://review.openstack.org/268351
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=379e606028c1b244a06249563dc2f5d25ec15eb4
Submitter: Jenkins
Branch: master

commit 379e606028c1b244a06249563dc2f5d25ec15eb4
Author: Joshua Harlow <email address hidden>
Date: Fri Jan 15 14:22:42 2016 -0800

    Have zookeeper heartbeat perform basic get

    Under situations where greenthreads are used sometimes
    the kazoo threads can get starved, which can cause bad
    things to happen, so in order to help avoid this have
    the zookeeper drivers heartbeat methods fetch the basic
    namespace (which will cause heartbeats to also be
    flushed and connection verification to happen as a
    side-effect).

    Related-Bug: #1512001

    Change-Id: I244d6c433a41c4b5cd6e0e5d6f01013ccd5db348

Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

I don't think it still exists, plz provide more info if you don't agree

Changed in ceilometer:
status: Triaged → Incomplete
assignee: Rohit Jaiswal (rohit-jaiswal-3) → nobody
gordon chung (chungg)
Changed in ceilometer:
status: Incomplete → Invalid
no longer affects: python-tooz/liberty
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers