amqp_changed is invoked by update_clients on every run of update-status hook.

Bug #1717579 reported by Jorge Niedbalski
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
High
Jorge Niedbalski

Bug Description

[Environment]

Xenial
Charms 17.02 (upgraded from 16.07)

[Description]

After chasing down a problem with a long-running update-status hook (for reference
the update-status hook process was spawned at 21:42:08 wait4(194023, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 194023 and terminated nearly 1 hour after 22:40:33 exit_group(0)).

I traced the process using this script [0] and identified a continuous loop over the following system calls:

21:53:28 fcntl(3, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=1073741825, l_len=1}) = 0
21:53:28 stat("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db", {st_mode=S_IFREG|0644, st_size=82284544, ...}) = 0
21:53:28 stat("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db", {st_mode=S_IFREG|0644, st_size=82284544, ...}) = 0
21:53:28 open("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db-journal", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 4
21:53:28 fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
21:53:28 geteuid() = 0
21:53:28 fchown(4, 0, 0) = 0
21:53:28 fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
21:53:28 lseek(4, 0, SEEK_SET) = 0
--
21:53:28 unlink("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db-journal") = 0
21:53:28 fcntl(3, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
21:53:28 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=2}) = 0
21:53:28 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
21:53:28 stat("/usr/sbin/rabbitmqctl", {st_mode=S_IFREG|0755, st_size=1568, ...}) = 0
21:53:28 pipe([4, 5]) = 0
--

The only part of the code making extensive use of the peer store/retrieve semantics
is located here [1].

Also, the following log entry are displayed every 5 secs or so:

unit-rabbitmq-0.log:942573:2017-09-13 17:33:15 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942574:2017-09-13 17:33:20 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942575:2017-09-13 17:33:24 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942576:2017-09-13 17:33:28 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942577:2017-09-13 17:33:32 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942578:2017-09-13 17:33:36 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942579:2017-09-13 17:33:40 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942580:2017-09-13 17:33:44 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942581:2017-09-13 17:33:48 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942582:2017-09-13 17:33:52 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942583:2017-09-13 17:33:56 DEBUG juju-log Propagating peer settings to all amqp relations
unit-rabbitmq-0.log:942584:2017-09-13 17:34:00 DEBUG juju-log Propagating peer settings to all amqp relations

Which are being printed on this line of code [2]

[Observations and possible fix]

The commit [3] added a explicit call to update_clients which is executed
on every hook run (including update-status) , the update_clients function iterates
over all the units related through the amqp relation and makes a call to the amqp_changed
function, then it iterates over the peer units setting the relation_settings for
each relation_id.

As the update_status hook interval is set to run every 5 minutes , isn't
convenient to do this iteration on this hook, particularly in large deployments
when the amqp relation is shared among multiple units and this behavior might
cause an unwanted workload overhead on the cloud.

For reference during the execution of this update-status hook 504 relation-set
executions were recorded:
$ ack-grep 'relation-set' strace.194010 | wc -l
504

[0] https://gist.github.com/niedbalski/0cfa4a3c91839ea9d124a3659befff41
[1] https://github.com/openstack/charm-rabbitmq-server/blob/b47623805a12e0f335b318f570de69924c979eb2/hooks/rabbitmq_server_relations.py#L141
[2] https://github.com/openstack/charm-rabbitmq-server/blob/b47623805a12e0f335b318f570de69924c979eb2/hooks/rabbitmq_server_relations.py#L291
[3] https://github.com/openstack/charm-rabbitmq-server/commit/2472e1ca9fbd68cdd50640eadd071177ddc5968e

tags: added: sts
Changed in charm-rabbitmq-server:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Jorge Niedbalski (niedbalski)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/504511

Revision history for this message
Edward Hope-Morley (hopem) wrote : Re: [Bug 1717579] Re: amqp_changed is invoked by update_clients on every run of update-status hook.
Download full text (5.9 KiB)

Hi @niedbalski please see bug 1698340 where this issue was partly resolved
for 17.08 release. The wider topic was discussed this week at the OpenStack
ptg. Unfortunately due to the history of this charm there is currently no
way to completely remove this per-hook execution but I think we can
certainly do more to improve it.

On 15 Sep 2017 8:21 pm, "Jorge Niedbalski" <email address hidden>
wrote:

> ** Changed in: charm-rabbitmq-server
> Status: New => In Progress
>
> ** Changed in: charm-rabbitmq-server
> Importance: Undecided => High
>
> ** Changed in: charm-rabbitmq-server
> Assignee: (unassigned) => Jorge Niedbalski (niedbalski)
>
> --
> You received this bug notification because you are a member of OpenStack
> Charmers, which is subscribed to OpenStack Charms.
> Matching subscriptions: openstack-charms
> https://bugs.launchpad.net/bugs/1717579
>
> Title:
> amqp_changed is invoked by update_clients on every run of update-
> status hook.
>
> Status in OpenStack rabbitmq-server charm:
> In Progress
>
> Bug description:
> [Environment]
>
> Xenial
> Charms 17.02 (upgraded from 16.07)
>
> [Description]
>
> After chasing down a problem with a long-running update-status hook (for
> reference
> the update-status hook process was spawned at 21:42:08 wait4(194023,
> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 194023 and terminated
> nearly 1 hour after 22:40:33 exit_group(0)).
>
> I traced the process using this script [0] and identified a continuous
> loop over the following system calls:
>
> 21:53:28 fcntl(3, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET,
> l_start=1073741825, l_len=1}) = 0
> 21:53:28 stat("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db",
> {st_mode=S_IFREG|0644, st_size=82284544, ...}) = 0
> 21:53:28 stat("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db",
> {st_mode=S_IFREG|0644, st_size=82284544, ...}) = 0
> 21:53:28 open("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db-journal",
> O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 4
> 21:53:28 fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> 21:53:28 geteuid() = 0
> 21:53:28 fchown(4, 0, 0) = 0
> 21:53:28 fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> 21:53:28 lseek(4, 0, SEEK_SET) = 0
> --
> 21:53:28 unlink("/var/lib/juju/agents/unit-rabbitmq-1/charm/.unit-state.db-journal")
> = 0
> 21:53:28 fcntl(3, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET,
> l_start=1073741826, l_len=510}) = 0
> 21:53:28 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET,
> l_start=1073741824, l_len=2}) = 0
> 21:53:28 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET,
> l_start=0, l_len=0}) = 0
> 21:53:28 stat("/usr/sbin/rabbitmqctl", {st_mode=S_IFREG|0755,
> st_size=1568, ...}) = 0
> 21:53:28 pipe([4, 5]) = 0
> --
>
> The only part of the code making extensive use of the peer
> store/retrieve semantics
> is located here [1].
>
> Also, the following log entry are displayed every 5 secs or so:
>
> unit-rabbitmq-0.log:942573:2017-09-13 17:33:15 DEBUG juju-log
> Propagating peer settings to all amqp relations
> unit-rabbi...

Read more...

Felipe Reyes (freyes)
tags: added: backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/504511
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=f9b2eb310e78cdc5a8ae9461a5bf7124c5bb95d7
Submitter: Jenkins
Branch: master

commit f9b2eb310e78cdc5a8ae9461a5bf7124c5bb95d7
Author: Jorge Niedbalski <email address hidden>
Date: Fri Sep 15 17:17:58 2017 -0300

    Remove update_clients() call from update-status.

    The commit [0] added a explicit call to update_clients
    which is executed on every hook run (including update-status),
    the update_clients function iterates over all the units
    related through the amqp relation and makes a call
    to the amqp_changed function.

    As the update_status hook interval is set to run every 5 minutes,
    isn't convenient to do a call to amqp_changed for every related unit
    particularly in large scale deployments, as this might cause
    an unwanted workload overhead on the cloud at every tick
    of the update-status hook.

    Further work needs to be done to improve the overall
    handling performed via the update_clients function but
    this commit alleviates the problem of leaving the update-status
    hook stuck for ~1 hour every time.

    [0] 2472e1ca9fbd68cdd50640eadd071177ddc5968e

    Change-Id: If9d6ed03d0b8a79fa0210fba57a3d54c2646214d
    Closes-Bug: #1717579
    Signed-off-by: Jorge Niedbalski <email address hidden>

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/17.08)

Fix proposed to branch: stable/17.08
Review: https://review.openstack.org/506763

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/506781

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/17.08)

Reviewed: https://review.openstack.org/506763
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=18ced509825bcdd249322d574a90d431ab9d088f
Submitter: Jenkins
Branch: stable/17.08

commit 18ced509825bcdd249322d574a90d431ab9d088f
Author: Jorge Niedbalski <email address hidden>
Date: Fri Sep 15 17:17:58 2017 -0300

    Remove update_clients() call from update-status.

    The commit [0] added a explicit call to update_clients
    which is executed on every hook run (including update-status),
    the update_clients function iterates over all the units
    related through the amqp relation and makes a call
    to the amqp_changed function.

    As the update_status hook interval is set to run every 5 minutes,
    isn't convenient to do a call to amqp_changed for every related unit
    particularly in large scale deployments, as this might cause
    an unwanted workload overhead on the cloud at every tick
    of the update-status hook.

    Further work needs to be done to improve the overall
    handling performed via the update_clients function but
    this commit alleviates the problem of leaving the update-status
    hook stuck for ~1 hour every time.

    [0] 2472e1ca9fbd68cdd50640eadd071177ddc5968e

    Change-Id: If9d6ed03d0b8a79fa0210fba57a3d54c2646214d
    Closes-Bug: #1717579
    Signed-off-by: Jorge Niedbalski <email address hidden>
    (cherry picked from commit f9b2eb310e78cdc5a8ae9461a5bf7124c5bb95d7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/506781
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=9f9646bc356f269f14b2112f96f2112e8b647f23
Submitter: Jenkins
Branch: master

commit 9f9646bc356f269f14b2112f96f2112e8b647f23
Author: David Ames <email address hidden>
Date: Fri Sep 22 14:25:03 2017 -0700

    Update all clients once and only once

    This change solves two problmes. First, one off problems waiting
    for the cluster to complete. Second, running update_clients too
    often. Update clients will get executed only once as soon as
    leader_node_is_ready() or client_node_is_ready() returns True.
    Subsequent client requests will be handled by normal
    amqp-relation-changed hooks.

    Co-Authored-By: Liam Young <email address hidden>
    Partial-Bug: #1717579

    Change-Id: I02c648cccc72d816beeec5546b6c7914d57c607a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/17.08)

Fix proposed to branch: stable/17.08
Review: https://review.openstack.org/511511

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/17.08)

Reviewed: https://review.openstack.org/511511
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=5b6399327dedc4ffc06f1e0b53394a60656ba84e
Submitter: Zuul
Branch: stable/17.08

commit 5b6399327dedc4ffc06f1e0b53394a60656ba84e
Author: David Ames <email address hidden>
Date: Fri Sep 22 14:25:03 2017 -0700

    Update all clients once and only once

    This change solves two problmes. First, one off problems waiting
    for the cluster to complete. Second, running update_clients too
    often. Update clients will get executed only once as soon as
    leader_node_is_ready() or client_node_is_ready() returns True.
    Subsequent client requests will be handled by normal
    amqp-relation-changed hooks.

    Co-Authored-By: Liam Young <email address hidden>
    Partial-Bug: #1717579

    Change-Id: I02c648cccc72d816beeec5546b6c7914d57c607a
    (cherry picked from commit 9f9646bc356f269f14b2112f96f2112e8b647f23)

James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: none → 17.11
James Page (james-page)
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.