Juju agent is not communicating with the server.

Bug #1720155 reported by Burton Swan
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
John A Meinel
2.2
Fix Released
Critical
John A Meinel

Bug Description

Juju version: 2.3-alpha1 and 2.2 series.

PR7868 is highly likely the cause of this issue:
https://github.com/juju/juju/pull/7868

On Juju 2.2 series it is PR7863:
https://github.com/juju/juju/pull/7863

Using charm-storage test as an example:
http://ci.jujucharms.com/job/charm-storage/

Re-run the last successful job, revision build 5743, commit a906cf9, works well:
http://ci.jujucharms.com/job/charm-storage/2027/

Then re-run the first failed job, revision build 5745, commit 774f2f0, failed again:
http://ci.jujucharms.com/job/charm-storage/2028/
2017-09-28 14:11:49 ERROR ('0', 'agent is not communicating with the server', 1149.777357)

commit 774f2f0 also triggered other failures, e.g.:
http://ci.jujucharms.com/job/multi-series-charm/

This issue can also be manually reproduced on the latest Juju 2.3-alpha1 by run:
$ ./assess_storage.py $ENV $JUJU_BIN $JUJU_DATA

description: updated
description: updated
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1720155] Re: Juju agent is not communicating with the server.

Note that you have to be logged in to see the 'charm-storage' job.
Otherwise those links come back as 404.

I'm investigating this now.

John

On Thu, Sep 28, 2017 at 7:50 PM, Burton Swan <email address hidden>
wrote:

> ** Description changed:
>
> - Juju version: 2.3-alpha1
> + Juju version: 2.3-alpha1 and 2.2 series.
>
> PR7868 is highly likely the cause of this issue:
> https://github.com/juju/juju/pull/7868
>
> Using charm-storage test as an example:
> http://ci.jujucharms.com/job/charm-storage/
>
> Re-run the last successful job, revision build 5743, commit a906cf9,
> works well:
> http://ci.jujucharms.com/job/charm-storage/2027/
>
> Then re-run the first failed job, revision build 5745, commit 774f2f0,
> failed again:
> http://ci.jujucharms.com/job/charm-storage/2028/
> 2017-09-28 14:11:49 ERROR ('0', 'agent is not communicating with the
> server', 1149.777357)
>
> commit 774f2f0 also triggered other failures, e.g.:
> http://ci.jujucharms.com/job/multi-series-charm/
>
> This issue can also be manually reproduced on the latest Juju 2.3-alpha1
> by run:
> $ ./assess_storage.py $ENV $JUJU_BIN $JUJU_DATA
>
> ** Description changed:
>
> Juju version: 2.3-alpha1 and 2.2 series.
>
> PR7868 is highly likely the cause of this issue:
> https://github.com/juju/juju/pull/7868
> +
> + On Juju 2.2 series it is PR7863:
> + https://github.com/juju/juju/pull/7863
>
> Using charm-storage test as an example:
> http://ci.jujucharms.com/job/charm-storage/
>
> Re-run the last successful job, revision build 5743, commit a906cf9,
> works well:
> http://ci.jujucharms.com/job/charm-storage/2027/
>
> Then re-run the first failed job, revision build 5745, commit 774f2f0,
> failed again:
> http://ci.jujucharms.com/job/charm-storage/2028/
> 2017-09-28 14:11:49 ERROR ('0', 'agent is not communicating with the
> server', 1149.777357)
>
> commit 774f2f0 also triggered other failures, e.g.:
> http://ci.jujucharms.com/job/multi-series-charm/
>
> This issue can also be manually reproduced on the latest Juju 2.3-alpha1
> by run:
> $ ./assess_storage.py $ENV $JUJU_BIN $JUJU_DATA
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1720155
>
> Title:
> Juju agent is not communicating with the server.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1720155/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (4.4 KiB)

Given the need to run against $ENV, I'm guessing this expects to have a
controller already bootstrapped? There shouldn't be anything in the patch
that changes agents (vs possibly changing how controllers operate).

I'm missing something in how to run the tests manually. If I try:

./assess_storage lxd ~/dev/go/bin/juju ~/dev/tmp/juju_test

Then it warns me that:
/home/jameinel/dev/go/src/github.com/juju/juju/acceptancetests/utility.py:88:
UserWarning: Not a directory /home/jameinel/dev/tmp/jujutest
  warnings.warn('Not a directory {}'.format(maybe_dir))
and fails with:
  File "/home/jameinel/dev/go/src/
github.com/juju/juju/acceptancetests/jujupy/configuration.py", line 37, in
get_selected_environment
    'Environment "{}" does not exist.'.format(selected))
jujupy.configuration.NoSuchEnvironment: Environment "lxd" does not exist.

However, if I first do "juju bootstrap lxd" I then get:
/home/jameinel/dev/go/src/github.com/juju/juju/acceptancetests/utility.py:94:
UserWarning: Directory '/home/jameinel/dev/tmp/jujutest' has existing
contents.
  'Directory {!r} has existing contents.'.format(maybe_dir))

and a final failure of:
jujupy.configuration.NoSuchEnvironment: Environment "lxd" does not exist.

Now, trying to look at the code, I see it trying to do:

def get_juju_home():
    home = os.environ.get('JUJU_HOME')
    if home is None:
        home = os.path.join(os.environ.get('HOME'), '.juju')
    return home
...
def get_environments_path(juju_home):
    return os.path.join(juju_home, 'environments.yaml')
...
def get_environments():
    """Return the environments for juju."""
    home = get_juju_home()
    with open(get_environments_path(home)) as env:
        return yaml.safe_load(env)['environments']

So it seems that this code assumes you have a leftover Juju-1.x setup
laying around that has a matching environments.yaml file?

On Mon, Oct 2, 2017 at 11:37 AM, John Meinel <email address hidden> wrote:

> Note that you have to be logged in to see the 'charm-storage' job.
> Otherwise those links come back as 404.
>
> I'm investigating this now.
>
> John
>
> On Thu, Sep 28, 2017 at 7:50 PM, Burton Swan <email address hidden>
> wrote:
>
>> ** Description changed:
>>
>> - Juju version: 2.3-alpha1
>> + Juju version: 2.3-alpha1 and 2.2 series.
>>
>> PR7868 is highly likely the cause of this issue:
>> https://github.com/juju/juju/pull/7868
>>
>> Using charm-storage test as an example:
>> http://ci.jujucharms.com/job/charm-storage/
>>
>> Re-run the last successful job, revision build 5743, commit a906cf9,
>> works well:
>> http://ci.jujucharms.com/job/charm-storage/2027/
>>
>> Then re-run the first failed job, revision build 5745, commit 774f2f0,
>> failed again:
>> http://ci.jujucharms.com/job/charm-storage/2028/
>> 2017-09-28 14:11:49 ERROR ('0', 'agent is not communicating with the
>> server', 1149.777357)
>>
>> commit 774f2f0 also triggered other failures, e.g.:
>> http://ci.jujucharms.com/job/multi-series-charm/
>>
>> This issue can also be manually reproduced on the latest Juju
>> 2.3-alpha1 by run:
>> $ ./assess_storage.py $ENV $JUJU_BIN $JUJU_DATA
>>
>> ** Description changed:
>>
>> Juju version: 2.3-alph...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.3 KiB)

I don't see any successes for multi-series-charm so I'm not sure that there
is actually a cause from this patch.

The failure around 'agent-lost' or
'agent-is-not-communicating-with-the-server' could, indeed be caused by the
change from this patch, as it is specifically dealing with changes around
detecting whether agents are alive and communicating or not.

I haven't managed to actually run the test yet, but just doing "bootstrap
&& deploy" I don't see the result. I've been trying with Xenial and LXD, I
may try with Trusty to see if it makes a difference. (Its possible that
Mongo 2.4 on Trusty doesn't support the $bit operation as we think it does,
though the test suite was passing.)

On Mon, Oct 2, 2017 at 12:07 PM, John Meinel <email address hidden> wrote:

> Given the need to run against $ENV, I'm guessing this expects to have a
> controller already bootstrapped? There shouldn't be anything in the patch
> that changes agents (vs possibly changing how controllers operate).
>
> I'm missing something in how to run the tests manually. If I try:
>
> ./assess_storage lxd ~/dev/go/bin/juju ~/dev/tmp/juju_test
>
> Then it warns me that:
> /home/jameinel/dev/go/src/github.com/juju/juju/
> acceptancetests/utility.py:88: UserWarning: Not a directory
> /home/jameinel/dev/tmp/jujutest
> warnings.warn('Not a directory {}'.format(maybe_dir))
> and fails with:
> File "/home/jameinel/dev/go/src/github.com/juju/juju/
> acceptancetests/jujupy/configuration.py", line 37, in
> get_selected_environment
> 'Environment "{}" does not exist.'.format(selected))
> jujupy.configuration.NoSuchEnvironment: Environment "lxd" does not exist.
>
> However, if I first do "juju bootstrap lxd" I then get:
> /home/jameinel/dev/go/src/github.com/juju/juju/
> acceptancetests/utility.py:94: UserWarning: Directory
> '/home/jameinel/dev/tmp/jujutest' has existing contents.
> 'Directory {!r} has existing contents.'.format(maybe_dir))
>
> and a final failure of:
> jujupy.configuration.NoSuchEnvironment: Environment "lxd" does not exist.
>
> Now, trying to look at the code, I see it trying to do:
>
> def get_juju_home():
> home = os.environ.get('JUJU_HOME')
> if home is None:
> home = os.path.join(os.environ.get('HOME'), '.juju')
> return home
> ...
> def get_environments_path(juju_home):
> return os.path.join(juju_home, 'environments.yaml')
> ...
> def get_environments():
> """Return the environments for juju."""
> home = get_juju_home()
> with open(get_environments_path(home)) as env:
> return yaml.safe_load(env)['environments']
>
> So it seems that this code assumes you have a leftover Juju-1.x setup
> laying around that has a matching environments.yaml file?
>
>
> On Mon, Oct 2, 2017 at 11:37 AM, John Meinel <email address hidden>
> wrote:
>
>> Note that you have to be logged in to see the 'charm-storage' job.
>> Otherwise those links come back as 404.
>>
>> I'm investigating this now.
>>
>> John
>>
>> On Thu, Sep 28, 2017 at 7:50 PM, Burton Swan <email address hidden>
>> wrote:
>>
>>> ** Description changed:
>>>
>>> - Juju version: 2.3-alpha1
>>> + Juju version: 2.3-alpha1 and 2.2 series.
>>>
>>> PR7868 is highly l...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (7.3 KiB)

Also, looking at http://ci.jujucharms.com/job/charm-storage/ I see only a
single success at ​a906cf9, all earlier revisions are marked as failure, so
I'm not sure if that is relevant or not.

There does seem to be something fishy going on with Mongo on Trusty. I see
the Ping Batcher operating:
2017-10-02 09:11:30 TRACE juju.state.presence pingbatcher.go:327
0xc42008b950 [cef56d] recorded 1 pings for 1 ping slot(s) and 1 fields in
0.010s

However, in the actual database there is no 'db.presence.pings' table:
juju:PRIMARY> show collections
presence.beings
presence.seqs
system.indexes

While in the shell I did some testing, and it seems that in Mongo 2.4
update(..., {upsert: true}) will *not* create a document if it doesn't
exist already if one of the operations is a $bit operation. Specifically,
doing:

juju:PRIMARY> db.presence.test.find()
juju:PRIMARY> db.presence.test.update({_id: "blah:1245"}, {$set: {"slot":
1245}, $bit: {"alive.0": {"or": 1}}}, {"upsert": true})
juju:PRIMARY> db.presence.test.find()
# nothing returned, nothing created
juju:PRIMARY> db.presence.test.update({_id: "blah:1245"}, {$set: {"slot":
1245}, $inc: {"alive.0": NumberInt(1)}}, {"upsert": true})
juju:PRIMARY> db.presence.test.find()
{ "_id" : "blah:1245", "alive" : { "0" : 1 }, "slot" : 1245 }
# $inc *did* create the document
juju:PRIMARY> db.presence.test.update({_id: "blah:1245"}, {$set: {"slot":
1245}, $bit: {"alive.0": {"or": 2}}}, {"upsert": true})
juju:PRIMARY> db.presence.test.find()
{ "_id" : "blah:1245", "alive" : { "0" : 3 }, "slot" : 1245 }
# And now the $bit operation is able to increment it.

I'll play with it a bit more, but I would have thought we had a trusty
based run of all the test suite, and that should have also failed, vs just
a CI test failing.

On Mon, Oct 2, 2017 at 1:07 PM, John Meinel <email address hidden> wrote:

> I don't see any successes for multi-series-charm so I'm not sure that
> there is actually a cause from this patch.
>
> The failure around 'agent-lost' or 'agent-is-not-communicating-with-the-server'
> could, indeed be caused by the change from this patch, as it is
> specifically dealing with changes around detecting whether agents are alive
> and communicating or not.
>
> I haven't managed to actually run the test yet, but just doing "bootstrap
> && deploy" I don't see the result. I've been trying with Xenial and LXD, I
> may try with Trusty to see if it makes a difference. (Its possible that
> Mongo 2.4 on Trusty doesn't support the $bit operation as we think it does,
> though the test suite was passing.)
>
> On Mon, Oct 2, 2017 at 12:07 PM, John Meinel <email address hidden>
> wrote:
>
>> Given the need to run against $ENV, I'm guessing this expects to have a
>> controller already bootstrapped? There shouldn't be anything in the patch
>> that changes agents (vs possibly changing how controllers operate).
>>
>> I'm missing something in how to run the tests manually. If I try:
>>
>> ./assess_storage lxd ~/dev/go/bin/juju ~/dev/tmp/juju_test
>>
>> Then it warns me that:
>> /home/jameinel/dev/go/src/github.com/juju/juju/acceptancetes
>> ts/utility.py:88: UserWarning: Not a directory
>> /home/jameinel/dev/tmp/jujutest
>> warn...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (8.8 KiB)

I can confirm that Mongo 2.4 differs from Mongo 3.2 in this case:
3.2:
> db.foo.update({_id: "a"}, {$bit: {"a": {"or": NumberLong(1)}}}, {upsert:
1})
WriteResult({ "nMatched" : 0, "nUpserted" : 1, "nModified" : 0, "_id" : "a"
})
> db.foo.find()
{ "_id" : "a", "a" : NumberLong(1) }
>
2.4:
> db.foo.update({_id: "a"}, {$bit: {"a": {"or": NumberLong(1)}}}, {upsert:
1})
Cannot use commands write mode, degrading to compatibility mode
WriteResult({ "nMatched" : 0, "nUpserted" : 0 })
> db.foo.find()

I can also confirm that if I only have Mongo 2.4 installed on my machine
and I run:
cd $GOPATH/src/github.com/juju/juju/state/presence

I get a lot of failures of the form:
----------------------------------------------------------------------
FAIL: pingbatcher_test.go:251:
PingBatcherSuite.TestPingBatcherFlushesOnShutdown

[LOG] 0:00.000 INFO juju.state.presence faking presence to time slot 0
pingbatcher_test.go:266:
    // And now we find it
    c.Assert(s.pings.FindId(docId).One(&res), jc.ErrorIsNil)
... value *errors.errorString = &errors.errorString{s:"not found"} ("not
found")

[LOG] 0:00.003 INFO juju.state.presence not faking presence time. Real time
slot in use.

So this should have been caught just by running the unit test suite on
Trusty.

On Mon, Oct 2, 2017 at 1:30 PM, John Meinel <email address hidden> wrote:

> Also, looking at http://ci.jujucharms.com/job/charm-storage/ I see only a
> single success at ​a906cf9, all earlier revisions are marked as failure, so
> I'm not sure if that is relevant or not.
>
> There does seem to be something fishy going on with Mongo on Trusty. I see
> the Ping Batcher operating:
> 2017-10-02 09:11:30 TRACE juju.state.presence pingbatcher.go:327
> 0xc42008b950 [cef56d] recorded 1 pings for 1 ping slot(s) and 1 fields in
> 0.010s
>
> However, in the actual database there is no 'db.presence.pings' table:
> juju:PRIMARY> show collections
> presence.beings
> presence.seqs
> system.indexes
>
> While in the shell I did some testing, and it seems that in Mongo 2.4
> update(..., {upsert: true}) will *not* create a document if it doesn't
> exist already if one of the operations is a $bit operation. Specifically,
> doing:
>
> juju:PRIMARY> db.presence.test.find()
> juju:PRIMARY> db.presence.test.update({_id: "blah:1245"}, {$set: {"slot":
> 1245}, $bit: {"alive.0": {"or": 1}}}, {"upsert": true})
> juju:PRIMARY> db.presence.test.find()
> # nothing returned, nothing created
> juju:PRIMARY> db.presence.test.update({_id: "blah:1245"}, {$set: {"slot":
> 1245}, $inc: {"alive.0": NumberInt(1)}}, {"upsert": true})
> juju:PRIMARY> db.presence.test.find()
> { "_id" : "blah:1245", "alive" : { "0" : 1 }, "slot" : 1245 }
> # $inc *did* create the document
> juju:PRIMARY> db.presence.test.update({_id: "blah:1245"}, {$set: {"slot":
> 1245}, $bit: {"alive.0": {"or": 2}}}, {"upsert": true})
> juju:PRIMARY> db.presence.test.find()
> { "_id" : "blah:1245", "alive" : { "0" : 3 }, "slot" : 1245 }
> # And now the $bit operation is able to increment it.
>
> I'll play with it a bit more, but I would have thought we had a trusty
> based run of all the test suite, and that should have also failed, vs just
> a CI test failing.
>
>
> On Mon, Oc...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

fundamentally this breaks Juju on Trusty, so it needs to be critically fixed.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

I believe this was actually fix released for beta1

Changed in juju:
milestone: none → 2.3-beta2
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
Changed in juju:
milestone: 2.3-beta2 → 2.3-beta1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.