Bug #1809478 “charm with storage gets stuck initialising on aws” : Bugs : Canonical Juju

Revision history for this message

Casey Marshall (cmars) wrote on 2018-12-21:

#1

Tarball of logs referenced above Edit (3.6 MiB, application/x-tar)

Revision history for this message

Casey Marshall (cmars) wrote on 2018-12-21:

#2

The ubuntu deploy continues.. it's been running over 30 min now, currently at "agent installing". https://paste.ubuntu.com/p/NSrM75tz83/

Revision history for this message

Casey Marshall (cmars) wrote on 2018-12-21:

#3

I got an update-status hook error on a postgresql unit, seems caused by API connection issues? https://paste.ubuntu.com/p/vfz2pZHwjH/

Revision history for this message

Casey Marshall (cmars) wrote on 2018-12-21:

#4

The ubuntu deploy completed successfully and went to a "ready" state. So perhaps if I keep retrying deploys, they may eventually succeed.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-07: Re: [Bug 1809478] Re: Juju model on 2.4.7 controller is unresponsive, agents are getting lost and stuck

#5

"unit-postgresql-5" cannot open api: unable to connect to API: dial
tcp 10.25.2.109:17070: i/o timeout

Typically when we see i/o timeout that is when we fail to contact Mongo. Do
you have logs from the controller?
The actual log lines do look like we are just failing to contact the
controller at all, the dial attempts are not being actively rejected but
just not going through.

Given "controller: prodstack-is" I'm guessing something was going on there?

On Fri, Dec 21, 2018 at 9:30 PM Casey Marshall <email address hidden>
wrote:

> The ubuntu deploy completed successfully and went to a "ready" state. So
> perhaps if I keep retrying deploys, they may eventually succeed.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1809478
>
> Title:
> Juju model on 2.4.7 controller is unresponsive, agents are getting
> lost and stuck
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1809478/+subscriptions
>

Revision history for this message

Casey Marshall (cmars) wrote on 2019-01-07: Re: Juju model on 2.4.7 controller is unresponsive, agents are getting lost and stuck

#6

Download full text (7.7 KiB)

I'm able to deploy other charms just fine, only kafka seems to be getting stuck. It's not even getting to the install hook, so I don't think it's anything to do with the charm. I forced the log level on one of the stuck units to DEBUG (by editing agent.conf) and it looks like the resolver is stuck waiting for storage, see log snippet below.

My deployment is using mojo, which uses juju-deployer. In my deployer bundle, I specify the storage volume up front. I imagine that (though I am not familiar with the internals of juju-deployer) the storage is being added immediately after the deploy API call or command is executed.

I commented out the storage declaration in the bundle for now, and instead, I'm running a script to attach storage after the charm is successfully deployed. That seems to workaround this issue.

2019-01-07 18:25:51 DEBUG juju.worker.logger logger.go:70 reconfiguring logging from "<root>=WARNING;juju=DEBUG" to "juju=DEBUG"
2019-01-07 18:25:51 DEBUG juju.worker.leadership tracker.go:130 kafka/6 making initial claim for kafka leadership
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "metric-spool" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "logging-config-updater" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "metric-sender" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "hook-retry-strategy" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.logger logger.go:58 overriding logging config with override from agent.conf "juju=DEBUG"
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "uniter" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:168 applying in-process legacy proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"10.25.2.109,10.25.2.110,10.25.2.111", AutoNoProxy
:""}
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:188 saving new legacy proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"10.25.2.109,10.25.2.110,10.25.2.111", AutoNoProxy:""}
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:252 new apt proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"", AutoNoProxy:""}
2019-01-07 18:25:51 DEBUG juju.worker.meterstatus connected.go:88 got meter status change signal from watcher
2019-01-07 18:25:51 DEBUG juju.network network.go:507 no lxc bridge addresses to filter for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "lxdbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "virbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-cloud:10.25.2.111 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:127.0.0.1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:::1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:561 addresses after filtering: [local-clou...

I'm able to deploy other charms just fine, only kafka seems to be getting stuck. It's not even getting to the install hook, so I don't think it's anything to do with the charm. I forced the log level on one of the stuck units to DEBUG (by editing agent.conf) and it looks like the resolver is stuck waiting for storage, see log snippet below.

My deployment is using mojo, which uses juju-deployer. In my deployer bundle, I specify the storage volume up front. I imagine that (though I am not familiar with the internals of juju-deployer) the storage is being added immediately after the deploy API call or command is executed.

I commented out the storage declaration in the bundle for now, and instead, I'm running a script to attach storage after the charm is successfully deployed. That seems to workaround this issue.

2019-01-07 18:25:51 DEBUG juju.worker.logger logger.go:70 reconfiguring logging from "<root>=WARNING;juju=DEBUG" to "juju=DEBUG"
2019-01-07 18:25:51 DEBUG juju.worker.leadership tracker.go:130 kafka/6 making initial claim for kafka leadership
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "metric-spool" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "logging-config-updater" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "metric-sender" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "hook-retry-strategy" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.logger logger.go:58 overriding logging config with override from agent.conf "juju=DEBUG"
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "uniter" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:168 applying in-process legacy proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"10.25.2.109,10.25.2.110,10.25.2.111", AutoNoProxy
:""}
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:188 saving new legacy proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"10.25.2.109,10.25.2.110,10.25.2.111", AutoNoProxy:""}
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:252 new apt proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"", AutoNoProxy:""}
2019-01-07 18:25:51 DEBUG juju.worker.meterstatus connected.go:88 got meter status change signal from watcher
2019-01-07 18:25:51 DEBUG juju.network network.go:507 no lxc bridge addresses to filter for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "lxdbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "virbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-cloud:10.25.2.111 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:127.0.0.1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:::1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:561 addresses after filtering: [local-cloud:10.25.2.111 local-machine:127.0.0.1 local-machine:::1]
2019-01-07 18:25:51 DEBUG juju.network network.go:507 no lxc bridge addresses to filter for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "lxdbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "virbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-cloud:10.25.2.110 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:127.0.0.1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:::1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:561 addresses after filtering: [local-cloud:10.25.2.110 local-machine:127.0.0.1 local-machine:::1]
2019-01-07 18:25:51 DEBUG juju.network network.go:507 no lxc bridge addresses to filter for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "lxdbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "virbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-cloud:10.25.2.109 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:127.0.0.1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:::1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:561 addresses after filtering: [local-cloud:10.25.2.109 local-machine:127.0.0.1 local-machine:::1]
2019-01-07 18:25:51 DEBUG juju.worker.apiaddressupdater apiaddressupdater.go:73 updating API hostPorts to [[10.25.2.111:17070 127.0.0.1:17070 [::1]:17070] [10.25.2.110:17070 127.0.0.1:17070 [::1]:17070] [10.25.2.109:17070 127.0.0.1:17070 [::1]:17070]]
2019-01-07 18:25:51 DEBUG juju.agent agent.go:581 API server address details [["10.25.2.111:17070" "127.0.0.1:17070" "[::1]:17070"] ["10.25.2.110:17070" "127.0.0.1:17070" "[::1]:17070"] ["10.25.2.109:17070" "127.0.0.1:17070" "[::1]:17070"]] written to agent config as ["10.25.2.111:17070" "10.25.2.110:17070" "10.25.2.109:17070"]
2019-01-07 18:25:51 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-kafka-6
2019-01-07 18:25:51 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.4.7-xenial-amd64
2019-01-07 18:25:51 DEBUG juju.agent.tools symlinks.go:44 jujud path /var/lib/juju/tools/2.4.7-xenial-amd64/jujud
2019-01-07 18:25:51 INFO juju.worker.leadership tracker.go:199 kafka/6 promoted to leadership of kafka
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:168 applying in-process legacy proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"10.25.2.109,10.25.2.110,10.25.2.111", AutoNoProxy:""}
2019-01-07 18:25:51 DEBUG juju.worker.uniter uniter.go:580 starting juju-run listener on unix:/var/lib/juju/agents/unit-kafka-6/run.socket2019-01-07 18:25:51 INFO juju.worker.uniter uniter.go:197 unit "kafka/6" started
2019-01-07 18:25:51 DEBUG juju.worker.uniter runlistener.go:107 juju-run listener running
2019-01-07 18:25:51 INFO juju.worker.uniter uniter.go:236 hooks are retried true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:431 got action change: [] ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:441 got relations change: ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:451 got storage change: [logs/12] ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:381 got unit change
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:364 got config change: ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:391 got application change
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:364 got config change: ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:461 got update status interval change: ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:411 got address change: ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.remotestate watcher.go:421 got leader settings change: ok=true
2019-01-07 18:25:51 DEBUG juju.worker.uniter.storage resolver.go:125 next hook op for storage-logs-12: {Kind:0 Life:alive Attached:false Location:}
2019-01-07 18:25:51 DEBUG juju.worker.uniter.storage resolver.go:91 still pending [storage-logs-12]

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-07:

#7

The charm hook execution will block until all requested storage is attached. For block storage, this means that the machine agent on the worker node has to poll block devices using lsblk and report back what it sees to Juju - the storage provisioner will attempt to match the /dev/foo block device name with the storage request and thus mark the storage as attached. Juju pauses charm hook execution until this happens.

There was a bug on AWS recently where the block device WWN could not be used to match the storage request: https://bugs.launchpad.net/juju/+bug/1778033. This was fixed in 2.4.7. Perhaps there's a similar issue here. You can turn on TRACE debugging for "juju.apiserver.storagecommon" to see the matching attempts to pair the block device info with the storage request to see why it might be failing.

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-08:

#8

So it looks like AWS is broken - Juju runs the udevadm command for each of the block devices it finds using lsblk. The udevadm command prints various attributes of the device which Juju needs to know, including WWN, device link etc. When I ssh'ed into an AWS instance running a deployed postgresq with storage, the /dev/xvdf device showed up when the EBS volume was create, but the udevadm output was virtually empty. Without this info, Juju cannot complete the storage setup. There's no issue when I tested on GCE - the udevadm output was fully populated.

I have no idea what's changed on AWS to break udevadm and with it Juju storage.

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-08:

#9

Note - the above was with 2.5 but there's no difference between 2.4.7 and 2.5 in this area.

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-08:

#10

Maybe it's related to the newish NVMe EBS instances?

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-11:

#11

So turns out that AWS xen pv block devices do not have identifying hardware info from udev that we can use to match the volume request from Juju, so we need to fall back to matching on device name. It's not ideal but all we can do for now.

https://github.com/juju/juju/pull/9620

Changed in juju:
milestone:	none → 2.6-beta1
status:	New → In Progress
assignee:	nobody → Ian Booth (wallyworld)
importance:	Undecided → High

Ian Booth (wallyworld) on 2019-01-11

summary:	- Juju model on 2.4.7 controller is unresponsive, agents are getting lost - and stuck + charm with storage get stuck initialising on aws
summary:	- charm with storage get stuck initialising on aws + charm with storage gets stuck initialising on aws

Revision history for this message

Casey Marshall (cmars) wrote on 2019-01-15:

#12

1809478.tgz Edit (8.4 KiB, application/x-tar)

Machine log, unit log and output of juju storage --format yaml

Revision history for this message

Casey Marshall (cmars) wrote on 2019-01-15:

#13

Reproduced my issue again on openstack, see attached logs/storage info. Issue is actually "quota exceeded" but I didn't figure this out until looking at the yaml format of `juju storage`. Nothing in the logs to indicate this was the underlying cause.

Ian Booth (wallyworld) on 2019-01-16

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2019-05-13

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

charm with storage gets stuck initialising on aws

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Ian Booth	Canonical Juju 2.6-beta1
2.4	Fix Released	High	Ian Booth	Canonical Juju 2.4.8
2.5	Fix Released	High	Ian Booth	Canonical Juju 2.5-rc2