charm with storage gets stuck initialising on aws

Bug #1809478 reported by Casey Marshall
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth
2.4
Fix Released
High
Ian Booth
2.5
Fix Released
High
Ian Booth

Bug Description

I'm having several issues working with a model on a Juju 2.4.7 controller, in our openstack (prodstack-is). I have an RT open on this issue, https://portal.admin.canonical.com/116010.

All files referenced below are in the attached tarball.

First off, the machine & unit agents are regularly losing connectivity to the controller, resulting in a status like shown in [omnibugs/000-juju-status-agents-lost].

Looking at a particular unit agent's log (I chose prometheus/0, which has been a part of this model for a while), I see regular connection errors to the controller, see [omnibugs/001-prometheus-unit-agent-errors-are-typical]

We're trying to deploy several new workloads to this model in a mojo spec. These unit agents keep getting stuck at "agent initializing":

unit agent log: [omnibugs/002-kafka-unit-stuck-agent-initializing]
status log: [omnibugs/003-kafka-status-log]
unit's cloud-init.log [omnibugs/003-kafka-unit-cloud-init]
cloud-init-output.log [omnibugs/003-kafka-unit-cloud-init-output]

I tried `juju deploy ubuntu` directly from the shell on wendigo. It is taking a very long time to deploy.. command has been running for about 10 minutes, and shows 'Deploying charm "cs:ubuntu-12".' but hasn't completed yet. Will update this bug with the outcome once it completes or errors out.

To summarize the state of things:
- Juju is incredibly slow to respond
- Juju agents are constantly losing connectivity
- Juju cannot deploy any new workloads into this model, the agents get stuck at "initializing"

You can reach me via IRC (cmars) or Telegram (@cmarsss). To be clear this is not currently impacting customers and the workloads are running fine on openstack, but it is concerning that we would be unable to roll out changes if we needed to, in order to support this service.

Revision history for this message
Casey Marshall (cmars) wrote :
Revision history for this message
Casey Marshall (cmars) wrote :

The ubuntu deploy continues.. it's been running over 30 min now, currently at "agent installing". https://paste.ubuntu.com/p/NSrM75tz83/

Revision history for this message
Casey Marshall (cmars) wrote :

I got an update-status hook error on a postgresql unit, seems caused by API connection issues? https://paste.ubuntu.com/p/vfz2pZHwjH/

Revision history for this message
Casey Marshall (cmars) wrote :

The ubuntu deploy completed successfully and went to a "ready" state. So perhaps if I keep retrying deploys, they may eventually succeed.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1809478] Re: Juju model on 2.4.7 controller is unresponsive, agents are getting lost and stuck

"unit-postgresql-5" cannot open api: unable to connect to API: dial
tcp 10.25.2.109:17070: i/o timeout

Typically when we see i/o timeout that is when we fail to contact Mongo. Do
you have logs from the controller?
The actual log lines do look like we are just failing to contact the
controller at all, the dial attempts are not being actively rejected but
just not going through.

Given "controller: prodstack-is" I'm guessing something was going on there?

On Fri, Dec 21, 2018 at 9:30 PM Casey Marshall <email address hidden>
wrote:

> The ubuntu deploy completed successfully and went to a "ready" state. So
> perhaps if I keep retrying deploys, they may eventually succeed.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1809478
>
> Title:
> Juju model on 2.4.7 controller is unresponsive, agents are getting
> lost and stuck
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1809478/+subscriptions
>

Revision history for this message
Casey Marshall (cmars) wrote : Re: Juju model on 2.4.7 controller is unresponsive, agents are getting lost and stuck
Download full text (7.7 KiB)

I'm able to deploy other charms just fine, only kafka seems to be getting stuck. It's not even getting to the install hook, so I don't think it's anything to do with the charm. I forced the log level on one of the stuck units to DEBUG (by editing agent.conf) and it looks like the resolver is stuck waiting for storage, see log snippet below.

My deployment is using mojo, which uses juju-deployer. In my deployer bundle, I specify the storage volume up front. I imagine that (though I am not familiar with the internals of juju-deployer) the storage is being added immediately after the deploy API call or command is executed.

I commented out the storage declaration in the bundle for now, and instead, I'm running a script to attach storage after the charm is successfully deployed. That seems to workaround this issue.

2019-01-07 18:25:51 DEBUG juju.worker.logger logger.go:70 reconfiguring logging from "<root>=WARNING;juju=DEBUG" to "juju=DEBUG"
2019-01-07 18:25:51 DEBUG juju.worker.leadership tracker.go:130 kafka/6 making initial claim for kafka leadership
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "metric-spool" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "logging-config-updater" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "metric-sender" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "hook-retry-strategy" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.logger logger.go:58 overriding logging config with override from agent.conf "juju=DEBUG"
2019-01-07 18:25:51 DEBUG juju.worker.dependency engine.go:545 "uniter" manifold worker started
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:168 applying in-process legacy proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"10.25.2.109,10.25.2.110,10.25.2.111", AutoNoProxy
:""}
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:188 saving new legacy proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"10.25.2.109,10.25.2.110,10.25.2.111", AutoNoProxy:""}
2019-01-07 18:25:51 DEBUG juju.worker.proxyupdater proxyupdater.go:252 new apt proxy settings proxy.Settings{Http:"", Https:"", Ftp:"", NoProxy:"", AutoNoProxy:""}
2019-01-07 18:25:51 DEBUG juju.worker.meterstatus connected.go:88 got meter status change signal from watcher
2019-01-07 18:25:51 DEBUG juju.network network.go:507 no lxc bridge addresses to filter for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "lxdbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:543 cannot get "virbr0" addresses: route ip+net: no such network interface (ignoring)
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-cloud:10.25.2.111 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:127.0.0.1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:492 including address local-machine:::1 for machine
2019-01-07 18:25:51 DEBUG juju.network network.go:561 addresses after filtering: [local-clou...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

The charm hook execution will block until all requested storage is attached. For block storage, this means that the machine agent on the worker node has to poll block devices using lsblk and report back what it sees to Juju - the storage provisioner will attempt to match the /dev/foo block device name with the storage request and thus mark the storage as attached. Juju pauses charm hook execution until this happens.

There was a bug on AWS recently where the block device WWN could not be used to match the storage request: https://bugs.launchpad.net/juju/+bug/1778033. This was fixed in 2.4.7. Perhaps there's a similar issue here. You can turn on TRACE debugging for "juju.apiserver.storagecommon" to see the matching attempts to pair the block device info with the storage request to see why it might be failing.

Revision history for this message
Ian Booth (wallyworld) wrote :

So it looks like AWS is broken - Juju runs the udevadm command for each of the block devices it finds using lsblk. The udevadm command prints various attributes of the device which Juju needs to know, including WWN, device link etc. When I ssh'ed into an AWS instance running a deployed postgresq with storage, the /dev/xvdf device showed up when the EBS volume was create, but the udevadm output was virtually empty. Without this info, Juju cannot complete the storage setup. There's no issue when I tested on GCE - the udevadm output was fully populated.

I have no idea what's changed on AWS to break udevadm and with it Juju storage.

Revision history for this message
Ian Booth (wallyworld) wrote :

Note - the above was with 2.5 but there's no difference between 2.4.7 and 2.5 in this area.

Revision history for this message
Ian Booth (wallyworld) wrote :

Maybe it's related to the newish NVMe EBS instances?

Revision history for this message
Ian Booth (wallyworld) wrote :

So turns out that AWS xen pv block devices do not have identifying hardware info from udev that we can use to match the volume request from Juju, so we need to fall back to matching on device name. It's not ideal but all we can do for now.

https://github.com/juju/juju/pull/9620

Changed in juju:
milestone: none → 2.6-beta1
status: New → In Progress
assignee: nobody → Ian Booth (wallyworld)
importance: Undecided → High
Ian Booth (wallyworld)
summary: - Juju model on 2.4.7 controller is unresponsive, agents are getting lost
- and stuck
+ charm with storage get stuck initialising on aws
summary: - charm with storage get stuck initialising on aws
+ charm with storage gets stuck initialising on aws
Revision history for this message
Casey Marshall (cmars) wrote :

Machine log, unit log and output of juju storage --format yaml

Revision history for this message
Casey Marshall (cmars) wrote :

Reproduced my issue again on openstack, see attached logs/storage info. Issue is actually "quota exceeded" but I didn't figure this out until looking at the yaml format of `juju storage`. Nothing in the logs to indicate this was the underlying cause.

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.