[AWS] t3 instance types fail deployment when storage is attached

Bug #1798001 reported by james beedy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth
2.3
Fix Released
Critical
Ian Booth
2.4
Fix Released
Critical
Ian Booth

Bug Description

t3 instance types fail to deploy when storage is attached see [0]

[0] https://paste.ubuntu.com/p/tgDQsmtmHb/

Revision history for this message
Ian Booth (wallyworld) wrote :

Can we get an errors reported via the AWS console? What does juju debug-log indicate? Maybe the instance type has a limitation of how many volumes can be attached?

Revision history for this message
james beedy (jamesbeedy) wrote :

@wallyworld `juju debug-log` does not contain log messages because the agent does not start it seems.

AWS states that the volume attachment limit for this instance type is 28, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html

I'm looking for errors in aws, still digging ....

Revision history for this message
james beedy (jamesbeedy) wrote :

to replicate, using juju 2.4.4:

`juju deploy postgresql --storage pgdata=ebs,10G`

Revision history for this message
Ian Booth (wallyworld) wrote :

Digging into the storage processing code, tt looks like the problem is that the EBS volumes are being exposed as NVMe block devices. This is relatively new behaviour that was previously confined to c5 and m5 instance types but now appears to have been more widely implemented. The issue is that the block device names become unpredictable which messes up how Juju determines whether a volume has become attached to a machine. This in turn blocks the initialisation of the unit agent since it waits for storage to become attached before installing the charm.

I need to look further to find a tasteful solution that works in all cases.

Changed in juju:
milestone: none → 2.4.5
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Ian Booth (wallyworld) wrote :

The only real way to fix this regardless of how EC2 behaviour might change underneath us is punt on assuming a NVMe block device link is valid for a newly attached machine volume even though it may not be. This is because we have to real way of querying how an attached volume will be exposed on a machine instance. This has no practical effect other than potentially printing an incorrect device link when printing storage volume information in YAML. The device name (eg xvdf) is accurate. To update the Juju model after the volume is recorded as attached is messy because we don't want to encode AWS specific behaviour at that layer.

I deployed a 2.4.4 controller and deployed postgresql and observed the storage issue. I then upgraded the controller with the abve fix and observed storage come good and postgresql became active.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Ian Booth (wallyworld)
Revision history for this message
Ian Booth (wallyworld) wrote :
Changed in juju:
milestone: 2.4.5 → 2.5-beta1
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.