'Deploying' timed out after 40 minutes / Failedbcache: register_bcache() error

Bug #1772490 reported by Ashley Lai
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
curtin
Invalid
Undecided
Unassigned
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

We have a few runs over the weekend failed to deploy with maas 2.3.3.

May 21 11:33:50 swoobat maas.node: [info] geodude: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 21 11:33:50 swoobat maas.node: [error] geodude: Marking node failed: Node operation 'Deploying' timed out after 40 minutes.

https://solutions.qa.canonical.com/#/qa/testRun/67dae845-b22e-4de1-9b30-0ecb28eb3c35

Revision history for this message
Ashley Lai (alai) wrote :
Changed in maas:
status: New → Incomplete
status: Incomplete → Invalid
summary: - deployment timed out after 40 minutes
+ bcache: register_bcache() error
Changed in maas:
status: Invalid → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: bcache: register_bcache() error

Hi Ashley,

Marking this incomplete for MAAS (although I think it is invalid). Opening a task for curtin and for the kernel team. The error in curtin implies is the same issue as [1]. Judging from [2], it seems that it should already be fixed:

May 21 10:57:03 geodude cloud-init[1643]: Processing triggers for libc-bin (2.27-3ubuntu1) ...
May 21 10:57:04 geodude cloud-init[1643]: curtin: Installation started. (18.1-623-gae48e86-0ubuntu1~ubuntu16.04.1)
May 21 10:57:04 geodude cloud-init[1643]: third party drivers not installed or necessary.
May 21 10:57:05 geodude kernel: [ 49.126408] bcache: register_bcache() error /dev/sda3: device already registered (emitting change event)
May 21 10:57:05 geodude kernel: [ 49.166935] bcache: register_bcache() error /dev/sda3: device already registered (emitting change event)
May 21 10:57:05 geodude kernel: [ 49.209233] bcache: register_bcache() error /dev/sda3: device already registered (emitting change event)
May 21 10:57:05 geodude kernel: [ 49.254763] bcache: register_bcache() error /dev/sda3: device already registered (emitting change event)
May 21 10:57:05 geodude kernel: [ 49.319986] bcache: register_bcache() error /dev/sda3: device already registered (emitting change event)

The ephemeral environment kernel seems to be:

May 21 10:56:43 geodude kernel: [ 0.000000] Linux version 4.15.0-20-generic (buildd@lgw01-amd64-039) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 (Ubuntu 4.15.0-20.21-generic 4.15.17)

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729145
[2]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729145/comments/54

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1772490

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Ryan Harper (raharper) wrote : Re: bcache: register_bcache() error

ay 21 10:57:05 geodude kernel: [ 49.126408] bcache: register_bcache() error /dev/sda3: device already registered (emitting change event)

These are not curtin or kernel errors but expected output.

I looked at the qa link but I didn't find the install.log debug output.

Changed in curtin:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Attached the rsyslog showing the error. It indeed doesn't seem like there were any curtin failures, but I wonder that, while curtin successfully process, the machine actually doesn't actually boot onto the filesystem due to the kernel issue?

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1772490] Re: bcache: register_bcache() error

Thanks for the log. Curtin installed without error, I'll mark invalid.

AFAICT, it booted fine and was instructed to power off.

May 21 13:18:51 geodude cloud-init[1676]: Powering node off.
May 21 13:18:51 geodude ec2:
May 21 13:18:51 geodude ec2:
#############################################################
May 21 13:18:51 geodude ec2: -----BEGIN SSH HOST KEY FINGERPRINTS-----
May 21 13:18:51 geodude ec2: 1024
SHA256:O2WkpYqVPV8G7pumPrb/sAi8F8pBY3ay3jF+Ymfko1Q root@geodude (DSA)
May 21 13:18:51 geodude ec2: 256
SHA256:KILySo9Cbqs70KPsyV16HZpWueeHqiBzOzPFSGxXl1M root@geodude
(ECDSA)
May 21 13:18:51 geodude ec2: 256
SHA256:C4clHtaNL6GpwIdlJwyZXq23NfbqK0s3YWzof0Eu7CY root@geodude
(ED25519)
May 21 13:18:51 geodude ec2: 2048
SHA256:LFGGivHhyNdrN5AXu5mj5eBENjk2tWNDj41K1VsP6Z0 root@geodude (RSA)
May 21 13:18:51 geodude ec2: -----END SSH HOST KEY FINGERPRINTS-----
May 21 13:18:51 geodude ec2:
#############################################################
May 21 13:18:51 geodude cloud-init[1676]: Cloud-init v. 18.2 running
'modules:final' at Mon, 21 May 2018 13:18:50 +0000. Up 27.71 seconds.
May 21 13:18:51 geodude cloud-init[1676]: Cloud-init v. 18.2 finished
at Mon, 21 May 2018 13:18:51 +0000. Datasource DataSourceMAAS
[http://10.244.40.33/MAAS/metadata/]. Up 28.40 seconds
May 21 13:18:51 geodude systemd[1]: Started Execute cloud user/final scripts.
May 21 13:18:51 geodude systemd[1]: Reached target Cloud-init target.
May 21 13:18:51 geodude systemd[1]: Startup finished in 15.521s
(kernel) + 13.137s (userspace) = 28.659s.

On Mon, May 21, 2018 at 3:15 PM, Andres Rodriguez
<email address hidden> wrote:
> Attached the rsyslog showing the error. It indeed doesn't seem like
> there were any curtin failures, but I wonder that, while curtin
> successfully process, the machine actually doesn't actually boot onto
> the filesystem due to the kernel issue?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1772490
>
> Title:
> bcache: register_bcache() error
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1772490/+subscriptions

Revision history for this message
Ryan Harper (raharper) wrote :

May 21 11:00:42 geodude cloud-init[1643]: curtin: Installation finished.

From the rsyslog, curtin finished the install without error.

Changed in curtin:
status: Incomplete → Invalid
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: bcache: register_bcache() error

@Ryan,

I'm marking this as incomplete for curtin provided that after further debugging, I can see that the late command that's supposed to send the "netboot_off" operation is not being sent.

This could be because curtin failed but we are lacking logs to determine this.

Changed in curtin:
status: Invalid → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Ashley,

Could you please start gathering the logs from HAProxy running for the MAAS servers?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Ryan,

Marking this as invalid for curtin again. I looked closely to the log and saw this:

May 21 11:00:35 geodude cloud-init[1643]: --2018-05-21 11:00:35-- http://10.244.40.33/MAAS/metadata/latest/by-id/gnbttp/
May 21 11:00:35 geodude cloud-init[1643]: Connecting to 10.244.40.33:80... connected.
May 21 11:00:35 geodude cloud-init[1643]: HTTP request sent, awaiting response... 200 OK
May 21 11:00:35 geodude cloud-init[1643]: Length: unspecified [text/plain]
May 21 11:00:35 geodude cloud-init[1643]: Saving to: ‘/dev/null’
May 21 11:00:35 geodude cloud-init[1643]: 0K 138K=0s
May 21 11:00:35 geodude cloud-init[1643]: 2018-05-21 11:00:35 (138 KB/s) - ‘/dev/null’ saved [2]

That means curtin run the correct netboot_off command, which should have told MAAS that the machine is to localboot on next reboot.

As such, I need the HAProxy logs to continue to be able to debug as it was done against: 10.244.40.33:80

Changed in curtin:
status: Incomplete → Invalid
summary: - bcache: register_bcache() error
+ 'Deploying' timed out after 40 minutes / Failedbcache: register_bcache()
+ error
Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1772490] Re: bcache: register_bcache() error

On Mon, May 21, 2018 at 3:59 PM, Andres Rodriguez
<email address hidden> wrote:
> @Ryan,
>
> I'm marking this as incomplete for curtin provided that after further
> debugging, I can see that the late command that's supposed to send the
> "netboot_off" operation is not being sent.
>
> This could be because curtin failed but we are lacking logs to determine
> this.

What?

Late commands run before we report curtin installation success.

Do you have the actual curtin config sent?

Also, generally it would be good if the qa runs set curtin install to
verbose so more info is dumped into the rsyslog output.
In debug mode we dump the merge curtin config that's sent to curtin in syslog.

>
> ** Changed in: curtin
> Status: Invalid => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1772490
>
> Title:
> bcache: register_bcache() error
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1772490/+subscriptions

Revision history for this message
Andres Rodriguez (andreserl) wrote :

After further looking into this, I can determine that:

1. Curtin installed successfully despite the kernel issues
1. Curtin told MAAS the installation completed successfully.
3. cloud-init told the machine to reboot.

At this point, we lose all logs, but from what we can see is that:

1. The machine never attempted to PXE boot post-deployment
2. There's no indication that the machine actually booted into the installed environment.

This leads me to believe that due to the kernel error above, the kernel could have crashed right before the machine rebooted, which prevented the machine from rebooting into the installed system. This is not something available in the rsyslog because at the time, networking is down. This would only show on a console log.

That said, the logs definitely show that the machine never came up again.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

As such, I don't believe this is a MAAS issue.

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1772490] Re: 'Deploying' timed out after 40 minutes / Failedbcache: register_bcache() error

The bcache message is *NOT* a failure in the kernel;

That message is emitted when a bcache device is reregistered, the
kernel will emit the changed event to ensure
that udev will create /dev/bcache/by-{uuid, label} links.

There may be some other issue preventing shutting down, but it's
unrelated to those bcache messages.

On Wed, May 23, 2018 at 10:35 AM, Jason Hobbs <email address hidden> wrote:
> *** This bug is a duplicate of bug 1768893 ***
> https://bugs.launchpad.net/bugs/1768893
>
> ** This bug has been marked a duplicate of bug 1768893
> installation on several nodes failed with errors relating to dmsetup remove of ceph devices.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1772490
>
> Title:
> 'Deploying' timed out after 40 minutes / Failedbcache:
> register_bcache() error
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1772490/+subscriptions

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.