Comment 88 for bug 1743249

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Ok - and what about the region controller losing contact with the rack
controller log messages? What is that about?

On Tue, Feb 6, 2018 at 11:37 AM, Andres Rodriguez
<email address hidden> wrote:
> fwiw, the deadlocks issues is regiond trying to determine which process
> should send updates to which racks for *dhcp* changes, so this is not at
> all related to the RPC boot requests for pxe.
>
> On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs <email address hidden>
> wrote:
>
>> Can you please comment on the deadlock detected error from the db log in
>> posted in #36
>>
>> http://paste.ubuntu.com/26530761/
>>
>> That is not expected behavior is it? Also the fact that MAAS thinks its
>> losing rack/region connections seems like it could be related to this
>> behavior.
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>> Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>> importance=Undecided; assignee=None;
>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>> status=In Progress; importance=Medium; <email address hidden>;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
>> mpontillo vorlon
>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>> Launchpad-Message-Rationale: Subscriber (MAAS)
>> Launchpad-Message-For: andreserl
>>
>
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
> Status in grub2 package in Ubuntu:
> In Progress
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> This is very similar to bug 1724677, which was happening pre-
> metldown/spectre. The only difference is we don't see "[critical] TFTP
> back-end failed" in the logs anymore.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> was directed by maas to power off, which it did successfully after
> booting to linux.
>
> Full logs are available here:
> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions