Comment 99 for bug 1743249

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

BTW to be clear here I'm saying I don't think the path forward on
improving this issue is thinking about how MAAS works and throwing out
patches that might improve performance here and there. The path
forward is to instrument MAAS on a system with slow i/o and to figure
out exactly where it's getting hung up.

Jason

On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbs <email address hidden> wrote:
> dm-delay looks very interesting along those lines.
>
> https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html
>
> https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
>
> On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs <email address hidden> wrote:
>> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
>> <email address hidden> wrote:
>>> I don't have logs anymore as I have since rebuilt my environment, but I can
>>> confirm seeing improvements on a maas server running with high IO (note it
>>> was a single region/rack).
>>>
>>> see inlien:
>>>
>>>
>>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <email address hidden>
>>> wrote:
>>>
>>>> Andres, it was a single test in both cases, and in both cases there was
>>>> almost no delay from MAAS. It's not significant enough to call it
>>>> positive results.
>>>>
>>>>
>>> Comment #93 shows there are /some/ improvements when comparing those two
>>> samples only, but as I have already said, we need data over time to in both
>>> scenarios to properly compare and determine whether the changes do make any
>>> material performance improvements with the current conditions of the
>>> samples (both samples are with a fixed io starvation on the environment).
>>>
>>>
>>>> Since neither of you answered yes, I'll assume the answer was no to my
>>>> question of whether there was anything in my logs or data that showed
>>>> reading the template from disk on the rack controller was the culprit,
>>>> and that this fix just represents a guess at what might be causing the
>>>> delay.
>>>>
>>>
>>> To be fair, your logs do not provide anything concrete to determine what's
>>> the culprit of the issue on the MAAS side. It provides a lot of clues, and
>>> we have since then determine that those issues were a result of IO
>>> starvation (from the VM's writing to disk). As such, the only way we can
>>> *really* see if the patch brings any significant performance improvements
>>> is to run tests in the environment were you were seeing the issues in the
>>> first place.
>>
>> I didn't think my logs provided anything concrete! That's because the
>> logging built into MAAS is not sufficient enough to do so.
>>
>> I can't break that environment to test anymore - we got it working
>> thanks to you guy's help and it's a production environment that needs
>> to keep running other tests.
>>
>> It might possible to recreate this on another maas server, using
>> 'stress' or a similar tool to cause disk contention.
>>
>> Jason
>>
>>> As such, if you are willing to test if these make any material difference,
>>> I would unfix your environment and do two runs (one without the fix, and
>>> one with the fix). That's the only way we can really compare and be certain
>>> in *your* environment.
>>>
>>>>
>>>> --
>>>> You received this bug notification because you are subscribed to MAAS.
>>>> https://bugs.launchpad.net/bugs/1743249
>>>>
>>>> Title:
>>>> Failed Deployment after timeout trying to retrieve grub cfg
>>>>
>>>> To manage notifications about this bug go to:
>>>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>>>
>>>> Launchpad-Notification-Type: bug
>>>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>>>> importance=Undecided; assignee=None;
>>>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>>>> status=Fix Released; importance=Medium; <email address hidden>;
>>>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>>>> Launchpad-Bug-Information-Type: Public
>>>> Launchpad-Bug-Private: no
>>>> Launchpad-Bug-Security-Vulnerability: no
>>>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
>>>> jason-hobbs mpontillo vorlon
>>>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>>>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>>>> Launchpad-Message-Rationale: Subscriber (MAAS)
>>>> Launchpad-Message-For: andreserl
>>>>
>>>
>>>
>>> --
>>> Andres Rodriguez (RoAkSoAx)
>>> Ubuntu Server Developer
>>> MSc. Telecom & Networking
>>> Systems Engineer
>>>
>>> --
>>> You received this bug notification because you are subscribed to the bug
>>> report.
>>> https://bugs.launchpad.net/bugs/1743249
>>>
>>> Title:
>>> Failed Deployment after timeout trying to retrieve grub cfg
>>>
>>> Status in MAAS:
>>> New
>>> Status in grub2 package in Ubuntu:
>>> Fix Released
>>>
>>> Bug description:
>>> A node failed to deploy after it failed to retrieve a grub.cfg from
>>> MAAS due to a timeout. In the logs, it's clear that the server tried
>>> to retrieve the grub cfg many times, over about 30 seconds:
>>>
>>> http://paste.ubuntu.com/26387256/
>>>
>>> We see the same thing for other hosts around the same time:
>>>
>>> http://paste.ubuntu.com/26387262/
>>>
>>> It seems like MAAS is taking way too long to respond to these
>>> requests.
>>>
>>> This is very similar to bug 1724677, which was happening pre-
>>> metldown/spectre. The only difference is we don't see "[critical] TFTP
>>> back-end failed" in the logs anymore.
>>>
>>> I connected to the console on this system and it had errors about
>>> timing out retrieving the grub-cfg, then it had an error message along
>>> the lines of "error not an ip" and then "double free". After I
>>> connected but before I could get a screenshot the system rebooted and
>>> was directed by maas to power off, which it did successfully after
>>> booting to linux.
>>>
>>> Full logs are available here:
>>> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>>> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>>>
>>> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>>>
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions