maas provider, hwclock out of sync means juju will not work

Bug #1511589 reported by David Britton
38
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Gavin Panella
cloud-init
Expired
Medium
Unassigned
curtin
Triaged
Undecided
Unassigned
falkor
Fix Released
High
Chris Glass
juju-core
Invalid
Undecided
Unassigned

Bug Description

MAAS provides no means to ensure the hardware clock is set, and juju relies on accurate clocks.

Leading to errors like this when you bootstrap on machines that otherwise works fine:

"ERROR juju.cmd supercommand.go:430 gomaasapi: got error back from server:
401 OK (Authorization Error: \'Expired timestamp: given 1446087606 and now
1446094822 has a greater difference than threshold 300\')\nERROR failed to
bootstrap environment: subprocess encountered error code 1\n\')'), 1),
(u'waiting', 179), (u'succeeded', 10)]"

The only thing a user can do is touch each machine, sometimes booting them into an OS to fix their hwclock (which can still drift from that point, of course).

This error path is exposed when the stock 'ntpdate' from ubuntu does not work, for instance, if your lab is behind a proxy.

David Britton (dpb)
description: updated
tags: removed: kanban-cross-team
David Britton (dpb)
description: updated
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi David,

I don't fully understand what the involvement of MAAS is here in Juju failing, however, when MAAS deploys a machine, it ensures that the clock is the same among all machines. Otherwise, machines wouldn't be able to access the metadata server on the deployment process (this also affects enlistment and commissioning).

Now, my question is whether the Juju client isbeing run on a different machine that has a different clock time than the MAAS server? Hence causing the maas provider to fail?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Please also, see [1] as this was the issue in the MAAS/cloud-init side.

https://bugs.launchpad.net/maas/+bug/978127

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

MAAS should inject the proper ntpdate, date and hwclock commands into the commissioning script. It knows about the ntp server already.

Bug #978127 doesn't adjust the clocks, just works around the problem to get the machine booting and installing the OS, that's my understanding. "set oauth clockskew to 602" doesn't adjust the clock.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Seems to be a MAAS issue as juju is just surfacing the MAAS error, so setting juju-core to invalid.

Changed in juju-core:
status: New → Invalid
David Britton (dpb)
Changed in falkor:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Cheryl,

Can you expand on what MAAS error is being surfaced? It is my understanding *can* deploy machines successfully, and if it can deploy them successfully, then in reality there's no MAAS error.

That being said, if MAAS / cloud-init is seeing the clock different and fixing the time for the skew to work, then, could it be that Juju is incorrectly capturing the error before it is being resolved and failing altogether?

Revision history for this message
Cheryl Jennings (cherylj) wrote :

I think the error is actually coming from OAuth when juju is trying to connect to the MAAS server. It appears that an incorrect timezone could cause this error, or just a time skew greater than the acceptable margin.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Right, so for what I've read over the weekend, it seems that if the Juju client is in a different timezone as the MAAS server (or potentially the bootstrap node), then this issue would happen.

David Britton (dpb)
Changed in falkor:
assignee: nobody → Chris Glass (tribaal)
Chris Glass (tribaal)
Changed in falkor:
status: Triaged → In Progress
Chris Glass (tribaal)
Changed in falkor:
status: In Progress → Fix Committed
Changed in falkor:
milestone: none → 0.15
status: Fix Committed → Fix Released
Revision history for this message
Gavin Panella (allenap) wrote :

Chris, Adam, what did the fix in falcor entail?

Changed in maas:
status: New → Incomplete
Revision history for this message
Adam Collard (adam-collard) wrote :

Gavin,

We added a ntpdate and hwclock -w to a commissioning script

David Britton (dpb)
Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Gavin Panella (allenap) wrote :

> We added a ntpdate and hwclock -w to a commissioning script

Interesting. I assumed curtin would be doing this already during installation, but something is obviously amiss. I'm going to add a curtin bug task because I don't think this is a bug in MAAS exactly, though it would be possible to fix this in MAAS.

Changed in maas:
status: Confirmed → Incomplete
Ryan Harper (raharper)
Changed in curtin:
status: New → Triaged
Revision history for this message
Scott Moser (smoser) wrote :

The way that I'd like to have this fixed in MAAS is
 a.) for maas to add vendor-data into the meta-data service
 b.) cloud-init to add support for configuring ntp
 c.) maas to declare in vendor-data what the ntp service is

The reason I've suggested this in vendor-data is that this is a maas setting, which would be relevant to all environments (commissioning, enlistment and user-deployed). The user-deployed environment has the *users* provided user-data, but maas would still be able to provide vendor-data there.

Changed in cloud-init:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1511589] Re: maas provider, hwclock out of sync means juju will not work

NTP is fine, but can we also have MAAS provide a "min-time" field which
cloud-init will use if it's very different from the hwclock, so that we
have a macro "get into the ballpark" hammer on time before ntp is even up?

Mark

Revision history for this message
Mike Pontillo (mpontillo) wrote :

If we just want a ballpark, cloud-init already contacts MAAS via HTTP, and there is already a "Date:" header we could scrape. To what extent do we trust that? (Enough to always apply it on commissioning/enlistment/deployment, or just as a fallback if NTP isn't working?)

Revision history for this message
David A. Desrosiers (setuid) wrote :

On 6/22/16 1:18 PM, Mike Pontillo wrote:
> If we just want a ballpark, cloud-init already contacts MAAS via HTTP,
> and there is already a "Date:" header we could scrape. To what extent do
> we trust that? (Enough to always apply it on
> commissioning/enlistment/deployment, or just as a fallback if NTP isn't
> working?)

Wouldn't diagnosing why NTP isn't working, be the better, long-term,
strategic approach?

Or something like 'hwclock --systohc', and then making sure your NTP
servers are correctly configured, correct stratum servers, ports open,
etc.?

Just a thought. TMTOWTDI.

--
David A. Desrosiers :: Mobile: +1 (860) 271-1642 (ET)
DSE|TAM - Bloomberg :: Canonical US, Ltd.
<email address hidden> <mailto:<email address hidden>>
:: gpg: 1024D/7075AE4A

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Right, the point is not to avoid NTP, the point is to jerk the system
into (rough) conformance as early as possible in the boot process. We
have a LOT of things that go wonky when you tell them it's the 70's. NTP
is great for the running system.

Mark

Revision history for this message
David Britton (dpb) wrote :

On Wed, Jun 22, 2016 at 11:18 AM, Mike Pontillo <<email address hidden>
> wrote:

> If we just want a ballpark, cloud-init already contacts MAAS via HTTP,
> and there is already a "Date:" header we could scrape. To what extent do
> we trust that? (Enough to always apply it on
> commissioning/enlistment/deployment, or just as a fallback if NTP isn't
> working?)
>
>
That reminds me Mike, enlistment and commissioning even use this date
header (I think) to adjust for any local skew, but only for using the oauth
token that it has been provided -- it's why enlisting/commissioning the
node works fine with a very old clock. What doesn't happen is the next
logical steps of setting the system clock and then the hwclock.

See: bug #978127

--
David Britton <email address hidden>

Lee Trager (ltrager)
Changed in maas:
status: Incomplete → Confirmed
milestone: none → 2.1.0
Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Can we confirm the agreed path here is that cloud-init will set a
hardware clock based on some rough indication from MAAS, if the
difference in time is more than a (low) threshold (call it 15 minutes)?

Mark

Sean Feole (sfeole)
tags: added: hs-arm64
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We've hit this in OIL on MAAS 2.0rc2.

tags: added: oil
Changed in maas:
milestone: 2.0.1 → 2.1.0
importance: Undecided → Critical
status: Confirmed → In Progress
assignee: nobody → Gavin Panella (allenap)
Revision history for this message
Andres Rodriguez (andreserl) wrote :

MAAS now provides NTP and keeps the MAAS servers as well as the machines in sync. As such, we are closing this one.

Thanks.

Changed in maas:
status: In Progress → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.