MAAS fails w/ bad hwclock & no NTP access

Bug #1581553 reported by Sean Feole
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
cloud-init
New
Undecided
Unassigned
curtin
New
Undecided
Unassigned

Bug Description

mmm, let me try this again :)

We have been testing maas 2.0 Beta 4 on some enablement hardware and have been seeing a problem in which during enlistment, a clock skew is reported on the host console:

[ 288.753993] cloud-init[1464]: Success
[ 290.910846] cloud-init[1464]: updated clock skew to 7853431
[ 290.911437] cloud-init[1464]: request to http://10.246.48.112/MAAS/metadata//2012-03-01/ failed. sleeping 1.: HTTP Error 401: OK
[ 290.911929] cloud-init[1464]: Success
[ 292.752177] cloud-init[1464]: updated clock skew to 7853431
[ 292.752746] cloud-init[1464]: request to http://10.246.48.112/MAAS/metadata//2012-03-01/ failed. sleeping 1.: HTTP Error 401: OK
[ 292.753234] cloud-init[1464]: Success
[ 337.916546] cloud-init[1464]: updated clock skew to 7853431
[ 337.917122] cloud-init[1464]: request to http://10.246.48.112/

This happens a number of times. And as you mentioned, this is cloud init fixing the clock skew.

The enlistment will complete and we are able to successfully finish commissioning the Host. The host will appear in a ready state via the MAAS UI.

Now before I go further, As mentioned earlier. I always ensure that both HOST AND CLIENT dates & time match in UEFI prior to starting enlistment. If anything the times on all of the hosts are offset by +/- 2 minutes.

Moving onto deployment:

When Deploying Xenial on these hosts This is where I get stuck, due to the clock-skew, which is fixed via cloud-init, tar will report time stamps approx 1 month into the future, while copying the disk image and eventually cause the deployment to fail due to a timeout. (tar is still extracting the root image)

Does setting the ntp host in maas settings have any affect on this, (for example: if ntp.ubuntu.com was unavailable ? )

We have been triaging this on our end of the stick, however would like some insight from the maas team.

[ 19.353487] cloud-init[1207]: Cloud-init v. 0.7.7 running 'init' at Thu, 11 Feb 2016 16:28:06 +0000. Up 19.03 seconds.
[ 19.368566] cloud-init[1207]: ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++
[ 19.388533] cloud-init[1207]: ci-info: +------------+-------+------------------------------+-------------+-------+-------------------+
[ 19.408484] cloud-init[1207]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 19.428484] cloud-init[1207]: ci-info: +------------+-------+------------------------------+-------------+-------+-------------------+
[ 19.444480] cloud-init[1207]: ci-info: | enP2p1s0f1 | False | . | . | . | 40:8d:5c:ba:b9:10 |
[ 19.464490] cloud-init[1207]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . | . |
[ 19.484480] cloud-init[1207]: ci-info: | lo | True | ::1/128 | . | host | . |
[ 19.500475] cloud-init[1207]: ci-info: | enP2p1s0f3 | False | . | . | . | 40:8d:5c:ba:b9:12 |
[ 19.520503] cloud-init[1207]: ci-info: | enP2p1s0f2 | True | 10.246.48.3 | 255.255.0.0 | . | 40:8d:5c:ba:b9:11 |
[ 19.536497] cloud-init[1207]: ci-info: | enP2p1s0f2 | True | fe80::428d:5cff:feba:b911/64 | . | link | 40:8d:5c:ba:b9:11 |
[ 19.556478] cloud-init[1207]: ci-info: +------------+-------+------------------------------+-------------+-------+-------------------+
[ 19.576514] cloud-init[1207]: ci-info: ++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
[ 19.592494] cloud-init[1207]: ci-info: +-------+-------------+-------------+-------------+------------+-------+
[ 19.608504] cloud-init[1207]: ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
[ 19.624535] cloud-init[1207]: ci-info: +-------+-------------+-------------+-------------+------------+-------+
[ 19.640528] cloud-init[1207]: ci-info: | 0 | 0.0.0.0 | 10.246.48.1 | 0.0.0.0 | enP2p1s0f2 | UG |
[ 19.656495] cloud-init[1207]: ci-info: | 1 | 10.246.0.0 | 0.0.0.0 | 255.255.0.0 | enP2p1s0f2 | U |
[ 19.672513] cloud-init[1207]: ci-info: +-------+-------------+-------------+-------------+------------+-------+
[ 19.688941] cloud-init[1207]: 2016-02-11 16:28:07,035 - url_helper.py[WARNING]: Setting oauth clockskew for 10.246.48.112 to 7938974
[ 19.704762] cloud-init[1207]: 2016-02-11 16:28:07,036 - handlers.py[WARNING]: failed posting event: start: init-network/check-cache: attempting to read from cache [trust]
[ OK ] Started Seed the pseudo random number generator on first boot.
[ OK ] Started System Logging Service.
[ 30.077613] cloud-init[1207]: 2016-02-11 16:28:17,758 - cc_resizefs.py[WARNING]: Device 'overlayroot' did not exist. cannot resize: dev=overlayroot mnt_point=/ path=/
         Stopping System Logging Service...
[ OK ] Stopped System Logging Service.
         Starting System Logging Service...
[ OK ] Started System Logging Service.
[ 45.015651] cloud-init[1207]: Generating public/private rsa key pair.
[ 45.024537] cloud-init[1207]: Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
[ 45.036528] cloud-init[1207]: Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
[ 45.048531] cloud-init[1207]: The key fingerprint is:
[ 45.056494] cloud-init[1207]: SHA256:I+KDEsp0Fe54XuAYVaP2sqXyZqt6CM++/tgX7NQat18 root@cvm15
[ 45.068940] cloud-init[1207]: The key's randomart image is:
[ 45.076665] cloud-init[1207]: +---[RSA 2048]----+
[ 45.088500] cloud-init[1207]: | o.o |
[ 45.096499] cloud-init[1207]: | o o . |
[ 45.104568] cloud-init[1207]: | . * |
[ 45.112575] cloud-init[1207]: | O o |
[ 45.120535] cloud-init[1207]: |.. +.*.=S |
[ 45.128452] cloud-init[1207]: |=..oo.@.o. |
[ OK ] Started Initial cloud-init job (metadata service crawler).
[ 45.136445] cloud-init[1207]: |++o.o* = . E |
[ OK ] Reached target Network is Online.
[ 45.156253] cloud-init[1207]: | oo+o++ . . |
[ 45.172228] cloud-init[1207]: | +B+*+. .. |
[ 45.180217] cloud-init[1207]: +----[SHA256]-----+
         Starting /etc/rc.local Compatibility...
[ 45.188245] cloud-init[1207]: Generating public/private dsa key pair.
         Starting iSCSI initiator daemon (iscsid)...
[ 45.208228] cloud-init[1207]: Your identification has been saved in /etc/ssh/ssh_host_dsa_key.
         Starting OpenBSD Secure Shell server...
[ 45.228241] cloud-init[1207]: [Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub. OK
] Reached target Cloud-config availability.
[ 45.240327] cloud-init[1207]: The key fingerprint is:
[ 45.272221] cloud-init[1207]: SHA256:WE+uOqGNjKmwyhiQEzYaSEZ/0aRSyPTkrpjyPQApBpE root@cvm15
[ 45.284217] cloud-init[1207]: The key's randomart image is:
[ 45.292209] cloud-init[1207]: +---[DSA 1024]----+
         Starting Apply the settings specified in cloud-config...
[ 45.300245] cloud-init[1207]: |o*o..+o. |
[ OK ] Started /etc/rc.local Compatibility.
[ 45.320227] cloud-init[1207]: |E .o= o. |
[ 45.336316] cloud-init[1207]: |=+ o = . . |
[ OK ] Started iSCSI initiator daemon (iscsid).
[ 45.344249] cloud-init[1207]: |*=. + o + |
[ 45.360228] cloud-init[1207]: |B. .. S o |
[ 45.368226] cloud-init[1207]: |..+ . . . |
[ 45.376226] cloud-init[1207]: |+o * + .. |
[ 45.384260] cloud-init[1207]: |=+o.= o. |
[ 45.392228] cloud-init[1207]: |*o. .... |
[ 45.400209] cloud-init[1207]: +----[SHA256]-----+
[ 45.408233] cloud-init[1207]: Generating public/private ecdsa key pair.
         Starting Login to default iSCSI targets...
[ 45.420209] cloud-init[1207]: Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key.
[ 45.440316] cloud-init[1207]: Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub.
[ 45.452238] cloud-init[1207]: The key fingerprint is:
[ 45.460232] cloud-init[1207]: SHA256:DddLJKqvTZHzKPJ65EN9aqYavYD/wd+VuJz1eqnbgoI root@cvm15
[ 45.472220] cloud-init[1207]: The key's randomart image is:
[ 45.480211] cloud-init[1207]: +---[ECDSA 256]---+
[ 45.488276] cloud-init[1207]: | . . |
[ 45.496207] cloud-init[1207]: | . + |
[ 45.504208] cloud-init[1207]: | o . o |
[ 45.512211] cloud-init[1207]: | . = . . |
[ 45.520203] cloud-init[1207]: | ..S . . |
[ 45.528233] cloud-init[1207]: | . oo..=.. . |
[ 45.536201] cloud-init[1207]: | . ++=.+oo.+ . |
[ 45.544224] cloud-init[1207]: | . =EO=o.=.oo |
[ 45.552204] cloud-init[1207]: | +=*=o.= +*o |
[ 45.560215] cloud-init[1207]: +----[SHA256]-----+
[ 45.568201] cloud-init[1207]: Generating public/private ed25519 key pair.
[ 45.580205] cloud-init[1207]: Your identification has been saved in /etc/ssh/ssh_host_ed25519_key.
[ 45.592205] cloud-init[1207]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
[ 45.604207] cloud-init[1207]: The key fingerprint is:
[ 45.612201] cloud-init[1207]: SHA256:E5SSzahN/DmxED4slp4ynrvH052zQXFFkok/MQ8r5yw root@cvm15
[ 45.624243] cloud-init[1207]: The key's randomart image is:
[ 45.632201] cloud-init[1207]: +--[ED25519 256]--+
[ 45.640211] cloud-init[1207]: | ..*.o.+o |
[ 45.648209] cloud-init[1207]: | +B.* *o |
[ 45.656205] cloud-init[1207]: | ++++o=.* |
[ 45.664206] cloud-init[1207]: | o.o..*+= . |
[ 45.672204] cloud-init[1207]: | o o S* . |
[ 45.680223] cloud-init[1207]: | . + .E.o |
[ 45.688208] cloud-init[1207]: | o. . ..o |
[ 45.696226] cloud-init[1207]: | .+ . +. |
[ 45.704225] cloud-init[1207]: | oo . .o |
[ 45.712233] cloud-init[1207]: +----[SHA256]-----+
[ OK ] Started OpenBSD Secure Shell server.
[ OK ] Started Login to default iSCSI targets.
[ OK ] Reached target Remote File Systems (Pre).
[ OK ] Reached target Remote File Systems.
         Starting LSB: daemon to balance interrupts for SMP systems...
         Starting Permit User Sessions...
         Starting LSB: automatic crash report generation...
         Starting LSB: Set the CPU Frequency Scaling governor to "ondemand"...
[ OK ] Started Permit User Sessions.
         Starting Terminate Plymouth Boot Screen...
         Starting Hold until boot process finishes up...
[ OK ] Started LSB: automatic crash report generation.
[ OK ] Started LSB: Set the CPU Frequency Scaling governor to "ondemand".
[ OK ] Started Terminate Plymouth Boot Screen.
[ OK ] Started Hold until boot process finishes up.
[ OK ] Started Serial Getty on ttyAMA0.
[ OK ] Started Getty on tty1.
[ OK ] Reached target Login Prompts.
[ OK ] Started LSB: daemon to balance interrupts for SMP systems.
[ 50.198320] cloud-init[1352]: Generating locales (this might take a while)...
[ 52.582921] cloud-init[1352]: en_US.UTF-8... done
[ 52.592665] cloud-init[1352]: Generation complete.

Ubuntu 16.04 LTS cvm15 ttyAMA0

cvm15 login: [ 131.235429] cloud-init[1352]: Cloud-init v. 0.7.7 running 'modules:config' at Thu, 11 Feb 2016 16:28:34 +0000. Up 46.61 seconds.
[ 135.455232] cloud-init[1459]: tar: ./curtin/deps/__init__.py: time stamp 2016-03-18 14:16:44 is 3102400.8621135 s in the future
[ 135.455801] cloud-init[1459]: tar: ./curtin/deps/check.py: time stamp 2016-03-18 14:16:44 is 3102400.86187222 s in the future
[ 135.456327] cloud-init[1459]: tar: ./curtin/deps/install.py: time stamp 2016-03-18 14:16:44 is 3102400.86172929 s in the future
[ 135.456824] cloud-init[1459]: tar: ./curtin/deps: time stamp 2016-05-12 12:42:12 is 7848728.86163933 s in the future
[ 135.457315] cloud-init[1459]: tar: ./curtin/__init__.py: time stamp 2016-03-18 14:16:44 is 3102400.86152015 s in the future
[ 135.457805] cloud-init[1459]: tar: ./curtin/udev.py: time stamp 2016-03-18 14:16:44 is 3102400.86139546 s in the future
[ 135.458353] cloud-init[1459]: tar: ./curtin/config.py: time stamp 2016-03-18 14:16:44 is 3102400.86125462 s in the future
[ 135.458847] cloud-init[1459]: tar: ./curtin/commands/block_meta.py: time stamp 2016-03-18 14:16:44 is 3102400.86097206 s in the future
[ 135.459338] cloud-init[1459]: tar: ./curtin/commands/apply_net.py: time stamp 2016-03-18 14:16:44 is 3102400.86084094 s in the future
[ 135.459835] cloud-init[1459]: tar: ./curtin/commands/extract.py: time stamp 2016-03-18 14:16:44 is 3102400.86071796 s in the future
[ 135.460352] cloud-init[1459]: tar: ./curtin/commands/__init__.py: time stamp 2016-03-18 14:16:44 is 3102400.8605888 s in the future
[ 135.460848] cloud-init[1459]: tar: ./curtin/commands/hook.py: time stamp 2016-03-18 14:16:44 is 3102400.86046616 s in the future
[ 135.461479] cloud-init[1459]: tar: ./curtin/commands/mkfs.py: time stamp 2016-03-18 14:16:44 is 3102400.8603415 s in the future
[ 135.461969] cloud-init[1459]: tar: ./curtin/commands/swap.py: time stamp 2016-03-18 14:16:44 is 3102400.86021895 s in the future
[ 135.462461] cloud-init[1459]: tar: ./curtin/commands/curthooks.py: time stamp 2016-03-18 14:16:44 is 3102400.86004385 s in the future
[ 135.462980] cloud-init[1459]: tar: ./curtin/commands/net_meta.py: time stamp 2016-03-18 14:16:44 is 3102400.85990484 s in the future
[ 135.463591] cloud-init[1459]: tar: ./curtin/commands/in_target.py: time stamp 2016-03-18 14:16:44 is 3102400.85977304 s in the future
[ 135.464143] cloud-init[1459]: tar: ./curtin/commands/system_upgrade.py: time stamp 2016-03-18 14:16:44 is 3102400.85964961 s in the future
[ 135.464718] cloud-init[1459]: tar: ./curtin/commands/install.py: time stamp 2016-03-18 14:16:44 is 3102400.85948571 s in the future
[ 135.465299] cloud-init[1459]: tar: ./curtin/commands/pack.py: time stamp 2016-03-18 14:16:44 is 3102400.85935982 s in the future
[ 135.465940] cloud-init[1459]: tar: ./curtin/commands/main.py: time stamp 2016-03-18 14:16:44 is 3102400.85921463 s in the future
[ 135.466576] cloud-init[1459]: tar: ./curtin/commands/system_install.py: time stamp 2016-03-18 14:16:44 is 3102400.85909071 s in the future
[ 135.467188] cloud-init[1459]: tar: ./curtin/commands: time stamp 2016-05-12 12:42:12 is 7848728.85900553 s in the future
[ 135.467857] cloud-init[1459]: tar: ./curtin/futil.py: time stamp 2016-03-18 14:16:44 is 3102400.85888899 s in the future
[ 135.468575] cloud-init[1459]: tar: ./curtin/log.py: time stamp 2016-03-18 14:16:44 is 3102400.85876914 s in the future
[ 135.469142] cloud-init[1459]: tar: ./curtin/swap.py: time stamp 2016-03-18 14:16:44 is 3102400.85864809 s in the future
[ 135.469725] cloud-init[1459]: tar: ./curtin/net/__init__.py: time stamp 2016-03-18 14:16:44 is 3102400.85846213 s in the future
[ 135.470302] cloud-init[1459]: tar: ./curtin/net/network_state.py: time stamp 2016-03-18 14:16:44 is 3102400.85830302 s in the future
[ 135.470891] cloud-init[1459]: tar: ./curtin/net: time stamp 2016-05-12 12:42:12 is 7848728.8582177 s in the future
[ 135.471392] cloud-init[1459]: tar: ./curtin/url_helper.py: time stamp 2016-03-18 14:16:44 is 3102400.85809278 s in the future

Revision history for this message
Sean Feole (sfeole) wrote :

Lets Reference bug #1581515 which was marked as invalid

summary: - Clock Skew Occurs during Enlistments and Deployments with maas 2.0 ,
- Xenial
+ Maas 2.0 Deployment Failing on arm64 , Xenial
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: Maas 2.0 Deployment Failing on arm64 , Xenial

Hi Sean,

Again, this is not a problem in MAAS. So why would cloud-init report the time is in the future. What happens if you just don't try to ensure the clocks are the same? The error in deployment above is telling me that the clocks are not synced.

Cloud-init only fixes the clock-skew so there are no authentication issues when sending information back to MAAS. Now, I don't see your installation log (you can grab it from the WebUI).

Also, MAAS doesn't do NTP as of now, but may do in the future.

I'm marking this as "invalid" for MAAS, and will open a task for curtin and cloud init. Howeve,r please do attach your installation log.

Changed in maas:
status: New → Invalid
Revision history for this message
Sean Feole (sfeole) wrote :

In the future please remember, from a customers point of view this can be very much is a problem with MAAS. I don't understand how not ensuring the clocks are the same help me in this situation.

For cloud-init and curtin here are is the log reported from the deployment. http://pastebin.ubuntu.com/16393095/

In this particular deployment the system successfully deployed, however in the logs we can still see the time stamp being a problem.

Revision history for this message
Scott Moser (smoser) wrote :

Nothing in a normal maas enlistment, commissioning or deployment will automatically set hardware clock. My opinion is that MAAS "owns" the hardware, so anything setting the clock should be done so under its direction. Cloud-init only has heuristic data on what is the right time.

There is work underway to support maas configuring ntp for the systems and that is the right path forward in all cases. We can add support for cloud-init to configure a system for ntp or even to one-time set the system clock via a reference ntp service, but doing so will require maas to *tell* it to do that.

cloud-init does not change the hardware or system clock at all. The messages you see are cloud-init setting the offset that it uses for a timestamp when doing Oauth to maas. cloud-init tries to read an oauth resource, gets a 403 and then tries again with the same credentials but with a timestamp similar to that of the server. That allows cloud-init to continue on and have access to the user-data, but does not update the system clock. (there is an improvement that we need to put in place to this described under bug 1531233 . this issue would come up if systemd sets the clock during boot after cloud-init has read and set its offset).

The dance above is necessary as oauth requires similar timestamps. Basically cloud-init says "well, whatever the server said is the right timestamp is the right timestamp, so go with that". That is cloud-init working around a broken system clock, but it does not do anything to fix the clock.

The errors you see with tar are really just warnings. They would be fixed with a correct system clock, but they do not affect deployment. MAAS should probably direct the system clock just set system clock on every deployment

There are really 2 classes of "bad clock" errors.
a.) system clock has no battery, and resets to Jan 1 1970 on reboot. Thus every boot is bad. I've only ever seen this on arm hardware, and possibly that only actually occurred on pre-release hardware that we got from vendors.

b.) system clock is set wrong but is generally working, possibly losing minutes or even hours in a year. Either timezone is set to local time or the clock has never been set. Setting in enlistment and/or commissioning would be sufficient.

cloud-init's oauth dance is kind of necessary for 'a', and would still be needed after ntp server support is in place as getting to information about the ntp server probably requires oauth.

Revision history for this message
Sean Feole (sfeole) wrote :

Thanks for the feedback,

So, If I read this correctly,

1.) MAAS is planning on supporting NTP in the future. (And is probably my only fix at this point? )
2.) Cloud-init setting the offset for Oauth requests is normal (this I understand)
3.) The tar errors are warnings, that should be fixed with a correct system clock.

What defines a correct system clock? arm64 efi platforms have no RTC support, so it's the job of rtc-efi to pass the date/time issued in UEFI -> OS, so far, from my tests and experiments there has been no problems with this handshake on Xenial.

( MAAS should probably direct the system clock just set system clock on every deployment ) -- Was this just a thought or is this something we should bug?

3a, The tar warnings do not affect deployment, however the warnings can take upwards of 5-15 minutes to report on the console, causing the deployment to 'timeout' and thus 'fail' in the 'maas ui'. In addition to these warnings the hardware is experiencing it's own complex slew of bugs that are delaying the deployment another +5/10 minutes or so. which we are working on. So it's just enough to trigger the MAAS UI timeout and flag the node under deployment as "Failed Deployment". I'm not sure what the default wait time is for a deployment or if that can be modified.

4.) I see the (a) version of bad clock errors quite often due to the nature of enablement work.
    (B) I see no evidence that the rtc-efi is misbehaving here.

So What are my options here?
  I will and still am, continuing to work this on my side. I appreciate the assistance thus far.

Revision history for this message
Sean Feole (sfeole) wrote :

Adding to to my last comment,

3 the tar warnings do not affect deployment meaning, YES I can access the host after X amount of minutes, once I notice that cloud-init is finished on the console. However, the MAAS UI reports the deployment as failed.

Which is somewhat useless to me when it comes to using JUJU and other tools, unless the timeout can be adjusted beyond the defaults.

Revision history for this message
dann frazier (dannf) wrote :

Given that serial spew itself can slowdown deployment, would it make sense to add --warning=no-timestamp to the tar extract command?

$ tar -xf noname.tar
tar: noname: time stamp 2016-05-17 12:55:38 is 7140.360712301 s in the future
$ tar --warning=no-timestamp -xf noname.tar
$

Revision history for this message
Sean Feole (sfeole) wrote :

Attached is the full deployment log from the host console.

Revision history for this message
Sean Feole (sfeole) wrote :

Regardless if the system clock is wrong or not, we should really suppress these tar messages. Here we go.

I've attached 3 files..

First the Maas Console showing the 2 systems deploying , (CVM13 and CVM15) @ timestamp 13:11

May 20 13:11:43 cvm12 maas.node: [INFO] cvm13: Status transition from ALLOCATED to DEPLOYING
May 20 13:11:43 cvm12 maas.power: [INFO] Changing power state (on) of node: cvm13 (4y3h7t)
May 20 13:11:43 cvm12 maas.node: [INFO] cvm15: Status transition from READY to ALLOCATED
May 20 13:11:43 cvm12 maas.node: [INFO] cvm15: allocated to user ubuntu
May 20 13:11:43 cvm12 maas.node: [INFO] cvm15: Status transition from ALLOCATED to DEPLOYING
May 20 13:11:43 cvm12 maas.power: [INFO] Changing power state (on) of node: cvm15 (4y3h7s)

Tar messages took 35 minutes do scroll over the console!!!!!!!!

The Systems did not restart to boot up from the disk until 13:46

( 35 Minutes to initially DEPLOY!!!! )

=> rackd.log <==
2016-05-20 13:46:48-0400 [TFTP (UDP)] Datagram received from ('10.246.48.115', 1051): <RRQDatagram(filename=b'grubaa64.efi', mode=b'octet', options=OrderedDict([(b'tsize', b'0'), (b'blksize', b'2016')]))>
2016-05-20 13:46:48-0400 [-] RemoteOriginReadSession starting on 60862

The deployment finished for CVM13 @ 13:48

(37 Minutes)

==> maas.log <==
May 20 13:48:22 cvm12 maas.node: [INFO] cvm13: Status transition from DEPLOYING to DEPLOYED

cvm15 on the other hand, as seen in the logs, encountered a system problem (Soft Lockup) due to an enablement bug we are working on which delayed the Deployment another +10 minutes. NMI recovered and continued finishing out the deployment.

as a result, MAAS fails the deployment due to timeout:

(+40 Minutes)

==> maas.log <==
May 20 13:52:40 cvm12 maas.node: [ERROR] cvm15: Marking node failed: None
May 20 13:52:40 cvm12 maas.node: [INFO] cvm15: Status transition from DEPLOYING to FAILED_DEPLOYMENT

However as seen in the console logs for CVM15, you can see the deployment does finish and we can access the host as expected.

In a real world scenario, I think we can all come to the agreement that system clocks are not always going to be correct. How Curtin handles this should be more or less system friendly.

I'm proposing
A: Curtin to surpress the tar warning using the flag from Comment #7. (I was going to try this @

Revision history for this message
Sean Feole (sfeole) wrote :

Attached are the log files for comment # 9

Revision history for this message
Sean Feole (sfeole) wrote :
Revision history for this message
Sean Feole (sfeole) wrote :
Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1581553] Re: Maas 2.0 Deployment Failing on arm64 , Xenial

On Fri, May 20, 2016 at 1:04 PM, Sean Feole <email address hidden>
wrote:

> Regardless if the system clock is wrong or not, we should really
> suppress these tar messages. Here we go.
>

If the clock is set properly no such messages from tar exist.
The message is a clear warning that *something*; namely the clock, is not
correct. Hiding these messages would paper over the underlying issue
of not getting the clock set correctly.

Hiding the messages as a debug option? Sure but suppressing the
warning from tar is something we should not do by default.

>
> In a real world scenario, I think we can all come to the agreement that
> system clocks are not always going to be correct. How Curtin handles
> this should be more or less system friendly.
>
> I'm proposing
> A: Curtin to surpress the tar warning using the flag from Comment #7. (I
> was going to try this @
>

If the clock is wrong, how can we trust other things that happen on the
host?
It's certainly possible that further errors occur on higher levels that
will be
harder to debug. "Well, the node deployed just fine with no errors in MAAS"

Revision history for this message
dann frazier (dannf) wrote : Re: Maas 2.0 Deployment Failing on arm64 , Xenial

Is it a MAAS requirement that the hwclock be correct? Requiring that seems like it will only lead to silent failures when e.g. batteries die. MAAS has code to deal with this in other areas - e.g. cert checking. If we don't trust hosts with bad hwclocks, wouldn't it be better to just abort with a clear message about this requirement? (Just devil's advocating there).

I do agree that hiding warnings isn't ideal - but do we really need a warning for *every file*? That is what is causing us to bottleneck on the console output and surpass the deployment timeout. Perhaps we could pipe output to a warn-once filter or something.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Dann,

No. It is not a requirement for MAAS to have the hwclock in sync. cloud-init would handle the situation to fix the clock skew for oauth, but again, doesn't require the hwclocks to to be in sync. What's being going on the field is that they setup their own internal NTP server, otherwise ubuntu should just automatically reach it if facing the internet.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1581553] Re: Maas 2.0 Deployment Failing on arm64 , Xenial

On Mon, May 23, 2016 at 1:42 PM, dann frazier <email address hidden>
wrote:

> Is it a MAAS requirement that the hwclock be correct? Requiring that
> seems like it will only lead to silent failures when e.g. batteries die.
> MAAS has code to deal with this in other areas - e.g. cert checking. If
> we don't trust hosts with bad hwclocks, wouldn't it be better to just
> abort with a clear message about this requirement? (Just devil's
> advocating there).
>

I would certainly agree with this. That probably ignores the fact that
some folks
are willing to take the risk given hardware that partly works versus none.
This sorta feels like a 'taints support' flag such that in MAAS the machine
is tagged
as unstable-clock and thus bugs on it are to be recreated on hardware
without
such issue.

>
> I do agree that hiding warnings isn't ideal - but do we really need a
> warning for *every file*? That is what is causing us to bottleneck on
> the console output and surpass the deployment timeout. Perhaps we could
> pipe output to a warn-once filter or something.
>

I think it'd be simpler to expose the "silence warning" sort of config
option that
maas could pass to curtin as long as MAAS itself is tracking which nodes
are
using this warning suppression (ie, tagged as unstable, support might
require
reproducing on more stable devices).

>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1581553
>
> Title:
> Maas 2.0 Deployment Failing on arm64 , Xenial
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1581553/+subscriptions
>

Revision history for this message
Sean Feole (sfeole) wrote :

On Mon, May 23, 2016 at 4:03 PM, Ryan Harper <email address hidden>
wrote:

>
>
> I think it'd be simpler to expose the "silence warning" sort of config
> option that maas could pass to curtin as long as MAAS itself is tracking
> which nodes
> are using this warning suppression (ie, tagged as unstable, support might
> require reproducing on more stable devices).
>

Some good points brought up here. Any idea whose feathers we could ruffle
to possibly get something rolling, i would be more than happy to offer
testing of this feature.

Revision history for this message
dann frazier (dannf) wrote : Re: Maas 2.0 Deployment Failing on arm64 , Xenial

(Been out on vacation, sorry for the delayed response)

From Comment #15, it sounds like it is a bug if a deployment fails because of a bad hwclock. In those cases, is NTP server access a requirement? (IIRC, feedback from the field is that MAAS works fine w/o NTP - though clock sync is needed for Juju).

Assuming that out-of-sync clock configs (bad hwclock + no NTP access) are supported, I'd prefer not to require users to have to select magic options (e.g. a debug mode) for MAAS to work. One option here is to increase the default timeout, giving the deploy enough time to write all the tar timestamp warnings to the console. From Sean's output, this looks like ~35 minutes for his setup.

However, the user would still have to suffer the extra (~25m?) delay and, further, I don't think any warnings after the first one are adding useful information. So another option would be to run tar's stderr through a filter like the one attached.

$ tar xf ../foobar.tar
tar: foo: time stamp 2016-06-01 23:16:02 is 45189.384620221 s in the future
tar: bar: time stamp 2016-06-01 23:16:02 is 45189.384196257 s in the future
$ tar xf ../foobar.tar 2>&1 | /tmp/warn-once
tar: foo: time stamp 2016-06-01 23:16:02 is 45168.679087452 s in the future
/tmp/warn-once: Suppressing further timestamp warnings to avoid console spew

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1581553] Re: Maas 2.0 Deployment Failing on arm64 , Xenial

On Wed, Jun 1, 2016 at 11:56 AM, dann frazier <email address hidden>
wrote:

> (Been out on vacation, sorry for the delayed response)
>
> >From Comment #15, it sounds like it is a bug if a deployment fails
> because of a bad hwclock. In those cases, is NTP server access a
> requirement? (IIRC, feedback from the field is that MAAS works fine w/o
> NTP - though clock sync is needed for Juju).
>
> Assuming that out-of-sync clock configs (bad hwclock + no NTP access)
> are supported, I'd prefer not to require users to have to select magic
>

I think this is the _real_ debate. Just because it works doesn't mean it
should
be supported.

If MAAS agrees that such platform should be supported, and that includes
determining beyond issues with deployment (such as the tar error) that may
occur due to the lack of a stable clock then I would expect we'll have a
more
formal plan to determine how best to stablize the clock prior to
installation
of the tarball and we'll avoid the whole mess in the first place.

> options (e.g. a debug mode) for MAAS to work. One option here is to
> increase the default timeout, giving the deploy enough time to write all
> the tar timestamp warnings to the console. From Sean's output, this
> looks like ~35 minutes for his setup.
>

I don't expect that a hidden/debug option is reasonable for users at all.
That's purely a development option that'd help working with such devices
while MAAS investigates how to stabilize/fix issues with clock.

Another alternative in the meanwhile would be to package up a modified
version of curtin with the filter into a PPA and use that when working with
these platforms until they're stable.

Ryan

Revision history for this message
dann frazier (dannf) wrote :

On Wed, Jun 1, 2016 at 2:46 PM, Ryan Harper <email address hidden> wrote:
> On Wed, Jun 1, 2016 at 11:56 AM, dann frazier <email address hidden>
> wrote:
>
>> (Been out on vacation, sorry for the delayed response)
>>
>> >From Comment #15, it sounds like it is a bug if a deployment fails
>> because of a bad hwclock. In those cases, is NTP server access a
>> requirement? (IIRC, feedback from the field is that MAAS works fine w/o
>> NTP - though clock sync is needed for Juju).
>>
>> Assuming that out-of-sync clock configs (bad hwclock + no NTP access)
>> are supported, I'd prefer not to require users to have to select magic
>>
>
> I think this is the _real_ debate. Just because it works doesn't mean it
> should
> be supported.

Ryan: Agreed - that's where I'm seeking clarity.

> If MAAS agrees that such platform should be supported, and that includes
> determining beyond issues with deployment (such as the tar error) that may
> occur due to the lack of a stable clock then I would expect we'll have a
> more
> formal plan to determine how best to stablize the clock prior to
> installation
> of the tarball and we'll avoid the whole mess in the first place.
>
>
>> options (e.g. a debug mode) for MAAS to work. One option here is to
>> increase the default timeout, giving the deploy enough time to write all
>> the tar timestamp warnings to the console. From Sean's output, this
>> looks like ~35 minutes for his setup.
>>
>
> I don't expect that a hidden/debug option is reasonable for users at all.
> That's purely a development option that'd help working with such devices
> while MAAS investigates how to stabilize/fix issues with clock.
>
> Another alternative in the meanwhile would be to package up a modified
> version of curtin with the filter into a PPA and use that when working with
> these platforms until they're stable.

I don't see how platform stability comes into play here. The issue is
simply that
any hardware clock may not be valid, and NTP may not be accessible. It isn't
specific to this platform or architecture.

For our testing purposes, we can hack around this by setting an accurate
hwclock or obtaining NTP access - but my goal here is to prevent users
from hitting this and not knowing how to proceed.

I'd prefer to either resolve this by dealing with this use case, or by
clearly documenting
that it is not supported, and that you must have either valid hwclocks
or access to an
NTP server.

[ We do have a blurb in the docs about how bad hwclocks may cause oauth to fail:
  http://maas.ubuntu.com/docs/troubleshooting.html?highlight=clock#nodes-hang-on-commissioning)
but that's a different problem with different symptoms that appears to
be resolved. ]

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (4.0 KiB)

On Wed, Jun 1, 2016 at 4:39 PM, dann frazier <email address hidden>
wrote:

> On Wed, Jun 1, 2016 at 2:46 PM, Ryan Harper <email address hidden>
> wrote:
> > On Wed, Jun 1, 2016 at 11:56 AM, dann frazier <
> <email address hidden>>
> > wrote:
> >
> >> (Been out on vacation, sorry for the delayed response)
> >>
> >> >From Comment #15, it sounds like it is a bug if a deployment fails
> >> because of a bad hwclock. In those cases, is NTP server access a
> >> requirement? (IIRC, feedback from the field is that MAAS works fine w/o
> >> NTP - though clock sync is needed for Juju).
> >>
> >> Assuming that out-of-sync clock configs (bad hwclock + no NTP access)
> >> are supported, I'd prefer not to require users to have to select magic
> >>
> >
> > I think this is the _real_ debate. Just because it works doesn't mean it
> > should
> > be supported.
>
> Ryan: Agreed - that's where I'm seeking clarity.
>

Cool

>
> > If MAAS agrees that such platform should be supported, and that includes
> > determining beyond issues with deployment (such as the tar error) that
> may
> > occur due to the lack of a stable clock then I would expect we'll have a
> > more
> > formal plan to determine how best to stablize the clock prior to
> > installation
> > of the tarball and we'll avoid the whole mess in the first place.
> >
> >
> >> options (e.g. a debug mode) for MAAS to work. One option here is to
> >> increase the default timeout, giving the deploy enough time to write all
> >> the tar timestamp warnings to the console. From Sean's output, this
> >> looks like ~35 minutes for his setup.
> >>
> >
> > I don't expect that a hidden/debug option is reasonable for users at all.
> > That's purely a development option that'd help working with such devices
> > while MAAS investigates how to stabilize/fix issues with clock.
> >
> > Another alternative in the meanwhile would be to package up a modified
> > version of curtin with the filter into a PPA and use that when working
> with
> > these platforms until they're stable.
>
> I don't see how platform stability comes into play here. The issue is
> simply that
> any hardware clock may not be valid, and NTP may not be accessible. It
> isn't
> specific to this platform or architecture.
>

IMO, having a valid hwclock is part and parcel to having a stable platform.
Literally if the platform had a valid hwclock we wouldn't have this thread
at all,
one would not see the tar warnings and the slow deployment.

>
> For our testing purposes, we can hack around this by setting an accurate
> hwclock or obtaining NTP access - but my goal here is to prevent users
> from hitting this and not knowing how to proceed.
>

> I'd prefer to either resolve this by dealing with this use case, or by
> clearly documenting
> that it is not supported, and that you must have either valid hwclocks
> or access to an
> NTP server.
>

Full agreement here. IMO, without a clear path to handle these situations
I think documentation that working hwclocks across reboots and access to
NTP are requirements. And a section in Troubleshooting or the like that
specifically references
the tar warning as common indicator of missing requirements (st...

Read more...

dann frazier (dannf)
summary: - Maas 2.0 Deployment Failing on arm64 , Xenial
+ MAAS fails w/ bad hwclock & no NTP access
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.