replica set EMPTYCONFIG MAAS bootstrap

Bug #1412621 reported by Paul Gear on 2015-01-20
120
This bug affects 23 people
Affects Status Importance Assigned to Milestone
juju-core
High
Andrew McDermott
1.24
High
Unassigned
1.25
High
Andrew McDermott

Bug Description

When attempting to bootstrap juju during a new stack deploy, it failed with the following error message:

2015-01-19 06:16:39 WARNING juju.replicaset replicaset.go:87 Initiate: fetching replication status failed: cannot get replica set status: can't get local.system.replset config from self or any seed (EMPTYCONFIG)

Complete log of the deploy attempt is attached.

When the juju bootstrap was attempted manually using the same command, it succeeded; this suggests there might be a race condition in the bootstrap process.

This may be a duplicate of https://bugs.launchpad.net/juju-core/+bug/1384549, but the symptoms are different.

Paul Gear (paulgear) wrote :
Curtis Hovey (sinzui) on 2015-01-20
tags: added: mongodb
tags: added: maas-provider
Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
Curtis Hovey (sinzui) on 2015-01-29
tags: added: bootstrap
summary: - Possible race condition in MAAS bootstrap
+ replica set EMPTYCONFIG MAAS bootstrap
Changed in juju-core:
importance: Medium → High
milestone: none → 1.24-alpha1
Nate Finch (natefinch) wrote :

You said: "When the juju bootstrap was attempted manually using the same command, it succeeded; this suggests there might be a race condition in the bootstrap process."

manually vs. what?

mahmoh (mahmoh) wrote :

This symptom was seen with juju bootstrap is consistently failing when the server cannot resolve it's own hostname; we hit this in an environment with a challenging DNS setup, the fix was to specify the local domain in MAAS, replace the default "maas" with my.local.domain.com or whatever your domain is. You could fix this temporarily by adding "127.0.1.1 <my-hostname>" to /etc/hosts and see if that fixes this particular problem.

These commands helped debug the juju pieces:

juju bootstrap --debug --keep-broken

/var/lib/juju/tools/1\.2\.3-precise-amd64/jujud bootstrap-state --data-dir '/var/lib/juju' --env-config '[^']*' --instance-id 'i-bootstrap' --constraints 'mem=2048M' --debug

mahmoh (mahmoh) wrote :

It would be nice if there was a better error message here like, "Error: unable to resolve hostname ...."

Michael Partridge (m-part) wrote :

I have seen this manifest as M.Morana mentioned wheen there is a DNS issue routing the master MAAS node. Another thing I have noticed is that sometime I have to set the toggle the PXE interface to "unmanaged" and then back to "DHCP and DNS" this causes DNS to start working again. I have tried using "sudo dpkg-reconfigure maas-dns", but this never seems to resolve the issue and restart bind9 correctly.

Curtis Hovey (sinzui) on 2015-04-27
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
Nate Finch (natefinch) wrote :

I agree that the error message could be better, but unfortunately it's the error message produced by Mongo, not by Juju, so we have little way to fix it. There are many reasons why this could fail - the name could be correct, but the network could be having problems, etc. I don't think there's really much we can do for this one.

Understood. How about propagating up an error that explained that mongo init failed plus any other interesting details about it? Or what about checking if we can resolve ourselves (hostname) as a prerequisite? I understand we expect a sane setup, maybe this is a check that MAAS could do instead. Just suggestions, whatever you think is best, thank you.

It would be useful if someone who can reproduce this issue could attach MongoDB's logs for the period when this error is seen. They're usually in /var/log/syslog.

Also note that the fixes for bug 1441913 (just landed) make improvements to Juju's logging around replicaset initialisation. This might help a user debug this issue when it occurs.

Charles Butler (lazypower) wrote :

While my problem isn't related to maas - i'm running into this when bootstrapping with the manual provider in AWS under a VPC.

http://paste.ubuntu.com/11098986/

Charles Butler (lazypower) wrote :

It appears to me that the public IP address of the unit was placed in the replset config, which expects the port to be exposed in the security group. I'm not sure why I thought it would have inserted the private address on eth0 - but it makes sense now that I have identified it was a connectivity issue.

To add to this - a workaround was to open port 37017 in the security group for the machine. I dont like the idea its traversing all the way out to the network bridge just to come back to connect to itself - but there is a clear path forward here to make it work in this particular instance.

Curtis Hovey (sinzui) wrote :

Several race and networking conditions are fixed in 1.24. This issue might be fixed with the release of 1.24.0

Changed in juju-core:
status: Triaged → Incomplete
Trent Lloyd (lathiat) wrote :

I was consistently having this issue on 1.23, moving to 1.24 (devel) fixed it.
I also seemed to have maas not updating DNS as expected.

Curtis Hovey (sinzui) on 2015-06-22
Changed in juju-core:
milestone: 1.25.0 → 1.24.1
Curtis Hovey (sinzui) on 2015-06-23
Changed in juju-core:
milestone: 1.24.1 → 1.25.0
Curtis Hovey (sinzui) on 2015-06-25
Changed in juju-core:
milestone: 1.25.0 → 1.25.1
hallblazzar (pigpigmanbill) wrote :

I've encountered the bug in 1.24.0.

( 2015-07-07 09:08:28 WARNING juju.replicaset replicaset.go:98 Initiate: fetching replication status failed: cannot get replica set status: can't get local.system.replset config from self or any seed (EMPTYCONFIG) )

But adding "127.0.1.1 <my-hostname>" to /etc/hosts didn't work for me.

What can I do?

Jason Hobbs (jason-hobbs) wrote :

We've hit this a couple of times on 1.24.4. Output of this failing for bootstrap:

https://pastebin.canonical.com/137721/

Changed in juju-core:
status: Incomplete → New
tags: added: oil
Matt Rae (mattrae) wrote :

I'm seeing this error multiple during bootstrap using juju 1.24.4 and MAAS 1.8.1. It appears that during bootstrap the host is not always added to /etc/bind/maas/zone.maas. I've found that clicking save in the cluster controller config in the MAAS webui will cause the zone file to be populated with the host.

If I save the cluster controller after the node has started but before mongodb is installed, the dns record will be populated and bootstrap will complete successfully.

Matt Rae (mattrae) wrote :

I tested setting the default domain name to something other than 'maas', but bootstrap still fails without saving the cluster controller to update the zone file.

It appears that by the time bootstrap is finished the zone is missing record for the host

Zone prior to bootstrap:

root@maas:~# cat /etc/bind/maas/zone.matt
; Zone file modified: 2015-08-18 05:58:44.763249.
; Note that the modification time of this file doesn't reflect
; the actual modification time. MAAS controls the modification time
; of this file to be able to force the zone to be reloaded by BIND.
$TTL 300
@ IN SOA matt. nobody.example.com. (
              0000000056 ; serial
              600 ; Refresh
              1800 ; Retry
              604800 ; Expire
              300 ; TTL
              )

    IN NS matt.
$GENERATE 11-200 10-20-0-$ IN A 10.20.0.$
matt. IN A 10.20.0.10
limping-crow IN A 10.20.0.11

Zone right after bootstrap fails

cat /etc/bind/maas/zone.matt
; Zone file modified: 2015-08-18 06:16:38.099978.
; Note that the modification time of this file doesn't reflect
; the actual modification time. MAAS controls the modification time
; of this file to be able to force the zone to be reloaded by BIND.
$TTL 300
@ IN SOA matt. nobody.example.com. (
              0000000057 ; serial
              600 ; Refresh
              1800 ; Retry
              604800 ; Expire
              300 ; TTL
              )

    IN NS matt.
$GENERATE 11-200 10-20-0-$ IN A 10.20.0.$
matt. IN A 10.20.0.10

Matt Rae (mattrae) on 2015-08-18
tags: added: cpec
Curtis Hovey (sinzui) on 2015-08-18
Changed in juju-core:
status: New → Triaged
Mark Ramm (mark-ramm) on 2015-08-20
Changed in juju-core:
importance: High → Critical
Curtis Hovey (sinzui) on 2015-08-20
Changed in juju-core:
status: Triaged → Incomplete
Sean Feole (sfeole) wrote :

fwiw, i thought I would add my findings as well in regards to this particular issue,

I just ran into this while doing some various tests and bringup on new hardware. I'm in a private vlan with my own dns/dhcp server everything is on arm64.

JuJu-core 1.24.5
MaaS 1.8 (recently enabled)

I had found a small mistake that caused dhcp to hand out the remote dns server first and not use my local dhcp server as the primary. This caused the issue to manifest itself during juju bootstrap attempts, since the local client could not be resolved.

Since fixing that issue with dhcp I have not seen this problem since. So if the end user if maintaining their own network, then the sanity of that environment can very much come into play here.

Dimiter Naydenov (dimitern) wrote :

According to the last few comments, this looks like more of a MAAS DNS/DHCP issue than a juju-core issue.

Looks like MAAS doesn't update the DNS records for the node correctly (or fails midway through).

Curtis Hovey (sinzui) wrote :

I get this same error when I attempt to manually bootstrap to a t2.medium in ec2. The instance type is unusable and might be helpful in finding a fix for this bug.

Curtis Hovey (sinzui) on 2015-09-29
Changed in juju-core:
importance: Critical → High
Changed in juju-core:
milestone: 1.25.1 → 1.26-alpha1
Changed in juju-core:
status: Incomplete → Triaged
tags: added: adoption charmers cpp
Cheryl Jennings (cherylj) wrote :

This problem seems to show up the first time a particular node in a MAAS is selected for the bootstrap node. I hit this consistently with the following steps:

1 - juju bootstrap (chosen MAAS node is node0, for example)
2 - bootstrap fails with the above EMPTYCONFIG error
3 - juju destroy-environment --force (won't work without --force)
4 - juju bootstrap (will succeed if chosen MAAS node is node0)

If I then destroy the environment and then force a different node to be chosen as the bootstrap node, I see the same error again.

Here is the log from the failed bootstrap: http://paste.ubuntu.com/12887244/

Absolutely correct
Thanks

On Wed, Oct 21, 2015 at 9:45 PM, Cheryl Jennings <
<email address hidden>> wrote:

> This problem seems to show up the first time a particular node in a MAAS
> is selected for the bootstrap node. I hit this consistently with the
> following steps:
>
> 1 - juju bootstrap (chosen MAAS node is node0, for example)
> 2 - bootstrap fails with the above EMPTYCONFIG error
> 3 - juju destroy-environment --force (won't work without --force)
> 4 - juju bootstrap (will succeed if chosen MAAS node is node0)
>
> If I then destroy the environment and then force a different node to be
> chosen as the bootstrap node, I see the same error again.
>
> Here is the log from the failed bootstrap:
> http://paste.ubuntu.com/12887244/
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1508498).
> https://bugs.launchpad.net/bugs/1412621
>
> Title:
> replica set EMPTYCONFIG MAAS bootstrap
>
> Status in juju-core:
> Triaged
> Status in juju-core 1.24 series:
> Triaged
>
> Bug description:
> When attempting to bootstrap juju during a new stack deploy, it failed
> with the following error message:
>
> 2015-01-19 06:16:39 WARNING juju.replicaset replicaset.go:87 Initiate:
> fetching replication status failed: cannot get replica set status:
> can't get local.system.replset config from self or any seed
> (EMPTYCONFIG)
>
> Complete log of the deploy attempt is attached.
>
> When the juju bootstrap was attempted manually using the same command,
> it succeeded; this suggests there might be a race condition in the
> bootstrap process.
>
> This may be a duplicate of https://bugs.launchpad.net/juju-
> core/+bug/1384549, but the symptoms are different.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1412621/+subscriptions
>

Cheryl Jennings (cherylj) wrote :

BTW - for my scenario in seq #20, I was using:
MAAS 1.7.6
juju 1.24.6

Cheryl Jennings (cherylj) wrote :

Menno has asked for the mongo logs in /var/log/syslog. I will try to recreate tomorrow and get those logs. But, if anyone else who has encountered this can provide them, that would help in the debugging efforts.

Cheryl Jennings (cherylj) wrote :

Here is the output for /var/log/syslog in another recreate.

Cheryl Jennings (cherylj) wrote :

This is not the first time we've seen this issue. Bug 1340663 was also opened for this problem, but never quite resolved. It could be an underlying problem with MAAS. There is some analysis to look at in bug 1340663 as well.

tags: added: bug-squad
Cheryl Jennings (cherylj) wrote :

I've also left the environment up from this last recreate, in case any additional information is needed.

Changed in juju-core:
assignee: nobody → Andrew McDermott (frobware)
Andrew McDermott (frobware) wrote :

This is a speculative comment: it seems to happen when MAAS does not register the node's name in its DNS system. I've had 3 failures this morning and on the last failure I noticed that I cannot `dig maas-node3.maas' on either the node or the maas controller. And when you see the replica initiation start ala:

2015-11-02 14:04:04 INFO juju.replicaset replicaset.go:78 Initiating replicaset with config replicaset.Config{Name:"juju", Version:1, Members:[]replicaset.Member{replicaset.Member{Id:1, Address:"maas-node3.maas:37017", Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil), Priority:(*float64)(nil), Tags:map[string]string{"juju-machine-id":"0"}, SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)}}}
2015-11-02 14:04:04 WARNING juju.replicaset replicaset.go:98 Initiate: fetching replication status failed: cannot get replica set status: can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2015-11-02 14:04:04 WARNING juju.replicaset replicaset.go:98 Initiate: fetching replication status failed: cannot get replica set status: can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2015-11-02 14:04:05 WARNING juju.replicaset replicaset.go:98 Initiate: fetching replication status failed: cannot get replica set status: can't get local.system.replset config from self or any seed (EMPTYCONFIG)

it's not clear right now whether the failure is due to it not resolving 'maas-node3.maas'.

Still investigating.

Cheryl Jennings (cherylj) wrote :

I am running a virtual MAAS and I hit this issue again once I rebooted my host machine. I think there's some merit with your suspicion around this being DNS related. I see that in the failed cases, we get this message during bootstrap: "sudo: unable to resolve host node2"

$ juju bootstrap --upload-tools
Bootstrapping environment "maas"
Starting new instance for initial state server
Launching instance
WARNING no architecture was specified, acquiring an arbitrary node
 - /MAAS/api/1.0/nodes/node-bddd96e6-686a-11e5-8b12-525400c6c9e6/
Building tools to upload (1.25.0.1-trusty-amd64)
Installing Juju agent on bootstrap instance
Waiting for address
Attempting to connect to node2.maas:22
Attempting to connect to node2.maas:22
Attempting to connect to 192.168.100.103:22
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
12:a5:c2:22:44:98:f3:7e:bf:f3:33:22:7c:11:3b:c1.
Please contact your system administrator.
Add correct host key in /home/cherylj/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/cherylj/.ssh/known_hosts:306
  remove with: ssh-keygen -f "/home/cherylj/.ssh/known_hosts" -R 192.168.100.103
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
sudo: unable to resolve host node2
Logging to /var/log/cloud-init-output.log on remote host
Running apt-get update
Running apt-get upgrade

After doing a destroy-environment --force and re-bootstrapping, I do not see that particular error.

Andrew McDermott (frobware) wrote :

Given that this does not happen every time I modified the MAAS code to never write the DNS zone entries when a node is deployed. Note: you need to run:

  sudo service bind9 restart && sudo service maas-dhcpd restart && sudo service maas-regiond restart

on the maas controller after changing the code.

With this I see this bug every time and so confirms my suspicions from comment #27.

aim@0-maas-controller0:/usr/lib/python2.7/dist-packages$ git diff

diff --git a/maasserver/dns/config.py b/maasserver/dns/config.py
index c896fe4..78a705f 100644
--- a/maasserver/dns/config.py
+++ b/maasserver/dns/config.py
@@ -150,9 +150,9 @@ def dns_update_zones_now(clusters):

     serial = next_zone_serial()
     for zone in ZoneGenerator(clusters, serial):
- maaslog.info("Generating new DNS zone file for %s", zone.zone_name)
- bind_write_zones([zone])
- bind_reload_zone(zone.zone_name)
+ maaslog.info("GENERATING NEW DNS ZONE FILE FOR %s", zone.zone_name)
+ # bind_write_zones([zone])
+ # bind_reload_zone(zone.zone_name)

 def dns_update_zones(clusters):

Andrew McDermott (frobware) wrote :

If you switch the MAAS setup to have a static IP range declared then the deployed node always has an entry in /etc/bind/maas/zone.maas and is resolvable. In my setup, my MAAS did NOT have a static IP range and I ran into this problem frequently. Having declared a static IP range, I have not run into this problem. The static IP range can be setup in the Cluster controller's interfaces section (see attached screenshot). Once you have setup a static range you will see permanent node entries in /etc/bind/maas/zone.maas.

Andrew McDermott (frobware) wrote :

In the MAAS provider the code prefers DNS names. The attached patch dispenses with preferring a name and I can confirm that if the DNS entry is not updated in MAAS then bootstrap works OK as the replicatset/mongo initiation is done via IP addresses, so no resolution required. Not necessarily recommending this as a solution but here for completeness.

Andrew McDermott (frobware) wrote :

Missing screenshot from comment #30 added.

Curtis Hovey (sinzui) on 2015-11-03
Changed in juju-core:
milestone: 1.26-alpha1 → 1.26-alpha2
Chris Gregan (cgregan) wrote :

Just encountered this with juju 1.25 stable and the latest MAAS 1.9Beta2 release

-03 22:45:59 WARNING juju.replicaset replicaset.go:98 Initiate: fetching replication status failed: cannot get replica set status: can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2015-11-03 22:46:00 INFO juju.worker.peergrouper initiate.go:78 finished MaybeInitiateMongoServer
2015-11-03 22:46:00 ERROR juju.cmd supercommand.go:429 cannot initiate replica set: cannot get replica set status: can't get local.system.replset config from self or any seed (EMPTYCONFIG)
ERROR failed to bootstrap environment: subprocess encountered error code 1

Cheryl Jennings (cherylj) wrote :

cgregan - if you set up the MAAS to have a static IP range as frobware mentions in comment #30, does it enable you to bootstrap?

Andrew McDermott (frobware) wrote :

I've pushed a WIP fix here:

  https://github.com/frobware/juju/tree/master-lp1412621-wip

It helps fix/mitigate/workaround this issue but it breaks quite a few unit tests which I'm now working through.

Andrew McDermott (frobware) wrote :
Changed in juju-core:
status: Triaged → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
Cheryl Jennings (cherylj) wrote :

Verified that I don't run into this with the latest master (which contains the fix from comment #36)

Andres Rodriguez (andreserl) wrote :

From the MAAS perspective:

1. Have a static range defined under the Managed Cluster Interface.
2. Deploy a machine (without Juju) and ensure that it is DNS routable (note that MAAS' DNS server is running on the Region Controller and /etc/resolv.conf needs to be pointed to that).

If MAAS can deploy a machine that's reacheable via DNS, then from the MAAS perspective, the node has been successfully deployed. Keep in mind that juju changes /etc/network/interfaces, and that might be causing the issues.

Andrew McDermott (frobware) wrote :

My apologies - I thought I had added the following to the bug:

When a node is deployed the DNS zone tables are not always updated for the new node which means its name is unresolvable. If you repeatedly bootstrap/destroy and also watch the zone tables you will sometimes see that the node entry is not added. When theat happens juju bootstrap fails because the juju/mongo replica set initiation will try to use 'maas-node3' (the hostname returned by MAAS) which ultimately fails to resolve.

Matt Rae (mattrae) wrote :

Confirming the issue I've seen was the same as Andrew's description in #40. on bootstrap sometimes the dns record for the node was not added to the maas zone file which caused bootstrapping to fail with this error.

I found that if I saved the cluster controller configuration, the record would then instantly be added to the zone file.

So far in testing MAAS 1.9 rc1+bzr4496 I havent' seen this issue on bootstrapping.

Curtis Hovey (sinzui) on 2015-11-25
Changed in juju-core:
status: Fix Committed → Fix Released
Cheryl Jennings (cherylj) wrote :

I was hitting this problem today when using the manual provider to bootstrap an EC2 instance. To resolve the issue, I had to open up ports 17070 and 37070 in the security group for the instance, and was then able to bootstrap normally.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers