[2.6.4] Manual Provider: Workers seems to crash on both controller and model machines

Bug #1833282 reported by Pedro Guimarães
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Xenial deployment with manual provider. No enable-ha feature used.
Juju 2.6.4 (2.6/Candidate)
Running on top of VMWare

Each VM has multiple networks connected into it. VMs can reach each other on multiple networks.

Machines crash and go to "down" state just after being added to model.

I can still reach machines using "juju ssh" command though.

Also, deploying 2.6.3/stable on this same environment worked fine last week.

juju crashdumps for both "kubernetes" and "controller" models and also /var/log from juju controller here: https://drive.google.com/drive/folders/1ufEPRAMZvWzm9dvKUcWN8dNcdbnbZfsW?usp=sharing

description: updated
Revision history for this message
Richard Harding (rharding) wrote :

So it looks like the controller comes up, but when it goes to start the peergrouper worker things start to get into error loops. The machines have multiple networks but are manual provider and don't support spaces. This causes things to end up not coming up as expected. Here's a repeating section of the peergrouper trying to start, causing a series of errors, raft then starts to error, and eventually we get a restart.

https://pastebin.canonical.com/p/bdt2fJR7D4/

Revision history for this message
Tim Penhey (thumper) wrote :

Can we be clear here when you say it is running on top of VMWare.

Did Juju provision the machine using VMWare, or is this a machine that was previously provisioned and then manually bootstrapped?

If the machine is created by Juju then this is a much more urgent issue.

If the machine has been manually bootstrapped into, then it is an issue for the manual provider. Perhaps we should not support bootstrapping into a manual machine that has multiple NICs.

Tim Penhey (thumper)
Changed in juju:
status: New → Incomplete
Revision history for this message
Joseph Phillips (manadart) wrote :

I can't see why this would work on 2.6.3 and not on the candidate.

In any case, there was a fix added for logged warnings some time back under:
https://github.com/juju/juju/pull/9964

It definitely should not error out for the single controller case. I will look into it.

Changed in juju:
status: Incomplete → In Progress
Changed in juju:
status: In Progress → Incomplete
Revision history for this message
Joseph Phillips (manadart) wrote :

Is this controller still up?

Can you get the machines collection from Mongo? You can connect to Juju's Mongo via one of the methods here:
https://discourse.jujucharms.com/t/login-into-mongodb/309

Then run:
"db.machines.find().pretty()"

What command are you using to bootstrap? Specifically, are you using a host-name or IP?

I did some testing locally with a 2-interface machine, and what I found when using an IP to bootstrap was:
1) The local machine returns its NIC addresses as scope "local-cloud".
2) The manual provider returns the address used to bootstrap as a "public" address.
3) When machine addresses are retrieved, any provider-sourced listings override machine-sourced ones.
4) The peer-grouper then has only a single "local-cloud" address and proceeds happily.

What can happen if bootstrapping with a host-name instead of an IP is that the host is not resolvable and no provider addresses are returned. This means the peer-grouper sees 2 "local-cloud" addresses and throws the error.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

I am using IPs only.

Say, address from network A (which is the correct one).

However, I can see that pinging other machines' hostnames, it resolves to the other network B (let's call it that way).

However, on /etc/hosts, the hostname is configured to network A address space.

My default gateway is network B.

Changed in juju:
status: Incomplete → New
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

@manadart, can you retry your environment set as above? ^^

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi,

I've rebuilt the environment and run the db command as suggested, here is the output: https://drive.google.com/open?id=11PEMBc7tOcqr0cr57_QYfZbmh4sk76gP

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

I've restarted jujud-machine-0.service at 06:49, here is the logs: https://pastebin.canonical.com/p/N9ntPGGXMW/

This node, for some reason, had docker installed and docker0 bridge configured.
Check line:
2019-06-20 06:50:06 DEBUG juju.worker.certupdater certupdater.go:196 new addresses [localhost juju-apiserver juju-mongodb anything MGMT_ADDRESS DOCKER0_GW]

Deleting docker0 seemed to stabilize the env.
I will run some retrials to confirm this was the issue.

However, I can still see enable-ha gets messed-up:
https://pastebin.canonical.com/p/qKtCZgChmW/

Mongodb seems to change its preferred address after I run juju enable-ha --to=1,2 from data network to mgmt.
Management, however, has very strict firewall permissions and probably port 37070 won't be available.

Full log since 06:50 with enable-ha (couple of hours later): https://pastebin.canonical.com/p/zWHgBhvG89/

Revision history for this message
Joseph Phillips (manadart) wrote :

Yes, so the machine has three local-cloud IPv4 addresses, one of which is overridden as public by the one used to provision it.

That leaves two local-cloud addresses, and without space designation the peer-grouper will not proceed if there is more than one.

So my testing was not a correct replication. If I add a third NIC, I will get the same issue.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1833282] Re: [2.6.4] Manual Provider: Workers seems to crash on both controller and model machines

I don't know if we need a quick fix, but the real fix here (IMO) would be
to allow declaring spaces/subnets with manually provisioned machines,
rather than trying to hack around the lack of spaces.

On Thu, Jun 20, 2019 at 12:50 PM Joseph Phillips <email address hidden>
wrote:

> Yes, so the machine has three local-cloud IPv4 addresses, one of which
> is overridden as public by the one used to provision it.
>
> That leaves two local-cloud addresses, and without space designation the
> peer-grouper will not proceed if there is more than one.
>
> So my testing was not a correct replication. If I add a third NIC, I
> will get the same issue.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1833282
>
> Title:
> [2.6.4] Manual Provider: Workers seems to crash on both controller and
> model machines
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1833282/+subscriptions
>

Revision history for this message
Tim Penhey (thumper) wrote :

I think we all agree that this would be the correct approach.

However, until we have mutable spaces and networks it makes it very
problematic to get right as you can not undo a mistake.

Changed in juju:
status: New → Triaged
milestone: none → 2.7-beta1
Changed in juju:
milestone: 2.7-beta1 → 2.7-rc1
Changed in juju:
milestone: 2.7-rc1 → none
importance: Undecided → Wishlist
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Wishlist → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.