gce: bootstrap instance has no network rule for API

Bug #1436191 reported by Andrew Wilkins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Dimiter Naydenov
1.23
Fix Released
High
Dimiter Naydenov

Bug Description

I've just bootstrapped the GCE provider on master (0f0986bb9771f974bcb52f1cf9543e5f9f521026). Bootstrap succeeds, but then "juju status" hangs as the API server port cannot be connected to.

I investigated in the GCE console, and found that the instance that was created was associated with the "default" network, and for that network there was no rule allowing port 17070.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

While destroying the environment, I got:

ERROR while removing instance "juju-9eddd2c8-55dc-4e48-89a0-99971c9bf984-machine-0": googleapi: Error 404: The resource 'projects/sunlit-inquiry-89505/global/firewalls/juju-9eddd2c8-55dc-4e48-89a0-99971c9bf984-machine-0' was not found, notFound
ERROR some instance removals failed: [juju-9eddd2c8-55dc-4e48-89a0-99971c9bf984-machine-0]

may well be related.

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.24-alpha1
tags: added: gce-provider
removed: gce
Changed in juju-core:
assignee: nobody → Dimiter Naydenov (dimitern)
status: Triaged → In Progress
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I signed up for a GCE free trial account.

After some trial and error, I managed to bootstrap an environment with 1.23 successfully.

I was able to reproduce the issue (bootstrap succeeds, then juju status hangs).

Looking the GCE web console, I can see a firewall rule added for the instance: source 0.0.0.0/0, tcp:17017

So the provider does add a rule, but for some reason it's not taking effect.
I'm still investigating.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I found the problem and proposed a fix for it - http://reviews.vapour.ws/r/1282/

In summary, the issue is that the firewall rule for the api port (17070) is getting created as an instance-specific rule, unlike the default global rules GCE creates for each project (default-allow-icmp, default-allow-ssh, default-allow-internal).

Because the rule about opening 17070/tcp is an instance rule, rather than environment-level rule, this interacts badly with the firewaller when the default FwInstance mode is used.

I can see in the machine-0.log that once the firewaller starts, it almost immediately closes the 17070/tcp port on machine 0. While analyzing why this happens and why it's not affecting other providers I realized that's some fallout from the port ranges work, which I'll file a separate bug about.

Basically, port 17070/tcp is not opened of machine 0 in state, and the firewaller initially compares what ports a machine needs (based on what units are assigned to it and whether any of them called OpenPorts), then computes the difference between what it can see in state and what it can get from the provider and opens/closes ports to equalize one with the other. More details and logs will be added to the related bug I'll file.

So my fix for GCE is to just always create a global firewall rule (named after the environment UUID + "juju-" prefix) for port 17070/tcp, rather than a rule just for machine 0. Live tested to work.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Fix ported to master as well - http://reviews.vapour.ws/r/1283/

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Why can't the firewaller know not to close the API port on a state server? If it goes strictly by what ports are indicated for the machine in the state DB then why isn't the API port added already?

I would think that the firewaller is making an Instance.Ports call on each of the instances returned by Environ.StateServerInstances for a given provider. If the firewaller isn't the right place for that then I'd expect it to happen as part of bootstrap and the result stored in state for the machine, where the firewaller would later see it.

This implies that that the provider's bootstrap handling is responsible for opening the state ports it needs (i.e. the status quo). That means each provider must build in it's own logic for something they all have to do. The better approach would be to have this handled in the provider/common code. Better yet, at bootstrap the state-server-related ports would be recorded in the DB for the new state server machine, providers wouldn't have to worry about it, and the firewaller can just keep doing what it already does. :)

Of course the same considerations would apply for ensure-availability as for bootstrap.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

The way the Azure provider does this is by just not reporting the API port in its Instance.Ports method. That way, the firewaller never thinks it needs to close the port, and API ports are opened only for state server instances.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Yes, the provider is responsible for making sure the API server port is open. That haven't changed, but I agree it can be improved.

How ports management and firewalling work will need to change and there are steps to get there, which I'll try to summarize in the bug I'm about to file about it.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.