Default secgroup reset periodically to allow 0.0.0.0/0 for 22, 17070, 37017

Bug #1420996 reported by Michael Nelson on 2015-02-11
328
This bug affects 15 people
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned
juju-core
Medium
Unassigned
1.25
Medium
Unassigned

Bug Description

Steps to reproduce:
 1) bootstrap an ec2 environmet *
 2) Edit the secgroup which juju creates for the specific environment, changing the inbound rules for 22, 17070, 37017 from "Anywhere" (0.0.0.0/0) to "Custom IP" (123.456.789.123/32)
 3) Verify shortly after that the ports are only accessible to the custom ip address

Expected result next day:
 4) View the secgroup and verify it's still set to the custom ip.

Actual result next day (or some time thereafter - haven't timed)
 4) secgroup has been reset to 0.0.0.0/0, and so ports are accessible from anywhere.

* I've only seen this on ec2, but reports on the mailing list of seeing this with openstack too: https://lists.ubuntu.com/archives/juju/2015-February/004957.html

Dimiter Naydenov (dimitern) wrote :

It will be useful to see some logs (machine-0.log with logging-config: <root>=DEBUG) to understand better what's the reason for this behaviour.

For one, I know the firewaller is eager to close ports that it thinks shouldn't be open. Depending on the firewall-mode setting, firewaller diffs the current set of ports to the changed ports coming from the environment (with FwGlobal mode) or the instance (FwInstance mode), and the opens or closes ports as needed.

Another thing I noticed in both EC2 and OpenStack providers is that we ignore the CIDRs when fetching security group rules from the cloud API (i.e. we assume all of them are 0.0.0.0/0), and also set CIDRs to 0.0.0.0/0 unconditionally when opening ports (adding rules). Combine this with the equality checks inside the OpenStack provider which ignore CIDRs and the "revoke-non-existing-rule-is-ok" AWS behavior used by the EC2 provider, this definitely needs more investigation.

As for why secgroup rules are changed after some time, this is because the firewaller attempts to reconcile opened/closed ports on *every* machine, unit, or openedPorts change, as well as service exposing. If you manually change secgroup rules to open 22/tcp, juju will most likely treat this as "oops, I see an opened port which is not marked as opened, better fix that!"

Caio Begotti (caio1982) wrote :
Download full text (3.5 KiB)

Sysadmins from Canonical confirmed the Juju + Openstack environments where I am testing this problem have no special differences between them. I don't see this problem occurring with Juju + Havana (Canonistack) though it doesn't mean this problem won't happen there, perhaps it is just how I tried to reproduce it there that didn't make a difference. I don't really know the release running on the other environment (Stagingstack).

I have tried sticking the SSH secgroup rule to IP range 0.0.0.0/0 just to see if it was the IP range I was using or what. I didn't matter, as you can see below.

Using open-port https://jujucharms.com/u/caio1982/open-port/ does not matter either, I had port 22 closed on Canonistack for the involved units and their secgroups did not get reset, they worked okay the next day. My open ports (all 22) on the other environment got their rules reset overnight, so I would this rule out.

Below you can see the result of my monitoring (running every 15min) showing the time the reset occured, approximately at 04:30AM UTC I think. I really wonder why this is the specific time it occurs...

Thu Feb 12 04:24:48 UTC 2015
+-------------+-----------+---------+-------------------+-------------------------+
| IP Protocol | From Port | To Port | IP Range | Source Group |
+-------------+-----------+---------+-------------------+-------------------------+
| tcp | 37017 | 37017 | 0.0.0.0/0 | |
| tcp | 17070 | 17070 | 0.0.0.0/0 | |
| tcp | 1 | 65535 | | juju-stg-pes-capomastro |
| udp | 1 | 65535 | | juju-stg-pes-capomastro |
| icmp | -1 | -1 | | juju-stg-pes-capomastro |
| tcp | 22 | 22 | 91...removed/32 | |
| tcp | 873 | 873 | 91...removed/32 | |
| tcp | 5666 | 5666 | 91...removed/32 | |
| icmp | -1 | -1 | 91...removed/32 | |
| tcp | 22 | 22 | 162...removed/32 | <- my rule |
| tcp | 22 | 22 | 162...removed/32 | <- my rule |
| tcp | 22 | 22 | 0.0.0.0/0 | <- my rule |
+-------------+-----------+---------+-------------------+-------------------------+
Thu Feb 12 04:39:49 UTC 2015
+-------------+-----------+---------+-----------------+-------------------------+
| IP Protocol | From Port | To Port | IP Range | Source Group |
+-------------+-----------+---------+-----------------+-------------------------+
| tcp | 37017 | 37017 | 0.0.0.0/0 | |
| tcp | 17070 | 17070 | 0.0.0.0/0 | |
| tcp | 1 | 65535 | | juju-stg-pes-capomastro |
| udp | 1 | 65535 | | juju-stg-pes-capomastro |
| icmp | -1 | -1 | | juju-stg-pes-capomastro |
| tcp | 22 ...

Read more...

Dimiter Naydenov (dimitern) wrote :

Thanks for the update!

However, without any logs I can't really analyze the problem. I understand the deployment is private and there will be sensitive data in the logs, but they can be scrubbed clean of these bits. The suspiciously regular interval (~10-15m) after which the unexpected changes happen leads me to think it might be related to the instance status poller somehow. But to know for sure some logs with 'logging-config: juju=DEBUG' set in environments.yaml.

Caio Begotti (caio1982) wrote :

Just got more info on the Openstack environment that shows this bug. It is running Icehouse, so something changed regarding secgroups between Havana and Icehouse that is broken now, but I understand it may be a huge load of changes in that part of the code, of course :-)

Dimiter, I can't really change the environment because I don't own it, it is maintained by Canonical's admins. What I can see in the logs is the following (they have some DEBUG entries in the past but I don't really know if they DEBUG was eventually turned off or what):

all-machines.log:machine-0: 2015-02-12 03:36:10 INFO juju.state.apiserver.charmrevisionupdater updater.go:107 retrieving revision information for 0 charms
all-machines.log:machine-0: 2015-02-12 03:36:10 ERROR juju.state.apiserver.charmrevisionupdater updater.go:111 finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
all-machines.log:machine-0: 2015-02-12 03:36:10 ERROR juju revisionupdater.go:73 worker/charm revision lookup: cannot process charms: finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
machine-0.log:2015-02-12 03:36:10 INFO juju.state.apiserver.charmrevisionupdater updater.go:107 retrieving revision information for 0 charms
machine-0.log:2015-02-12 03:36:10 ERROR juju.state.apiserver.charmrevisionupdater updater.go:111 finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
machine-0.log:2015-02-12 03:36:10 ERROR juju revisionupdater.go:73 worker/charm revision lookup: cannot process charms: finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused

I will ask some of the sysadmins involved to step in and provide more info, but I can't promise they will do that ASAP.

Dimiter Naydenov (dimitern) wrote :

Ok, in the meantime I can suggest a few things to try:
1. Try adding more rules with different ports, protocols, and IP ranges
2. Paste (using paste.c.c or paste.ubuntu.com links please, not directly here as it's very hard to follow) the output of these commands:
nova list-extensions
nova version-list
nova secgroup-list
nova secgroup-list-rules default
nova secgroup-list-rules juju-stg-pes-capomastro
nova secgroup-list-default-rules
3. If you have access via juju to that environment, you can try: juju set-env logging-config='juju=DEBUG' to enable debug logs.
Of course, drop any sensitive data before pasting.

Caio Begotti (caio1982) wrote :

Okay, so, I don't want to pollute this report with more private stuff but we narrowed it down inside Canonical and this does not seem to be a Juju bug for us on Stagingstack (running Juju with Icehouse) actually. Only one guy in the sysadmin team knew there were some sanity checks being run overnight in the problematic environment, and he was in a distance timezone so we didn't know about that until now. This is embarrassing but please consider my reports somewhat invalid (though I understand it still might be a problem for EC2 users).

Aaron Bentley (abentley) on 2015-02-12
Changed in juju-core:
importance: Undecided → Medium
status: New → Triaged
Haw Loeung (hloeung) wrote :

Confirmed that this also affects environments using the Azure juju provider where 'input endpoints' are removed on changes to the environment.

Greg Mason (gmason) wrote :

A fairly reliable way to reproduce this on Azure is to restart the jujud-machine-0 process. Within a couple minutes, most juju agents should show as lost.

Canonical sysadmins have a rudimentary script which will reset the lost config on Azure endpoints. Ideally, juju would fix this in the same way it's getting broken.

David Lawson (deej) on 2016-07-28
tags: added: canonical-is
Anastasia (anastasia-macmood) wrote :

This will be resolved by a solution to bug #1287658. Marking it as duplicate.

Junien Fridrick (axino) wrote :

Hi,

I don't agree that this is a duplicate of bug #1287658. #1287658 is a feature improvement, and is classified as Wishlist, as it should be.

However, this bug is an actual bug that makes monitoring Azure environment with juju very painful, because the rules that allow said monitoring are removed every once in a while.

Changed in juju-core:
status: Triaged → New
Anastasia (anastasia-macmood) wrote :

@Junien Fridrick

Please clarify what version of Juju you are using. This bug has been originally filed in Feb 2015.

Changed in juju-core:
status: New → Incomplete
importance: Medium → Undecided
Changed in juju-core:
status: Incomplete → Triaged
importance: Undecided → High
importance: High → Medium
status: Triaged → New
Barry Price (barryprice) wrote :

Speaking for Junien, this is a Juju 1.25.6 environment.

Changed in juju-core:
status: New → Triaged
Changed in juju-core:
status: Triaged → Won't Fix
Changed in juju:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 2.1.0
Changed in juju:
assignee: nobody → Roufique hossain (rtatours)
Changed in juju-core:
assignee: nobody → Roufique hossain (rtatours)
Junien Fridrick (axino) on 2017-01-20
Changed in juju:
assignee: Roufique hossain (rtatours) → nobody
Changed in juju-core:
assignee: Roufique hossain (rtatours) → nobody
Changed in juju:
milestone: 2.1.0 → none
Nick Moffitt (nick-moffitt) wrote :

This bites us regularly on our public cloud instances. We just had an alert storm caused by a simple juju add-unit. When can we see a fix for this?

Tim Penhey (thumper) wrote :

How does this bite you exactly? What is setting off the alert storm?

Junien Fridrick (axino) wrote :

@thumper : we manually create restrictive secgroup rules allowing our Nagios master to run checks on the public cloud instances. We can't use the "open-port" feature of juju because it can only give unrestricted access to a port (i.e. no filter on source IP).

When we juju add-unit, juju wipes out these restrictive rules, so the Nagios master checks start failing, which creates an alert storm.

Tim Penhey (thumper) wrote :

Thanks, I'll add it to our lead chat next week.

Arguably the "edit the security group that juju manages, and then assume
that juju will never manage it again" is a bit of a misfeature.
There are several times that Juju reevaluates if the security group matches
the rules that you have told Juju to support. (what things are exposed,
adding units, etc.)

We certainly are missing the ability to inform Juju about more involved
rules that you would like us to use. A couple options would be:
1) allow a separate security group that can be user-managed, separate from
the one that Juju manages. Its almost never good to have 2 'things'
(people/agents) managing the same object.
2) Allow for something more expressive than just 'expose'. Exposing to a
CIDR, controlling CIDRs on a per endpoint/port basis, etc. There is a fair
bit of design work to make sure we're capturing appropriate abstractions
that both let sys admins express exactly what they're hoping, while still
forming it as a set of promises, rather than just arbitrary configuration
that just means admins have to do all the work to make sure everything
lines up correctly all the time.

On Fri, May 19, 2017 at 6:39 AM, Tim Penhey <email address hidden>
wrote:

> Thanks, I'll add it to our lead chat next week.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1420996
>
> Title:
> Default secgroup reset periodically to allow 0.0.0.0/0 for 22, 17070,
> 37017
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1420996/+subscriptions
>

Paul Gear (paulgear) wrote :

Option 1) would be a welcome interim step whilst we wait for a solution to 2) (covered in lp:1287658).

To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Duplicates of this bug

Other bug subscribers