Bug #1420996 “Default secgroup reset periodically to allow 0.0.0...” : Bugs : Canonical Juju

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-12:

#1

It will be useful to see some logs (machine-0.log with logging-config: <root>=DEBUG) to understand better what's the reason for this behaviour.

For one, I know the firewaller is eager to close ports that it thinks shouldn't be open. Depending on the firewall-mode setting, firewaller diffs the current set of ports to the changed ports coming from the environment (with FwGlobal mode) or the instance (FwInstance mode), and the opens or closes ports as needed.

Another thing I noticed in both EC2 and OpenStack providers is that we ignore the CIDRs when fetching security group rules from the cloud API (i.e. we assume all of them are 0.0.0.0/0), and also set CIDRs to 0.0.0.0/0 unconditionally when opening ports (adding rules). Combine this with the equality checks inside the OpenStack provider which ignore CIDRs and the "revoke-non-existing-rule-is-ok" AWS behavior used by the EC2 provider, this definitely needs more investigation.

As for why secgroup rules are changed after some time, this is because the firewaller attempts to reconcile opened/closed ports on *every* machine, unit, or openedPorts change, as well as service exposing. If you manually change secgroup rules to open 22/tcp, juju will most likely treat this as "oops, I see an opened port which is not marked as opened, better fix that!"

Revision history for this message

Caio Begotti (caio1982) wrote on 2015-02-12:

#2

Download full text (3.5 KiB)

Sysadmins from Canonical confirmed the Juju + Openstack environments where I am testing this problem have no special differences between them. I don't see this problem occurring with Juju + Havana (Canonistack) though it doesn't mean this problem won't happen there, perhaps it is just how I tried to reproduce it there that didn't make a difference. I don't really know the release running on the other environment (Stagingstack).

I have tried sticking the SSH secgroup rule to IP range 0.0.0.0/0 just to see if it was the IP range I was using or what. I didn't matter, as you can see below.

Using open-port https://jujucharms.com/u/caio1982/open-port/ does not matter either, I had port 22 closed on Canonistack for the involved units and their secgroups did not get reset, they worked okay the next day. My open ports (all 22) on the other environment got their rules reset overnight, so I would this rule out.

Below you can see the result of my monitoring (running every 15min) showing the time the reset occured, approximately at 04:30AM UTC I think. I really wonder why this is the specific time it occurs...

Thu Feb 12 04:24:48 UTC 2015
+------ | IP Protocol +------ | tcp | tcp | tcp | 1 | udp | 1 | icmp | -1 | tcp | 22 | tcp | 873 | tcp | icmp | -1 | tcp | 22 | tcp | 22 | tcp | 22 +------ +------ | IP Protocol +------ | tcp | tcp | tcp | 1 | udp | 1 | icmp | -1 | tcp | 22 ... />-------+-----------+---------+-------------------+-------------------------+
| From Port | To Port | IP Range | Source Group |
/>-------+-----------+---------+-------------------+-------------------------+
| 37017 | 37017 | 0.0.0.0/0 | |
| 17070 | 17070 | 0.0.0.0/0 | |
| 65535 | | juju-stg-pes-capomastro |
| 65535 | | juju-stg-pes-capomastro |
| -1 | | juju-stg-pes-capomastro |
| 22 | 91...removed/32 | |
| 873 | 91...removed/32 | |
| 5666 | 5666 | 91...removed/32 | |
| -1 | 91...removed/32 | |
| 22 | 162...removed/32 | <- my rule |
| 22 | 162...removed/32 | <- my rule |
| 22 | 0.0.0.0/0 | <- my rule |
/>-------+-----------+---------+-------------------+-------------------------+
/>-------+-----------+---------+-----------------+-------------------------+
| From Port | To Port | IP Range | Source Group |
/>-------+-----------+---------+-----------------+-------------------------+
| 37017 | 37017 | 0.0.0.0/0 | |
| 17070 | 17070 | 0.0.0.0/0 | |
| 65535 | | juju-stg-pes-capomastro |
| 65535 | | juju-stg-pes-capomastro |
| -1 | | juju-stg-pes-capomastro |

Sysadmins from Canonical confirmed the Juju + Openstack environments where I am testing this problem have no special differences between them. I don't see this problem occurring with Juju + Havana (Canonistack) though it doesn't mean this problem won't happen there, perhaps it is just how I tried to reproduce it there that didn't make a difference. I don't really know the release running on the other environment (Stagingstack).

I have tried sticking the SSH secgroup rule to IP range 0.0.0.0/0 just to see if it was the IP range I was using or what. I didn't matter, as you can see below.

Using open-port https://jujucharms.com/u/caio1982/open-port/ does not matter either, I had port 22 closed on Canonistack for the involved units and their secgroups did not get reset, they worked okay the next day. My open ports (all 22) on the other environment got their rules reset overnight, so I would this rule out.

Below you can see the result of my monitoring (running every 15min) showing the time the reset occured, approximately at 04:30AM UTC I think. I really wonder why this is the specific time it occurs...

Thu Feb 12 04:24:48 UTC 2015
+-------------+-----------+---------+-------------------+-------------------------+
| IP Protocol | From Port | To Port | IP Range          | Source Group            |
+-------------+-----------+---------+-------------------+-------------------------+
| tcp         | 37017     | 37017   | 0.0.0.0/0         |                         |
| tcp         | 17070     | 17070   | 0.0.0.0/0         |                         |
| tcp         | 1         | 65535   |                   | juju-stg-pes-capomastro |
| udp         | 1         | 65535   |                   | juju-stg-pes-capomastro |
| icmp        | -1        | -1      |                   | juju-stg-pes-capomastro |
| tcp         | 22        | 22      | 91...removed/32   |                         |
| tcp         | 873       | 873     | 91...removed/32   |                         |
| tcp         | 5666      | 5666    | 91...removed/32   |                         |
| icmp        | -1        | -1      | 91...removed/32   |                         |
| tcp         | 22        | 22      | 162...removed/32  | <- my rule              |
| tcp         | 22        | 22      | 162...removed/32  | <- my rule              |
| tcp         | 22        | 22      | 0.0.0.0/0         | <- my rule              |
+-------------+-----------+---------+-------------------+-------------------------+
Thu Feb 12 04:39:49 UTC 2015
+-------------+-----------+---------+-----------------+-------------------------+
| IP Protocol | From Port | To Port | IP Range        | Source Group            |
+-------------+-----------+---------+-----------------+-------------------------+
| tcp         | 37017     | 37017   | 0.0.0.0/0       |                         |
| tcp         | 17070     | 17070   | 0.0.0.0/0       |                         |
| tcp         | 1         | 65535   |                 | juju-stg-pes-capomastro |
| udp         | 1         | 65535   |                 | juju-stg-pes-capomastro |
| icmp        | -1        | -1      |                 | juju-stg-pes-capomastro |
| tcp         | 22        | 22      | 91...removed/32 |                         |
| tcp         | 873       | 873     | 91...removed/32 |                         |
| tcp         | 5666      | 5666    | 91...removed/32 |                         |
| icmp        | -1        | -1      | 91...removed/32 |                         |
+-------------+-----------+---------+-----------------+-------------------------+

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-12:

#3

Thanks for the update!

However, without any logs I can't really analyze the problem. I understand the deployment is private and there will be sensitive data in the logs, but they can be scrubbed clean of these bits. The suspiciously regular interval (~10-15m) after which the unexpected changes happen leads me to think it might be related to the instance status poller somehow. But to know for sure some logs with 'logging-config: juju=DEBUG' set in environments.yaml.

Revision history for this message

Caio Begotti (caio1982) wrote on 2015-02-12:

#4

Just got more info on the Openstack environment that shows this bug. It is running Icehouse, so something changed regarding secgroups between Havana and Icehouse that is broken now, but I understand it may be a huge load of changes in that part of the code, of course :-)

Dimiter, I can't really change the environment because I don't own it, it is maintained by Canonical's admins. What I can see in the logs is the following (they have some DEBUG entries in the past but I don't really know if they DEBUG was eventually turned off or what):

all-machines.log:machine-0: 2015-02-12 03:36:10 INFO juju.state.apiserver.charmrevisionupdater updater.go:107 retrieving revision information for 0 charms
all-machines.log:machine-0: 2015-02-12 03:36:10 ERROR juju.state.apiserver.charmrevisionupdater updater.go:111 finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
all-machines.log:machine-0: 2015-02-12 03:36:10 ERROR juju revisionupdater.go:73 worker/charm revision lookup: cannot process charms: finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
machine-0.log:2015-02-12 03:36:10 INFO juju.state.apiserver.charmrevisionupdater updater.go:107 retrieving revision information for 0 charms
machine-0.log:2015-02-12 03:36:10 ERROR juju.state.apiserver.charmrevisionupdater updater.go:111 finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
machine-0.log:2015-02-12 03:36:10 ERROR juju revisionupdater.go:73 worker/charm revision lookup: cannot process charms: finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused

I will ask some of the sysadmins involved to step in and provide more info, but I can't promise they will do that ASAP.

Just got more info on the Openstack environment that shows this bug. It is running Icehouse, so something changed regarding secgroups between Havana and Icehouse that is broken now, but I understand it may be a huge load of changes in that part of the code, of course :-)

Dimiter, I can't really change the environment because I don't own it, it is maintained by Canonical's admins. What I can see in the logs is the following (they have some DEBUG entries in the past but I don't really know if they DEBUG was eventually turned off or what):

all-machines.log:machine-0: 2015-02-12 03:36:10 INFO juju.state.apiserver.charmrevisionupdater updater.go:107 retrieving revision information for 0 charms
all-machines.log:machine-0: 2015-02-12 03:36:10 ERROR juju.state.apiserver.charmrevisionupdater updater.go:111 finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
all-machines.log:machine-0: 2015-02-12 03:36:10 ERROR juju revisionupdater.go:73 worker/charm revision lookup: cannot process charms: finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
machine-0.log:2015-02-12 03:36:10 INFO juju.state.apiserver.charmrevisionupdater updater.go:107 retrieving revision information for 0 charms
machine-0.log:2015-02-12 03:36:10 ERROR juju.state.apiserver.charmrevisionupdater updater.go:111 finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused
machine-0.log:2015-02-12 03:36:10 ERROR juju revisionupdater.go:73 worker/charm revision lookup: cannot process charms: finding charm revision info: Cannot access the charm store. Are you connected to the internet? Error details: Get https://store.juju.ubuntu.com/charm-info: dial tcp 91.removed:443: connection refused

I will ask some of the sysadmins involved to step in and provide more info, but I can't promise they will do that ASAP.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-12:

#5

Ok, in the meantime I can suggest a few things to try:
1. Try adding more rules with different ports, protocols, and IP ranges
2. Paste (using paste.c.c or paste.ubuntu.com links please, not directly here as it's very hard to follow) the output of these commands:
nova list-extensions
nova version-list
nova secgroup-list
nova secgroup-list-rules default
nova secgroup-list-rules juju-stg-pes-capomastro
nova secgroup-list-default-rules
3. If you have access via juju to that environment, you can try: juju set-env logging-config='juju=DEBUG' to enable debug logs.
Of course, drop any sensitive data before pasting.

Revision history for this message

Caio Begotti (caio1982) wrote on 2015-02-12:

#6

Okay, so, I don't want to pollute this report with more private stuff but we narrowed it down inside Canonical and this does not seem to be a Juju bug for us on Stagingstack (running Juju with Icehouse) actually. Only one guy in the sysadmin team knew there were some sanity checks being run overnight in the problematic environment, and he was in a distance timezone so we didn't know about that until now. This is embarrassing but please consider my reports somewhat invalid (though I understand it still might be a problem for EC2 users).

Aaron Bentley (abentley) on 2015-02-12

Changed in juju-core:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Haw Loeung (hloeung) wrote on 2016-03-22:

#7

Confirmed that this also affects environments using the Azure juju provider where 'input endpoints' are removed on changes to the environment.

Revision history for this message

Greg Mason (gmason) wrote on 2016-06-30:

#8

A fairly reliable way to reproduce this on Azure is to restart the jujud-machine-0 process. Within a couple minutes, most juju agents should show as lost.

Canonical sysadmins have a rudimentary script which will reset the lost config on Azure endpoints. Ideally, juju would fix this in the same way it's getting broken.

David Lawson (deej) on 2016-07-28

tags:

added: canonical-is

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-08-16:

#9

This will be resolved by a solution to bug #1287658. Marking it as duplicate.

Revision history for this message

Junien F (axino) wrote on 2016-10-10:

#10

Hi,

I don't agree that this is a duplicate of bug #1287658. #1287658 is a feature improvement, and is classified as Wishlist, as it should be.

However, this bug is an actual bug that makes monitoring Azure environment with juju very painful, because the rules that allow said monitoring are removed every once in a while.

Changed in juju-core:
status:	Triaged → New

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-10-10:

#11

@Junien Fridrick

Please clarify what version of Juju you are using. This bug has been originally filed in Feb 2015.

Changed in juju-core:
status:	New → Incomplete
importance:	Medium → Undecided

Richard Harding (rharding) on 2016-10-10

Changed in juju-core:
status:	Incomplete → Triaged
importance:	Undecided → High
importance:	High → Medium
status:	Triaged → New

Revision history for this message

Barry Price (barryprice) wrote on 2016-10-11:

#12

Speaking for Junien, this is a Juju 1.25.6 environment.

Anastasia (anastasia-macmood) on 2016-10-11

Changed in juju-core:
status:	New → Triaged

Anastasia (anastasia-macmood) on 2016-10-17

Changed in juju-core:
status:	Triaged → Won't Fix

Anastasia (anastasia-macmood) on 2016-10-18

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → 2.1.0

Roufique hossain (rtatours) on 2016-12-13

Changed in juju:
assignee:	nobody → Roufique hossain (rtatours)
Changed in juju-core:
assignee:	nobody → Roufique hossain (rtatours)

Junien F (axino) on 2017-01-20

Changed in juju:
assignee:	Roufique hossain (rtatours) → nobody
Changed in juju-core:
assignee:	Roufique hossain (rtatours) → nobody

Anastasia (anastasia-macmood) on 2017-01-20

Changed in juju:
milestone:	2.1.0 → none

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2017-05-16:

#13

This bites us regularly on our public cloud instances. We just had an alert storm caused by a simple juju add-unit. When can we see a fix for this?

Revision history for this message

Tim Penhey (thumper) wrote on 2017-05-18:

#14

How does this bite you exactly? What is setting off the alert storm?

Revision history for this message

Junien F (axino) wrote on 2017-05-18:

#15

@thumper : we manually create restrictive secgroup rules allowing our Nagios master to run checks on the public cloud instances. We can't use the "open-port" feature of juju because it can only give unrestricted access to a port (i.e. no filter on source IP).

When we juju add-unit, juju wipes out these restrictive rules, so the Nagios master checks start failing, which creates an alert storm.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-05-19:

#16

Thanks, I'll add it to our lead chat next week.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-05-20: Re: [Bug 1420996] Re: Default secgroup reset periodically to allow 0.0.0.0/0 for 22, 17070, 37017

#17

Arguably the "edit the security group that juju manages, and then assume
that juju will never manage it again" is a bit of a misfeature.
There are several times that Juju reevaluates if the security group matches
the rules that you have told Juju to support. (what things are exposed,
adding units, etc.)

We certainly are missing the ability to inform Juju about more involved
rules that you would like us to use. A couple options would be:
1) allow a separate security group that can be user-managed, separate from
the one that Juju manages. Its almost never good to have 2 'things'
(people/agents) managing the same object.
2) Allow for something more expressive than just 'expose'. Exposing to a
CIDR, controlling CIDRs on a per endpoint/port basis, etc. There is a fair
bit of design work to make sure we're capturing appropriate abstractions
that both let sys admins express exactly what they're hoping, while still
forming it as a set of promises, rather than just arbitrary configuration
that just means admins have to do all the work to make sure everything
lines up correctly all the time.

On Fri, May 19, 2017 at 6:39 AM, Tim Penhey <email address hidden>
wrote:

> Thanks, I'll add it to our lead chat next week.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1420996
>
> Title:
> Default secgroup reset periodically to allow 0.0.0.0/0 for 22, 17070,
> 37017
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1420996/+subscriptions
>

Revision history for this message

Paul Gear (paulgear) wrote on 2017-05-20:

#18

Option 1) would be a welcome interim step whilst we wait for a solution to 2) (covered in lp:1287658).

Revision history for this message

Junien F (axino) wrote on 2018-10-11:

#19

This bug is still causing problems for us

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#20

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Revision history for this message

Haw Loeung (hloeung) wrote on 2022-11-13:

#22

Maybe we can make it a model config with the default to reset secgroups on changes but it's overridable per model?

Changed in juju:
importance:	Low → Medium

Canonical Juju

Default secgroup reset periodically to allow 0.0.0.0/0 for 22, 17070, 37017

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
Canonical Juju	Triaged	Medium	Unassigned
juju-core	Won't Fix	Medium	Unassigned
1.25	Won't Fix	Medium	Unassigned