Juju isn't handling errors from ec2 provider re firewall rules properly.

Bug #2038494 reported by Thomas Miller
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
High
Nicolas Vinuesa

Bug Description

We have had people outside of Juju's control change security group rules to both and a remove rules. When Juju goes to perform an action such as removing a rule that doesn't exist or adding a rule that already exists in the security group we are getting errors from AWS such as not found or already exists.

If the we are doing a security group removal operation and we get a not found error then we should consider this success for the purpose of our state. Vice versa if we are adding a rule and it already exists this shouldn't be an error.

Provided is some Juju output of the problem.

"errorCode": "Client.InvalidPermission.Duplicate",
"errorMessage": "the specified rule \"peer: 172.20.0.0/16, TCP, from port: 8082, to port: 8082, ALLOW\" already exists",

2023-09-28 05:50:10 ERROR juju.worker.dependency engine.go:695 "firewaller" manifold worker returned unexpected error: cannot close ports: operation error EC2: RevokeSecurityGroupIngress, https response error StatusCode: 400, RequestID: 7530a36f-99ca-41d9-8967-b13636f0fd0d, api error InvalidPermission.NotFound: The specified rule does not exist in this security group.

https://pastebin.ubuntu.com/p/sX4TyYtH4K/

This should just be a case of matching on the AWS strong error types and returning a well typed juju error to the firewall worker to make a decision on.

Tags: sts
Thomas Miller (tlmiller)
Changed in juju:
importance: Undecided → High
tags: added: sts
Changed in juju:
status: New → Triaged
Changed in juju:
assignee: nobody → Nicolas Vinuesa (nvinuesa)
milestone: none → 3.1.8
Harry Pidcock (hpidcock)
Changed in juju:
milestone: 3.1.8 → 3.3.3
Revision history for this message
Nicolas Vinuesa (nvinuesa) wrote :

@tlmiller I cannot reproduce this, maybe I'm doing something wrong. This is the scenario I'm following (both juju 3.1.7 and 3.3.1):

```
juju bootstrap aws/eu-west-3 c
juju add-model m
juju deploy ubuntu
juju exec --unit ubuntu/0 open-port 8080/tcp
```
At this point I see the logs
```
controller-0: 18:34:11 INFO juju.worker.firewaller opened port ranges [8080/tcp from 0.0.0.0/0,::/0] on "machine-0"
```
And the inbound rule has been correctly added to the security group.

Now, if I manually remove the rule from the security group in the aws console, and then run:
```
juju unexpose ubuntu
```
then I see the logs
```
controller-0: 18:35:16 INFO juju.worker.firewaller closed port ranges [8080/tcp from 0.0.0.0/0,::/0] on "machine-0"
```
And no error.

The same happens the other way around (if I manually create the rule before exposing the app).

Do you have a reproducer?

Changed in juju:
status: Triaged → Invalid
Ian Booth (wallyworld)
Changed in juju:
milestone: 3.3.3 → 3.3.4
Changed in juju:
status: Invalid → Incomplete
Revision history for this message
Koo Zhong Zheng (kzz333) wrote :

Hello,

For your information, I had re-dived into this bug which basically involved manual changes of network security group on that particular cloud machines in case 00369877 and 00369883.

1) Regarding lost of service connection, basically it was due to removing port 17070 where this port is required during upgrade, please find the sample errors and status after this port was removed in my testing environment:

# juju status
Unit Workload Agent Machine Public address Ports Message
easyrsa/1 unknown lost 6 10.6.1.94 agent lost, see 'juju show-status-log easyrsa/1'

Machine State Address Inst id Series AZ Message
6 down 10.6.1.94 5b1612b0-65b8-4f43-819a-96195647729f jammy nova ACTIVE

# sample juju logs, which will fail during upgrade
2024-03-01 09:15:03 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [75f369] "machine-6" cannot open api: unable to connect to API: dial tcp 10.6.1.96:17070: i/o timeout
2024-03-01 09:16:09 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [75f369] "machine-6" cannot open api: unable to connect to API: dial tcp 252.1.96.1:17070: i/o timeout
2024-03-01 09:17:34 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [75f369] "machine-6" cannot open api: unable to connect to API: dial tcp 252.1.96.1:17070: i/o timeout
2024-03-01 09:19:08 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [75f369] "machine-6" cannot open api: unable to connect to API: dial tcp 10.6.1.96:17070: i/o timeout

2) The looping of closing and opening ports, due to the juju will confuse about previous and current state of ports during the upgrade after unexpected manual alteration was made. The juju will expect the network security group is juju-managed only. This looping scenario will most likely happen especially if there are changes for any new rules in network security group of newer application version.

3) Furthermore, there are more useful ingress ports had been removed like port 22 (ssh) and other egress ports to contact other services by the non-official custom script.

4) Hence, I would like to suggest to set the status of this bug report to "invalid", since this is not likely a bug that is caused by juju logic itself.

Best Regards,
Koo

Changed in juju:
milestone: 3.3.4 → 3.3.5
Changed in juju:
milestone: 3.3.5 → 3.3.6
Harry Pidcock (hpidcock)
Changed in juju:
status: Incomplete → Invalid
milestone: 3.3.6 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.