Firewaller issues on vmware vsphere

Bug #1732665 reported by Merlijn Sebrechts
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Unassigned
2.3
High
Unassigned

Bug Description

Using Juju 2.3-rc1.1 on vmware vsphere provider, the firewaller seems to be broken. I tested this in the controller model.

As a result, everything is reachable from the outside, exposed or not.

The full log in all its glory here: http://paste.ubuntu.com/25973402/

juju debug-log --replay | grep -v "juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: Host key verification failed" | grep "firewall"

machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: <nil>
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: <nil>
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker started
machine-0: 17:37:19 DEBUG juju.worker.firewaller started watching opened port ranges for the model
machine-0: 17:37:19 DEBUG juju.worker.firewaller started watching "machine-0"
machine-0: 17:37:19 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: <nil>
machine-0: 17:37:20 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: <nil>
machine-0: 17:37:20 DEBUG juju.worker.dependency "firewaller" manifold worker started
machine-0: 17:37:20 DEBUG juju.worker.dependency "firewaller" manifold worker stopped: failed to list open ports: Host key verification failed.
github.com/juju/juju/worker/firewaller/firewaller.go:258:
machine-0: 17:37:20 DEBUG juju.worker.firewaller started watching opened port ranges for the model
machine-0: 17:46:46 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: ssh: Could not resolve hostname : Name or service not known
machine-0: 17:46:59 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: ssh: Could not resolve hostname : Name or service not known
machine-0: 17:47:31 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: ssh: Could not resolve hostname : Name or service not known
machine-0: 19:02:33 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: ssh: Could not resolve hostname : Name or service not known
machine-0: 19:02:36 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: ssh: Could not resolve hostname : Name or service not known
machine-0: 19:02:40 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: ssh: Could not resolve hostname : Name or service not known
machine-0: 11:15:25 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: cannot respond to units changes for "machine-1": failed to configure ports on external network: Host key verification failed.
machine-0: 11:15:28 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: cannot respond to units changes for "machine-1": failed to configure ports on external network: Host key verification failed.

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

Also, if `external-network` is not specified during bootstrap, the logs are spammed with this:

machine-0: 11:33:54 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: cannot respond to units changes for "machine-7": Can't close/open ports without external network

It might be a good idea to stop retrying this when you bootstrap without `external-network`..

description: updated
Revision history for this message
John A Meinel (jameinel) wrote :

It would be good to understand where we were trying to SSH that we couldn't actually resolve the hostname.

Changed in juju:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

John, do you need extra information from me? You can find the full log here: http://paste.ubuntu.com/25973402/

Maybe this is relevant, but the hostnames of the VM's are not resolvable at the moment, but this hasn't posed an issue in the past..

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

That might actually explain why I see both "cannot resolve hostname" and "host key verification failed". Juju probably first tries to login using hostname, that fails, resulting in the fallback to ip address, but the host keys are associated with the hostname, not the IP so host key verification fails..?

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

I fixed the DNS issues. Now I don't get the `cannot resolve hostname` issues, but I still get the "host key verification failed" issue.

Tim Penhey (thumper)
tags: added: firewaller vmware
Andrew Wilkins (axwalk)
Changed in juju:
milestone: none → 2.3.1
status: Incomplete → Triaged
Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

What's the root cause on this, Andrew?

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Merlijn, Mark: it appears that there's a regression due to us now being more strict about SSH host key verification.

The firewalling for vsphere works by managing iptables rules, but currently that is done by the controller via SSH (i.e. the controller connects to the vsphere machines via SSH, and runs iptables commands). The SSH connections previously ignored the host keys, and now they don't; and we don't know the host keys at the point where we make the SSH connections.

The quick and dirty solution is just to go back to not checking host keys. That works and should be OK, as there's no sensitive information being transferred. The worst attack I can conceive of is directing the controller to firewall some other machine.

The better but more involved solution is to not use SSH at all, nor manage it from the controller, and instead have each machine agent run a worker to manage its own iptables rules.

Changed in juju:
milestone: 2.3.1 → none
Tim Penhey (thumper)
Changed in juju:
milestone: none → 2.3.2
Revision history for this message
Andrew Wilkins (axwalk) wrote :

https://github.com/juju/juju/pull/8200 fixes the regression by disabling strict host key checking, and fixing some iptables-specific bits. I'd like to leave this bug open so that we can do this at the machine agent level.

Andrew Wilkins (axwalk)
Changed in juju:
milestone: 2.3.2 → 2.4-beta1
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

This doesn't seem to fix the firewall.

If I create a 2.3.1 controller and I upgrade it to 2.3.2; the error messages change from

juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports: Host key verification failed

to

machine-0: 15:16:26 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: failed to list open ports:

If I bootstrap a 2.3.2 controller, I don't get any errors anymore, but the firewaller doesn't seem to be doing anything.

This is on an exposed unit with Jenkins:

$ sudo iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A INPUT -i lxdbr0 -p tcp -m tcp --dport 53 -m comment --comment "managed by lxd-bridge" -j ACCEPT
-A INPUT -i lxdbr0 -p udp -m udp --dport 53 -m comment --comment "managed by lxd-bridge" -j ACCEPT
-A INPUT -i lxdbr0 -p udp -m udp --dport 67 -m comment --comment "managed by lxd-bridge" -j ACCEPT
-A FORWARD -o lxdbr0 -m comment --comment "managed by lxd-bridge" -j ACCEPT
-A FORWARD -i lxdbr0 -m comment --comment "managed by lxd-bridge" -j ACCEPT

Unit Workload Agent Machine Public address Ports Message
jenkins/0* active idle 0 193.190.127.175 8080/tcp,48484/tcp Jenkins is running

I bootstrapped using the following command

juju bootstrap vmware1 vmware-test2 --config primary-network=V31_TENGU --config datastore=NFSSTORE1 --config external-network=V28_IBBTDMZ2

Note that the machines don't have a public address when they are created; the public address was manually added after deployment to the already existing interface connected to `V28_IBBTDMZ2`. Juju pick up that address after a while, as you can see in the status output above.

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

Andrew, any update on this? Is it the expected behavior that the firewall doesn't work?

Revision history for this message
Ian Booth (wallyworld) wrote :

@Merlijn Andrew no longer works at Canonical but we'll start looking again into the issue in this bug this week.

tags: added: vsphere-provider
Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
status: Triaged → In Progress
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@merlijn-sebrechts,

We are not able to reproduce what you're seeing in #9. It'd be helpful if you could do the following and verify where it works and where it doesn't.

For the machine in question, run juju show-machine <#>, for each address in 'ip-addresses' run
juju ssh --proxy --debug ubuntu@<ip>

If you've manually added addresses, please indicate which ones.

Changed in juju:
status: In Progress → Incomplete
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

# juju show-machine 1

model: controller
machines:
  "1":
    juju-status:
      current: started
      since: 01 Mar 2018 14:13:10+01:00
      version: 2.3.3
    dns-name: 193.190.127.173
    ip-addresses:
    - 193.190.127.173
    - 10.10.139.77
    - 10.10.245.1
    instance-id: juju-0a40b7-1
    machine-status:
      current: running
      message: poweredOn
      since: 16 Nov 2017 15:37:16+01:00
    series: xenial
    network-interfaces:
      ens192:
        ip-addresses:
        - 10.10.139.77
        mac-address: 00:50:56:a4:00:20
        is-up: true
      ens224:
        ip-addresses:
        - 193.190.127.173
        mac-address: 00:50:56:a4:e3:19
        gateway: 193.190.127.129
        is-up: true
      ens256:
        ip-addresses:
        - 10.10.127.63
        mac-address: 00:50:56:a4:e1:7b
        is-up: false
      tun0:
        ip-addresses:
        - 10.15.216.1
        - 10.13.223.1
        - 10.10.244.1
        - 10.10.210.1
        - 10.10.233.1
        - 10.10.236.1
        - 10.10.231.1
        - 10.10.206.1
        - 10.10.201.1
        - 10.10.245.1
        mac-address: ""
        is-up: true
    hardware: arch=amd64 root-disk=8192M

# Which ones can we connect to?

juju ssh --proxy --debug ubuntu@193.190.127.173 -> connects to controller and then timeout
juju ssh --proxy --debug ubuntu@10.10.139.77 -> success!
juju ssh --proxy --debug ubuntu@10.10.245.1 -> connects to controller and then timeout

I can confirm that the `193.190.127.173` address is not reachable from inside the cluster. This is something we're still working on, but this shouldn't be an issue for Juju, since it can still use the other IP address, no?

193.190.127.173 is manually added. 10.10.245.1 is the IP of a tun network created by a VPN running on that machine.

Changed in juju:
assignee: Heather Lanigan (hmlanigan) → nobody
Revision history for this message
Anastasia (anastasia-macmood) wrote :

We are actively working on this right now. However, the fix will not make the point release. I am adjusting the milestone and assigning bug to the person making progress in the codebase.

Revision history for this message
John A Meinel (jameinel) wrote :

The work that Witold is currently doing on VSphere does not directly fix this.
Specifically, we need to change from driving the firewall from outside of the machine, to having a worker running in the agent that is already on the machine, and having it drive the local iptables firewall.
Note that an on-instance firewaller would be useful in many other cases as well, not just in vpshere.

Changed in juju:
milestone: 2.4-beta1 → none
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Removing from a milestone as this work will not be done in 2.3 series.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers