juju bootstrap node or deploy service with vSphere provider connects vm to wrong virtual switch

Bug #1619812 reported by Larry Michel
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Andrew Wilkins
2.2
Fix Released
High
Andrew Wilkins

Bug Description

It looks like juju bootstrap will pick the first available virtual switch for the bootstrap NIC. The problem is that if the virtual NIC is not on the correct network, then there won't be any network connectivity.

There should be a way to specify which virtual NIC to use or if Juju wants to use a known virtual connection then user is responsible for naming the vswitch correctly. i.e. br100 or br-int

Larry Michel (lmic)
description: updated
Curtis Hovey (sinzui)
tags: added: ci
Changed in juju:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Richard Harding (rharding)
milestone: none → 2.0-beta18
Changed in juju:
milestone: 2.0-beta18 → 2.0-rc1
Changed in juju:
milestone: 2.0-rc1 → 2.0.1
Revision history for this message
Larry Michel (lmic) wrote :

This one has a workaround. It works on an ESXi host with a single virtual switch connected to the right network or it could also work by swapping the network connection on the virtual switches so that the one that it picks connects to the right network.

Changed in juju:
importance: Critical → High
milestone: 2.0.1 → 2.1.0
Changed in juju:
status: Triaged → In Progress
status: In Progress → Triaged
Changed in juju:
assignee: Richard Harding (rharding) → nobody
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Larry,
It would be great to know if you can still the issue with 2.1-rc1.

I am removing this from 2.1 milestone for now as we are not likely to address it in this release.

Changed in juju:
milestone: 2.1.0 → none
status: Triaged → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

@Anastasia, I will try to recreate.

Revision history for this message
Larry Michel (lmic) wrote :

@Anastasia, I have been able to recreate with the 2.1 release. I recreated this with juju deploy service where the VM came up connected to any virtual switch. If there are more than one virtual switch, then juju can connect to any of the virtual switch. As I indicated earlier, the user needs to be able to specify which virtual switch that the VMs should be connected to. This is a minimal requirement until spaces can be implemented.

summary: - juju bootstrap node with vSphere provider connects to wrong virtual
- switch
+ juju bootstrap node or deploy service with vSphere provider connects vm
+ to wrong virtual switch
Changed in juju:
status: Incomplete → New
Revision history for this message
Adam Stokes (adam-stokes) wrote :

I am about to finish adding vsphere support to conjure-up and would be exposing the external-network capability when users add credentials. Could we get this looked at for 2.2?

tags: added: conjure
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Larry,

There were some changes between 2.1 point releases. Could you please specify exact 2.1 version you've used?

@Adam,

Did you already start implementing support and see this bug as well?

Changed in juju:
status: New → Incomplete
Changed in juju:
milestone: none → 2.2.0
status: Incomplete → Triaged
Revision history for this message
John A Meinel (jameinel) wrote :

external-network is not about configuring what vswitch you want to connect to (at least from what I can debug reading the code). It is specifically more about "if I expose an application on this VM what subnet do I ask for IP addresses from".

Specifically, this is what I see:
 if ecfg.externalNetwork() != "" {
  ip, err := vm.WaitForIP(context.TODO())
  if err != nil {
   return nil, errors.Trace(err)
  }
  client := common.NewSshInstanceConfigurator(ip)
  err = client.ConfigureExternalIpAddress(spec.apiPort)
  if err != nil {
   return nil, errors.Trace(err)
  }
 }

and ConfigureExternalIpAddress is doing:
func (c *sshInstanceConfigurator) ConfigureExternalIpAddress(apiPort int) error {
 cmd := `printf 'auto eth1\niface eth1 inet dhcp' | sudo tee -a /etc/network/interfaces.d/eth1.cfg
sudo ifup eth1
sudo iptables -i eth1 -I INPUT -m state --state NEW -j DROP`

 if apiPort > 0 {
  cmd += fmt.Sprintf("\nsudo iptables -I INPUT -p tcp --dport %d -j ACCEPT", apiPort)
 }

Which means that if 'external-network' is supplied, it waits for the instance to start, and then SSH's to the instance, and creates a new 'eth1' device that is setup to listed to DHCP and creates a firewall to drop everything but API Port traffic.

It does appear that we might be setting up an interface from the outside to be using that device:
  s.DeviceChange = append(s.DeviceChange, &types.VirtualDeviceConfigSpec{
   Operation: types.VirtualDeviceConfigSpecOperationAdd,
   Device: &types.VirtualE1000{
    VirtualEthernetCard: types.VirtualEthernetCard{
     VirtualDevice: types.VirtualDevice{
      Backing: &types.VirtualEthernetCardNetworkBackingInfo{
       VirtualDeviceDeviceBackingInfo: types.VirtualDeviceDeviceBackingInfo{
        DeviceName: ecfg.externalNetwork(),
       },
      },
      Connectable: &types.VirtualDeviceConnectInfo{
       StartConnected: true,
       AllowGuestControl: true,
      },
     },
    },
   },
  })

Anyway, none of this has anything to do with selecting a particular vswitch to interact with. It is a very hackish hard-wired way of getting a machine that can run a Juju controller inside of vsphere.

It may also be that it munges IP tables to expose ports etc for eth1, but a lot of that code is quite hard coded to things like 'eth1' which also means it won't work for Xenial machines, only Trusty.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Larry, I do not understand how to reproduce this issue, primarily because I don't understand VMWare well. Can you please provide either an environment in which the issue manifests, or specific steps (e.g. CLI commands) to reproduce the issue?

AFAICT, we don't specify a vswitch anywhere. The Ubuntu OVFs hard code a connection to "VM Network".

Changed in juju:
status: Triaged → Incomplete
Changed in juju:
milestone: 2.2-beta2 → none
Revision history for this message
Larry Michel (lmic) wrote :

@Andrew, I have been trying to recreate with 2.2 but not having any success which is good if this means the problem has been fixed. Do you know whether this hardcode was added recently? However, do note that the hardcode to a default network would work for a lot of vsphere environments, but "VM Network" is just a default name for the default portgroup. There is nothing preventing a user from renaming it and I've verified that the bootstrap will fail if I do rename it to say VMNetwork. Typically, I'd name my portgroup based on the network that they connect to.

I'll keep trying to reproduce until I can either get it to fail or conclude that the issue is no longer reproducible. If I see it again, then it should not be any problem to give you access to that environment.

Revision history for this message
Luis San Martin (pathcl) wrote :

It is still failing.. at least on vSphere 6.0 and juju

$ juju bootstrap vsphere/dc0 --bootstrap-constraints "cores=2 mem=4G root-disk=32G" --to zone=cluster1
Creating Juju controller "vsphere-dc0" on vsphere/dc0
Looking for packaged Juju agent version 2.2-beta4 for amd64
Launching controller instance(s) on vsphere/dc0...
 - downloading http://cloud-images.ubuntu.com/releases/server/releases/xenial/release-20WARNING failed to create instance in availability zone cluster1: creating import spec: Host did not have any virtual network defined.
 - failed to create instance in any availability zone: creating import spec: Host did noERROR failed to bootstrap model: cannot start bootstrap instance: failed to create instance in any availability zone: creating import spec: Host did not have any virtual network defined.

$ juju --version
2.2-beta4-xenial-amd64

Guess I'll have to fix the ova

John A Meinel (jameinel)
tags: added: vsphere-provider
removed: vsphere
Revision history for this message
Javier Urien (javierurien) wrote :

Hello,
  +1 here. I have a Distributed Virtual Switch on my datacenter, and the bootstrap fails. I created a Standard Switch with no nics but I can't assign nics so I have no network on the VMs.
  For what I see here, there is no solution yet, and the workaround won't work for me. Is there a plan on how and when this problem will be fixed?

Best Regards.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Luis, sorry for the lack of reply - I was not subscribed, and this was just brought to my attention.

Luis and Javier, I'm not sure if there's a solution for this at the moment. We'll have to investigate and get back to you. I'll reopen this, since there's clearly something we're still missing.

Changed in juju:
status: Incomplete → Triaged
Andrew Wilkins (axwalk)
Changed in juju:
status: Triaged → In Progress
Andrew Wilkins (axwalk)
Changed in juju:
assignee: nobody → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

OK, so I think I've got a handle on what we need to do.

The OVA/OVFs on cloud-images.ubuntu.com specify a NIC using the "VM Network" network. We're not validating that, and not enabling you to override that. Moreover, there's absolutely no way it would work with a distributed port group.

So I'll look at doing the following:
 - introduce a "network" config, via which users will be able to specify the name of the primary network to which each VM will be attached the primary network name to attach to each VM
 - if "network" is unspecified, use "VM Network" if it exists; if it does not, but there is exactly one network available to the host/cluster, use that instead
 - if there's no "VM Network", and no single network, we'll probably just return an error and require the user to make a choice by setting the network config

We'll need to check the type of network, and treat distributed port groups a little bit specially. I'm hoping to have this in a 2.2.3 release soon.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1619812] Re: juju bootstrap node or deploy service with vSphere provider connects vm to wrong virtual switch

we seem to be chasing rabbits that aren't actually allowing us to model
spaces on VMware and actually support per application networks.

What is the granularity of a network? Is it a list of subnets that can talk
to each other (like AWS VPC) or is it more like a subnet?

John
=:->

On Jul 19, 2017 10:27, "Andrew Wilkins" <email address hidden>
wrote:

> OK, so I think I've got a handle on what we need to do.
>
> The OVA/OVFs on cloud-images.ubuntu.com specify a NIC using the "VM
> Network" network. We're not validating that, and not enabling you to
> override that. Moreover, there's absolutely no way it would work with a
> distributed port group.
>
> So I'll look at doing the following:
> - introduce a "network" config, via which users will be able to specify
> the name of the primary network to which each VM will be attached the
> primary network name to attach to each VM
> - if "network" is unspecified, use "VM Network" if it exists; if it does
> not, but there is exactly one network available to the host/cluster, use
> that instead
> - if there's no "VM Network", and no single network, we'll probably just
> return an error and require the user to make a choice by setting the
> network config
>
> We'll need to check the type of network, and treat distributed port
> groups a little bit specially. I'm hoping to have this in a 2.2.3
> release soon.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1619812
>
> Title:
> juju bootstrap node or deploy service with vSphere provider connects
> vm to wrong virtual switch
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1619812/+subscriptions
>

Revision history for this message
Andrew Wilkins (axwalk) wrote :

> What is the granularity of a network? Is it a list of subnets that can talk to each other (like AWS VPC) or is it more like a subnet?

It's all Layer 2. See https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/virtual_networking_concepts.pdf

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I think this PR fixes the issue: https://github.com/juju/juju/pull/7660.

The vCenter I have access to has only one host, and that host has only one available physical NIC. I'm chasing down another one that I can test with, but if anyone watching would like to help out on that front, that would be great.

Andrew Wilkins (axwalk)
Changed in juju:
milestone: none → 2.2.3
milestone: 2.2.3 → 2.3-alpha1
Andrew Wilkins (axwalk)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
nnutter (nnutter) wrote :

I don't know if it was a mistake but that PR was merged to develop and didn't go out in 2.2.3.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

@nnutter: the change was backported to 2.2 in https://github.com/juju/juju/pull/7671.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.