multi nic guests - cannot ping/ssh intermittently-quantum network

Bug #1042397 reported by JB
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Undecided
Unassigned

Bug Description

In two node Essex-2012-1-1/quantum-2012.1/OpenVSwitch-vlanconf:

1) created a shared network and a private network for a tenant .
2) launched instances with single NIC and two NICs (one on shared and other one on private network). (ttylinux image)
3) was able to ping and ssh to all the instances from controller fine.

However, in case of second NIC (eth1 - that connects to private network), ping and ssh misses intermittently (sometimes, ping and ssh are successful, sometimes not).

Whenver, ping or ssh is not successful, I see "bad udp cksum" error for that nic ip address in the tcpdump of that gateway

sudo tcpdump -n -e -vv -ttt -i gw-** (this is the gateway of the private network )

   x.x.x.x > 192.168.5.4.68: [bad udp cksum 0xb94e -> 0x4e4b!] BOOTP/DHCP, Reply, length 313, xid 0xbbc7a634, Flags [none] (0x0000)
          Client-IP 192.168.5.4
          Your-IP 192.168.5.4
          Server-IP 192.168.5.1
          Client-Ethernet-Address fa:16:3e:25:3d:de
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: ACK
            Server-ID Option 54, length 4: 192.168.5.1
            Lease-Time Option 51, length 4: 120
            RN Option 58, length 4: 56
            RB Option 59, length 4: 101
            Subnet-Mask Option 1, length 4: 255.255.255.0
            BR Option 28, length 4: 192.168.5.255
            Default-Gateway Option 3, length 4: 192.168.5.1
            Domain-Name-Server Option 6, length 4: 192.168.5.1
            Domain-Name Option 15, length 9: "novalocal"
            Hostname Option 12, length 8: "host-192"

Revision history for this message
dan wendlandt (danwent) wrote :

Are you sure that the 'bad udp checksum' message is only visible when you see problems, or is it that you're only running tcpdump when you see problems? seeing that error with tcpdump is actually pretty common, I believe, so I believe it may be a red herring.

I wonder if your vm is getting two different default gateways, one for each NIC. If a VM is getting multiple NICs, you need to make sure that only one of those subnets has a gateway IP. You can create a subnet without a gateway IP by using the --no-gateway option when creating the subnet (since subnets get a gateway IP by default).

Revision history for this message
JB (jaybeltur) wrote : Re: [Bug 1042397] Re: multi nic guests - cannot ping/ssh intermittently-quantum network
Download full text (4.6 KiB)

1) "bad udp checksum" message happens when the vm cannot be pinged otherwise it shows "udp sum ok".  When it is "udp sum ok", vm can be pinged and sshed.

x.x.x.x > 192.168.5.4.68: [udp sum ok]BOOTP/DHCP, Reply, length 313, xid 0xbbc7a634, Flags [none] (0x0000)
            Client-IP 192.168.5.4
            Your-IP 192.168.5.4
            Server-IP 192.168.5.1
            Client-Ethernet-Address fa:16:3e:25:3d:de
            Vendor-rfc1048 Extensions
              Magic Cookie 0x63825363
              DHCP-Message Option 53, length 1: ACK
              Server-ID Option 54, length 4: 192.168.5.1
              Lease-Time Option 51, length 4: 120
              RN Option 58, length 4: 56
              RB Option 59, length 4: 101
              Subnet-Mask Option 1, length 4: 255.255.255.0
              BR Option 28, length 4: 192.168.5.255
              Default-Gateway Option 3, length 4: 192.168.5.1
              Domain-Name-Server Option 6, length 4: 192.168.5.1
              Domain-Name Option 15, length 9: "novalocal"
              Hostname Option 12, length 8: "host-192"

2) Yes, I have two gateways for the same tenant.

1 gateway(gw-5dc620b6-02) for shared network in the tenant with ip 192.168.4.1 (192.168.4.x/24 network)
1 gateway(gw-587d10b3-12) for private network in the same tenant with ip 192.168.5.1 (192.168.5.x/24 network)

Guest VMs : eth0 on shared network and eth1 on private network.
Pinging/Sshing ip addresses from controller host (host quantum server) on eth0 work fine consistently.
Pinging/Sshing ip addresses from controller host on eth1 work intermittently.
pinging/sshing from one guest vm to another vm both on shared network and private network work fine consistently

So, should there be only one gateway per tenant?

Thanks,
-Jay

________________________________
From: dan wendlandt <email address hidden>
To: <email address hidden>
Sent: Monday, August 27, 2012 1:40 PM
Subject: [Bug 1042397] Re: multi nic guests - cannot ping/ssh intermittently-quantum network

Are you sure that the 'bad udp checksum' message is only visible when
you see problems, or is it that you're only running tcpdump when you see
problems?  seeing that error with tcpdump is actually pretty common, I
believe, so I believe it may be a red herring.

I wonder if your vm is getting two different default gateways, one for
each NIC.  If a VM is getting multiple NICs, you need to make sure that
only one of those subnets has a gateway IP.  You can create a subnet
without a gateway IP by using the --no-gateway option when creating the
subnet (since subnets get a gateway IP by default).

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1042397

Title:
  multi nic guests - cannot ping/ssh intermittently-quantum network

Status in OpenStack Quantum (virtual network service):
  New

Bug description:
  In two node Essex-2012-1-1/quantum-2012.1/OpenVSwitch-vlanconf:

  1) created a shared network and a private network for a tenant .
  2) launched instances with single NIC and two NICs (one on shared and other one on private network). (ttylinux image)
  3) was able to ping and ssh to all the i...

Read more...

Revision history for this message
Aaron Rosen (arosen) wrote :

There should only be one default gw. Since you have two default gateways the interface packets leave on may be changing during the life of a connection which would explain why things intermittently work (This is assuming you are pinging/sshing outside of the subnet that the interface resides in) .

 The bad udp checksum is probably due to checksum offloading and is set when the packet hits the nic.

Revision history for this message
dan wendlandt (danwent) wrote :

yes, only one of those networks can have the default gateway. But we should make sure there is a way in the API to configure host-routes on one of the subnets such that this gateway is not the default gateway, but can still be used to reach a specific set of remote subnets.

I'm a bit worried that the current logic may equate "gateway" and "default gateway".

Changed in quantum:
status: New → Incomplete
milestone: none → folsom-rc1
Revision history for this message
dan wendlandt (danwent) wrote :

adding salvatore, so he can comment about api spec for having two networks with gateway, only one of which is the default gateway.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Thanks for raising this issue.
We believed we completely sorted it out when we allowed for create subnets without gateways, but this bug confirms that is not the case.

As this bug regards the way in which routes are propagated to VMs, the fix needs also to have a counterpart in the options set in the dhcp agent, ensuring that the router option is set only for the default subnet.

We neglected, that for each VM there should be only a single default gateway - which is probably a fairly big neglection, which is due to the fact that most of our tests are single-nic.

A possible solution would be by allowing an optional attribute in the fixed_ips dictionary for a port for specifying which subnet should be the default one for a given port; indeed we might have 192.168.4.0/24 gw 192.168.4.1 and 192.168.5.0/24 gw 192.168.5.1, but 192.168.4.1 might be the default gw for a VM, whereas 192.168.5.1 might be the default for another one.
If default is missing, we can assume the first subnet is the default one IMHO.

The above fix would work for multiple IPs on a single NIC, but not for multiple NICs, unless we add a check for verifying a single default subnet is specified for all ports with a given device_id. This is quite convoluted, and I will look at how simplify it.

The ' quick and dirty' patch would consist in ensuring the dhcp agent sends only once the router option to the same VMs, but this will result in a nondeterministic assignment of the default gateway.

The above however would not address

Revision history for this message
dan wendlandt (danwent) wrote :

Actually, supporting this use case (as I understand it) was part of the original API design, I'm just not sure the implementation followed suite.

Let's say there where two subnets, 'public', which has the default gateway going to the Internet and 'private', which has a gateway that routes 10.0.0.0/8 via a different gateway to the customer premises. The VM has two NICs, one on public and one on private, and traffic to the Internet should go out the public NIC and traffic to the customer prem site should go out the private NIC. Here is how it "should" work, based on my understanding:

- create public subnet using all default settings. Will get a gateway IP of .1, and should automatically get a host_route associated with the subnet that sends 0.0.0.0/0 to the .1 gateway.
- create a private subnet. No need to specify the gateway, as it can be the .1 in that subnet. However, you must specify host_routes, as leaving the default would mean that the .1 should be the default gateway for all hosts plugged into that network. Instead, host_routes should contain a single route: 10.0.0.0/8 via .1 .

Does that solve the use case you're trying to achieve Jay?

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

The alternative virtual network design proposed by Dan should actually solve the issue.
However, as confirmed by Dan, something in the implementation probably did not follow the original design.

If associating a gateway only with the public subnet solves this particular issue, I propose to fix this issue post Folsom.
As I stated in my previous, rather confused, comment, we could tweak the API in order to ensure only a single default gateway is assigned. However, considering that the approach is quite convoluted, this will translate into complexity exposed to API users, which we must avoid at all costs.

The correct way of looking at this issue, is that probably the subnet's gateway is not necessarily the default gateway.

Revision history for this message
JB (jaybeltur) wrote :
Download full text (4.1 KiB)

In quantum-2012.1 version,  I could not find --no_gateway option.

As suggested below , I changed the routing table to the following (i changed the default gateway for private tenant network from 0.0.0.0 to 192.168.17.1 )

192.168.13.0    0.0.0.0         255.255.255.0   U     0      0        0 gw-386ccdef-a3
192.168.17.0    192.168.17.1    255.255.255.0   UG    0      0        0 gw-9c7ce537-69

Is this correct?

Still I see ping (from controller)  working intermittently when pinging the guest eth1 address 192.168.17.8.

When I did the traceroute (from controller), when it is not working, it does not seem to see the gateway (gw-9c7ce537-69 - 192.168.17.1) as the next hop.
 traceroute 192.168.17.8
traceroute to 192.168.17.8 (192.168.17.8), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
30 * * *

When it works, it does not show this gateway as the next hop :
traceroute to 192.168.17.8 (192.168.17.8), 30 hops max, 60 byte packets
 1  192.168.17.8 (192.168.17.8)  2.161 ms  3.122 ms  2.790 ms

Because of the intermittent pinging, floating ip does not seem to work consistently as this gets natted to private ip address.

Thanks,
-Jay

________________________________
From: Salvatore Orlando <email address hidden>
To: <email address hidden>
Sent: Friday, August 31, 2012 9:59 AM
Subject: [Bug 1042397] Re: multi nic guests - cannot ping/ssh intermittently-quantum network

The alternative virtual network design proposed by Dan should actually solve the issue.
However, as confirmed by Dan, something in the implementation probably did not follow the original design.

If associating a gateway only with the public subnet solves this particular issue, I propose to fix this issue post Folsom.
As I stated in my previous, rather confused, comment, we could tweak the API in order to ensure only a single default gateway is assigned. However, considering that the approach is quite convoluted, this will translate into complexity exposed to API users, which we must avoid at all costs.

The correct way of looking at this issue, is that probably the subnet's
gateway is not necessarily the default gateway.

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1042397

Title:
  multi nic guests - cannot ping/ssh intermittently-quantum network

Status in OpenStack Quantum (virtual network service):
  Incomplete

Bug description:
  In two node Essex-2012-1-1/quantum-2012.1/OpenVSwitch-vlanconf:

  1) created a shared network and a private network for a tenant .
  2) launched instances with single NIC and two NICs (one on shared and other one on private network). (ttylinux image)
  3) was able to ping and ssh to all the instances from controller fine.

  However, in case of second NIC (eth1 - that connects to private network), ping and ssh misses intermittently (sometimes, ping and ssh are successful, sometimes not).

  Whenver, ping or ssh is not successful, I see "bad udp cksum" error for that nic ip address in the tcpdump of that gateway

  sudo tcpdump -n -e -vv -ttt -i gw-** (this is the gateway of the
  private network )

    x.x.x.x > 192.168.5.4.68: [ba...

Read more...

Revision history for this message
JB (jaybeltur) wrote :
Download full text (6.5 KiB)

It looks like it is a problem with the cloud-init of the images.
In case of ttylinux image, cloud-init lets setting two default gateway routes. However, maverick and Oneiric ubuntu cloud-init allow only one default gateway to be injected (which is the gateway of the first nic uuid provided in the command line)
[So, either quantum should not inject multiple default gateways into guests or cloud-init of the guest image should not let setting multiple default gateways)
[I think that is the reason for seeing ping (from controller) working intermittently in case of ttylinux instances. In case of ubuntu instances, ping (from controller) to the second nic address does not work at all consistently (because the return path does not have the second default gateway route) ]
[Also, I want to keep two gateways for these two networks. This is needed when guests launched with single nic need to reach out to other networks]

nova list
| 5e8a5df3-c36c-4358-a19e-c48a1e4fece2 | test-tty-linux | ACTIVE | web-net=192.168.5.24; net-1=192.168.4.43|
| 630f9687-cc44-413c-97c7-61baa48b328e | test-maverick  | ACTIVE | web-net=192.168.5.27; net-1=192.168.4.46|
test-tty-linux$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
0.0.0.0         192.168.4.1     0.0.0.0         UG    0      0        0 eth0
0.0.0.0         192.168.5.1     0.0.0.0         UG    0      0        0 eth1

ubuntu@test-maverick:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.4.1     0.0.0.0         UG    100    0        0 eth0
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
Controller:~$route -n
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 gw-5dc620b6-02
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 gw-587d10b3-12

Thanks,
-Jay

________________________________
From: Jayashree(Jay) Beltur <email address hidden>
To: Bug 1042397 <email address hidden>
Sent: Saturday, September 1, 2012 8:26 PM
Subject: Re: [Bug 1042397] Re: multi nic guests - cannot ping/ssh intermittently-quantum network

In quantum-2012.1 version,  I could not find --no_gateway option.

As suggested below , I changed the routing table to the following (i changed the default gateway for private tenant network from 0.0.0.0 to 192.168.17.1 )

192.168.13.0    0.0.0.0         255.255.255.0   U     0      0        0 gw-386ccdef-a3
192.168.17.0    192.168.17.1    255.255.255.0   UG    0      0        0 gw-9c7ce537-69

Is this correct?

Still I see ping (from controller)  working intermittently when pinging the guest eth1 address 192.168.17.8.

When I did the traceroute (from controller), when it is not working, it does not seem to see the gateway (gw-9c7ce537-69 - 192.168.17.1) as the next hop.
 traceroute 192.168.17.8
traceroute to 192.168.17.8 (192.168.17.8), 30 hops ma...

Read more...

Revision history for this message
JB (jaybeltur) wrote :
Download full text (6.5 KiB)

It looks like it is a problem with the cloud-init of the images.
In case of ttylinux image, cloud-init lets setting two default gateway routes. However, maverick and Oneiric ubuntu cloud-init allow only one default gateway to be injected (which is the gateway of the first nic uuid provided in the command line)
[So, either quantum should not inject multiple default gateways into guests or cloud-init of the guest image should not let setting multiple default gateways)
[I think that is the reason for seeing ping (from controller) working intermittently in case of ttylinux instances. In case of ubuntu instances, ping (from controller) to the second nic address does not work at all consistently (because the return path does not have the second default gateway route) ]
[Also, I want to keep two gateways for these two networks. This is needed when guests launched with single nic are reachable from other networks thru floating ip]

nova list
| 5e8a5df3-c36c-4358-a19e-c48a1e4fece2 | test-tty-linux | ACTIVE | web-net=192.168.5.24; net-1=192.168.4.43|
| 630f9687-cc44-413c-97c7-61baa48b328e | test-maverick  | ACTIVE | web-net=192.168.5.27; net-1=192.168.4.46|
test-tty-linux$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
0.0.0.0         192.168.4.1     0.0.0.0         UG    0      0        0 eth0
0.0.0.0         192.168.5.1     0.0.0.0         UG    0      0        0 eth1

ubuntu@test-maverick:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.4.1     0.0.0.0         UG    100    0        0 eth0
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
Controller:~$route -n
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 gw-5dc620b6-02
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 gw-587d10b3-12

Thanks,
-Jay

From: Jayashree(Jay) Beltur <email address hidden>
To: Bug 1042397 <email address hidden>
Sent: Saturday, September 1, 2012 8:26 PM
Subject: Re: [Bug 1042397] Re: multi nic guests - cannot ping/ssh intermittently-quantum network

In quantum-2012.1 version,  I could not find --no_gateway option.

As suggested below , I changed the routing table to the following (i changed the default gateway for private tenant network from 0.0.0.0 to 192.168.17.1 )

192.168.13.0    0.0.0.0         255.255.255.0   U     0      0        0 gw-386ccdef-a3
192.168.17.0    192.168.17.1    255.255.255.0   UG    0      0        0 gw-9c7ce537-69

Is this correct?

Still I see ping (from controller)  working intermittently when pinging the guest eth1 address 192.168.17.8.

When I did the traceroute (from controller), when it is not working, it does not seem to see the gateway (gw-9c7ce537-69 - 192.168.17.1) as the next hop.
 traceroute 192.168.17.8
traceroute to 192.168.17.8 (192.168.17.8), 30 hops max, 60 byte packets
...

Read more...

Aaron Rosen (arosen)
Changed in quantum:
assignee: nobody → Aaron Rosen (arosen)
Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Dan's senario works if we set no-gateway option for private subnet.

+1 for Salvatore's spec to have default fixed ip ( subnet )

Revision history for this message
dan wendlandt (danwent) wrote :

Nachi, I think you're misunderstanding the scenario I was describing. In this scenario, both subnets have a gateway, so it does not make sense to use --no-gateway. The tricky part of the the scenario is that while both subnets have a gateway, but one of the gateways is not the default gateway, it is simply a gateway to a particular prefix. Does that make sense?

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Dan
I may misunderstands the Qunatum v2.0 spec.
I suppose the gateway_ip is the ip for the default gateway which is distributed by dhcp.
If it is not the default gateway, current spec didn't make sense for me, because it takes only one gateway ip.

Revision history for this message
Aaron Rosen (arosen) wrote :

Hi Nachi,

All of these routes will need to be distributed via dhcp. Also if I use --no-gateway the dhcp server will advertise the default gateway for a subnet to be that of the dhcp server (which is not the intended behavior).

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Hi Aaron

Thank you for your catch that!

Revision history for this message
JB (jaybeltur) wrote :

For now, the following work around works for my scenarios:

1) I retain two gateways one for each network

2) Inside the guest VM delete the second default gateway from the routing table that is set on eth1 device (works for all image instances)

This stops the pinging/sshing eth1 address from controller, which is fine. But, atleast it displays consistent behavior.

Thanks,
-Jay

________________________________
From: dan wendlandt <email address hidden>
To: <email address hidden>
Sent: Monday, September 3, 2012 8:01 PM
Subject: [Bug 1042397] Re: multi nic guests - cannot ping/ssh intermittently-quantum network

Nachi, I think you're misunderstanding the scenario I was describing.
In this scenario, both subnets have a gateway, so it does not make sense
to use --no-gateway.  The tricky part of the the scenario is that while
both subnets have a gateway, but one of the gateways is not the default
gateway, it is simply a gateway to a particular prefix.  Does that make
sense?

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1042397

Title:
  multi nic guests - cannot ping/ssh intermittently-quantum network

Status in OpenStack Quantum (virtual network service):
  Incomplete

Bug description:
  In two node Essex-2012-1-1/quantum-2012.1/OpenVSwitch-vlanconf:

  1) created a shared network and a private network for a tenant .
  2) launched instances with single NIC and two NICs (one on shared and other one on private network). (ttylinux image)
  3) was able to ping and ssh to all the instances from controller fine.

  However, in case of second NIC (eth1 - that connects to private network), ping and ssh misses intermittently (sometimes, ping and ssh are successful, sometimes not).

  Whenver, ping or ssh is not successful, I see "bad udp cksum" error for that nic ip address in the tcpdump of that gateway

  sudo tcpdump -n -e -vv -ttt -i gw-** (this is the gateway of the
  private network )

    x.x.x.x > 192.168.5.4.68: [bad udp cksum 0xb94e -> 0x4e4b!] BOOTP/DHCP, Reply, length 313, xid 0xbbc7a634, Flags [none] (0x0000)
            Client-IP 192.168.5.4
            Your-IP 192.168.5.4
            Server-IP 192.168.5.1
            Client-Ethernet-Address fa:16:3e:25:3d:de
            Vendor-rfc1048 Extensions
              Magic Cookie 0x63825363
              DHCP-Message Option 53, length 1: ACK
              Server-ID Option 54, length 4: 192.168.5.1
              Lease-Time Option 51, length 4: 120
              RN Option 58, length 4: 56
              RB Option 59, length 4: 101
              Subnet-Mask Option 1, length 4: 255.255.255.0
              BR Option 28, length 4: 192.168.5.255
              Default-Gateway Option 3, length 4: 192.168.5.1
              Domain-Name-Server Option 6, length 4: 192.168.5.1
              Domain-Name Option 15, length 9: "novalocal"
              Hostname Option 12, length 8: "host-192"

To manage notifications about this bug go to:
https://bugs.launchpad.net/quantum/+bug/1042397/+subscriptions

Revision history for this message
dan wendlandt (danwent) wrote :

Hi Jay,

Yes, for Essex, that may be the best work around possible.

dan

Changed in quantum:
milestone: folsom-rc1 → none
Revision history for this message
dan wendlandt (danwent) wrote :

Aaron, can you file a separate bug to track any issues you and Nachi find while investigating the behavior of the folsom code? I'm untargeting this bug from RC1

Changed in quantum:
status: Incomplete → Won't Fix
Revision history for this message
dan wendlandt (danwent) wrote :

setting this bug to 'won't fix', as this behavior was not in scope for Quantum in Essex.

Aaron Rosen (arosen)
Changed in quantum:
assignee: Aaron Rosen (arosen) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.