Joyent instances sometimes can't communicate via the internal network

Bug #1401130 reported by Curtis Hovey
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Horacio Durán
1.21
Fix Released
Critical
Menno Finlay-Smits

Bug Description

http://juju-ci.vapour.ws:8080/job/joyent-deploy-precise-amd64/ has been failing over the recent runs. I suspect a network issue is the real reason the agents cannot callback to the state-server because joyent has a history of network issue.

Both the upgrade test's and the bundle tests show that joyent was very good for 1.20.x during this period. Looks like a machine got stuck in provisioning at the end of Dec 9

cloud-init-output.log on a pending machine reports:
+ curl -sSfw tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s --insecure -o /var/lib/juju/tools/1.22-alpha1-precise-amd64/tools.tar.gz https://10.112.4.213:17070/tools/1.22-alpha1-precise-amd64
curl: (7) couldn't connect to host
tools from https://10.112.4.213:17070/tools/1.22-alpha1-precise-amd64 downloaded: HTTP 000; time 63.119s; size 0 bytes; speed 0.000 bytes/s + [ 5 -lt 5 ]
+ sleep 15
+ sha256sum /var/lib/juju/tools/1.22-alpha1-precise-amd64/tools.tar.gz
sha256sum: /var/lib/juju/tools/1.22-alpha1-precise-amd64/tools.tar.gz: No such file or directory
failed: /var/lib/cloud/instance/scripts/runcmd [1]
2014-12-10 14:29:20,646 - cc_scripts_user.py[WARNING]: failed to run-parts in /var/lib/cloud/instance/scripts

This issue may relate to bug 1383922

Revision history for this message
Curtis Hovey (sinzui) wrote :
Revision history for this message
Curtis Hovey (sinzui) wrote :

The last pass for this test was commit a44dabb3, the first fail was commit 61df9231. There is only 1 suspect commit.

commit 61df9231 Merge pull request #1289 from howbazaar/machine-command

Curtis Hovey (sinzui)
description: updated
Revision history for this message
Tim Penhey (thumper) wrote :

Curtis, that commit didn't touch any networking stuff in any way.

I suspect something else is at play.

Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
status: Triaged → In Progress
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The test spins up 3 machines. Machine 0 and 1 are fine but machine 2 can't contact the state server.

To reproduce the problem manually:
   juju bootstrap --upload-tools
   juju add-machine
   juju add-machine

The difference is to do with routing. Machine 1 and 2 both contact machine 0 using its internal address. On machine-1 the eth1 (internal address) is used to get to machine-0. On machine-2 eth0 ends up getting used (external address via default route) and the outbound packets get dropped after the 2nd hop.

Here's an example of how networking looked when I reproed the problem:

machine-0::
eth0: external address
eth1: 10.112.1.26/21

machine-1::
   eth0: 165.225.128.14/23
   eth1: 10.112.7.85/21

   default via 165.225.128.1 dev eth0
   10.112.0.0/21 dev eth1 proto kernel scope link src 10.112.7.85
   165.225.128.0/23 dev eth0 proto kernel scope link src 165.225.128.14

machine-2::
   eth0: 165.225.131.78/23
   eth1: 10.112.79.247/21

   default via 165.225.130.1 dev eth0
   10.112.72.0/21 dev eth1 proto kernel scope link src 10.112.79.247
  165.225.130.0/23 dev eth0 proto kernel scope link src 165.225.131.78

It looks like the subnet masks being used aren't quite right but I'm not sure why yet.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Commit 61df9231 does NOT appear to be the problem. It still happens for me when I bootstrap with the revision before it.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

At this point this is looking like a problem at Joyent to me.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I'll be picking this up where Menno left off to investigate and possibly propose a solution.

Changed in juju-core:
assignee: Menno Smits (menno.smits) → Dimiter Naydenov (dimitern)
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I couldn't reproduce the issue, but I have an idea what's causing it.

It seems the cloud images on joyent have a different networking config: /etc/network/interfaces includes /etc/network/interfaces.d/*.cfg (where only eth0.cfg is present) and /etc/network/interfaces.smartdc/*.cfg (where eth1.cfg is).
Now eth0 is configured with the public IP, eth1 with the private one. Both are configured to auto start and use DHCP.

That's all fine, but when we launch any other machine after bootstrapping, it has JobManageNetworking set. The bootstrap node doesn't have the job (I've disabled it with https://github.com/juju/juju/pull/1046 as it was causing issues with MAAS). So what happens is, the networker worker runs in "intrusive mode" when JobManageNetworking is set, and overwrites /etc/network/interfaces with our "managed" config, which only includes /etc/network/interfaces.d/*.cfg.

I'll propose a fix which will disable the networker on joyent (like we do on maas), until we have the time to fix it properly (i.e. generate a config suitable for joyent).

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Proposed a fix to unblock CI with https://github.com/juju/juju/pull/1308 and filed a separate bug #1401423 to properly fix the networker for joyent, when possible.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Backporting to 1.21 as well.

Changed in juju-core:
status: In Progress → Fix Committed
milestone: none → 1.22
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Backport proposed as https://github.com/juju/juju/pull/1309 - approved and will soon land.

Revision history for this message
Curtis Hovey (sinzui) wrote :

The joyent deploy test for master (1.22-alpha1) still failed all three times, the 1.21 run passed (it always did)

I wonder if this is an issue with 1.22 and *precise*.

Changed in juju-core:
status: Fix Committed → In Progress
Changed in juju-core:
assignee: Dimiter Naydenov (dimitern) → Menno Smits (menno.smits)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The problem is still there. I can easily reproduce it with 1.20.12, 1.21-beta3, 1.21-current and master. I don't believe that disabling the networker had any effect. You sometimes have to add several machines (3 or 4) before seeing it but eventually a new machine will not be able to talk to the state server.

The issue seems to be a "feature" of Joyent's cloud. Instances may end up with networks on their internal (eth1) interfaces that can't talk to each other. There's a forum post from Joyent themselves here:

https://help.joyent.com/entries/21748665-Private-vlans-are-not-routing-to-each-other-by-default-

The reason that the Joyent CI test occasionally works is that it only uses 3 machines and sometimes those machines end up being on the same internal network. It really is pure luck.

In the forum post the person from Joyent suggest routing all of 10.0.0.0/8 via eth1. I'm going to manually try this now to see if it helps.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've just found that sometimes the Joyent cloud will give you many machines in a row that are all on the same internal network. It can sometimes take many attempts to get given a machine that isn't on the same network as machine-0.

Revision history for this message
Curtis Hovey (sinzui) wrote :

We know from experience that Joyent will give you a machine that is stuck in provisioning. that is to say. We bring up a machine, it is get stuck, we detroy the env...but you cannot delete a machine that is in provisioning. It lingers. In subsequent bootstraps/deployments, the stuck machine will be offered as the requested machine. Once this happens ,all joyent based tests fail.

Joyent's inability to delete/clean a machine might be worse than we believed.

I can confirm I bootstrapped with 1.20.14 and 1.20-beta4 several times today. I have not been able to bootstrap with 1.22-alpha1, so I still think something is worse in master than the other branches.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

When 2 hosts can't communicate, adding a route for all of 10.0.0.0/8 to the internal interface fixes the problem (as described in the forum post). This needs to be done on both sides so that the return packets also use the correct interface.

The required command is: sudo ip route add 10/8 dev eth1

This is a bit of a hack and I'm not sure how we make this happen automatically.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I have filed a support ticket with Joyent to get clarification on this.

summary: - Joyent deployments cannot download agents from state-server
+ Joyent instances sometimes can't communicate via the internal network
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Disabling the networker solved an important case - eth1 on all machines it was running got removed from /etc/network/interfaces, which means if that machine reboots it won't be able to communicate on the local network (10.x.x.x) *at all*. But as it appears, this wasn't the only issue.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Speaking for users, there is a regression.

We have extensively tested master (1.22), 1.21, and 1 20 this week in
Joyent. Master always fails, where as 1.21 and 1.20 pass, and are more
reliable than aws were we often see instances not available.

Juju 1.22 and joyent just don't work (even for small deployments). We
know that 1.22 must get the agent from the state-server, whereas 1.20
and 1.21 will get it from streams or a local container. After the
machine agent is started, it is calling home. Maybe the network
changes between the the time of cloud-init and starting the agent.
Maybe it doesn't change fast enough and we get an intermittent
failure.

As the for extensive testing. We have unlimited resources in joyent, so
Juju QA is using it test changes to industrial testing
(repeatability). Using 1.22 built last week, and 1.21 and 1.20, we say
high success rates, some times 100% for all jujus. We tested bundle
deployments with 1.20.14 which gave us 100% success.

We do see intermittent failures using 1.20 in the joyent cloud health
check. so we know statistically, the problem does exists for every
juju, but we are seeing 100% failure for master tip. The success rates
were better for master last week, and the rates for 1.20 and 1.21 are
great for all weeks.

Revision history for this message
Curtis Hovey (sinzui) wrote :

We retested master and 1.21 in the last two hours. 1.21 passed first time. master failed repeatedly.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

As disucssed on juju-dev, I'm beginning to think there's another (perhaps unrelated to this) issue at play that's preventing the Joyent tests from passing on master. I'll look at this once this networking issue is resolved.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Fixed in master in 851e44a9eb35a9904181ae28be0d5571768d9b23 and 5a583c4f7e9e1788a9d387ca711ed7ce780c24ff.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Curtis Hovey (sinzui) wrote :

precise still not passed. The results are muddied by bug 1396099. a retest and capture of logs shows tools cannot be downloads.

State-server: 72.2.114.47, 10.112.1.232
Other machine: 165.225.138.137, 10.112.72.30

Revision history for this message
Curtis Hovey (sinzui) wrote :

The new trusty deploy job did pass the first time. we retested to see if it got lucky. We see
    http://juju-ci.vapour.ws:8080/job/joyent-deploy-trusty-amd64/11/console
failed to download the tools from the state-server

State-server: 72.2.114.47, 10.112.1.232
Other machine: 165.225.138.137, 10.112.72.30

^ NOTE that these are the same set of addresses that the precise test got 15 minutes later
     http://juju-ci.vapour.ws:8080/job/joyent-deploy-precise-amd64/1253/console

Changed in juju-core:
assignee: Menno Smits (menno.smits) → Horacio Durán (hduran-8)
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Curtis Hovey (sinzui) wrote :

The failures reported in comments 23 and 24 happened when no version of juju could deploy a stack. 1.22-alpha1 was retested with 1.20.14 and 1.21-beta4 when the Joyent cloud health check was seen to pass. For each version of juju, these steps were repeated 5 times to gauge the reliability of success, that 5 agents are started. 1.22-alpha1 could not do this last week
    juju bootstrap -e curtis-joyent --upload-tools
    juju deploy -e curtis-joyent -n 2 ubuntu
    juju status -e curtis-joyent
    juju destroy-environment -y curtis-joyent

1.20.14 passed 4 of 5 tries
1.21-beta4 passed 5 of 5 tries
1.22-alpha1 passes 5 of 5 tries.

1.22-alpha1 is equal to beta4 and may even be better than 1.20.14

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Now backported to 1.21 (e0af8bc1cda130a8725fad2d535a83578ae25a8a and a3b1ac15805cf9db360ac3f51c33c3320499929d).

tags: added: network
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.