multi-machine topology, cannot reach an instance from the CLC

Bug #559230 reported by C de-Avillez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
eucalyptus (Ubuntu)
Fix Released
Medium
Mathias Gug
Lucid
Fix Released
Medium
Mathias Gug

Bug Description

Release of Ubuntu: Lucid beta2, UEC images of 20100407.1
Package Version:
Expected Results:
Actual Results:

This is probably a setup error. I am running as follows;

   lucid-amd64-topo2:
     hosts:
       cempedak: CLC
       mabolo: Walrus
       marula: CC
       santol: SC
       sapodilla: NC
       soncoya: NC

Output from the runs are saved (I am setting up a bzr for them right now).
While SSH-ing into one instance, most of the times SSH fails with a timeout (failure rate was at about 94%).

I am running Mathias' uec-testing-scripts.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: eucalyptus-cloud 1.6.2-0ubuntu27
ProcVersionSignature: Ubuntu 2.6.32-19.28-server 2.6.32.10+drm33.1
Uname: Linux 2.6.32-19-server x86_64
Architecture: amd64
Date: Fri Apr 9 09:07:07 2010
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: eucalyptus

Revision history for this message
C de-Avillez (hggdh2) wrote :
Revision history for this message
Thierry Carrez (ttx) wrote :

According to http://iso.qa.ubuntu.com/qatracker/result/3918/496 this works on a manual setup, so maybe it's a test artifact from the testrig... Would be good to confirm though.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Marking High, as this needs to be resolved before Lucid's release. I'm going to to target it at 10.04 GA.

Marking Incomplete, as we don't yet have enough information to debug this yet.

I ran instances without a problem in the same setup (manually installed). I'm hoping there's just a problem with the test automation....

Stay tuned, this is being actively worked. Assigning to me.

Changed in eucalyptus (Ubuntu):
importance: Undecided → High
status: New → Incomplete
assignee: nobody → Dustin Kirkland (kirkland)
milestone: none → ubuntu-10.04
Revision history for this message
Thierry Carrez (ttx) wrote :

I'm setting up a 5-machine topology right now to confirm it's a testrig automation artifact, in which case I'll lower the priority.

You mentioned running with several different users. Could it be that the "euca-authorize" call was not issued for all users (but just for the first one) resulting in TCP port 22 connection failures for most of them ?

Revision history for this message
Thierry Carrez (ttx) wrote :

I definitely can't reproduce on my local topology2 install. Lowering priority until we can reproduce outside the test automation (which still needs to be fixed/stabilized for Lucid).

Changed in eucalyptus (Ubuntu Lucid):
importance: High → Medium
milestone: ubuntu-10.04 → none
Revision history for this message
C de-Avillez (hggdh2) wrote :
Download full text (4.7 KiB)

I tested it again thie evening, with Dustin monitoring. We again used lucid-amd64-topo2, and based the installs on the daily server/UEC images (releases.ubuntu.com is not accessible from tamarind, so I could not use Beta2).

Installation was uneventful.

I then ran the config_single.yaml test. No problems starting instances, but still the script (or even I, manually) could not ssh into them, failing with a timeout.

ran, just for the sake of it (I do not know what is, or is not, blocked by the firewall(s)) a traceroute agaisnt one of the instances, from cepedak. It reached marula (the CC), and then starred all.

I then logged in Marula, and ssh-ed to an instance I had manually started. I *could* reach it (but failed, correctly, on public key -- I had not added a new key for this run, and the ones used by uec_test.py had already been revoked).

This is the log of the IRC chat between Dustin and myself:

2010-04-13 18:25:32 hggdh kirkland: nodes registered, running a single-instance test now
2010-04-13 18:33:02 hggdh kirkland: test running, log is being written to ~/uec-testing-scripts/resutls/single*
2010-04-13 18:33:09 hggdh kirkland: on cempedak
2010-04-13 18:33:20 kirkland hggdh: cool, and you can ssh in?
2010-04-13 18:35:08 hggdh kirkland: negative
2010-04-13 18:35:19 kirkland hggdh: cannot ssh in
2010-04-13 18:35:25 hggdh kirkland: ssh fails on timeout
2010-04-13 18:35:31 hggdh really sounds like routing
2010-04-13 18:36:18 kirkland hggdh: interesting
2010-04-13 18:36:25 kirkland hggdh: okay, put the log somewhere for me to check out
2010-04-13 18:38:27 hggdh kirkland: k. I just ran one instance by hand, and then tried to ssh into it -- fails with a timeout
2010-04-13 18:39:25 kirkland hggdh: okay, that's easy to reproduce
2010-04-13 18:39:27 kirkland hggdh: log?
2010-04-13 18:42:29 hggdh kirkland: people.c.c/~cerdea/single_test.log.2010-04-13_193218
2010-04-13 18:46:15 kirkland hggdh: rsync -aP people.canonical.com:~cerdea/single_test.log.2010-04-13_193218 .
2010-04-13 18:46:20 kirkland hggdh: file not found
2010-04-13 18:47:04 kirkland hggdh: found it, public_html
2010-04-13 18:47:27 hggdh heh. one wants it on public_html, another on the root ;-)
2010-04-13 18:49:35 kirkland hggdh: ls -alF users/admin/uectest-k0.priv
2010-04-13 18:50:07 kirkland hggdh: and cat that file, make sure it matches -----BEGIN RSA PRIVATE KEY-----
2010-04-13 18:50:33 kirkland hggdh: is that instance still running?
2010-04-13 18:50:43 kirkland hggdh: can you telnet to its port 22 ?
2010-04-13 18:51:03 hggdh kirkland: yes, the instance is still running
2010-04-13 18:52:00 hggdh kirkland: the priv key seems kosher
2010-04-13 18:52:27 kirkland hggdh: and telnet ?
2010-04-13 18:53:50 hggdh kirkland: timeout. Also, a traceroute (FWIW) reaches marula (the CC) and stops there
2010-04-13 18:54:07 kirkland hggdh: oh, interesting
2010-04-13 18:54:22 kirkland hggdh: that's got to be it
2010-04-13 18:54:25 hggdh kirkland: let me try to ssh from marula
2010-04-13 18:54:38 kirkland hggdh: yeah
2010-04-13 18:54:43 kirkland hggdh: scp the priv key over
2010-04-13 18:54:47 kirkland hggdh: and try from there
2010-04-13 18:55:15 hggdh kirkland: first test -- reachability -- succes...

Read more...

Revision history for this message
C de-Avillez (hggdh2) wrote :

traceroute output:

ubuntu@cempedak:~/uec-testing-scripts$ sudo traceroute -n -p 22 -P tcp 10.55.55.100
[sudo] password for ubuntu:
traceroute to 10.55.55.100 (10.55.55.100), 30 hops max, 60 byte packets
 1 10.55.55.100 12.337 ms 0.062 ms 0.057 ms
 2 * * *
 3 * * *
 4 *^C
ubuntu@cempedak:~/uec-testing-scripts$

SSH try from Cempedak (the CLC):

ubuntu@cempedak:~/uec-testing-scripts$ ssh -vv 10.55.55.100
OpenSSH_5.3p1 Debian-3ubuntu3, OpenSSL 0.9.8k 25 Mar 2009
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to 10.55.55.100 [10.55.55.100] port 22.
debug1: connect to address 10.55.55.100 port 22: Connection timed out
ssh: connect to host 10.55.55.100 port 22: Connection timed out
ubuntu@cempedak:~/uec-testing-scripts$

SSH try from Marula (the CC):

ubuntu@marula:~$ ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ./uectest-k0.priv ubuntu@10.55.55.100
Warning: Permanently added '10.55.55.100' (RSA) to the list of known hosts.
Permission denied (publickey). <--- expected, since I had not added a key (and was using the one from the uec_test run)
ubuntu@marula:~$

Revision history for this message
C de-Avillez (hggdh2) wrote :
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Re-assigning to Mathias. I'm fairly certain that this is an issue either particular to the lab setup or to the test automation, as I haven't been able to reproduce it here, and Carlos seems to have triaged it down to a networking issue (traceroute failing from the CLC to the guest).

Carlos has reproduced the issue again and provided the logs that Mathias requested, so I moved it from incomplete to confirmed.

Changed in eucalyptus (Ubuntu Lucid):
assignee: Dustin Kirkland (kirkland) → Mathias Gug (mathiaz)
status: Incomplete → Confirmed
Revision history for this message
Thierry Carrez (ttx) wrote :

Some more information about the network configuration on the CC and the NC could help to compare our working systems with the failing one:

ip address show
sysctl -n net.ipv4.ip_forward

Could it be linked to the metadata service vs. vlan issue that mathiaz reported ?

Revision history for this message
C de-Avillez (hggdh2) wrote :

IP addresses and forward status on Marula:

ubuntu@marula:~$ ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:b3:25:ee:68 brd ff:ff:ff:ff:ff:ff
    inet 169.254.169.254/32 scope link eth0:metadata
    inet 10.55.55.8/24 brd 10.55.55.255 scope global eth0
    inet 172.19.1.1/27 brd 172.19.1.31 scope global eth0:priv
    inet 10.55.55.100/32 scope global eth0:pub
    inet6 fe80::225:b3ff:fe25:ee68/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:25:ee:6a brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:25:ee:6c brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:25:ee:6e brd ff:ff:ff:ff:ff:ff
ubuntu@marula:~$ sysctl -n net.ipv4.ip_forward
1
ubuntu@marula:~$

Revision history for this message
Etienne Goyer (etienne-goyer-outlands) wrote :

Just a question: why is VNET_PUBLICIPS commented in eucalyptus.conf?

Revision history for this message
C de-Avillez (hggdh2) wrote :

from a NC (Sapodilla):

ubuntu@sapodilla:~$ ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:b3:1f:d6:ee brd ff:ff:ff:ff:ff:ff
    inet6 fe80::225:b3ff:fe1f:d6ee/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:1f:d6:f0 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:1f:d6:f2 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:1f:d6:f4 brd ff:ff:ff:ff:ff:ff
6: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 00:25:b3:1f:d6:ee brd ff:ff:ff:ff:ff:ff
    inet 10.55.55.3/24 brd 10.55.55.255 scope global br0
    inet6 fe80::225:b3ff:fe1f:d6ee/64 scope link
       valid_lft forever preferred_lft forever
7: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether ee:47:b3:d7:54:ca brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
    inet6 fe80::ec47:b3ff:fed7:54ca/64 scope link
       valid_lft forever preferred_lft forever
ubuntu@sapodilla:~$ sysctl -n net.ipv4.ip_forward
1
ubuntu@sapodilla:~$

Revision history for this message
Etienne Goyer (etienne-goyer-outlands) wrote :

Looking at the comment history, I think it might be that the VNET_PUBLICIPS range and the address of VNET_PUBINTERFACE are in the same subnet, which cannot possibly work. Which is a known limitation related to netfilter; ask Dan about it.

But without knowing what the value of VNET_PUBLICIPS on the CC is, I cannot be sure.

Revision history for this message
Etienne Goyer (etienne-goyer-outlands) wrote :

Just to explain a bit further. The reason I am suspecting VNET_PUBLICIPS to be wrong is that eth0 on the CC has two IP in the 10.55.55.0/24 range (see comment #11). I presume 10.55.55.8 is the CC actual IP, and comment #7 indicate that 10.55.55.100 is the IP of the instance. In this case, VNET_PUBLICIPS overlap with 10.55.55.0/24, which would screw up SNAT/DNAT on the CC. i am actually fairly sure if you try to connect using any other machine (not just the CLC), it is going to fail.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Dan,

Can you confirm/explain/enhance Etienne's suspicions?

:-Dustin

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Hmm, in my setup here, I have a 255.255.255.0 network, 10.1.1.X.

My CC is on 10.1.1.72, and my PUBLIC_IPS are 10.1.1.100-10.1.1.199. This is working for me.

Revision history for this message
Thierry Carrez (ttx) wrote :

About comment 12, I suspect this is because those are the CLC config files, not the CC ones.

Getting eucalyptus.[local.]conf for the CC would give some additional insight on VNET_PUBLICIPS and the network configuration.

Revision history for this message
Mathias Gug (mathiaz) wrote :

The issue was that the uec-testing-scripts were not starting the instances in the correct group. euca-run-instances wasn't using the -g option so that instances were started in the default group. If the default group had already been updated to authorize ssh traffic the test would succeed.

I've pushed a fix to my branch: lp:~mathiaz/+junk/uec-testing-scripts/.

Changed in eucalyptus (Ubuntu Lucid):
status: Confirmed → Fix Released
Revision history for this message
Etienne Goyer (etienne-goyer-outlands) wrote :

Good news, but can we clarify the addressing part nonetheless? Ie, can VNET_PUBLICIPS and the IP of VNET_PUBINTERFACE on the CC be on the same subnet?

Revision history for this message
Mathias Gug (mathiaz) wrote : Re: [Bug 559230] Re: multi-machine topology, cannot reach an instance from the CLC

On Thu, Apr 15, 2010 at 05:50:35PM -0000, Etienne Goyer wrote:
> Good news, but can we clarify the addressing part nonetheless? Ie, can
> VNET_PUBLICIPS and the IP of VNET_PUBINTERFACE on the CC be on the same
> subnet?
>

Yes they can.

--
Mathias Gug
Ubuntu Developer http://www.ubuntu.com

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.