Bug #559230 “multi-machine topology, cannot reach an instance fr...” : Bugs : eucalyptus package : Ubuntu

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-04-09:

#1

Dependencies.txt Edit (6.3 KiB, text/plain; charset="utf-8")
EucalyptusCloudDebugLog.gz Edit (203.2 KiB, application/x-gzip)
EucalyptusCloudOutputLog.gz Edit (231.8 KiB, application/x-gzip)
EucalyptusInstalledVersions.txt Edit (255 bytes, text/plain; charset="utf-8")
eucalyptus.conf.txt Edit (1.5 KiB, text/plain; charset="utf-8")
eucalyptus.local.conf.txt Edit (279 bytes, text/plain; charset="utf-8")

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-04-09:

#2

According to http://iso.qa.ubuntu.com/qatracker/result/3918/496 this works on a manual setup, so maybe it's a test artifact from the testrig... Would be good to confirm though.

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2010-04-09:

#3

Marking High, as this needs to be resolved before Lucid's release. I'm going to to target it at 10.04 GA.

Marking Incomplete, as we don't yet have enough information to debug this yet.

I ran instances without a problem in the same setup (manually installed). I'm hoping there's just a problem with the test automation....

Stay tuned, this is being actively worked. Assigning to me.

Changed in eucalyptus (Ubuntu):
importance:	Undecided → High
status:	New → Incomplete
assignee:	nobody → Dustin Kirkland (kirkland)
milestone:	none → ubuntu-10.04

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-04-12:

#4

I'm setting up a 5-machine topology right now to confirm it's a testrig automation artifact, in which case I'll lower the priority.

You mentioned running with several different users. Could it be that the "euca-authorize" call was not issued for all users (but just for the first one) resulting in TCP port 22 connection failures for most of them ?

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-04-12:

#5

I definitely can't reproduce on my local topology2 install. Lowering priority until we can reproduce outside the test automation (which still needs to be fixed/stabilized for Lucid).

Changed in eucalyptus (Ubuntu Lucid):
importance:	High → Medium
milestone:	ubuntu-10.04 → none

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-04-14:

#6

Download full text (4.7 KiB)

I tested it again thie evening, with Dustin monitoring. We again used lucid-amd64-topo2, and based the installs on the daily server/UEC images (releases.ubuntu.com is not accessible from tamarind, so I could not use Beta2).

Installation was uneventful.

I then ran the config_single.yaml test. No problems starting instances, but still the script (or even I, manually) could not ssh into them, failing with a timeout.

ran, just for the sake of it (I do not know what is, or is not, blocked by the firewall(s)) a traceroute agaisnt one of the instances, from cepedak. It reached marula (the CC), and then starred all.

I then logged in Marula, and ssh-ed to an instance I had manually started. I *could* reach it (but failed, correctly, on public key -- I had not added a new key for this run, and the ones used by uec_test.py had already been revoked).

This is the log of the IRC chat between Dustin and myself:

2010-04-13 18:25:32 hggdh kirkland: nodes registered, running a single-instance test now
2010-04-13 18:33:02 hggdh kirkland: test running, log is being written to ~/uec-testing-scripts/resutls/single*
2010-04-13 18:33:09 hggdh kirkland: on cempedak
2010-04-13 18:33:20 kirkland hggdh: cool, and you can ssh in?
2010-04-13 18:35:08 hggdh kirkland: negative
2010-04-13 18:35:19 kirkland hggdh: cannot ssh in
2010-04-13 18:35:25 hggdh kirkland: ssh fails on timeout
2010-04-13 18:35:31 hggdh really sounds like routing
2010-04-13 18:36:18 kirkland hggdh: interesting
2010-04-13 18:36:25 kirkland hggdh: okay, put the log somewhere for me to check out
2010-04-13 18:38:27 hggdh kirkland: k. I just ran one instance by hand, and then tried to ssh into it -- fails with a timeout
2010-04-13 18:39:25 kirkland hggdh: okay, that's easy to reproduce
2010-04-13 18:39:27 kirkland hggdh: log?
2010-04-13 18:42:29 hggdh kirkland: people.c.c/~cerdea/single_test.log.2010-04-13_193218
2010-04-13 18:46:15 kirkland hggdh: rsync -aP people.canonical.com:~cerdea/single_test.log.2010-04-13_193218 .
2010-04-13 18:46:20 kirkland hggdh: file not found
2010-04-13 18:47:04 kirkland hggdh: found it, public_html
2010-04-13 18:47:27 hggdh heh. one wants it on public_html, another on the root ;-)
2010-04-13 18:49:35 kirkland hggdh: ls -alF users/admin/uectest-k0.priv
2010-04-13 18:50:07 kirkland hggdh: and cat that file, make sure it matches -----BEGIN RSA PRIVATE KEY-----
2010-04-13 18:50:33 kirkland hggdh: is that instance still running?
2010-04-13 18:50:43 kirkland hggdh: can you telnet to its port 22 ?
2010-04-13 18:51:03 hggdh kirkland: yes, the instance is still running
2010-04-13 18:52:00 hggdh kirkland: the priv key seems kosher
2010-04-13 18:52:27 kirkland hggdh: and telnet ?
2010-04-13 18:53:50 hggdh kirkland: timeout. Also, a traceroute (FWIW) reaches marula (the CC) and stops there
2010-04-13 18:54:07 kirkland hggdh: oh, interesting
2010-04-13 18:54:22 kirkland hggdh: that's got to be it
2010-04-13 18:54:25 hggdh kirkland: let me try to ssh from marula
2010-04-13 18:54:38 kirkland hggdh: yeah
2010-04-13 18:54:43 kirkland hggdh: scp the priv key over
2010-04-13 18:54:47 kirkland hggdh: and try from there
2010-04-13 18:55:15 hggdh kirkland: first test -- reachability -- succes...

I tested it again thie evening, with Dustin monitoring. We again used lucid-amd64-topo2, and based the installs on the daily server/UEC images (releases.ubuntu.com is not accessible from tamarind, so I could not use Beta2).

Installation was uneventful.

I then ran the config_single.yaml test. No problems starting instances, but still the script (or even I, manually) could not ssh into them, failing with a timeout.

ran, just for the sake of it (I do not know what is, or is not, blocked by the firewall(s)) a traceroute agaisnt one of the instances, from cepedak. It reached marula (the CC), and then starred all.

I then logged in Marula, and ssh-ed to an instance I had manually started. I *could* reach it (but failed, correctly, on public key -- I had not added a new key for this run, and the ones used by uec_test.py had already been revoked).

This is the log of the IRC chat between Dustin and myself:

2010-04-13 18:25:32	hggdh	kirkland: nodes registered, running a single-instance test now
2010-04-13 18:33:02	hggdh	kirkland: test running, log is being written to ~/uec-testing-scripts/resutls/single*
2010-04-13 18:33:09	hggdh	kirkland: on cempedak
2010-04-13 18:33:20	kirkland	hggdh: cool, and you can ssh in?
2010-04-13 18:35:08	hggdh	kirkland: negative
2010-04-13 18:35:19	kirkland	hggdh: cannot ssh in
2010-04-13 18:35:25	hggdh	kirkland: ssh fails on timeout
2010-04-13 18:35:31	hggdh	really sounds like routing
2010-04-13 18:36:18	kirkland	hggdh: interesting
2010-04-13 18:36:25	kirkland	hggdh: okay, put the log somewhere for me to check out
2010-04-13 18:38:27	hggdh	kirkland: k. I just ran one instance by hand, and then tried to ssh into it -- fails with a timeout
2010-04-13 18:39:25	kirkland	hggdh: okay, that's easy to reproduce
2010-04-13 18:39:27	kirkland	hggdh: log?
2010-04-13 18:42:29	hggdh	kirkland: people.c.c/~cerdea/single_test.log.2010-04-13_193218
2010-04-13 18:46:15	kirkland	hggdh: rsync -aP people.canonical.com:~cerdea/single_test.log.2010-04-13_193218 .
2010-04-13 18:46:20	kirkland	hggdh: file not found
2010-04-13 18:47:04	kirkland	hggdh: found it, public_html
2010-04-13 18:47:27	hggdh	heh. one wants it on public_html, another on the root ;-)
2010-04-13 18:49:35	kirkland	hggdh: ls -alF users/admin/uectest-k0.priv
2010-04-13 18:50:07	kirkland	hggdh: and cat that file, make sure it matches -----BEGIN RSA PRIVATE KEY-----
2010-04-13 18:50:33	kirkland	hggdh: is that instance still running?
2010-04-13 18:50:43	kirkland	hggdh: can you telnet to its port 22 ?
2010-04-13 18:51:03	hggdh	kirkland: yes, the instance is still running
2010-04-13 18:52:00	hggdh	kirkland: the priv key seems kosher
2010-04-13 18:52:27	kirkland	hggdh: and telnet ?
2010-04-13 18:53:50	hggdh	kirkland: timeout. Also, a traceroute (FWIW) reaches marula (the CC) and stops there
2010-04-13 18:54:07	kirkland	hggdh: oh, interesting
2010-04-13 18:54:22	kirkland	hggdh: that's got to be it
2010-04-13 18:54:25	hggdh	kirkland: let me try to ssh from marula
2010-04-13 18:54:38	kirkland	hggdh: yeah
2010-04-13 18:54:43	kirkland	hggdh: scp the priv key over
2010-04-13 18:54:47	kirkland	hggdh: and try from there
2010-04-13 18:55:15	hggdh	kirkland: first test -- reachability -- successful
2010-04-13 18:55:21	hggdh	will move the priv key there now
2010-04-13 18:55:21	kirkland	hggdh: ack
2010-04-13 19:00:03	kirkland	hggdh: and?
2010-04-13 19:00:13	hggdh	kirkland: getting permission denied (pub key)
2010-04-13 19:00:30	hggdh	kirkland: but the important piece is that I am *reaching* the instance
2010-04-13 19:00:34	kirkland	hggdh: hrm, odd
2010-04-13 19:00:38	kirkland	hggdh: agreed on that point
2010-04-13 19:00:49	kirkland	hggdh: and you're doing ssh -i ./whatever.priv ubuntu@ip ?
2010-04-13 19:00:58	kirkland	hggdh: and whatever.priv is perm'd 600
2010-04-13 19:01:17	hggdh	kirkland: yes indeed, and will check again
2010-04-13 19:01:26	hggdh	but on wrong permission ssh would bail out
2010-04-13 19:03:41	hggdh	kirkland: and the full command is ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ./uectest-k0.priv  ubuntu@10.55.55.100
2010-04-13 19:04:07	hggdh	although sort of overworked, I admit
2010-04-13 19:04:24	kirkland	hggdh: hmm, okay
2010-04-13 19:04:35	kirkland	hggdh: it may be that the guest is having trouble getting out
2010-04-13 19:04:48	kirkland	hggdh: or at least to have the key injected
2010-04-13 19:04:58	kirkland	hggdh: okay, add your traceroute findings to that bug
2010-04-13 19:05:11	kirkland	hggdh: and email mathias (cc me) the link to that log
2010-04-13 19:05:33	kirkland	hggdh: i'm reassured that this appears to be a networking issue, but we'll need to get to the bottom of it
2010-04-13 19:05:38	kirkland	hggdh: i gotta run for the night
2010-04-13 19:05:41	kirkland	hggdh: thanks dude!
2010-04-13 19:05:55	hggdh	kirkland: will do, and g'night

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-04-14:

#7

traceroute output:

ubuntu@cempedak:~/uec-testing-scripts$ sudo traceroute -n -p 22 -P tcp 10.55.55.100
[sudo] password for ubuntu:
traceroute to 10.55.55.100 (10.55.55.100), 30 hops max, 60 byte packets
1 10.55.55.100 12.337 ms 0.062 ms 0.057 ms
2 * * *
3 * * *
4 *^C
ubuntu@cempedak:~/uec-testing-scripts$

SSH try from Cempedak (the CLC):

ubuntu@cempedak:~/uec-testing-scripts$ ssh -vv 10.55.55.100
OpenSSH_5.3p1 Debian-3ubuntu3, OpenSSL 0.9.8k 25 Mar 2009
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to 10.55.55.100 [10.55.55.100] port 22.
debug1: connect to address 10.55.55.100 port 22: Connection timed out
ssh: connect to host 10.55.55.100 port 22: Connection timed out
ubuntu@cempedak:~/uec-testing-scripts$

SSH try from Marula (the CC):

ubuntu@marula:~$ ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ./uectest-k0.priv ubuntu@10.55.55.100
Warning: Permanently added '10.55.55.100' (RSA) to the list of known hosts.
Permission denied (publickey). <--- expected, since I had not added a key (and was using the one from the uec_test run)
ubuntu@marula:~$

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-04-14:

#8

single test log Edit (83.3 KiB, text/plain)

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2010-04-14:

#9

Re-assigning to Mathias. I'm fairly certain that this is an issue either particular to the lab setup or to the test automation, as I haven't been able to reproduce it here, and Carlos seems to have triaged it down to a networking issue (traceroute failing from the CLC to the guest).

Carlos has reproduced the issue again and provided the logs that Mathias requested, so I moved it from incomplete to confirmed.

Changed in eucalyptus (Ubuntu Lucid):
assignee:	Dustin Kirkland (kirkland) → Mathias Gug (mathiaz)
status:	Incomplete → Confirmed

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-04-14:

#10

Some more information about the network configuration on the CC and the NC could help to compare our working systems with the failing one:

ip address show
sysctl -n net.ipv4.ip_forward

Could it be linked to the metadata service vs. vlan issue that mathiaz reported ?

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-04-14:

#11

IP addresses and forward status on Marula:

ubuntu@marula:~$ ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:b3:25:ee:68 brd ff:ff:ff:ff:ff:ff
    inet 169.254.169.254/32 scope link eth0:metadata
    inet 10.55.55.8/24 brd 10.55.55.255 scope global eth0
    inet 172.19.1.1/27 brd 172.19.1.31 scope global eth0:priv
    inet 10.55.55.100/32 scope global eth0:pub
    inet6 fe80::225:b3ff:fe25:ee68/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:25:ee:6a brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:25:ee:6c brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:25:ee:6e brd ff:ff:ff:ff:ff:ff
ubuntu@marula:~$ sysctl -n net.ipv4.ip_forward
1
ubuntu@marula:~$

Revision history for this message

Etienne Goyer (etienne-goyer-outlands) wrote on 2010-04-14:

#12

Just a question: why is VNET_PUBLICIPS commented in eucalyptus.conf?

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-04-14:

#13

from a NC (Sapodilla):

ubuntu@sapodilla:~$ ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:b3:1f:d6:ee brd ff:ff:ff:ff:ff:ff
    inet6 fe80::225:b3ff:fe1f:d6ee/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:1f:d6:f0 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:1f:d6:f2 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:25:b3:1f:d6:f4 brd ff:ff:ff:ff:ff:ff
6: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 00:25:b3:1f:d6:ee brd ff:ff:ff:ff:ff:ff
    inet 10.55.55.3/24 brd 10.55.55.255 scope global br0
    inet6 fe80::225:b3ff:fe1f:d6ee/64 scope link
       valid_lft forever preferred_lft forever
7: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether ee:47:b3:d7:54:ca brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
    inet6 fe80::ec47:b3ff:fed7:54ca/64 scope link
       valid_lft forever preferred_lft forever
ubuntu@sapodilla:~$ sysctl -n net.ipv4.ip_forward
1
ubuntu@sapodilla:~$

Revision history for this message

Etienne Goyer (etienne-goyer-outlands) wrote on 2010-04-14:

#14

Looking at the comment history, I think it might be that the VNET_PUBLICIPS range and the address of VNET_PUBINTERFACE are in the same subnet, which cannot possibly work. Which is a known limitation related to netfilter; ask Dan about it.

But without knowing what the value of VNET_PUBLICIPS on the CC is, I cannot be sure.

Revision history for this message

Etienne Goyer (etienne-goyer-outlands) wrote on 2010-04-14:

#15

Just to explain a bit further. The reason I am suspecting VNET_PUBLICIPS to be wrong is that eth0 on the CC has two IP in the 10.55.55.0/24 range (see comment #11). I presume 10.55.55.8 is the CC actual IP, and comment #7 indicate that 10.55.55.100 is the IP of the instance. In this case, VNET_PUBLICIPS overlap with 10.55.55.0/24, which would screw up SNAT/DNAT on the CC. i am actually fairly sure if you try to connect using any other machine (not just the CLC), it is going to fail.

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2010-04-14:

#16

Dan,

Can you confirm/explain/enhance Etienne's suspicions?

:-Dustin

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2010-04-14:

#17

Hmm, in my setup here, I have a 255.255.255.0 network, 10.1.1.X.

My CC is on 10.1.1.72, and my PUBLIC_IPS are 10.1.1.100-10.1.1.199. This is working for me.

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-04-15:

#18

About comment 12, I suspect this is because those are the CLC config files, not the CC ones.

Getting eucalyptus.[local.]conf for the CC would give some additional insight on VNET_PUBLICIPS and the network configuration.

Revision history for this message

Mathias Gug (mathiaz) wrote on 2010-04-15:

#19

The issue was that the uec-testing-scripts were not starting the instances in the correct group. euca-run-instances wasn't using the -g option so that instances were started in the default group. If the default group had already been updated to authorize ssh traffic the test would succeed.

I've pushed a fix to my branch: lp:~mathiaz/+junk/uec-testing-scripts/.

Changed in eucalyptus (Ubuntu Lucid):
status:	Confirmed → Fix Released

Revision history for this message

Etienne Goyer (etienne-goyer-outlands) wrote on 2010-04-15:

#20

Good news, but can we clarify the addressing part nonetheless? Ie, can VNET_PUBLICIPS and the IP of VNET_PUBINTERFACE on the CC be on the same subnet?

Revision history for this message

Mathias Gug (mathiaz) wrote on 2010-04-15: Re: [Bug 559230] Re: multi-machine topology, cannot reach an instance from the CLC

#21

On Thu, Apr 15, 2010 at 05:50:35PM -0000, Etienne Goyer wrote:
> Good news, but can we clarify the addressing part nonetheless? Ie, can
> VNET_PUBLICIPS and the IP of VNET_PUBINTERFACE on the CC be on the same
> subnet?
>

Yes they can.

--
Mathias Gug
Ubuntu Developer http://www.ubuntu.com

Ubuntu
eucalyptus package

multi-machine topology, cannot reach an instance from the CLC

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	eucalyptus (Ubuntu)	Fix Released	Medium	Mathias Gug
	Lucid	Fix Released	Medium	Mathias Gug

Ubuntueucalyptus package

multi-machine topology, cannot reach an instance from the CLC

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
eucalyptus package