Ubuntu
eucalyptus package

Second euca-run-instance request in same security group causes eucalyptus to remove network assoicated with security group

Bug #564355 reported by Piotr T Zbiegiel on 2010-04-16

This bug affects 5 people

	Status	Importance	Assigned to
Eucalyptus	Fix Released	Undecided	Unassigned
1.6.2	Won't Fix	Undecided	Unassigned
eucalyptus (CentOS)	New	Undecided	Unassigned
eucalyptus (Ubuntu)	Invalid	High	C de-Avillez
Lucid	Won't Fix	High	Unassigned
Maverick	Invalid	High	C de-Avillez

Bug Description

We are running eucalyptus 1.6.2-0ubuntu27 on lucid beta1 in MANAGED-NOVLAN. I will retest as soon as is feasible with ubuntu30 but as I see no mention of this issue/fix in the changelog I wanted to get the information in your hands.

Eucalyptus has trouble allocating additional VMs to existing security groups in some cases. I tried several tests and saw very similar results. Eucalyptus allows you to request VMs in a given security group. Once all the VMs are running an additional euca-run-instances request for that security group will fail and in some cases the network associated with that security group will be removed from iptables (even if there are running VMs within that security group). The network that was freed up can be re-allocated to another security group but new VMs requested in that security group fail with the same "failed to add host" message.

---------------------------------------------------
A typical cycle looks like this (command-line interspersed with snippets of cc.log):

$ euca-run-instances -n 250 -g default…

[Thu Apr 15 14:14:51 2010][001325][EUCAINFO ] StartNetwork(): called
[Thu Apr 15 14:14:51 2010][001324][EUCAINFO ] ConfigureNetwork(): called
[Thu Apr 15 14:14:51 2010][001324][EUCAINFO ] vnetTableRule(): applying iptables rule: -A user-default -s 0.0.0.0/0 -d 10.0.8.0/24 -p tcp --dport 22:22 -j ACCEPT
[Thu Apr 15 14:14:51 2010][001327][EUCAINFO ] RunInstances(): called

#….Proceeds to run 250 instances successfully…..

$ euca-run-instances -n 1 -g default….

[Thu Apr 15 14:29:46 2010][001376][EUCAINFO ] StartNetwork(): called
[Thu Apr 15 14:29:46 2010][001368][EUCAINFO ] ConfigureNetwork(): called
[Thu Apr 15 14:29:46 2010][001368][EUCAINFO ] vnetTableRule(): applying iptables rule: -A user-default -s 0.0.0.0/0 -d 10.0.8.0/24 -p tcp --dport 22:22 -j ACCEPT
[Thu Apr 15 14:29:46 2010][001328][EUCAINFO ] RunInstances(): called
[Thu Apr 15 14:29:46 2010][001328][EUCAERROR ] vnetAddHost(): failed to add host d0:0d:3B:E6:07:11 on vlan 10
[Thu Apr 15 14:29:46 2010][001328][EUCAERROR ] RunInstances(): could not find/initialize any free network address, failing doRunInstances()

#…..After 15 minutes instance goes to terminated and TerminateInstance() is called many times (once per NC?)…….

[Thu Apr 15 14:39:51 2010][005458][EUCAERROR ] ERROR: TerminateInstance() could not be invoked (check NC host, port, and credentia
ls)
[Thu Apr 15 14:39:51 2010][001326][EUCAINFO ] TerminateInstances(): calling terminate instance (i-3BE60711) on (192.168.1.2)
[Thu Apr 15 14:39:51 2010][005459][EUCAERROR ] ERROR: TerminateInstance() could not be invoked (check NC host, port, and credentia
ls)
[Thu Apr 15 14:39:51 2010][001326][EUCAINFO ] TerminateInstances(): calling terminate instance (i-3BE60711) on (192.168.1.3)
[Thu Apr 15 14:39:51 2010][005460][EUCAERROR ] ERROR: TerminateInstance() could not be invoked (check NC host, port, and credentia
ls)
[Thu Apr 15 14:39:51 2010][001326][EUCAINFO ] TerminateInstances(): calling terminate instance (i-3BE60711) on (192.168.1.4)
[Thu Apr 15 14:39:51 2010][005461][EUCAERROR ] ERROR: TerminateInstance() could not be invoked (check NC host, port, and credentia
ls)

#……It then removes the network allocated for the user's default security group even though there are 250 running VMs!!!……

[Thu Apr 15 14:40:00 2010][001328][EUCAINFO ] StopNetwork(): called

#iptables shows that the chain user-default has disappeared!

---------------------------------------------------
I tried many different combinations of numbers of nodes, etc.
(ADDRSPERNET is 256)

250 + 1 additional (the 1 additional failed, network was removed and VMs are inaccessible)
100 + 1 additional (the 1 additional failed, network was removed and VMs are inaccessible)
20 + 20 additional (the 20 additional failed, network was removed and VMs are inaccessible)

I did have some success adding to to existing security groups by 10 or 20 nodes at a time. One security group grew to 80 nodes before I received the "failed to add host" messages. It seemed I was more successful when I was making requests rapidly (waiting only a few minutes between requests) rather than waiting for all the nodes to allocate in a given reservation. I am at a loss to the exact cause because some security groups are allowed to expand while others are cut off from receiving additional IPs well before they reach ADDRSPERNET.

Tags:

Scott Moser (smoser) on 2010-04-19

Changed in eucalyptus (Ubuntu):
importance:	Undecided → High

Revision history for this message

Piotr T Zbiegiel (pzbiegiel) wrote on 2010-04-19:

I was able to repeat this behavior with ADDRSPERNET set to 128. The system seems more prone to this behavior when a user makes requests for large numbers of VMs in a security group and then attempts to add more. Not sure if this bug manifests based on the size of requests or how many IPs are already allocated in a given security group.

Thierry Carrez (ttx) on 2010-04-21

Changed in eucalyptus (Ubuntu Lucid):
assignee:	nobody → Dustin Kirkland (kirkland)

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2010-04-22:

Thierry says that he'll try to reproduce i.

Changed in eucalyptus (Ubuntu Lucid):
assignee:	Dustin Kirkland (kirkland) → Thierry Carrez (ttx)
milestone:	none → lucid-updates

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-04-23:

Won't have time to reproduce today or next week (away from cloud setup), so i'll put it back in the pool for anyone to pick. Not pre-release material anyway, potential SRU target.

Changed in eucalyptus (Ubuntu Lucid):
assignee:	Thierry Carrez (ttx) → nobody

Thierry Carrez (ttx) on 2010-07-20

Changed in eucalyptus (Ubuntu):
milestone:	lucid-updates → none
Changed in eucalyptus (Ubuntu Lucid):
milestone:	lucid-updates → none

Revision history for this message

Aimon Bustardo (aimonb) wrote on 2010-08-10:

This also Effects 1.6.2 CentOS release. When ADDRPERNET is set to 256 it causes same behavior with this error:

[Wed Aug 11 08:22:12 2010][005109][EUCADEBUG ] RunInstances(): params: userId=admin, emiId=emi-54A01BEF, kernelId=eki-5A861EB7, ramdiskId=UNSET, emiURL=http://192.168.1.11:8773/services/Walrus/x86_64-CentOS-5.4/managed-public-nil-nil-x86_64-CentOS-5.4-v1.2-2.0.manifest.xml, kernelURL=http://192.168.1.11:8773/services/Walrus/Kernels-x86_64/nil-public-nil-nil-x86_64-vmlinuz-2.6.18-xenU-ec2-v1.2-1.0.manifest.xml, ramdiskURL=UNSET, instIdsLen=1, netNamesLen=1, macAddrsLen=1, networkIndexListLen=1, minCount=1, maxCount=1, ownerId=admin, reservationId=r-43C4087C, keyName=, vlan=10, userData=, launchIndex=0, targetNode=UNSET
[Wed Aug 11 08:22:12 2010][005109][EUCADEBUG ] RunInstances(): running instance i-3816057B with emiId emi-54A01BEF...
[Wed Aug 11 08:22:12 2010][005109][EUCAERROR ] vnetAddHost(): failed to add host d0:0d:38:16:05:7B on vlan 10
[Wed Aug 11 08:22:12 2010][005109][EUCADEBUG ] RunInstances(): assigning MAC/IP: d0:0d:38:16:05:7B/0.0.0.0/0.0.0.0/5
[Wed Aug 11 08:22:12 2010][005109][EUCAERROR ] RunInstances(): could not find/initialize any free network address, failing doRunInstances()
[Wed Aug 11 08:22:12 2010][005109][EUCADEBUG ] RunInstances(): done

Revision history for this message

Aimon Bustardo (aimonb) wrote on 2010-08-11:

Actually this shows itself when at 64 or 128 also.

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-08-25:

@all: could we have logs from all Eucalyptus components showing the issue? Please attach them here. Thank you

Revision history for this message

Piotr T Zbiegiel (pzbiegiel) wrote on 2010-08-25:

Not sure about logs from all Eucalyptus components. This problem seems to be centered in the cluster controller code. I think the telling log lines I've seen after the second "euca-run-instances" command are:

[Thu Apr 15 14:29:46 2010][001328][EUCAINFO ] RunInstances(): called
[Thu Apr 15 14:29:46 2010][001328][EUCAERROR ] vnetAddHost(): failed to add host d0:0d:3B:E6:07:11 on vlan 10
[Thu Apr 15 14:29:46 2010][001328][EUCAERROR ] RunInstances(): could not find/initialize any free network address, failing doRunInstances()

Once the cluster controller fails to issue network addresses for the new instances it doesn't bother to farm them out to the node controllers. Those instances are never started on any of the NCs.

It almost seems like the cluster controller forgets about the available network addresses on a given network and won't allocate addresses for new instances. The most distressing thing is (and this doesn't happen every time) the network associated with a given security group is deallocated by the cluster controller. Its rule chain is removed from iptables and I've even seen other users get issued the same slice of network addresses for their new security groups. All this while instances in the old security group are still in a running state.

I can confirm Aimon's comment. We have seen this behavior with ADDRSPERNET set to 256, 128, and 64.

Revision history for this message

Piotr T Zbiegiel (pzbiegiel) wrote on 2010-08-25:

I wanted to add that we have since upgraded to 1.6.2-0ubuntu30.3 and still witness this behavior regularly.

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-08-27:

RC bugs need to be assigned to someone -- assigning to Carlos for verification (or tracking Eucalyptus verification)

Changed in eucalyptus (Ubuntu Maverick):
assignee:	nobody → C de-Avillez (hggdh2)

Revision history for this message

Dmitrii Zagorodnov (dmitrii) wrote on 2010-08-28:

#10

If someone has logs from all components they would be helpful, actually. (Although it may *seem* like a CC problem, it *may* originate in the CLC.) Also, if it is easy for someone to check if the problem has been solved (as I suspect it has) in 2.0.0, currently in Maverick, that would help a lot. Thanks!

Thierry Carrez (ttx) on 2010-09-02

tags:

added: server-mrs

Revision history for this message

Dmitrii Zagorodnov (dmitrii) wrote on 2010-09-02:

#11

I am not able to recreate this problem with 2.0.0-lp code (revno 1342), so I am closing the bug. Marking "Fix Committed" even though we haven't confirmed it. It may well be a genuine problem in 1.6.2/Lucid, though.

Specifically, I ran multiple times, 24 instances followed by 24 instances in the same (non-default) group, with VNET_ADDRPERNET=128 and VNET_MODE="MANAGED-NOVLAN" using Eucalyptus built from source. All instances started.

Changed in eucalyptus:
status:	New → Fix Committed

Revision history for this message

Dmitrii Zagorodnov (dmitrii) wrote on 2010-09-02:

#12

I meant revno 1236, sorry.

Revision history for this message

C de-Avillez (hggdh2) wrote on 2010-09-07:

#13

Marking Incomplete for the Lucid/Eucalyptus task -- we still need the logs, as I and Dmitrri requested earlier on; marking triaged for the Maverick/Eucalyptus task: I have been unable to reproduce it on Maverick, but will keep on trying.

Note that this may well be a volume-related issue: I cannot reproduce Piotr's results, but I also do not have a cloud with a few hundred nodes to test on.

Changed in eucalyptus (Ubuntu Lucid):
status:	New → Incomplete
Changed in eucalyptus (Ubuntu Maverick):
status:	New → Triaged

Revision history for this message

Thierry Carrez (ttx) wrote on 2010-09-09:

#14

We were not able to reproduce this on maverick / 2.0. Please set it back to Triaged if anyone can.

Changed in eucalyptus (Ubuntu Maverick):
status:	Triaged → Invalid

Revision history for this message

Aimon Bustardo (aimonb) wrote on 2010-10-29:

#15

Hi Here is Log from CLC that corresponds to CC log you have seen above (see attached). An interesting note about the attached log. There is only one successful vm in the secgroup. All other fail with the attached and above errors. When I create another secgroup, the new one behaves normally and I can launch VMs.

Aimon

Revision history for this message

Aimon Bustardo (aimonb) wrote on 2010-10-29:

#16

clc-debug.log Edit (54.5 KiB, text/plain)

Forgot the log.

Revision history for this message

graziano obertelli (graziano.obertelli) wrote on 2011-12-04:

#17

I'm not able to reproduce it with 2.0.3. Aimon, if you still see this issue, please open a new bug.

no longer affects:	eucalyptus/eucalyptus-devel
Changed in eucalyptus:
status:	Fix Committed → Fix Released

Revision history for this message

Rolf Leggewie (r0lf) wrote on 2015-06-17:

#18

lucid has seen the end of its life and is no longer receiving any updates. Marking the lucid task for this ticket as "Won't Fix".

Changed in eucalyptus (Ubuntu Lucid):
status:	Incomplete → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

clc-debug.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntueucalyptus package

Second euca-run-instance request in same security group causes eucalyptus to remove network assoicated with security group

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
eucalyptus package