Murano picks cidr for subnet which is already used

Bug #1502437 reported by Serg Melikyan
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Murano
Fix Released
High
Alexander Tivelkov
Future
Confirmed
Medium
Unassigned
Kilo
Won't Fix
High
Alexander Tivelkov
Liberty
Fix Released
High
Alexander Tivelkov
Mitaka
Fix Released
High
Alexander Tivelkov

Bug Description

1. Started rally tests to deploy environments with AD application
2. Murano started deploy and try to create net, subnet, security group and router for the Primary Controller:
<134>Oct 2 08:48:16 node-8 murano-engine Pushing: {'heat_template_version': '2013-05-23', 'description': 'This stack was generated by Murano for environment rally_fcEJiJRuEf (ID: f9fdfa18d9a047739de023f0ce5c695f)', 'resources': {'network-f1a314acd5f041298bba1f3f92511873': {'type': 'OS::Neutron::Net', 'properties': {'name': 'rally_fcEJiJRuEf-network-f1a314acd5f041298bba1f3f92511873'}}, 'subnet-f1a314acd5f041298bba1f3f92511873': {'type': 'OS::Neutron::Subnet', 'properties': {'ip_version': 4, 'cidr': u'10.0.20.0/24', 'dns_nameservers': [u'8.8.8.8'], 'network': {'get_resource': 'network-f1a314acd5f041298bba1f3f92511873'}}}, u'MuranoSecurityGroup-rally_fcEJiJRuEf': {'type': 'OS::Neutron::SecurityGroup', 'properties': {'rules': [{'port_range_min': None, 'port_range_max': None, 'protocol': 'icmp', 'remote_ip_prefix': '0.0.0.0/0'}, {'protocol': u'tcp', 'port_range_max': 25, 'port_range_min': 25, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 53, 'port_range_min': 53, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 53, 'port_range_min': 53, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 88, 'port_range_min': 88, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 88, 'port_range_min': 88, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 123, 'port_range_min': 123, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 135, 'port_range_min': 135, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 137, 'port_range_min': 137, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 138, 'port_range_min': 138, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 445, 'port_range_min': 445, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 445, 'port_range_min': 445, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 464, 'port_range_min': 464, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 464, 'port_range_min': 464, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 389, 'port_range_min': 389, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 389, 'port_range_min': 389, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 636, 'port_range_min': 636, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 3268, 'port_range_min': 3268, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 3269, 'port_range_min': 3269, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 5722, 'port_range_min': 5722, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 9389, 'port_range_min': 9389, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 65535, 'port_range_min': 49152, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 65535, 'port_range_min': 49152, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 3389, 'port_range_min': 3389, 'remote_ip_prefix': '0.0.0.0/0'}], 'description': 'Composite security group of Murano environment rally_fcEJiJRuEf'}}, 'ri-f1a314acd5f041298bba1f3f92511873': {'type': 'OS::Neutron::RouterInterface', 'properties': {'router_id': u'7849909b-29c5-4fa3-b0b6-00682b447c9a', 'subnet': {'get_resource': 'subnet-f1a314acd5f041298bba1f3f92511873'}}}}}

3. Subnet creation failed due to specified cidr is already in use:
<135>Oct 2 08:49:30 node-8 murano-engine Publisher.send: sending message results to {'oslo.message': '{"_unique_id": "3d0cfaea8c9848a98bdbd6a6a7cb30b4", "_msg_id": "110afb112d4c407cbae30e3f351b3bd5", "args": {"environment_id": "f9fdfa18d9a047739de023f0ce5c695f", "result": {"action": {"isException": true, "result": {"message": "[exceptions.EnvironmentError]: Unexpected stack state UPDATE_FAILED: BadRequest: resources.ri-f1a314acd5f041298bba1f3f92511873: Bad router request: Cidr 10.0.20.0/24 of subnet b03ce2f8-e6e3-4a6d-a332-d317918d784c overlaps with cidr 10.0.20.0/24 of subnet 3b45a44e-378e-4bb1-ac42-a869f6b3ca6d", "details": "exceptions.EnvironmentError: Unexpected stack state UPDATE_FAILED: BadRequest: resources.ri-f1a314acd5f041298bba1f3f92511873: Bad router request: Cidr 10.0.20.0/24 of subnet b03ce2f8-e6e3-4a6d-a332-d317918d784c overlaps with cidr 10.0.20.0/24 of subnet 3b45a44e-378e-4bb1-ac42-a869f6b3ca6d\\nTraceback (most recent call last):\\n File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano/Classes/Environment.yaml\\", line 74:9 in method deploy of class io.murano.Environment\\n $.applications.pselect($.deploy())\\n File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano.apps.activeDirectory.ActiveDirectory/Classes/ActiveDirectory.yaml\\", line 56:9 in method deploy of class io.murano.apps.activeDirectory.ActiveDirectory\\n $.primaryController.deploy()\\n File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano.apps.activeDirectory.ActiveDirectory/Classes/PrimaryController.yaml\\", line 38:9 in method deploy of class io.murano.apps.activeDirectory.PrimaryController\\n $.super($.deploy())\\n File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano.apps.activeDirectory.ActiveDirectory/Classes/Controller.yaml\\", line 28:11 in method deploy of class io.murano.apps.activeDirectory.Controller\\n $.host.deploy()\\n File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano.apps.activeDirectory.ActiveDirectory/Classes/Host.yaml\\", line 38:9 in method deploy of class io.murano.apps.activeDirectory.Host\\n $.super($.deploy())\\n File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano/Classes/reso:q File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano/Classes/resources/Instance.yaml\\", line 159:13 in method ensureNetworksDeployed of class io.murano.resources.Instance\\n $.environment.defaultNetworks.environment.deploy()\\n File \\"/tmp/murano-packages-cache/26038d1b-40b6-4865-acb9-33617bd30e43/io.murano/Classes/resources/NeutronNetwork.yaml\\", line 67:13 in method deploy of class io.murano.resources.NeutronNetwork\\n $._environment.stack.push()\\n File \\"/usr/lib/python2.7/dist-packages/murano/engine/system/heat_stack.py\\", line 193 in method push\\n lambda status: status == \'UPDATE_COMPLETE\')\\n File \\"/usr/lib/python2.7/dist-packages/murano/engine/system/heat_stack.py\\", line 143 in method _wait_state\\n \\"Unexpected stack state {0}{1}\\".format(status, reason))"}}, "model": {"ObjectsCopy": {"defaultNetworks": {"environment": {"useDefaultDns": true, "name": "rally_fcEJiJRuEf-network", "dnsNameserver": "8.8.8.8", "externalRouterId": "7849909b-29c5-4fa3-b0b6-00682b447c9a", "autogenerateSubnet": true, "subnetCidr": "10.0.20.0/24", "autoUplink": true, "?": {"type": "io.murano.resources.NeutronNetwork", "id": "f1a314acd5f041298bba1f3f92511873"}}, "flat": null}, "name": "rally_fcEJiJRuEf", "?": {"type": "io.murano.Environment", "id": "f9fdfa18d9a047739de023f0ce5c695f"}, "applications": [{"name": "my.domain-MqzLkKgU6V09yPdG", "adminPassword": "P@ssw0rd", "adminAccountName": "Administrator", "primaryController": {"host": {"availabilityZone": "nova", "name": "murano-1", "adminPassword": "asr5znW0tq!", "assignFloatingIp": false, "securityGroupName": null, "floatingIpAddress": null, "keyname": null, "?": {"type": "io.murano.apps.activeDirectory.Host", "id": "73d3827f-b15b-463e-8012-de3d15f54076"}, "ipAddresses": [], "adminAccountName": "Administrator", "flavor": "m1.medium", "image": "Murano_windows_image", "networks": {"useFlatNetwork": false, "primaryNetwork": null, "useEnvironmentNetwork": true, "customNetworks": []}, "sharedIps": []}, "recoveryPassword": "P@ssw0rd", "?": {"type": "io.murano.apps.activeDirectory.PrimaryController", "id": "e67b587a-64a7-4a77-9849-345c916b1f62"}, "dnsIp": null}, "secondaryControllers": [], "?": {"type": "io.murano.apps.activeDirectory.ActiveDirectory", "id": "2ac09a3b-aae2-4530-9e35-2bca474feb0b"}}]}, "Attributes": [["f9fdfa18d9a047739de023f0ce5c695f", "io.murano.Environment", "generatedEnvironmentName", "sgllpif9ephqb61"]], "Objects": {"defaultNetworks": {"environment": {"useDefaultDns": true, "name": "rally_fcEJiJRuEf-network", "dnsNameserver": "8.8.8.8", "externalRouterId": "7849909b-29c5-4fa3-b0b6-00682b447c9a", "autogenerateSubnet": true, "subnetCidr": "10.0.20.0/24", "autoUplink": true, "?": {"type": "io.murano.resources.NeutronNetwork", "_actions": {}, "id": "f1a314acd5f041298bba1f3f92511873"}}, "flat": null}, "name": "rally_fcEJiJRuEf", "?": {"type": "io.murano.Environment", "_actions": {"f9fdfa18d9a047739de023f0ce5c695f_deploy": {"enabled": true, "name": "deploy"}}, "id": "f9fdfa18d9a047739de023f0ce5c695f"}, "applications": [{"name": "my.domain-MqzLkKgU6V09yPdG", "adminPassword": "P@ssw0rd", "adminAccountName": "Administrator", "primaryController": {"host": {"availabilityZone": "nova", "name": "murano-1", "adminPassword": "asr5znW0tq!", "assignFloatingIp": false, "securityGroupName": null, "floatingIpAddress": null, "keyname": null, "?": {"type": "io.murano.apps.activeDirectory.Host", "_actions": {}, "id": "73d3827f-b15b-463e-8012-de3d15f54076"}, "ipAddresses": [], "adminAccountName": "Administrator", "flavor": "m1.medium", "image": "Murano_windows_image", "networks": {"useFlatNetwork": false, "primaryNetwork": null, "useEnvironmentNetwork": true, "customNetworks": []}, "sharedIps": []}, "recoveryPassword": "P@ssw0rd", "?": {"type": "io.murano.apps.activeDirectory.PrimaryController", "_actions": {}, "id": "e67b587a-64a7-4a77-9849-345c916b1f62"}, "dnsIp": null}, "secondaryControllers": [], "?": {"type": "io.murano.apps.activeDirectory.ActiveDirectory", "_actions": {}, "id": "2ac09a3b-aae2-4530-9e35-2bca474feb0b"}}]}, "SystemData": {}}}}, "method": "process_result", "_reply_q": "reply_6e0f8d8d2cd443c8bc6160347e3bb569"}', 'oslo.version': '2.0'} with routing key murano

Deployment:
stable/kilo VLAN, ceph-all, 3 Controllers, 15 Computes

Revision history for this message
Alexander Tivelkov (ativelkov) wrote :

I cannot say for sure (need to spend more time with the logs), but it seems like this may be an expected behavior: by default, we have a limit of 20 networks per tenant. If more networks are required, this may be changed in the config file

Revision history for this message
Viktoria Efimova (vefimova) wrote :
Download full text (14.4 KiB)

For concurrently deployments from 31 to 36 Rally distributes environment among 10 tenants, the maximum number of environments created in one tenant was only 7.

Bug was reproduced on test with 32 concurrently deployments. The env with error was deployed in tenant with another 5 environments.

As seen from the logs for two environments which deployed by different murano-engines there is a try to create interface for murano-default-router using net with subnet with the same cidr, what leads to the error for one of them:

============Murano-engine sent template for first env:
/var/log/murano-all.log.4.gz:<134>Oct 7 03:42:59 node-18 murano-engine Pushing: {'heat_template_version': '2013-05-23', 'description': 'This stack was generated by Murano for environment rally_JkRpl7IA4t (ID: f4830774694b47219a30bdee37bab950)', 'resources': {u'MuranoSecurityGroup-rally_JkRpl7IA4t': {'type': 'OS::Neutron::SecurityGroup', 'properties': {'rules': [{'port_range_min': None, 'port_range_max': None, 'protocol': 'icmp', 'remote_ip_prefix': '0.0.0.0/0'}, {'protocol': u'tcp', 'port_range_max': 25, 'port_range_min': 25, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 53, 'port_range_min': 53, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 53, 'port_range_min': 53, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 88, 'port_range_min': 88, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 88, 'port_range_min': 88, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 123, 'port_range_min': 123, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 135, 'port_range_min': 135, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 137, 'port_range_min': 137, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 138, 'port_range_min': 138, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 445, 'port_range_min': 445, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 445, 'port_range_min': 445, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 464, 'port_range_min': 464, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 464, 'port_range_min': 464, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 389, 'port_range_min': 389, 'remote_mode': 'remote_group_id'}, {'protocol': u'udp', 'port_range_max': 389, 'port_range_min': 389, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 636, 'port_range_min': 636, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 3268, 'port_range_min': 3268, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 3269, 'port_range_min': 3269, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 5722, 'port_range_min': 5722, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 9389, 'port_range_min': 9389, 'remote_mode': 'remote_group_id'}, {'protocol': u'tcp', 'port_range_max': 65535, 'port_range_min': 49152, 'remote_mode': 'remote_group_id'}, {'proto...

Revision history for this message
Viktoria Efimova (vefimova) wrote :

===================Repeat previous commit for read convenience with paste:================================

For concurrently deployments from 31 to 36 Rally distributes environment among 10 tenants, the maximum number of environments created in one tenant was only 7.

Bug was reproduced on test with 32 concurrently deployments. The env with error was deployed in tenant with another 5 environments.

As seen from the logs for two environments which deployed by different murano-engines there is a try to create interface for murano-default-router using net with subnet with the same cidr, what leads to the error for one of them:

1. Murano-engine on node-18 sent template for first env:
http://paste.openstack.org/raw/475770/

2. Murano-engine on node-1 sent template for second env:
http://paste.openstack.org/raw/475771/

3.Second env gets error:
http://paste.openstack.org/raw/475772/

So it looks like concurrency issue: probably the second engine doesn't get such cidr in neutron response on 'subnet' list as the subnet with such cidr hasn't been created at the time of request processing.

BTW here is the example of heat template for env http://paste.openstack.org/show/475759/

tags: added: engine
tags: added: kilo-backport-potential
tags: added: liberty-backport-potential
Revision history for this message
Alexander Tivelkov (ativelkov) wrote :

OK, so some findings on this one.

So, this is a bug indeed, but there is no easy way to fix it properly. The problem happens due to a real collision when several concurrent deployments attempt to pick a cidr using a pseudo-random algorithm. At the default setting the probability of the conflict is ideally ~ 3% (in practise - worse, because the algorithm is not truly random), but it growths up significantly with the number of concurrent deployments (at 10 concurrent deployments it is about 80%)

We have the following options to consider:

1. increase the max_environments setting from 20 to 200 (or, actually, to anything from 128 to 255 - it will have the same effect): this will increase the number of available CIDRs in 8 times, so the probability of the collision will be much lower. But such solution cannot be considered as a real fix, as it does not eliminate the problem, it just makes it less probable to happen.

2. pre-allocate CIDRs by murano-api instead of murano engine. The API can synchronise the environment creation, as it has access to the database. This will be a real fix, i.e. it will prevent the conflicts from ever happening. Its cons: it requires us to store CIDRs as database columns (so we may efficiently fetch them), and that is the abstraction leak from the muranopl class of Environment into the DB model of murano-api, which is not very good from overall architecture point of view. However, this is a good approach to take in Mitaka and probably backport to stable/liberty as well.

3. Use some kind of distributed cache (Redis or similar) to synchornise the state between different engines. This will allow us to fix the issue right in place, without abstraction leakage. But this is a major architecture redesign (useful for lots of other good features) and is unlikely to be backported into Liberty branch (and also unlikely to happen in Mitaka as well, as we have quite a lengthy backlog already).

I propose to do p1 for Liberty and plan doing p2 for Mitaka, while keeping p3 in mind.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to murano (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234362

Changed in murano:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to murano (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/234363

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to murano (stable/liberty)

Reviewed: https://review.openstack.org/234363
Committed: https://git.openstack.org/cgit/openstack/murano/commit/?id=f120a953e4a03abbe3467582839e1c89c3e97e1c
Submitter: Jenkins
Branch: stable/liberty

commit f120a953e4a03abbe3467582839e1c89c3e97e1c
Author: Alexander Tivelkov <email address hidden>
Date: Tue Oct 13 20:31:05 2015 +0300

    Increased the number of environments per tenant

    The 'max_environments' setting of [networking] group in murano
    configuration file defines the maximum number of networks which may be
    created by murano for any given router, thus eventually limiting the
    number of environments to simultaneously co-exists within a tenant.

    The previous default (20) was very low, and it was causing CIDR
    conflicts even when the actual number of envs was not reaching the
    limit.

    This change increases the number of CIDRs allowed for environment
    networks, thus reducing the probability of CIDR conflict.

    NOTE: This change just reduces the risk of conflicts but does not
    eleminate it completely.

    Change-Id: Id913d17b8f7207afc9b1983287349a6d70a09edf
    Partial-bug: #1502437

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to murano (master)

Reviewed: https://review.openstack.org/234362
Committed: https://git.openstack.org/cgit/openstack/murano/commit/?id=84bbe9c75820b3eb1cb1c07c001bd86346db08e0
Submitter: Jenkins
Branch: master

commit 84bbe9c75820b3eb1cb1c07c001bd86346db08e0
Author: Alexander Tivelkov <email address hidden>
Date: Tue Oct 13 20:31:05 2015 +0300

    Increased the number of environments per tenant

    The 'max_environments' setting of [networking] group in murano
    configuration file defines the maximum number of networks which may be
    created by murano for any given router, thus eventually limiting the
    number of environments to simultaneously co-exists within a tenant.

    The previous default (20) was very low, and it was causing CIDR
    conflicts even when the actual number of envs was not reaching the
    limit.

    This change increases the number of CIDRs allowed for environment
    networks, thus reducing the probability of CIDR conflict.

    NOTE: This change just reduces the risk of conflicts but does not
    eleminate it completely.

    Change-Id: Id913d17b8f7207afc9b1983287349a6d70a09edf
    Partial-bug: #1502437

Changed in murano:
milestone: mitaka-1 → mitaka-2
Revision history for this message
Alexander Tivelkov (ativelkov) wrote :

In L- and M-releases this one is addressed by the increasing the number of cidrs in the generated range, which reduces the conflict probability.
In future releases (N-series) this should be addressed in more proper way by utilising distributed locks or by doing cidr allocation within Heat plugin

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.