Fuel for OpenStack

Cobbler did not refresh list of nodes after environment deletion

Bug #1460166 reported by Egor Kotko on 2015-05-29

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	Medium	Vladimir Sharshov	Fuel for OpenStack 8.0
6.1.x	Won't Fix	Medium	Fuel Python (Deprecated)	Fuel for OpenStack 6.1
7.0.x	Fix Released	Medium	Vladimir Sharshov	Fuel for OpenStack 7.0
8.0.x	Fix Released	Medium	Vladimir Sharshov	Fuel for OpenStack 8.0

Bug Description

{"build_id": "2015-05-27_17-33-07", "build_number": "474", "release_versions": {"2014.2.2-6.1": {"VERSION": {"build_id": "2015-05-27_17-33-07", "build_number": "474", "api": "1.0", "fuel-library_sha": "05b59b7c9279222de734295535d86f53dd3d4225", "nailgun_sha": "ac8668cc06368fe22330e293c9ce8655d46846bd", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce", "astute_sha": "5d570ae5e03909182db8e284fbe6e4468c0a4e3e", "fuel-ostf_sha": "4cd2fef040ae9e7645a6b17a7cb44d3cd8fbe0be", "release": "6.1", "fuelmain_sha": "6b5712a7197672d588801a1816f56f321cbceebd"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "05b59b7c9279222de734295535d86f53dd3d4225", "nailgun_sha": "ac8668cc06368fe22330e293c9ce8655d46846bd", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce", "astute_sha": "5d570ae5e03909182db8e284fbe6e4468c0a4e3e", "fuel-ostf_sha": "4cd2fef040ae9e7645a6b17a7cb44d3cd8fbe0be", "release": "6.1", "fuelmain_sha": "6b5712a7197672d588801a1816f56f321cbceebd"}

1) Setup env Centos 3Controllers, 2Computes, 2 Cephs
2) Backup master (dockerctl backup)
3) Reset env
4) Restore master(dockerctl restore)
5) Delete env via UI
6) Create new env via cli, start deploy Neutron Vlan 3Controllers, 2Computes, 2 Cephs

Actual result:
<class 'cobbler.cexceptions.CX'>:'MAC address duplicated: 64:0c:2a:51:d2:93'
http://paste.openstack.org/show/245256/

Tags:

Revision history for this message

Egor Kotko (ykotko) wrote on 2015-05-29:

log.tar.gz Edit (111.0 MiB, application/x-tar)

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-05-29:

I'm poking through the logs and it looks like the deployment of the new environment was started prior to the remove cluster processing which would result in the duplicate mac address as node-1 (previous) was still in the system after the restore. Additionally the remove_nodes message did not contain any nodes which is why it wasn't removed during the removal of the environment.

from astute.log:
2015-05-29T13:12:42 info: [626] Processing RPC call 'remove_nodes'
2015-05-29T13:12:42 debug: [626] 9c2e3a56-31e3-4325-ae3a-746bd03667f2 Node list is empty

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-05-29:

Looking at the nailgun logs the deletion happened after a network verify not a deployment start.

2015-05-29 11:29:06.748 DEBUG [7f735200c740] (logger) Request PUT /api/clusters/1/reset from 10.109.5.2:33112 {}
2015-05-29 11:29:06.753 DEBUG [7f735200c740] (logger) Request DELETE /api/tasks/51 from 10.109.5.2:33459
2015-05-29 11:29:06.796 DEBUG [7f735200c740] (logger) Response code '204 No Content' for DELETE /api/tasks/51 from 10.109.5.2:33459
2015-05-29 11:29:07.042 DEBUG [7f735200c740] (logger) Response code '202 Accepted' for PUT /api/clusters/1/reset from 10.109.5.2:33
112
...
2015-05-29 12:44:25.739 DEBUG [7fed87c98740] (logger) Response code '202 Accepted' for PUT /api/clusters/1/network_configuration/neutron/verify from 10.109.5.2:34650
...
2015-05-29 13:12:41.972 DEBUG [7fed87c98740] (logger) Request DELETE /api/clusters/1 from 10.109.5.2:35059
2015-05-29 13:12:42.167 DEBUG [7fed87c98740] (logger) Response code '202 Accepted' for DELETE /api/clusters/1 from 10.109.5.2:35059

Also following the nailgun logs, after the environment was deleted it appears that the node that was checking-in (that causes the dupe mac error) did a post to /api/nodes/ to register itself where previously it had been just doing updates to /api/nodes/agent/. It got a 400 to /api/nodes/agent:
2015-05-29 13:13:04.176 DEBUG [7fed87c98740] (logger) Response code '400 Bad Request' for PUT /api/nodes/agent/ from 10.109.5.5:52363
And then called in to /api/nodes/:
2015-05-29 13:13:04.424 DEBUG [7fed87c98740] (logger) Response code '201 Created' for POST /api/nodes/ from 10.109.5.5:52364

Vladimir Kuklin (vkuklin) on 2015-05-29

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Fuel Astute Team (fuel-astute)

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-05-29:

I was able to reproduce it on my 2nd try. You have to wait until the nodes come back up after being reset before removing the environment. When you delete the environment it appears that the original definition of the nodes (nodes-{1,2,3,4,5}) are removes and they get reregistered as the next set (nodes-{6,7,8,9,10})

Here is my cobbler list and my fuel nodes after doing the steps from the original report:
http://paste.openstack.org/show/245702/

I only used 2 nodes so 1 and 2 turned into 6 and 7.

Sylwester Brzeczkowski (sbrzeczkowski) on 2015-06-01

summary:

- Cobbler did not refresh list of nodes after enironment deletion
+ Cobbler did not refresh list of nodes after environment deletion

Vladimir Sharshov (vsharshov) on 2015-06-01

Changed in fuel:
assignee:	Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-06-01:

Alex Schultz (alex-schultz) right, there is no info about nodes in case of cluster deletion.

Process message from worker queue: "{\"args\": {\"engine\": {\"url\": \"http://10.109.5.2:80/cobbler_api\", \"username\": \"cobbler\", \"password\": \"SQbm9agw\", \"master_ip\": \"10.109.5.2\"}, \"nodes\": [], \"task_uuid\": \"9c2e3a56-31e3-4325-ae3a-746bd03667f2\", \"check_ceph\": false}, \"respond_to\": \"remove_cluster_resp\", \"method\": \"remove_nodes\", \"api_version\": \"1.0\"}"
2015-05-29T13:12:42 debug: [626] Got message with payload "{\"args\": {\"engine\": {\"url\": \"http://10.109.5.2:80/cobbler_api\", \"username\": \"cobbler\", \"password\": \"SQbm9agw\", \"master_ip\": \"10.109.5.2\"}, \"nodes\": [], \"task_uuid\": \"9c2e3a56-31e3-4325-ae3a-746bd03667f2\", \"check_ceph\": false}, \"respond_to\": \"remove_cluster_resp\", \"method\": \"remove_nodes\", \"api_version\": \"1.0\"}"
2015-05-29T13:12:42 debug: [626] Dispatching message: {"args"=>{"engine"=>{"url"=>"http://10.109.5.2:80/cobbler_api", "username"=>"cobbler", "password"=>"SQbm9agw", "master_ip"=>"10.109.5.2"}, "nodes"=>[], "task_uuid"=>"9c2e3a56-31e3-4325-ae3a-746bd03667f2", "check_ceph"=>false}, "respond_to"=>"remove_cluster_resp", "method"=>"remove_nodes", "api_version"=>"1.0"}

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-06-01:

I'm lowering the severity of this bug as it can be corrected by running the following to remove the old node from cobbler:

dockerctl shell cobbler cobbler system remove --name=node-#

In this command, node-# is old name of the node. For example:

dockerctl shell cobbler cobbler system remove --name=node-1

Other helpful troubleshooting commands for cobbler:

# show all the systems currently registered in cobbler
dockerctl shell cobbler cobbler system list

# show information for the a registered system, using name from the system list
dockerctl shell cobbler cobbler system report --name=node-5

Changed in fuel:
status:	Confirmed → Triaged
importance:	High → Medium

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-06-01:

Why we do not delete this nodes from Cobbler: because we restore DB and Cobbler in case of restoring master process.
Yes, we show that nodes are discovered instead of ready, but cluster status still in 'production' state after restoring.

According to this comment https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/test/integration/test_deletion_task.py#L128-L129 we do not perform any action for nodes in discovery node state.

I think we should mark this scenario as unexpected because we restore from backup all previous information and could not guarantee stability for any changes and tasks after backup and before restore moments.

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-06-01:

I suggest to mark it as medium and move it to 7.0. We can generate many scenarios where backup with old data will affect clusters.

Also i suggest to write notice to the documentation that we can guaranty backup/restore only in case of actual data: do something with cluster, backup it.

Changed in fuel:
milestone:	6.1 → 7.0
assignee:	Vladimir Sharshov (vsharshov) → Fuel Python Team (fuel-python)
tags:	added: docs feature module-nailgun

Nastya Urlapova (aurlapova) on 2015-06-02

tags:

added: release-notes

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-09-19:

Actual information after recheck on local machine. Now this scenario worked both on 7.0 and 8.0 environments because of this changes:

* https://bugs.launchpad.net/fuel/+bug/1491725;
* https://bugs.launchpad.net/fuel/+bug/1494446;
* https://bugs.launchpad.net/fuel/+bug/1455610;
* https://bugs.launchpad.net/fuel/+bug/1486157.

I update info about statutes and closed as Fix committed because now we support this solution.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-12: Related fix proposed to fuel-qa (master)

#10

Related fix proposed to branch: master
Review: https://review.openstack.org/233743

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-14: Related fix merged to fuel-qa (master)

#11

Reviewed: https://review.openstack.org/233743
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=7a8687a150b785b6f8799ba1d108613b60acbbee
Submitter: Jenkins
Branch: master

commit 7a8687a150b785b6f8799ba1d108613b60acbbee
Author: Alexander Kurenyshev <email address hidden>
Date: Thu Oct 8 18:25:06 2015 +0300

Add test for cobbler refresh nodes
This patch addes check for related bug to system tests

    Test steps:
    1. Create env with 1Controller, 1Compute, 1Ceph
    2. Start provisioning
    3. Backup master
    4. Reset env
    5. Restore master
    6. Delete env
    7. Create new env via CLI with the same staff
    8. Start provisioning via CLI

    Change-Id: If01f320f54df21505663540cffa663e6a7a432f1
    Closes-Bug: 1503243
    Related-Bug: 1460166

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2015-10-15:

#12

Manual test has passed. Also automated test has been written and passed on CI http://jenkins-product.srt.mirantis.net:8080/view/custom_iso/job/8.0.custom_system_test/168/console
So, I move this bug to fix released

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2015-10-15:

#13

Also I've run this system test again 7.0 RC4 ISO build and it passed.
Console output for this run is here [1]

[1] https://paste.mirantis.net/show/1278/
So I move this bug to Fix released for 7.0 too

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-python

Olga Gusarenko (ogusarenko) on 2016-02-26

tags:

added: 8.0 release-notes-done
removed: release-notes

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

log.tar.gz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.