Cobbler did not refresh list of nodes after environment deletion

Bug #1460166 reported by Egor Kotko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Vladimir Sharshov
6.1.x
Won't Fix
Medium
Fuel Python (Deprecated)
7.0.x
Fix Released
Medium
Vladimir Sharshov
8.0.x
Fix Released
Medium
Vladimir Sharshov

Bug Description

{"build_id": "2015-05-27_17-33-07", "build_number": "474", "release_versions": {"2014.2.2-6.1": {"VERSION": {"build_id": "2015-05-27_17-33-07", "build_number": "474", "api": "1.0", "fuel-library_sha": "05b59b7c9279222de734295535d86f53dd3d4225", "nailgun_sha": "ac8668cc06368fe22330e293c9ce8655d46846bd", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce", "astute_sha": "5d570ae5e03909182db8e284fbe6e4468c0a4e3e", "fuel-ostf_sha": "4cd2fef040ae9e7645a6b17a7cb44d3cd8fbe0be", "release": "6.1", "fuelmain_sha": "6b5712a7197672d588801a1816f56f321cbceebd"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "05b59b7c9279222de734295535d86f53dd3d4225", "nailgun_sha": "ac8668cc06368fe22330e293c9ce8655d46846bd", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce", "astute_sha": "5d570ae5e03909182db8e284fbe6e4468c0a4e3e", "fuel-ostf_sha": "4cd2fef040ae9e7645a6b17a7cb44d3cd8fbe0be", "release": "6.1", "fuelmain_sha": "6b5712a7197672d588801a1816f56f321cbceebd"}

1) Setup env Centos 3Controllers, 2Computes, 2 Cephs
2) Backup master (dockerctl backup)
3) Reset env
4) Restore master(dockerctl restore)
5) Delete env via UI
6) Create new env via cli, start deploy Neutron Vlan 3Controllers, 2Computes, 2 Cephs

Actual result:
<class 'cobbler.cexceptions.CX'>:'MAC address duplicated: 64:0c:2a:51:d2:93'
http://paste.openstack.org/show/245256/

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Alex Schultz (alex-schultz) wrote :

I'm poking through the logs and it looks like the deployment of the new environment was started prior to the remove cluster processing which would result in the duplicate mac address as node-1 (previous) was still in the system after the restore. Additionally the remove_nodes message did not contain any nodes which is why it wasn't removed during the removal of the environment.

from astute.log:
2015-05-29T13:12:42 info: [626] Processing RPC call 'remove_nodes'
2015-05-29T13:12:42 debug: [626] 9c2e3a56-31e3-4325-ae3a-746bd03667f2 Node list is empty

Changed in fuel:
status: New → Confirmed
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Looking at the nailgun logs the deletion happened after a network verify not a deployment start.

2015-05-29 11:29:06.748 DEBUG [7f735200c740] (logger) Request PUT /api/clusters/1/reset from 10.109.5.2:33112 {}
2015-05-29 11:29:06.753 DEBUG [7f735200c740] (logger) Request DELETE /api/tasks/51 from 10.109.5.2:33459
2015-05-29 11:29:06.796 DEBUG [7f735200c740] (logger) Response code '204 No Content' for DELETE /api/tasks/51 from 10.109.5.2:33459
2015-05-29 11:29:07.042 DEBUG [7f735200c740] (logger) Response code '202 Accepted' for PUT /api/clusters/1/reset from 10.109.5.2:33
112
...
2015-05-29 12:44:25.739 DEBUG [7fed87c98740] (logger) Response code '202 Accepted' for PUT /api/clusters/1/network_configuration/neutron/verify from 10.109.5.2:34650
...
2015-05-29 13:12:41.972 DEBUG [7fed87c98740] (logger) Request DELETE /api/clusters/1 from 10.109.5.2:35059
2015-05-29 13:12:42.167 DEBUG [7fed87c98740] (logger) Response code '202 Accepted' for DELETE /api/clusters/1 from 10.109.5.2:35059

Also following the nailgun logs, after the environment was deleted it appears that the node that was checking-in (that causes the dupe mac error) did a post to /api/nodes/ to register itself where previously it had been just doing updates to /api/nodes/agent/. It got a 400 to /api/nodes/agent:
2015-05-29 13:13:04.176 DEBUG [7fed87c98740] (logger) Response code '400 Bad Request' for PUT /api/nodes/agent/ from 10.109.5.5:52363
And then called in to /api/nodes/:
2015-05-29 13:13:04.424 DEBUG [7fed87c98740] (logger) Response code '201 Created' for POST /api/nodes/ from 10.109.5.5:52364

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Astute Team (fuel-astute)
Revision history for this message
Alex Schultz (alex-schultz) wrote :

I was able to reproduce it on my 2nd try. You have to wait until the nodes come back up after being reset before removing the environment. When you delete the environment it appears that the original definition of the nodes (nodes-{1,2,3,4,5}) are removes and they get reregistered as the next set (nodes-{6,7,8,9,10})

Here is my cobbler list and my fuel nodes after doing the steps from the original report:
http://paste.openstack.org/show/245702/

I only used 2 nodes so 1 and 2 turned into 6 and 7.

summary: - Cobbler did not refresh list of nodes after enironment deletion
+ Cobbler did not refresh list of nodes after environment deletion
Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Alex Schultz (alex-schultz) right, there is no info about nodes in case of cluster deletion.

Process message from worker queue: "{\"args\": {\"engine\": {\"url\": \"http://10.109.5.2:80/cobbler_api\", \"username\": \"cobbler\", \"password\": \"SQbm9agw\", \"master_ip\": \"10.109.5.2\"}, \"nodes\": [], \"task_uuid\": \"9c2e3a56-31e3-4325-ae3a-746bd03667f2\", \"check_ceph\": false}, \"respond_to\": \"remove_cluster_resp\", \"method\": \"remove_nodes\", \"api_version\": \"1.0\"}"
2015-05-29T13:12:42 debug: [626] Got message with payload "{\"args\": {\"engine\": {\"url\": \"http://10.109.5.2:80/cobbler_api\", \"username\": \"cobbler\", \"password\": \"SQbm9agw\", \"master_ip\": \"10.109.5.2\"}, \"nodes\": [], \"task_uuid\": \"9c2e3a56-31e3-4325-ae3a-746bd03667f2\", \"check_ceph\": false}, \"respond_to\": \"remove_cluster_resp\", \"method\": \"remove_nodes\", \"api_version\": \"1.0\"}"
2015-05-29T13:12:42 debug: [626] Dispatching message: {"args"=>{"engine"=>{"url"=>"http://10.109.5.2:80/cobbler_api", "username"=>"cobbler", "password"=>"SQbm9agw", "master_ip"=>"10.109.5.2"}, "nodes"=>[], "task_uuid"=>"9c2e3a56-31e3-4325-ae3a-746bd03667f2", "check_ceph"=>false}, "respond_to"=>"remove_cluster_resp", "method"=>"remove_nodes", "api_version"=>"1.0"}

Revision history for this message
Alex Schultz (alex-schultz) wrote :

I'm lowering the severity of this bug as it can be corrected by running the following to remove the old node from cobbler:

dockerctl shell cobbler cobbler system remove --name=node-#

In this command, node-# is old name of the node. For example:

dockerctl shell cobbler cobbler system remove --name=node-1

Other helpful troubleshooting commands for cobbler:

# show all the systems currently registered in cobbler
dockerctl shell cobbler cobbler system list

# show information for the a registered system, using name from the system list
dockerctl shell cobbler cobbler system report --name=node-5

Changed in fuel:
status: Confirmed → Triaged
importance: High → Medium
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Why we do not delete this nodes from Cobbler: because we restore DB and Cobbler in case of restoring master process.
Yes, we show that nodes are discovered instead of ready, but cluster status still in 'production' state after restoring.

According to this comment https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/test/integration/test_deletion_task.py#L128-L129 we do not perform any action for nodes in discovery node state.

I think we should mark this scenario as unexpected because we restore from backup all previous information and could not guarantee stability for any changes and tasks after backup and before restore moments.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I suggest to mark it as medium and move it to 7.0. We can generate many scenarios where backup with old data will affect clusters.

Also i suggest to write notice to the documentation that we can guaranty backup/restore only in case of actual data: do something with cluster, backup it.

Changed in fuel:
milestone: 6.1 → 7.0
assignee: Vladimir Sharshov (vsharshov) → Fuel Python Team (fuel-python)
tags: added: docs feature module-nailgun
tags: added: release-notes
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Actual information after recheck on local machine. Now this scenario worked both on 7.0 and 8.0 environments because of this changes:

* https://bugs.launchpad.net/fuel/+bug/1491725;
* https://bugs.launchpad.net/fuel/+bug/1494446;
* https://bugs.launchpad.net/fuel/+bug/1455610;
* https://bugs.launchpad.net/fuel/+bug/1486157.

I update info about statutes and closed as Fix committed because now we support this solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/233743

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/233743
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=7a8687a150b785b6f8799ba1d108613b60acbbee
Submitter: Jenkins
Branch: master

commit 7a8687a150b785b6f8799ba1d108613b60acbbee
Author: Alexander Kurenyshev <email address hidden>
Date: Thu Oct 8 18:25:06 2015 +0300

    Add test for cobbler refresh nodes
    This patch addes check for related bug to system tests

    Test steps:
    1. Create env with 1Controller, 1Compute, 1Ceph
    2. Start provisioning
    3. Backup master
    4. Reset env
    5. Restore master
    6. Delete env
    7. Create new env via CLI with the same staff
    8. Start provisioning via CLI

    Change-Id: If01f320f54df21505663540cffa663e6a7a432f1
    Closes-Bug: 1503243
    Related-Bug: 1460166

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

Manual test has passed. Also automated test has been written and passed on CI http://jenkins-product.srt.mirantis.net:8080/view/custom_iso/job/8.0.custom_system_test/168/console
So, I move this bug to fix released

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

Also I've run this system test again 7.0 RC4 ISO build and it passed.
Console output for this run is here [1]

[1] https://paste.mirantis.net/show/1278/
So I move this bug to Fix released for 7.0 too

Dmitry Pyzhov (dpyzhov)
tags: added: area-python
tags: added: 8.0 release-notes-done
removed: release-notes
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.