Network verification stopped responding after deploying cluster.
Connect with me ,If you need a env to find solution.
Build # 187
Scenario:
1. Create new environment
2. Choose Neutron, VLAN
3. Choose Ceph for images
4. Choose Sahara
5. Choose Ceilometer
6. Add 1 controller+ceph
7. Add 1 compute+ceph
8. Add 1 cinder+ceph
9. Add 2 mongo
10. Change disk configuration for both Mongo nodes. Change 'MongoDB' volume for vdc
11. Deploy the environment
12. Verify networks
Expected result:
On step 12 net check is successful
Actual result:
Network verification isn't responding.
On ui we see network verification still running already more than 30 minutes.
In table of tasks we can see following:
[root@nailgun log]# fuel task
id | status | name | cluster | progress | uuid
---|---------|-------------------------|---------|----------|-------------------------------------
36 | ready | provision | 2 | 100 | 5b0c8bc1-c9ad-47e4-af1c-05769b7887a4
38 | running | verify_networks | 2 | 0 | 4f740d59-3961-411e-8adf-84cf8b90d831
37 | ready | deployment | 2 | 100 | 7ff4dc08-76ab-4a8c-bed6-e7d752c052b1
32 | ready | deploy | 2 | 100 | 84368f8d-b370-4578-8d93-c8184181d25c
39 | running | check_dhcp | 2 | 0 | 2082cd6a-2b82-4e92-9e10-04a9d19533b3
40 | running | check_repo_availability | 2 | 0 | fb173701-8384-4818-94ba-e0cd5c02cabc
41 | running | dump | None | 0 | 69b4d207-7119-4298-a547-02cd90caeeb8
[root@nailgun log]#
I have seen the environment and here what I've investigated.
Nailgun sends network check message to Astute, in Astute logs there is nothing about this message, after checking RabbitMQ naily queue the message was found, it was sent to the Astute, Astute didn't respond with acknowledgement message, so RabbitMQ kept the message without resending it to other workers.
So eventmachine received the message but stuck before trying to log it [1], or it stuck on logging attempt.
Also we probably had similar issue with logging which just stuck [2].
After worker which received the message was killed, message was rescheduled and received by another worker.
We had snapshot of the environment, after it was reverted Astute instantly reconnected and message was rescheduled.
So it adds more complexity to debug the issue.
[1] https:/ /github. com/stackforge/ fuel-astute/ blob/53c86cba59 3ddbac776ce5a33 60240274c20738c /lib/astute/ server/ server. rb#L62 /github. com/stackforge/ fuel-astute/ commit/ 3ce8643c2d84472 56561f0eafb71a2 58b6f74f17# diff-e58148f7ac 9ffd88d46811627 73da473
[2] https:/