Exception: Agents failed to join

Bug #1779304 reported by Nahian Chowdhury
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
shaker
New
Undecided
Unassigned

Bug Description

I am getting the following error, tried many times with some changes,

not reciveing the poll as operation.

Exception: Agents failed to join: {'shaker_tderis_slave_0': 'lost', 'shaker_tderis_master_0': 'lost'}

------
snippet of debug outcome:

---
Deployed agent:

2018-06-29 07:13:07.015 3633 DEBUG keystoneauth.session [-] REQ: curl -g -i -X GET http://192.168.2.9:8004/v1/62f36ee8851c42edbaa42be8c828f36a/stacks/shaker_tderis/294edb07-3ad7-498b-911d-549dd53897ef/outputs/shaker_tderis_master_0_ip -H "Accept: application/json" -H "User-Agent: python-heatclient" -H "X-Auth-Token: {SHA1}9b80b66a24e89e2fd8cfec0ddee5c454dd35e733" _http_log_request /root/venv/local/lib/python2.7/site-packages/keystoneauth1/session.py:448
2018-06-29 07:13:07.765 3633 DEBUG keystoneauth.session [-] RESP: [200] Content-Length: 122 Content-Type: application/json Date: Fri, 29 Jun 2018 07:13:07 GMT X-Openstack-Request-Id: req-224074cd-1717-4543-95de-4f29756d0d95 _http_log_response /root/venv/local/lib/python2.7/site-packages/keystoneauth1/session.py:479
2018-06-29 07:13:07.766 3633 DEBUG keystoneauth.session [-] RESP BODY: {"output": {"output_value": "10.0.0.6", "output_key": "shaker_tderis_master_0_ip", "description": "No description given"}} _http_log_response /root/venv/local/lib/python2.7/site-packages/keystoneauth1/session.py:511
2018-06-29 07:13:07.766 3633 DEBUG keystoneauth.session [-] GET call to orchestration for http://192.168.2.9:8004/v1/62f36ee8851c42edbaa42be8c828f36a/stacks/shaker_tderis/294edb07-3ad7-498b-911d-549dd53897ef/outputs/shaker_tderis_master_0_ip used request id req-224074cd-1717-4543-95de-4f29756d0d95 request /root/venv/local/lib/python2.7/site-packages/keystoneauth1/session.py:844
2018-06-29 07:13:07.767 3633 DEBUG shaker.engine.server [-] Deployed agents: {'shaker_tderis_slave_0': {'node': u'kolla-compute1', 'zone': u'nova', 'availability_zone': u'nova:kolla-compute1', 'ip': u'10.0.0.10', 'master': {'node': u'kolla-compute2', 'zone': u'nova', 'availability_zone': u'nova:kolla-compute2', 'ip': u'10.0.0.6', 'mode': 'master', 'slave_id': 'shaker_tderis_slave_0', 'id': 'shaker_tderis_master_0'}, 'mode': 'slave', 'master_id': 'shaker_tderis_master_0', 'id': 'shaker_tderis_slave_0'}, 'shaker_tderis_master_0': {'node': u'kolla-compute2', 'slave': {'node': u'kolla-compute1', 'zone': u'nova', 'availability_zone': u'nova:kolla-compute1', 'ip': u'10.0.0.10', 'mode': 'slave', 'master_id': 'shaker_tderis_master_0', 'id': 'shaker_tderis_slave_0'}, 'zone': u'nova', 'availability_zone': u'nova:kolla-compute2', 'ip': u'10.0.0.6', 'mode': 'master', 'slave_id': 'shaker_tderis_slave_0', 'id': 'shaker_tderis_master_0'}} play_scenario /root/venv/local/lib/python2.7/site-packages/shaker/engine/server.py:182
2018-06-29 07:13:07.767 3633 INFO shaker.engine.quorum [-] Waiting for quorum of agents: set(['shaker_tderis_slave_0', 'shaker_tderis_master_0'])

--
last many lines before the ERROR,
-----

2018-06-29 07:22:58.553 3639 DEBUG shaker.agent.agent [-] Polling task: {'operation': 'poll', 'agent_id': '__heartbeat'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:40
2018-06-29 07:22:58.555 3633 DEBUG shaker.engine.messaging [-] Received request: {'operation': 'poll', 'agent_id': '__heartbeat'} __iter__ /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:51
2018-06-29 07:22:58.557 3633 DEBUG shaker.engine.messaging [-] Sent reply: {'operation': 'none'} reply_handler /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:55
2018-06-29 07:22:58.557 3639 DEBUG shaker.agent.agent [-] Received: {'operation': 'none'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:43
2018-06-29 07:23:08.568 3639 DEBUG shaker.agent.agent [-] Polling task: {'operation': 'poll', 'agent_id': '__heartbeat'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:40
2018-06-29 07:23:08.569 3633 DEBUG shaker.engine.messaging [-] Received request: {'operation': 'poll', 'agent_id': '__heartbeat'} __iter__ /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:51
2018-06-29 07:23:08.570 3633 DEBUG shaker.engine.messaging [-] Sent reply: {'operation': 'none'} reply_handler /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:55
2018-06-29 07:23:08.571 3639 DEBUG shaker.agent.agent [-] Received: {'operation': 'none'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:43
2018-06-29 07:23:18.582 3639 DEBUG shaker.agent.agent [-] Polling task: {'operation': 'poll', 'agent_id': '__heartbeat'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:40
2018-06-29 07:23:18.583 3633 DEBUG shaker.engine.messaging [-] Received request: {'operation': 'poll', 'agent_id': '__heartbeat'} __iter__ /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:51
2018-06-29 07:23:18.584 3633 DEBUG shaker.engine.messaging [-] Sent reply: {'operation': 'none'} reply_handler /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:55
2018-06-29 07:23:18.584 3639 DEBUG shaker.agent.agent [-] Received: {'operation': 'none'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:43
2018-06-29 07:23:28.594 3639 DEBUG shaker.agent.agent [-] Polling task: {'operation': 'poll', 'agent_id': '__heartbeat'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:40
2018-06-29 07:23:28.595 3633 DEBUG shaker.engine.messaging [-] Received request: {'operation': 'poll', 'agent_id': '__heartbeat'} __iter__ /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:51
2018-06-29 07:23:28.596 3633 DEBUG shaker.engine.messaging [-] Sent reply: {'operation': 'none'} reply_handler /root/venv/local/lib/python2.7/site-packages/shaker/engine/messaging.py:55
2018-06-29 07:23:28.596 3639 DEBUG shaker.agent.agent [-] Received: {'operation': 'none'} poll_task /root/venv/local/lib/python2.7/site-packages/shaker/agent/agent.py:43

----

I thought It's a network issues and I checked following,

Ping:

From External Router -- Both Instance >>> Worked
External Router -- deploy node (shaker deployed) >> worked
External Router -- Internet >> worked

What might be the cause, and at which point I should check for networking issues?

Nodes:

3 controller nodes
2 compute nodes
2 network nodes

Those are all in overcloud.

Thanks in Advance.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Could you please check the following:
 1. Start shaker with parameter --agent-join-timeout set to large value (e.g. 3600 = 1 hour)
 2. While shaker is running create a VM inside a network created by shaker (Cirros VM would be enough)
 3. Do telnet to the address:port of shaker.
 4. If 3 is succeed, then boot VM from shaker image (a key will be required to access it)
 5. Login into VM and check that shaker-agent is installed.
 6. If 5 is yes, start shaker-agent manually and check whether it was able to connect to shaker.

Revision history for this message
Nahian Chowdhury (nahian) wrote :

Steps 1 & 2 are done.
On 3,
telnet 192.168.1.10 1337
Trying 192.168.1.10...
Connected to 192.168.1.10.
Escape character is '^]'.
Connection closed by foreign host.

netstat -ln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:1337 0.0.0.0:* LISTEN
tcp6 0 0 :::22 :::* LISTEN
tcp6 0 0 :::12865 :::* LISTEN

Is that the network issue?

4. where can i get the key for shaker image?

Revision history for this message
Nahian Chowdhury (nahian) wrote :

My deploy host has 192.168.1.10 on the interface eth1

Revision history for this message
Nahian Chowdhury (nahian) wrote :

root@kolla-deploy:~# netstat -tapnl | grep 1337
tcp 0 0 0.0.0.0:1337 0.0.0.0:* LISTEN 17812/python2
tcp 0 0 192.168.1.10:1337 192.168.1.10:55534 TIME_WAIT -
tcp 0 0 192.168.1.10:55518 192.168.1.10:1337 ESTABLISHED 17818/python2
tcp 0 0 192.168.1.10:1337 192.168.1.10:55518 ESTABLISHED 17812/python2

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Networking looks good (since telnet is able to connect from VM to shaker server).

For 4) you need to create SSH key, upload it to OpenStack and pass key name when creating VM (e.g. https://docs.openstack.org/nova/queens/user/launch-instance-from-image.html). User name for Ubuntu-based image is `ubuntu`

Revision history for this message
Nahian Chowdhury (nahian) wrote :

Interesting point, I ran 10 instance single node scenario = "instance_metadata.yaml".

And checked that

 Lost agents: set(['shaker_frktis_agent_0', 'shaker_frktis_agent_1', 'shaker_frktis_agent_5', 'shaker_frktis_agent_6', 'shaker_frktis_agent_7', 'shaker_frktis_agent_8', 'shaker_frktis_agent_9'])

Means that shaker_frktis_agent_2,3,4 was not lost, I checked it on netstat -tapnl,

root@kolla-deploy:~# netstat -tapnl | grep 13337
tcp 0 0 0.0.0.0:13337 0.0.0.0:* LISTEN 17895/python2
tcp 0 0 192.168.1.10:13337 192.168.1.117:55058 ESTABLISHED 17895/python2
tcp 0 0 192.168.1.10:13337 192.168.1.111:60266 ESTABLISHED 17895/python2
tcp 0 0 192.168.1.10:13337 192.168.1.119:33362 ESTABLISHED 17895/python2
tcp 0 0 192.168.1.10:36314 192.168.1.10:13337 ESTABLISHED 17901/python2
tcp 0 0 192.168.1.10:13337 192.168.1.10:36314 ESTABLISHED 17895/python2

192.168.1.111
192.168.1.117
192.168.1.119
ips are for shaker_frktis_agent_2,3,4.
whereas others are lost.

Any idea?

I will try to boot them vm, next.

Revision history for this message
Nahian Chowdhury (nahian) wrote :

when I ran the following command in the command line manually (found in the instance_metadata.yaml)

metadata=`curl http://169.254.169.254/latest/meta-data/ 2>/dev/null`

      echo "{"
      for line in ${metadata}; do
          val=`curl http://169.254.169.254/latest/meta-data/$line 2>/dev/null | tr "\n" ","`
          echo "\"$line\": \"${val}\","
      done
      key=`curl http://169.254.169.254/latest/meta-data/public-keys/0/openssh-key 2>/dev/null | tr "\n" ""`
      echo "\"public-keys/0/openssh-key\": \"${key}\""
      echo "}"

----
The output:
----

"ami-id": "ami-0000003d",
"ami-launch-index": "0",
"ami-manifest-path": "FIXME",
"block-device-mapping/": "ami,root",
"hostname": "kolla-deploy.novalocal",
"instance-action": "none",
"instance-id": "i-00001ccd",
"instance-type": "m1.xlarge",
"local-hostname": "kolla-deploy.novalocal",
"local-ipv4": "192.168.2.10",
"placement/": "availability-zone",
"public-hostname": "kolla-deploy.novalocal",
"public-ipv4": "141.20.187.79",
"public-keys/": "0=MobTry",
"reservation-id": "r-jmnjrllv",
"security-groups": "default,rules_neutron_external,default,rules_neutron_external",

description: updated
Revision history for this message
Nahian Chowdhury (nahian) wrote :

I checked the LOG from the instance from horizon,

At some point I saw

] Listening on UUID daemon activation socket.[ 532.226791] cloud-init[1013]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
[ 532.255624]
cloud-init[1013]: The key fingerprint is:
[ 532.279300] [[0;32m OK [0mcloud-init[1013]: SHA256:tQpkm3ITV92d78drXkSFoiNWVSCPYDroMSOENBBeUIw root@shaker-oazrks-agent-5] Started Daily apt upgrade and clean activities.

[ 532.317553] cloud-init[1013]: The key's randomart image is:
[[0;32m OK [0m] [ 532.338389] cloud-initReached target Timers.

Revision history for this message
Nahian Chowdhury (nahian) wrote :

**sorry missed the important line

[ 497.606235] cloud-init[1013]: 2018-07-05 13:30:08,482 - DataSourceEc2.py[WARNING]: Metadata 'network' key not valid: None.
] Listening on UUID daemon activation socket.[ 532.226791] cloud-init[1013]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
[ 532.255624]
cloud-init[1013]: The key fingerprint is:
[ 532.279300] [[0;32m OK [0mcloud-init[1013]: SHA256:tQpkm3ITV92d78drXkSFoiNWVSCPYDroMSOENBBeUIw root@shaker-oazrks-agent-5] Started Daily apt upgrade and clean activities.

[ 532.317553] cloud-init[1013]: The key's randomart image is:
[[0;32m OK [0m] [ 532.338389] cloud-initReached target Timers.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

> Means that shaker_frktis_agent_2,3,4 was not lost,

That's good, it means that networking is configured properly, image is correct and has shaker-agent.

My suggestion is to check whether all VMs get proper metadata via cloud-init. It may be a case that metadata server is overloaded and does not reply within default cloud-init timeout (which is, as I recall correctly, just 10 seconds). If it is the case then the issue should not appear with lower number of VMs.

Revision history for this message
Nahian Chowdhury (nahian) wrote :

About the shaker_frktis_agent_2,3,4, I checked in netstat output, I can see their IP address.
So I assumed they were not lost.

But I tried with full_l2.yaml, there are only two VM spawn, that did not work too.

In the VM Log from horizon, I found the following

cloud-init[1013]: 2018-07-05 13:30:08,482 - DataSourceEc2.py[WARNING]: Metadata 'network' key not valid: None.

What does it mean > Metadata 'network' key not valid?

Note: I could create my own network and VM, even with floating IP I could access my VM, and VM to the Internet as well.

Revision history for this message
Nahian Chowdhury (nahian) wrote :

Hi Shakhat,

It worked fine now,

I guess it started working after I created a new VM with my key and attached it with the shaker network.

"For 4) you need to create SSH key, upload it to OpenStack and pass key name when creating VM (e.g. https://docs.openstack.org/nova/queens/user/launch-instance-from-image.html). User name for Ubuntu-based image is `ubuntu`"

I don't know how to check the metadata whether its overloaded and how to check VMs get proper metadata via cloud-init. Would like to know these.

Thanks a million.

Now I want to create my own HOT templates to test, but how can I define my own key in HOT?

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Hi Nahian,

> It worked fine now,
> I guess it started working after I created a new VM with my key and attached it with the shaker network.

That's good!
Though it's strange why manual boot of VM helps. I would suggest to look more precisely on difference between booting VMs via HOT and the manual one. It may be that Heat operates too quickly and some resources are not created properly (e.g. network services start too late and VMs boots before metadata agent -- if this is the case than the less VMs are in HOT the less chances to get them work).

Regarding HOT, you can take https://github.com/openstack/shaker/blob/master/shaker/scenarios/openstack/l2.hot as starting point. Key should be a property of Server (key_name: <name of Key resource>). HOT file is referenced from scenario file, so the latter needs to be updated too.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.