I have made some progress on the periodic message loss issue. First, I believe the messages are being sent by the VIM, but are not being received by the nova-api-osapi pod. As an example, in one failure case, the VIM would have sent two GET requests to the nova-api here: 2019-02-27T16:54:05.222 controller-1 VIM_Thread[192368] DEBUG _vim_nfvi_audits.py.974 Audit instance called, timer_id=16. 2019-02-27T16:54:05.223 controller-1 VIM_Thread[192368] INFO _vim_nfvi_audits.py.1003 Auditing instance a5d7d490-5e12-4161-9949-b400f79237a5. 2019-02-27T16:54:05.223 controller-1 VIM_Thread[192368] INFO _vim_nfvi_audits.py.1003 Auditing instance c572d839-19f5-4aec-9f57-f34b8bcf2a2d. I can see that the second message is processed by the nova-api-osapi pod on controller-1 (co-located with the VIM): 2019-02-27 16:54:05,634.634 1 INFO nova.osapi_compute.wsgi.server [req-64ca04ca-684e-4b82-9cde-cd0f4dc8e4f9 3609cdfff9cb486c92732d162375fb9f 9f0e8cacb37b42c7a5b3c7bc99863582 - default default] 128.224.151.84 "GET /v2.1/9f0e8cacb37b42c7a5b3c7bc99863582/servers/c572d839-19f5-4aec-9f57-f34b8bcf2a2d HTTP/1.1" status: 200 len: 2708 time: 0.4034779 The VIM handles the reply: 2019-02-27T16:54:05.636 controller-1 VIM_Thread[192368] INFO _vim_nfvi_audits.py.941 Audit-Instance callback for c572d839-19f5-4aec-9f57-f34b8bcf2a2d 2019-02-27T16:54:05.640 controller-1 VIM_Thread[192368] DEBUG _instance_director.py.1579 Notify other directors that an instance centos-1 audit is inprogress. But the first message doesn’t appear in the nova-api-osapi logs and the VIM eventually times out waiting for the reply: 2019-02-27T16:54:15.441 controller-1 VIM_Thread[192368] DEBUG _vim_nfvi_audits.py.974 Audit instance called, timer_id=16. 2019-02-27T16:54:15.441 controller-1 VIM_Thread[192368] INFO _vim_nfvi_audits.py.984 Audit instance queries still outstanding, outstanding=OrderedDict([(u'a5d7d490-5e12-4161-9949-b400f79237a5', u'centos-2')]) 2019-02-27T16:54:25.234 controller-1 VIM_Thread[192368] INFO _task_worker_pool.py.73 Timeout worker Compute-Worker-0 2019-02-27T16:54:26.239 controller-1 VIM_Thread[192368] ERROR _task.py.200 Task(get_instance) work (get_server) timed out, id=3901. 2019-02-27T16:54:26.240 controller-1 VIM_Thread[192368] ERROR _vim_nfvi_audits.py.961 Audit-Instance callback, not completed, response={'completed': False, 'reason': ''}. I next determined that this problem is not specific to the VIM. I wrote simple script to send GET requests directly to the nova API using curl (run this from a shell with the OpenStack credentials set up): #!/bin/bash TENANT_ID=`openstack token issue | grep "| project_id |" | cut -f3 -d'|' | tr -d '[[:space:]]'` TOKEN_ID=`openstack token issue | grep "| id |" | cut -f3 -d'|' | tr -d '[[:space:]]'` counter=1 while [ $counter -le 10000 ] do echo echo "Request: $counter" date +"%D %T %N" curl -g -i -X GET http://nova-api.openstack.svc.cluster.local:8774/v2.1/$TENANT_ID/flavors/detail -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: $TOKEN_ID" -H "X-OpenStack-Nova-API-Version: 2.1" #curl -g -i -X GET http://nova.openstack.svc.cluster.local:80/v2.1/$TENANT_ID/flavors/detail -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: $TOKEN_ID" -H "X-OpenStack-Nova-API-Version: 2.1" #curl -g -i -X GET -H 'Accept: */*' -H 'User-Agent: python-glanceclient' -H 'Connection: keep-alive' -H "X-Auth-Token: $TOKEN_ID" -H 'Content-Type: application/octet-stream' http://glance.openstack.svc.cluster.local:80/v2/schemas/image #curl -g -i -X GET -H 'Accept: */*' -H 'User-Agent: python-glanceclient' -H 'Connection: keep-alive' -H "X-Auth-Token: $TOKEN_ID" -H 'Content-Type: application/octet-stream' http://glance-api.openstack.svc.cluster.local:9292/v2/schemas/image rc=$? if [ $rc -ne 0 ] then echo $rc exit 1 fi ((counter++)) echo done The results are interesting: - The first GET to http://nova-api.openstack.svc.cluster.local:8774 fails after a relatively small number of iterations (less than 500). I repeated this many times with the same result. - The second GET to http://nova.openstack.svc.cluster.local:80 never fails (after 10000 iterations). - The third and fourth GET requests repeat the same test using the glance URLs. They have the same result - http://glance-api.openstack.svc.cluster.local:9292 fails (although it runs for about 3000 iterations usually) and http://glance.openstack.svc.cluster.local:80 never fails. So, it looks like the problem affects the nova-api service, but not the nova service. I don’t know much about the kubernetes networking, but it looks like the nova service goes through ingress: [root@controller-1 ~(keystone_admin)]# kubectl -n openstack describe service nova Name: nova Namespace: openstack Labels: Annotations: Selector: app=ingress-api Type: ClusterIP IP: 10.98.129.53 Port: http 80/TCP TargetPort: 80/TCP Endpoints: 172.16.0.168:80,172.16.1.25:80 Port: https 443/TCP TargetPort: 443/TCP Endpoints: 172.16.0.168:443,172.16.1.25:443 Session Affinity: None Events: But the nova-api service goes directly to the nova api pod: [root@controller-1 ~(keystone_admin)]# kubectl -n openstack describe service nova-api Name: nova-api Namespace: openstack Labels: Annotations: Selector: application=nova,component=os-api,release_group=osh-openstack-nova Type: ClusterIP IP: 10.106.230.13 Port: n-api 8774/TCP TargetPort: 8774/TCP Endpoints: 172.16.0.180:8774,172.16.1.46:8774 Session Affinity: None Events: