Containers: Resolving hostname fails within nova containers leading to config_drive VM migration failures

Bug #1821026 reported by Peng Peng on 2019-03-20
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
High
Joseph Richard

Bug Description

Brief Description
-----------------
Launch a VM using config drive and add some test data to config drive on this VM, Do cold migrate and lock host. lock host failed.

Severity
--------
Major

Steps to Reproduce
------------------
- Launch a vm using config drive
- Add test data to config drive on vm
- cold migrate
- lock host

Expected Behaviour
------------------
Lock_host success

Actual Behaviour
----------------
failed

Reproducibility
---------------
Reproducible
8 failed of 10 runs

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
master as of 20190320T013000Z

This was working with 20190318T233000Z

Timestamp/Logs
--------------
[2019-03-20 11:15:28,278] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show 293dbb8b-b3f3-4162-9cae-c328b5852ae5'
[2019-03-20 11:15:29,852] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-1 |
| OS-EXT-SRV-ATTR:hostname | tenant2-config-drive-5 |
.....
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | True

[2019-03-20 11:16:04,769] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne migrate --poll 293dbb8b-b3f3-4162-9cae-c328b5852ae5'
[2019-03-20 11:16:34,515] 387 DEBUG MainThread ssh.expect :: Output:

Server migrating... 0% complete
Server migrating... 0% complete
Server migrating... 0% complete
Server migrating... 0% complete
Server migrating... 0% complete
Server migrating... 100% complete
Finished

[2019-03-20 11:16:34,619] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show 293dbb8b-b3f3-4162-9cae-c328b5852ae5'
[2019-03-20 11:16:36,212] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-2

[2019-03-20 11:17:08,864] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-2'

[2019-03-20 11:17:44,675] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show compute-2'
[2019-03-20 11:17:46,169] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------------------------------------+
| action | none |
| administrative | unlocked

Ghada Khalil (gkhalil) wrote :

Marking as release gating; requires investigation related to the VIM. Issue appears to be introduced recently.

Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
importance: Undecided → Medium
status: New → Triaged
description: updated
tags: added: stx.2019.05 stx.nfv
Bart Wensley (bartwensley) wrote :
Download full text (4.6 KiB)

The VIM is attempting to live migrate the instances, which is supported for instances with a config drive attached. Nova accepts the live migration request, but then the live migration fails:
2019-03-20T11:17:10.609 controller-1 VIM_Thread[81893] INFO _instance_director.py.151 Instance action allowed for tenant2-config_drive-5, action_type=live-migrate
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] DEBUG _instance.py.1933 Live Migrate instance tenant2-config_drive-5.
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] INFO _instance_state_initial.py.36 Exiting state (initial) for tenant2-config_drive-5.
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] INFO _instance_state_live_migrate.py.28 Entering state (live-migrate) for tenant2-config_drive-5.
2019-03-20T11:17:10.614 controller-1 VIM_Thread[81893] DEBUG _instance_task_work.py.131 Live-Migrate-Instance for tenant2-config_drive-5.
2019-03-20T11:17:10.651 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=5f2cfded-3e11-45cc-967e-415780dd6e09.
2019-03-20T11:17:10.733 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=f754ad66-675d-4b75-b381-5647e46da715.
2019-03-20T11:17:10.811 controller-1 VIM_Thread[81893] DEBUG _vim_nfvi_events.py.235 Instance state-change, nfvi_instance={'attached_volumes': [], 'live_migration_timeout': None, 'name': u'tenant2-config_drive-5', 'recovery_priority': None, 'tenant_id': '018a4f4f-b194-48ba-9d4b-dec5205f280f', 'avail_status': [], 'nfvi_data': {'vm_state': u'active', 'task_state': u'migrating', 'power_state': ''}, 'live_migration_support': None, 'instance_type': None, 'oper_state': 'enabled', 'host_name': u'compute-2', 'admin_state': 'unlocked', 'action': 'migrating', 'image_uuid': None, 'uuid': u'293dbb8b-b3f3-4162-9cae-c328b5852ae5'}.
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_state_live_migrate.py.114 Live-Migrate starting for tenant2-config_drive-5.
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_director.py.1601 Instance tenant2-config_drive-5 has recovered on host compute-2.
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_director.py.1591 Instance tenant2-config_drive-5 state change notification.
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_task_work.py.110 Live-Migrate-Instance callback for tenant2-config_drive-5, response={'completed': True, 'reason': ''}.
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_tasks.py.122 Task (live-migrate-instance_tenant2-config_drive-5) complete.
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_state_live_migrate.py.99 Live-Migrate inprogress for tenant2-config_drive-5.
2019-03-20T11:17:10.857 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=d252c539-3bb9-4ee4-a59d-3629a42900d4.
2019-03-20T11:17:10.858 controller-1 VIM_Alarm_Thread[82065] INFO fm.py.180 Raised alarm, uuid=541557d8-8658-428d-949a-176a024f38c7, fm_uuid=0d2345c4-431a-460c-ba0f-ba8fe7b8d0f0.
2019-03-20T11:17:11.832 controller-1 VIM_Thread[81893] INFO _vim_nfvi_audits.py.873 Au...

Read more...

Ghada Khalil (gkhalil) on 2019-03-28
Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Frank Miller (sensfan22)
Frank Miller (sensfan22) wrote :

From Peng's latest reproduction this looks to be an issue with nova not being able to resolve the compute hostname.

As part of a live migration for a VM with a config-drive, nova on the destination worker node needs to scp the config-drive file from the source worker node. This is failing. See timeline below:

nfv-vim log:
2019-04-03T17:19:12.788 controller-1 VIM_Thread[126662] DEBUG _instance.py.1933 Live Migrate instance tenant2-config_drive-1.
2019-04-03T17:19:17.221 controller-1 VIM_Thread[126662] INFO _instance_director.py.908 Migrate of instance tenant2-config_drive-1 from host compute-0 failed.

nova-compute log from compute-0 (source worker):
{"log":"2019-04-03 17:19:17,025.025 43749 ERROR nova.compute.manager [-] [instance: 96e1cbf9-97d0-41af-a02a-f36da0e3fbcd] Pre live migration failed at compute-1: RemoteError: Remote error: ProcessExecutionError Unexpected error while running command.\n","stream":"stdout","time":"2019-04-03T17:19:17.028238913Z"}
{"log":"Command: scp -r compute-0:/var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd/disk.config /var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd\n"
...<snip>...
"ProcessExecutionError: Unexpected error while running command.\\nCommand: scp -r compute-0:/var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd/disk.config /var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd\\nExit code: 1\\nStdout: u''\\nStderr: u'ssh: Could not resolve hostname compute-0: Name or service not known\\\\r\\\\n'\\n\"].\n","stream":"stdout","time":"2019-04-03T17:19:17.028284159Z"}
#followed by a traceback

Ken Young (kenyis) on 2019-04-05
tags: added: stx.2.0
removed: stx.2019.05
Gerry Kopec (gerry-kopec) wrote :
Download full text (9.6 KiB)

Did some investigation of the could not resolve hostname issue in nova.

Looking from inside a nova-compute pod in a standard config and trying to ping another compute, you get intermittent results:
controller-0:~$ kubectl exec -it -n openstack nova-compute-compute-0-75ea0372-rg9kk -c nova-compute /bin/bash
[root@compute-0 /]# while :; do (ping compute-1 -c 1; sleep 2;);done
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.078 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.078/0.078/0.078/0.000 ms
ping: compute-1: Name or service not known
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.100 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.100/0.100/0.100/0.000 ms
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.106 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.106/0.106/0.106/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.100 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.100/0.100/0.100/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.101 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.101/0.101/0.101/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from compute-1 (192....

Read more...

Ghada Khalil (gkhalil) on 2019-04-09
tags: added: stx.retestneeded
Wendy Mitchell (wmitchellwr) wrote :

still failing in regression on load 20190508T013000Z
@ [2019-05-11 17:39:17,336] 'system host-lock compute-0'
@[ 2019-05-11 17:39:23,884]Send 'system host-show compute-0'
vim_progress_status | Migrate of instance tenant2-config_drive-1 from host compute-0 failed.

Frank Miller (sensfan22) on 2019-06-18
Changed in starlingx:
assignee: Frank Miller (sensfan22) → Gerry Kopec (gerry-kopec)
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_113_121
Load: 20190624T233000Z

[2019-06-25 12:37:07,477] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-06-25 12:37:44,291] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show compute-1'
[2019-06-25 12:37:45,880] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+------------------------------------------------------------------------+
| Property | Value |
+---------------------+------------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

Peng Peng (ppeng) wrote :
Wendy Mitchell (wmitchellwr) wrote :

Easily reproduceable Build ID: 20190622T013000Z
{lab wp_3-7 nova/test_config_drive.py::test_vm_with_config_drive}

tags: added: stx.regression stx.retestneded
removed: stx.retestneeded
Wendy Mitchell (wmitchellwr) wrote :

attaching output of the ProcessExecutionError in nova compute logs for the failing instance edb6ee9d-8e3b-4497-9114-ef44f345b1c0

{"log":"2019-06-26 17:48:31.653 51432 ERROR nova.compute.manager [-] [instance: edb6ee9d-8e3b-4497-9114-ef44f345b1c0] Pre live migration failed at compute-1: RemoteError: Remote error: ProcessExecutionError
Unexpected error while running command.\n","stream":"stdout","time":"2019-06-26T17:48:31.656605907Z"}
{"log":"Command: scp -r compute-0:/var/lib/nova/instances/edb6ee9d-8e3b-4497-9114-ef44f345b1c0/disk.config /var/lib/nova/instances/edb6ee9d-8e3b-4497-9114-ef44f345b1c0\n","stream":"stdout","time":"2019-06-
26T17:48:31.656628644Z"}

Matt Peters (mpeters-wrs) wrote :

In order to be able to consistently resolve internal host names (those that are only resolvable by dnsmasq), the coredns configuration should be updated to use the dnsmasq floating IP rather than referencing resolv.conf which also has the external DNS servers listed. This will ensure all DNS resolutions that are not within the K8s cluster will go through dnsmasq running on the controllers.

The *proxy* entry of coredns configmap (Corefile) should be configured to the following:

proxy . <mgmt-floating-ip>

Frank Miller (sensfan22) on 2019-07-15
summary: - Containers: lock_host failed on a host with config_drive VM
+ Containers: Resolving hostname fails within nova containers leading to
+ config_drive VM migration failures
Ghada Khalil (gkhalil) on 2019-07-15
tags: added: stx.retestneeded
removed: stx.retestneded
tags: added: stx.containers stx.networking
removed: stx.nfv
Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → wanghao (wanghao749)
fupingxie (fpxie) on 2019-07-17
Changed in starlingx:
assignee: wanghao (wanghao749) → fupingxie (fpxie)
Frank Miller (sensfan22) on 2019-07-17
tags: removed: stx.containers
Wendy Mitchell (wmitchellwr) wrote :

Fails in regression tests
Lab: wcp_63_66
Build ID: 20190713T013000Z
FAIL 20190714 18:36:22 testcases/functional/nova/test_config_drive.py::test_vm_with_config_drive

fupingxie (fpxie) wrote :

Yesterday i tried to reproduce the problem in a All-in-one duplex R2.0 system, but not. So I'm trying to add a separate compute-node to reproduce the problem.

fupingxie (fpxie) wrote :

I tried to add a compute node for All-in-one duplex. However, when I have added, nothing service run at the compute node. Here is my operation:
1. add a compute host via portal
2. assign an new interface as datanetwork, and assign mgmt and cluster-host using one interface, via portal
3. unlock the compute node

Now, I'm trying to fix this problem.

fupingxie (fpxie) wrote :

when i added compute nodes and apply helm-charts-stx-openstack-centos-dev-latest.tgz, I got this error in stx-openstack-apply.log:
Timed out waiting for jobs (namespace=openstack, labels=()). These jobs were not ready=['neutron-db-sync', 'nova-cell-setup']

and when I exec "kubectl get nodes", the role of the compute node is None:

[root@controller-0 08d16c4b6d0b1ca5008fdebd15f4e35d97177985f1eb16c52c8674d80736de92(keystone_admin)]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
compute-1 Ready <none> 13h v1.13.5
controller-0 Ready master 19h v1.13.5
controller-1 Ready master 16h v1.13.5

fupingxie (fpxie) wrote :

@Peng Peng
Hi, what your operation in "- Add test data to config drive on vm". I create an instance with this command:
"nova boot --nic net-id=d84f5dc2-26fd-41fa-a673-b502a0d0de43 --image 7f5915c0-ffcb-489e-b706-1bd38079bd74 --flavor 36d5ae34-8668-4efb-8bc4-2c98972fc217 --config-drive true --admin-pass Fh123456 VM-1"

However, when I cold migrate the VM-1 from compute-0 to compute-1, and then lock compute-1, I locked successfully.

fupingxie (fpxie) wrote :

@Peng Peng
Here is my operation, but I can not reproduce your problem:
1. careat an VM:
nova boot --nic net-id=d84f5dc2-26fd-41fa-a673-b502a0d0de43 --image adcca643-ba09-437a-a966-9d486bcb782c --flavor 36d5ae34-8668-4efb-8bc4-2c98972fc217 --config-drive true --admin-pass Fh123456 --user-data test.config xiexie-5
and this is mu test.config:
chpasswd:
    list: |
        root:rootroot
        centos:centos
    expire: false
ssh_pwauth: yes

hostname: xiexie

resolv_conf:
    nameservers: ['8.8.8.8']
    searchdomains:
        - localdomain
    domain: localdomain
    options:
        rotate: true
        timeout: 1
manage_resolv_conf: true

packages:
    - vim
    - wget
    - httpd

timezone: 'Asia/Shanghai'

runcmd:
    - [ sed, -i, "s/^ *SELINUX=enforcing/SELINUX=disabled/g", /etc/selinux/config ]
    - [ mkdir, /dropme ]
    - [ touch, /root/abc.txt ]

power_state:
    delay: now
    mode: reboot
    message: reboot now
    timeout: 30
    condition: true
2. migrate the VM from compute-0 to compute-1
3. then lock compute-1
4. Lock successfully

And my ISO is 20190630....
Is my operation different from yours.

Peng Peng (ppeng) wrote :

Our TC steps:

Test Step 1: boot up a VM and confirm the config drive is set to True in vm
nova --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume a1c83b3c-f28b-4a36-af0a-0e0ecc3dab9d --flavor 0b531cdd-1034-4b8f-9319-ca64cb5bb699 --key-name keypair-tenant2 --config-drive True --nic net-id=335c549f-8900-4777-b856-ef4337776015 --nic net-id=d75e1fef-d804-472a-8637-acb293269a23 tenant2-config_drive-5 --meta foo=bar --poll --block-device source=volume,device=vda,dest=volume,id=acd02b16-59c2-49b6-94be-a601fe3ee9da

| config_drive | True

Test Step 2: Add date to config drive
ssh to vm
mount | grep "/dev/hd" | awk '{print $3} '
python -m json.tool /media/hda/openstack/latest/meta_data.json | grep foo

Test Step 3: Check config_drive vm files on hypervisor after vm launch
Test Step 4: Cold migrate VM
Test Step 5: Check config drive after cold migrate VM
Test Step 6: Lock the compute host

Detail execution log attached

Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Le, Huifeng (hle2) wrote :

@fupingxie, could you please try with the latest build with train patch (e.g. ISO after 20191115) to see if you can reproduce this issue? Thanks!

Peng Peng (ppeng) wrote :

Issue was reproduced on train
2019-11-21_20-00-00
wcp_3-6

log:
[2019-11-25 21:20:06,726] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'

[2019-11-25 21:20:42,358] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show compute-0'
[2019-11-25 21:20:43,914] 433 DEBUG MainThread ssh.expect :: Output:
+-----------------------+--------------------------------------------------------------------------+
| Property | Value |
+-----------------------+--------------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

Peng Peng (ppeng) wrote :
Ghada Khalil (gkhalil) wrote :

As per review in the stx networking team meeting (2019-12-12), we agreed that this bug should still be fixed for stx.3.0, so raising the priority to High as only high priority bugs are considered for cherry-picking in maintenance releases.

Changed in starlingx:
importance: Medium → High
Le, Huifeng (hle2) on 2019-12-12
Changed in starlingx:
assignee: fupingxie (fpxie) → marvin Yu (marvin-yu)
marvin Yu (marvin-yu) wrote :

Hi matt,
I tried to verify your submitted, but the test show all DNS resolutions that are not within k8s will go through dnsmasq.
the coredns-resolv.conf file show as below.
----------------------------------------------------------------------------------
[sysadmin@controller-0 ~(keystone_admin)]$ cat resolv.conf # this file copy from coredns pod.
nameserver 192.178.204.2 # dnsmasq listen on 192.178.204.2:53
nameserver 10.248.2.1
----------------------------------------------------------------------------------
the coredns will use dnsmasq as a upstream dns server when resolving domain that not within k8s.
the host interface also can receive some dns require when tupdump listen to 192.178.204.2:53.
...
08:41:35.573214 IP controller-1.45569 > controller.domain: 3673+ A? compute-1. (27)
08:41:35.573350 IP controller.domain > controller-1.45569: 3673* 1/0/0 A 192.178.204.39 (43)
...

Is it possible that the problem appears on dnsmasq? Do you have any suggestions?

Hi Peng,
Could you please try to reproduce this bug with the latest build? I`ve tried many times, but it`s hard to duplicate this bug.
when you duplicate it in your environment, please check that the host, such as controller-0, can ping compute-0 or compute-1 directly.
this is to verify that dnsmasq is working. thx~

Matt Peters (mpeters-wrs) wrote :

coredns is by default configured to use the proxy plugin with resolv.conf as the proxy target. Since resolv.conf has both dnsmasq (floating mgmt IP) and the public DNS servers. The default policy for selecting a server for name resolution is "random" [1], therefore it is possible that occasionally the request will fail (and be cached) when resolving DNS entries that are only resolvable via dnsmasq (host names). Furthermore, in a multi-node system, there are multiple instances of coredns that are used, each with the above random behavior.

The bug report indicates that this issue is not always reproducible and that is because of the above behavior. With the recommended setup of removing resolv.conf and using the floating mgmt IP for the proxy configuration it will ensure all requests go through dnsmasq.

[1] https://coredns.io/plugins/proxy/

ChenjieXu (midone) wrote :

Hi all,

The similar problem is reported here:
https://github.com/coredns/coredns/issues/2830

The default policy for forward which facilitates proxying DNS messages to upstream resolver is random. It means that multiple upstreams are randomized (see policy) on first use. In this bug, 10.248.2.1 maybe chosen and 10.248.2.1 can't resolve compute-0.
https://github.com/coredns/coredns/tree/master/plugin/forward

Another possible solution: change the policy of forward from random to sequential which selects hosts based on sequential ordering. (Note 192.178.204.2 is the first dns server in /etc/resolv.conf).

Fix proposed to branch: master
Review: https://review.opendev.org/700100

Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil) wrote :

Joseph Richard will take this over as per agreement with Yong Hu

Changed in starlingx:
assignee: marvin Yu (marvin-yu) → Joseph Richard (josephrichard)
Ghada Khalil (gkhalil) on 2020-03-05
tags: added: stx.4.0

Fix proposed to branch: master
Review: https://review.opendev.org/729758

Reviewed: https://review.opendev.org/729758
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=7ecbdadbfb33f407d87e5eb4458e92b86c1c6fb7
Submitter: Zuul
Branch: master

commit 7ecbdadbfb33f407d87e5eb4458e92b86c1c6fb7
Author: Joseph Richard <email address hidden>
Date: Tue May 19 15:30:45 2020 -0400

    Use sequential forward policy in coredns

    If possible, dns should be resolved through dnsmasq, in order to allow
    proper resolution of platform (e.g. controller) hostnames, which would
    fail to resolve from external nameservers.

    Partial-Bug: 1821026
    Change-Id: I4f5cdb7ac79dfe19626623adb5622645cf8569ab
    Signed-off-by: Joseph Richard <email address hidden>

Reviewed: https://review.opendev.org/732910
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=9dc80285641d03e4abf7a2469c0a19e6f557d444
Submitter: Zuul
Branch: master

commit 9dc80285641d03e4abf7a2469c0a19e6f557d444
Author: Matt Peters <email address hidden>
Date: Tue Jun 2 09:59:24 2020 -0500

    Fix host name resolution for AIO-SX IPV6

    dnsmasq is not processing DNS requests sent to the UDP port 53 when binding
    to the loopback interface on an IPv6 system. The requests are processed
    correctly if dnsmasq is explicitly configured to listen for the management
    address via the listen_address parameter.

    Closes-Bug: 1881772
    Related-Bug: 1821026

    Change-Id: I47f733d2a35c946acd2952efd246a973826e8114
    Signed-off-by: Matt Peters <email address hidden>

Change abandoned by Matt Peters (<email address hidden>) on branch: master
Review: https://review.opendev.org/700100
Reason: This has already been fixed by:
https://review.opendev.org/#/c/729758/

Change abandoned by Joseph Richard (<email address hidden>) on branch: master
Review: https://review.opendev.org/735278
Reason: Ran into issues with update to handle with simplex upgrades. Abandoning and moving this change to an upgrade script.
See https://review.opendev.org/#/c/736797/

Reviewed: https://review.opendev.org/736797
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=7639db0d71f53c147031c7edbfd530530a496cd6
Submitter: Zuul
Branch: master

commit 7639db0d71f53c147031c7edbfd530530a496cd6
Author: Joseph Richard <email address hidden>
Date: Thu Jun 18 12:00:42 2020 -0400

    Use sequential forward policy in coredns

    If possible, dns should be resolved through dnsmasq, in order to allow
    proper resolution of platform (e.g. controller) hostnames, which would
    fail to resolve from external nameservers.

    This commit handles setting sequential policy over an upgrade.

    Closes-Bug: 1821026
    Change-Id: Ib9b09bcfe2b84226ef25cfaaa2fa9d1f8051409e
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil) wrote :

Fixes are merged in stx master and will be included in the stx.4.0 release. Given no users have raised an issue with this when using stx.3.0, the plan is not to port back the changes due to complexity.

tags: removed: stx.3.0
Peng Peng (ppeng) wrote :

TC: test_vm_with_config_drive, all passed on recent loads.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.