Bug #1821026 “Containers: Resolving hostname fails within nova c...” : Bugs : StarlingX

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-03-20:

#1

Marking as release gating; requires investigation related to the VIM. Issue appears to be introduced recently.

Changed in starlingx:
assignee:	nobody → Bart Wensley (bartwensley)
importance:	Undecided → Medium
status:	New → Triaged
description:	updated
tags:	added: stx.2019.05 stx.nfv

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-03-28:

#2

Download full text (4.6 KiB)

The VIM is attempting to live migrate the instances, which is supported for instances with a config drive attached. Nova accepts the live migration request, but then the live migration fails:
2019-03-20T11:17:10.609 controller-1 VIM_Thread[81893] INFO _instance_director.py.151 Instance action allowed for tenant2-config_drive-5, action_type=live-migrate
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] DEBUG _instance.py.1933 Live Migrate instance tenant2-config_drive-5.
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] INFO _instance_state_initial.py.36 Exiting state (initial) for tenant2-config_drive-5.
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] INFO _instance_state_live_migrate.py.28 Entering state (live-migrate) for tenant2-config_drive-5.
2019-03-20T11:17:10.614 controller-1 VIM_Thread[81893] DEBUG _instance_task_work.py.131 Live-Migrate-Instance for tenant2-config_drive-5.
2019-03-20T11:17:10.651 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=5f2cfded-3e11-45cc-967e-415780dd6e09.
2019-03-20T11:17:10.733 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=f754ad66-675d-4b75-b381-5647e46da715.
2019-03-20T11:17:10.811 controller-1 VIM_Thread[81893] DEBUG _vim_nfvi_events.py.235 Instance state-change, nfvi_instance={'attached_volumes': [], 'live_migration_timeout': None, 'name': u'tenant2-config_drive-5', 'recovery_priority': None, 'tenant_id': '018a4f4f-b194-48ba-9d4b-dec5205f280f', 'avail_status': [], 'nfvi_data': {'vm_state': u'active', 'task_state': u'migrating', 'power_state': ''}, 'live_migration_support': None, 'instance_type': None, 'oper_state': 'enabled', 'host_name': u'compute-2', 'admin_state': 'unlocked', 'action': 'migrating', 'image_uuid': None, 'uuid': u'293dbb8b-b3f3-4162-9cae-c328b5852ae5'}.
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_state_live_migrate.py.114 Live-Migrate starting for tenant2-config_drive-5.
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_director.py.1601 Instance tenant2-config_drive-5 has recovered on host compute-2.
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_director.py.1591 Instance tenant2-config_drive-5 state change notification.
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_task_work.py.110 Live-Migrate-Instance callback for tenant2-config_drive-5, response={'completed': True, 'reason': ''}.
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_tasks.py.122 Task (live-migrate-instance_tenant2-config_drive-5) complete.
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_state_live_migrate.py.99 Live-Migrate inprogress for tenant2-config_drive-5.
2019-03-20T11:17:10.857 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=d252c539-3bb9-4ee4-a59d-3629a42900d4.
2019-03-20T11:17:10.858 controller-1 VIM_Alarm_Thread[82065] INFO fm.py.180 Raised alarm, uuid=541557d8-8658-428d-949a-176a024f38c7, fm_uuid=0d2345c4-431a-460c-ba0f-ba8fe7b8d0f0.
2019-03-20T11:17:11.832 controller-1 VIM_Thread[81893] INFO _vim_nfvi_audits.py.873 Au...

The VIM is attempting to live migrate the instances, which is supported for instances with a config drive attached. Nova accepts the live migration request, but then the live migration fails: 
2019-03-20T11:17:10.609 controller-1 VIM_Thread[81893] INFO _instance_director.py.151 Instance action allowed for tenant2-config_drive-5, action_type=live-migrate 
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] DEBUG _instance.py.1933 Live Migrate instance tenant2-config_drive-5. 
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] INFO _instance_state_initial.py.36 Exiting state (initial) for tenant2-config_drive-5. 
2019-03-20T11:17:10.613 controller-1 VIM_Thread[81893] INFO _instance_state_live_migrate.py.28 Entering state (live-migrate) for tenant2-config_drive-5. 
2019-03-20T11:17:10.614 controller-1 VIM_Thread[81893] DEBUG _instance_task_work.py.131 Live-Migrate-Instance for tenant2-config_drive-5. 
2019-03-20T11:17:10.651 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=5f2cfded-3e11-45cc-967e-415780dd6e09. 
2019-03-20T11:17:10.733 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=f754ad66-675d-4b75-b381-5647e46da715. 
2019-03-20T11:17:10.811 controller-1 VIM_Thread[81893] DEBUG _vim_nfvi_events.py.235 Instance state-change, nfvi_instance={'attached_volumes': [], 'live_migration_timeout': None, 'name': u'tenant2-config_drive-5', 'recovery_priority': None, 'tenant_id': '018a4f4f-b194-48ba-9d4b-dec5205f280f', 'avail_status': [], 'nfvi_data': {'vm_state': u'active', 'task_state': u'migrating', 'power_state': ''}, 'live_migration_support': None, 'instance_type': None, 'oper_state': 'enabled', 'host_name': u'compute-2', 'admin_state': 'unlocked', 'action': 'migrating', 'image_uuid': None, 'uuid': u'293dbb8b-b3f3-4162-9cae-c328b5852ae5'}. 
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_state_live_migrate.py.114 Live-Migrate starting for tenant2-config_drive-5. 
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_director.py.1601 Instance tenant2-config_drive-5 has recovered on host compute-2. 
2019-03-20T11:17:10.815 controller-1 VIM_Thread[81893] INFO _instance_director.py.1591 Instance tenant2-config_drive-5 state change notification. 
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_task_work.py.110 Live-Migrate-Instance callback for tenant2-config_drive-5, response={'completed': True, 'reason': ''}. 
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_tasks.py.122 Task (live-migrate-instance_tenant2-config_drive-5) complete. 
2019-03-20T11:17:10.827 controller-1 VIM_Thread[81893] DEBUG _instance_state_live_migrate.py.99 Live-Migrate inprogress for tenant2-config_drive-5. 
2019-03-20T11:17:10.857 controller-1 VIM_Event-Log_Thread[82063] INFO fm.py.379 Generated customer log, fm_uuid=d252c539-3bb9-4ee4-a59d-3629a42900d4. 
2019-03-20T11:17:10.858 controller-1 VIM_Alarm_Thread[82065] INFO fm.py.180 Raised alarm, uuid=541557d8-8658-428d-949a-176a024f38c7, fm_uuid=0d2345c4-431a-460c-ba0f-ba8fe7b8d0f0. 
2019-03-20T11:17:11.832 controller-1 VIM_Thread[81893] INFO _vim_nfvi_audits.py.873 Audit instances called, timer_id=15. 
2019-03-20T11:17:14.780 controller-1 VIM_Thread[81893] DEBUG nfvi_compute_api.py.3324 Instance action-change: instance_uuid=293dbb8b-b3f3-4162-9cae-c328b5852ae5, task_state=migrating, task_status=start, error_msg=None. 
2019-03-20T11:17:14.780 controller-1 VIM_Thread[81893] DEBUG _vim_nfvi_events.py.288 Instance action-change, uuid=293dbb8b-b3f3-4162-9cae-c328b5852ae5, nfvi_action=live-migrate, nfvi_action_state=started, reason=None. 
2019-03-20T11:17:16.997 controller-1 VIM_Thread[81893] DEBUG _vim_nfvi_events.py.235 Instance state-change, nfvi_instance={'attached_volumes': [], 'live_migration_timeout': None, 'name': u'tenant2-config_drive-5', 'recovery_priority': None, 'tenant_id': '018a4f4f-b194-48ba-9d4b-dec5205f280f', 'avail_status': [], 'nfvi_data': {'vm_state': u'active', 'task_state': 'none', 'power_state': ''}, 'live_migration_support': None, 'instance_type': None, 'oper_state': 'enabled', 'host_name': u'compute-2', 'admin_state': 'unlocked', 'action': '', 'image_uuid': None, 'uuid': u'293dbb8b-b3f3-4162-9cae-c328b5852ae5'}. 
2019-03-20T11:17:17.000 controller-1 VIM_Thread[81893] INFO _instance_state_live_migrate.py.120 Live-Migrate no longer in progress for tenant2-config_drive-5. 
2019-03-20T11:17:17.000 controller-1 VIM_Thread[81893] INFO _instance_director.py.908 Migrate of instance tenant2-config_drive-5 from host compute-2 failed.

Someone from the nova area needs to take a look at the nova logs and determine why nova fails to live migrate the instance.

Ghada Khalil (gkhalil) on 2019-03-28

Changed in starlingx:
assignee:	Bart Wensley (bartwensley) → Frank Miller (sensfan22)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-04-03:

#3

From Peng's latest reproduction this looks to be an issue with nova not being able to resolve the compute hostname.

As part of a live migration for a VM with a config-drive, nova on the destination worker node needs to scp the config-drive file from the source worker node. This is failing. See timeline below:

nfv-vim log:
2019-04-03T17:19:12.788 controller-1 VIM_Thread[126662] DEBUG _instance.py.1933 Live Migrate instance tenant2-config_drive-1.
2019-04-03T17:19:17.221 controller-1 VIM_Thread[126662] INFO _instance_director.py.908 Migrate of instance tenant2-config_drive-1 from host compute-0 failed.

nova-compute log from compute-0 (source worker):
{"log":"2019-04-03 17:19:17,025.025 43749 ERROR nova.compute.manager [-] [instance: 96e1cbf9-97d0-41af-a02a-f36da0e3fbcd] Pre live migration failed at compute-1: RemoteError: Remote error: ProcessExecutionError Unexpected error while running command.\n","stream":"stdout","time":"2019-04-03T17:19:17.028238913Z"}
{"log":"Command: scp -r compute-0:/var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd/disk.config /var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd\n"
...<snip>...
"ProcessExecutionError: Unexpected error while running command.\\nCommand: scp -r compute-0:/var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd/disk.config /var/lib/nova/instances/96e1cbf9-97d0-41af-a02a-f36da0e3fbcd\\nExit code: 1\\nStdout: u''\\nStderr: u'ssh: Could not resolve hostname compute-0: Name or service not known\\\\r\\\\n'\\n\"].\n","stream":"stdout","time":"2019-04-03T17:19:17.028284159Z"}
#followed by a traceback

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Revision history for this message

Gerry Kopec (gerry-kopec) wrote on 2019-04-08:

#4

Download full text (9.6 KiB)

Did some investigation of the could not resolve hostname issue in nova.

Looking from inside a nova-compute pod in a standard config and trying to ping another compute, you get intermittent results:
controller-0:~$ kubectl exec -it -n openstack nova-compute-compute-0-75ea0372-rg9kk -c nova-compute /bin/bash
[root@compute-0 /]# while :; do (ping compute-1 -c 1; sleep 2;);done
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.078 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.078/0.078/0.078/0.000 ms
ping: compute-1: Name or service not known
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.100 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.100/0.100/0.100/0.000 ms
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.106 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.106/0.106/0.106/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.100 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.100/0.100/0.100/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.101 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.101/0.101/0.101/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from compute-1 (192....

Did some investigation of the could not resolve hostname issue in nova.

Looking from inside a nova-compute pod in a standard config and trying to ping another compute, you get intermittent results:
controller-0:~$ kubectl exec -it -n openstack nova-compute-compute-0-75ea0372-rg9kk -c nova-compute /bin/bash
[root@compute-0 /]# while :; do (ping compute-1 -c 1; sleep 2;);done
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.078 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.078/0.078/0.078/0.000 ms
ping: compute-1: Name or service not known
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.100 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.100/0.100/0.100/0.000 ms
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
ping: compute-1: Name or service not known
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.106 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.106/0.106/0.106/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.100 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.100/0.100/0.100/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.101 ms

--- compute-1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.101/0.101/0.101/0.000 ms
PING compute-1 (192.168.204.122) 56(84) bytes of data.
64 bytes from compute-1 (192.168.204.122): icmp_seq=1 ttl=64 time=0.103 ms

Get the same result for infra/cluster-host name, though note that they're on the same interface in this lab (wcp99-103):
[root@compute-0 /]# while :; do (ping compute-1-infra -c 1; sleep 2;);done
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
PING compute-1-infra (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.107 ms

--- compute-1-infra ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.107/0.107/0.107/0.000 ms
ping: compute-1-infra: Name or service not known
PING compute-1-infra (192.168.204.122) 56(84) bytes of data.
64 bytes from 192.168.204.122 (192.168.204.122): icmp_seq=1 ttl=64 time=0.094 ms

--- compute-1-infra ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.094/0.094/0.094/0.000 ms
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known
ping: compute-1-infra: Name or service not known

Don't see any problems when I attempt this from compute-0 host outside of pod.

For reference, here's /etc/resolv.conf from inside the pod:
[root@compute-0 /]# cat /etc/resolv.conf
nameserver 10.96.0.10
search openstack.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Looking at the nova code, it uses hostname (instance.host) without any configurable optionality:
nova/virt/libvirt/driver.py:
    def pre_live_migration(self, context, instance, block_device_info,
                           network_info, disk_info, migrate_data):
        ...
            if not is_shared_block_storage:
                # Ensure images and backing files are present.
                LOG.debug('Checking to make sure images and backing files are '
                          'present before live migration.', instance=instance)
                self._create_images_and_backing(
                    context, instance, instance_dir, disk_info,
                    fallback_from_host=instance.host)
                if (configdrive.required_by(instance) and
                        CONF.config_drive_format == 'iso9660'):
                    # NOTE(pkoniszewski): Due to a bug in libvirt iso config
                    # drive needs to be copied to destination prior to
                    # migration when instance path is not shared and block
                    # storage is not shared. Files that are already present
                    # on destination are excluded from a list of files that
                    # need to be copied to destination. If we don't do that
                    # live migration will fail on copying iso config drive to
                    # destination and writing to read-only device.
                    # Please see bug/1246201 for more details.
                    src = "%s:%s/disk.config" % (instance.host, instance_dir)
                    self._remotefs.copy_file(src, instance_dir)

In stx-nova based on pike, we had hooked this to convert hostname to hostname-infra to ensure we used the correct network:
                if (configdrive.required_by(instance) and
                        CONF.config_drive_format == 'iso9660'):
                    # NOTE(pkoniszewski): Due to a bug in libvirt iso config
                    # drive needs to be copied to destination prior to
                    # migration when instance path is not shared and block
                    # storage is not shared. Files that are already present
                    # on destination are excluded from a list of files that
                    # need to be copied to destination. If we don't do that
                    # live migration will fail on copying iso config drive to
                    # destination and writing to read-only device.
                    # Please see bug/1246201 for more details.
                    src = "%s:%s/disk.config" % (
                        utils.safe_ip_format(instance.host),
                        instance_dir)
                    self._remotefs.copy_file(src, instance_dir)

nova/utils:
def safe_ip_format(ip):
    """Transform ip string to "safe" format.

Will return ipv4 addresses unchanged, but will nest ipv6 addresses
    inside square brackets.
    """
    try:
        if netaddr.IPAddress(ip).version == 6:
            return '[%s]' % ip
    except (TypeError, netaddr.AddrFormatError):  # hostname
        # In TiC, we set up ssh keys for passwordless ssh between
        # computes. If we have an infra interface present, the keys
        # will be associated with that interface rather than the
        # mgmt interface. We also always provide hostname
        # resolution for the mgmt interface (compute-n) and the
        # infra interface (compute-n-infra) irrespective of the
        # infra interface actually being provisioned. By ensuring
        # that we use the infra interface hostname we guarantee we
        # will align with the ssh keys.
        if '-infra' not in ip:
            return '%s-infra' % ip
        pass
    # it's IPv4 or hostname
    return ip

So assuming we want to support live migration of VMs with config drives, we'll need to fix hostname resolution inside nova pods and figure out a way for it to resolve to the cluster-host network instead of management network as the ssh keys are setup for cluster-host network only.

Alternately, we would have to add configurability to nova to allow us to program the ip address we want as is currently done for live and cold migration via options live_migration_inbound_addr and my_ip, respectively.

Ghada Khalil (gkhalil) on 2019-04-09

tags:

added: stx.retestneeded

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-05-13:

#5

still failing in regression on load 20190508T013000Z
@ [2019-05-11 17:39:17,336] 'system host-lock compute-0'
@[ 2019-05-11 17:39:23,884]Send 'system host-show compute-0'
vim_progress_status | Migrate of instance tenant2-config_drive-1 from host compute-0 failed.

Frank Miller (sensfan22) on 2019-06-18

Changed in starlingx:
assignee:	Frank Miller (sensfan22) → Gerry Kopec (gerry-kopec)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-25:

#6

ALL_NODES_20190625.221452.tar Edit (33.6 MiB, application/x-tar)

Issue was reproduced on
Lab: WCP_113_121
Load: 20190624T233000Z

[2019-06-25 12:37:07,477] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-06-25 12:37:44,291] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show compute-1'
[2019-06-25 12:37:45,880] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+------------------------------------------------------------------------+
| Property | Value |
+---------------------+------------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-25:

#7

controller-1_20190625.221808.tar Edit (28.6 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-26:

#8

Easily reproduceable Build ID: 20190622T013000Z
{lab wp_3-7 nova/test_config_drive.py::test_vm_with_config_drive}

tags:

added: stx.regression stx.retestneded
removed: stx.retestneeded

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-26:

#9

attaching output of the ProcessExecutionError in nova compute logs for the failing instance edb6ee9d-8e3b-4497-9114-ef44f345b1c0

{"log":"2019-06-26 17:48:31.653 51432 ERROR nova.compute.manager [-] [instance: edb6ee9d-8e3b-4497-9114-ef44f345b1c0] Pre live migration failed at compute-1: RemoteError: Remote error: ProcessExecutionError
Unexpected error while running command.\n","stream":"stdout","time":"2019-06-26T17:48:31.656605907Z"}
{"log":"Command: scp -r compute-0:/var/lib/nova/instances/edb6ee9d-8e3b-4497-9114-ef44f345b1c0/disk.config /var/lib/nova/instances/edb6ee9d-8e3b-4497-9114-ef44f345b1c0\n","stream":"stdout","time":"2019-06-
26T17:48:31.656628644Z"}

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-26:

#10

instance_edb6ee9d-8e3b-4497-9114-ef44f345b1c0_ProcessExecutionError.txt Edit (90.6 KiB, text/plain)

Revision history for this message

Matt Peters (mpeters-wrs) wrote on 2019-07-12:

#11

In order to be able to consistently resolve internal host names (those that are only resolvable by dnsmasq), the coredns configuration should be updated to use the dnsmasq floating IP rather than referencing resolv.conf which also has the external DNS servers listed. This will ensure all DNS resolutions that are not within the K8s cluster will go through dnsmasq running on the controllers.

The *proxy* entry of coredns configmap (Corefile) should be configured to the following:

proxy . <mgmt-floating-ip>

Frank Miller (sensfan22) on 2019-07-15

summary:

- Containers: lock_host failed on a host with config_drive VM
+ Containers: Resolving hostname fails within nova containers leading to
+ config_drive VM migration failures

Ghada Khalil (gkhalil) on 2019-07-15

tags:	added: stx.retestneeded removed: stx.retestneded
tags:	added: stx.containers stx.networking removed: stx.nfv
Changed in starlingx:
assignee:	Gerry Kopec (gerry-kopec) → wanghao (wanghao749)

fupingxie (fpxie) on 2019-07-17

Changed in starlingx:
assignee:	wanghao (wanghao749) → fupingxie (fpxie)

Frank Miller (sensfan22) on 2019-07-17

tags:

removed: stx.containers

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-18:

#12

Fails in regression tests
Lab: wcp_63_66
Build ID: 20190713T013000Z
FAIL 20190714 18:36:22 testcases/functional/nova/test_config_drive.py::test_vm_with_config_drive

Revision history for this message

fupingxie (fpxie) wrote on 2019-07-19:

#13

Yesterday i tried to reproduce the problem in a All-in-one duplex R2.0 system, but not. So I'm trying to add a separate compute-node to reproduce the problem.

Revision history for this message

fupingxie (fpxie) wrote on 2019-07-22:

#14

I tried to add a compute node for All-in-one duplex. However, when I have added, nothing service run at the compute node. Here is my operation:
1. add a compute host via portal
2. assign an new interface as datanetwork, and assign mgmt and cluster-host using one interface, via portal
3. unlock the compute node

Now, I'm trying to fix this problem.

Revision history for this message

fupingxie (fpxie) wrote on 2019-07-25:

#15

when i added compute nodes and apply helm-charts-stx-openstack-centos-dev-latest.tgz, I got this error in stx-openstack-apply.log:
Timed out waiting for jobs (namespace=openstack, labels=()). These jobs were not ready=['neutron-db-sync', 'nova-cell-setup']

and when I exec "kubectl get nodes", the role of the compute node is None:

[root@controller-0 08d16c4b6d0b1ca5008fdebd15f4e35d97177985f1eb16c52c8674d80736de92(keystone_admin)]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
compute-1 Ready <none> 13h v1.13.5
controller-0 Ready master 19h v1.13.5
controller-1 Ready master 16h v1.13.5

Revision history for this message

fupingxie (fpxie) wrote on 2019-08-02:

#16

@Peng Peng
Hi, what your operation in "- Add test data to config drive on vm". I create an instance with this command:
"nova boot --nic net-id=d84f5dc2-26fd-41fa-a673-b502a0d0de43 --image 7f5915c0-ffcb-489e-b706-1bd38079bd74 --flavor 36d5ae34-8668-4efb-8bc4-2c98972fc217 --config-drive true --admin-pass Fh123456 VM-1"

However, when I cold migrate the VM-1 from compute-0 to compute-1, and then lock compute-1, I locked successfully.

Revision history for this message

fupingxie (fpxie) wrote on 2019-08-05:

#17

@Peng Peng
Here is my operation, but I can not reproduce your problem:
1. careat an VM:
nova boot --nic net-id=d84f5dc2-26fd-41fa-a673-b502a0d0de43 --image adcca643-ba09-437a-a966-9d486bcb782c --flavor 36d5ae34-8668-4efb-8bc4-2c98972fc217 --config-drive true --admin-pass Fh123456 --user-data test.config xiexie-5
and this is mu test.config:
chpasswd:
    list: |
        root:rootroot
        centos:centos
    expire: false
ssh_pwauth: yes

hostname: xiexie

resolv_conf:
    nameservers: ['8.8.8.8']
    searchdomains:
        - localdomain
    domain: localdomain
    options:
        rotate: true
        timeout: 1
manage_resolv_conf: true

packages:
    - vim
    - wget
    - httpd

timezone: 'Asia/Shanghai'

runcmd:
    - [ sed, -i, "s/^ *SELINUX=enforcing/SELINUX=disabled/g", /etc/selinux/config ]
    - [ mkdir, /dropme ]
    - [ touch, /root/abc.txt ]

power_state:
    delay: now
    mode: reboot
    message: reboot now
    timeout: 30
    condition: true
2. migrate the VM from compute-0 to compute-1
3. then lock compute-1
4. Lock successfully

And my ISO is 20190630....
Is my operation different from yours.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-08-12:

#18

TIS_AUTOMATION.txt Edit (5.8 MiB, text/plain)

Our TC steps:

Test Step 1: boot up a VM and confirm the config drive is set to True in vm
nova --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume a1c83b3c-f28b-4a36-af0a-0e0ecc3dab9d --flavor 0b531cdd-1034-4b8f-9319-ca64cb5bb699 --key-name keypair-tenant2 --config-drive True --nic net-id=335c549f-8900-4777-b856-ef4337776015 --nic net-id=d75e1fef-d804-472a-8637-acb293269a23 tenant2-config_drive-5 --meta foo=bar --poll --block-device source=volume,device=vda,dest=volume,id=acd02b16-59c2-49b6-94be-a601fe3ee9da

| config_drive | True

Test Step 2: Add date to config drive
ssh to vm
mount | grep "/dev/hd" | awk '{print $3} '
python -m json.tool /media/hda/openstack/latest/meta_data.json | grep foo

Test Step 3: Check config_drive vm files on hypervisor after vm launch
Test Step 4: Cold migrate VM
Test Step 5: Check config drive after cold migrate VM
Test Step 6: Lock the compute host

Detail execution log attached

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#19

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

Le, Huifeng (hle2) wrote on 2019-11-25:

#20

@fupingxie, could you please try with the latest build with train patch (e.g. ISO after 20191115) to see if you can reproduce this issue? Thanks!

Revision history for this message

Peng Peng (ppeng) wrote on 2019-11-29:

#21

ALL_NODES_20191126.001643.tar Edit (200.1 MiB, application/x-tar)

Issue was reproduced on train
2019-11-21_20-00-00
wcp_3-6

log:
[2019-11-25 21:20:06,726] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'

[2019-11-25 21:20:42,358] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show compute-0'
[2019-11-25 21:20:43,914] 433 DEBUG MainThread ssh.expect :: Output:
+-----------------------+--------------------------------------------------------------------------+
| Property | Value |
+-----------------------+--------------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

Revision history for this message

Peng Peng (ppeng) wrote on 2019-11-29:

#22

TIS_AUTOMATION.log Edit (8.0 MiB, application/octet-stream)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-12:

#23

As per review in the stx networking team meeting (2019-12-12), we agreed that this bug should still be fixed for stx.3.0, so raising the priority to High as only high priority bugs are considered for cherry-picking in maintenance releases.

Changed in starlingx:
importance:	Medium → High

Le, Huifeng (hle2) on 2019-12-12

Changed in starlingx:
assignee:	fupingxie (fpxie) → marvin Yu (marvin-yu)

Revision history for this message

marvin Yu (marvin-yu) wrote on 2019-12-18:

#24

Hi matt,
I tried to verify your submitted, but the test show all DNS resolutions that are not within k8s will go through dnsmasq.
the coredns-resolv.conf file show as below.
----------------------------------------------------------------------------------
[sysadmin@controller-0 ~(keystone_admin)]$ cat resolv.conf # this file copy from coredns pod.
nameserver 192.178.204.2 # dnsmasq listen on 192.178.204.2:53
nameserver 10.248.2.1
----------------------------------------------------------------------------------
the coredns will use dnsmasq as a upstream dns server when resolving domain that not within k8s.
the host interface also can receive some dns require when tupdump listen to 192.178.204.2:53.
...
08:41:35.573214 IP controller-1.45569 > controller.domain: 3673+ A? compute-1. (27)
08:41:35.573350 IP controller.domain > controller-1.45569: 3673* 1/0/0 A 192.178.204.39 (43)
...

Is it possible that the problem appears on dnsmasq? Do you have any suggestions?

Hi Peng,
Could you please try to reproduce this bug with the latest build? I`ve tried many times, but it`s hard to duplicate this bug.
when you duplicate it in your environment, please check that the host, such as controller-0, can ping compute-0 or compute-1 directly.
this is to verify that dnsmasq is working. thx~

Revision history for this message

Matt Peters (mpeters-wrs) wrote on 2019-12-18:

#25

coredns is by default configured to use the proxy plugin with resolv.conf as the proxy target. Since resolv.conf has both dnsmasq (floating mgmt IP) and the public DNS servers. The default policy for selecting a server for name resolution is "random" [1], therefore it is possible that occasionally the request will fail (and be cached) when resolving DNS entries that are only resolvable via dnsmasq (host names). Furthermore, in a multi-node system, there are multiple instances of coredns that are used, each with the above random behavior.

The bug report indicates that this issue is not always reproducible and that is because of the above behavior. With the recommended setup of removing resolv.conf and using the floating mgmt IP for the proxy configuration it will ensure all requests go through dnsmasq.

[1] https://coredns.io/plugins/proxy/

Revision history for this message

ChenjieXu (midone) wrote on 2019-12-19:

#26

Hi all,

The similar problem is reported here:
https://github.com/coredns/coredns/issues/2830

The default policy for forward which facilitates proxying DNS messages to upstream resolver is random. It means that multiple upstreams are randomized (see policy) on first use. In this bug, 10.248.2.1 maybe chosen and 10.248.2.1 can't resolve compute-0.
https://github.com/coredns/coredns/tree/master/plugin/forward

Another possible solution: change the policy of forward from random to sequential which selects hosts based on sequential ordering. (Note 192.178.204.2 is the first dns server in /etc/resolv.conf).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-20: Fix proposed to ansible-playbooks (master)

#27

Fix proposed to branch: master
Review: https://review.opendev.org/700100

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-02-21:

#28

Joseph Richard will take this over as per agreement with Yong Hu

Changed in starlingx:
assignee:	marvin Yu (marvin-yu) → Joseph Richard (josephrichard)

Ghada Khalil (gkhalil) on 2020-03-05

tags:

added: stx.4.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-20:

#29

Fix proposed to branch: master
Review: https://review.opendev.org/729758

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-02: Fix merged to ansible-playbooks (master)

#30

Reviewed: https://review.opendev.org/729758
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=7ecbdadbfb33f407d87e5eb4458e92b86c1c6fb7
Submitter: Zuul
Branch: master

commit 7ecbdadbfb33f407d87e5eb4458e92b86c1c6fb7
Author: Joseph Richard <email address hidden>
Date: Tue May 19 15:30:45 2020 -0400

Use sequential forward policy in coredns

    If possible, dns should be resolved through dnsmasq, in order to allow
    proper resolution of platform (e.g. controller) hostnames, which would
    fail to resolve from external nameservers.

    Partial-Bug: 1821026
    Change-Id: I4f5cdb7ac79dfe19626623adb5622645cf8569ab
    Signed-off-by: Joseph Richard <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-02: Related fix merged to stx-puppet (master)

#31

Reviewed: https://review.opendev.org/732910
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=9dc80285641d03e4abf7a2469c0a19e6f557d444
Submitter: Zuul
Branch: master

commit 9dc80285641d03e4abf7a2469c0a19e6f557d444
Author: Matt Peters <email address hidden>
Date: Tue Jun 2 09:59:24 2020 -0500

Fix host name resolution for AIO-SX IPV6

    dnsmasq is not processing DNS requests sent to the UDP port 53 when binding
    to the loopback interface on an IPv6 system. The requests are processed
    correctly if dnsmasq is explicitly configured to listen for the management
    address via the listen_address parameter.

Closes-Bug: 1881772
Related-Bug: 1821026

Change-Id: I47f733d2a35c946acd2952efd246a973826e8114
Signed-off-by: Matt Peters <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-09: Change abandoned on ansible-playbooks (master)

#32

Change abandoned by Matt Peters (<email address hidden>) on branch: master
Review: https://review.opendev.org/700100
Reason: This has already been fixed by:
https://review.opendev.org/#/c/729758/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-12: Fix proposed to stx-puppet (master)

#33

Fix proposed to branch: master
Review: https://review.opendev.org/735278

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-18: Fix proposed to config (master)

#34

Fix proposed to branch: master
Review: https://review.opendev.org/736797

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-18: Change abandoned on stx-puppet (master)

#35

Change abandoned by Joseph Richard (<email address hidden>) on branch: master
Review: https://review.opendev.org/735278
Reason: Ran into issues with update to handle with simplex upgrades. Abandoning and moving this change to an upgrade script.
See https://review.opendev.org/#/c/736797/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-19: Fix merged to config (master)

#36

Reviewed: https://review.opendev.org/736797
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=7639db0d71f53c147031c7edbfd530530a496cd6
Submitter: Zuul
Branch: master

commit 7639db0d71f53c147031c7edbfd530530a496cd6
Author: Joseph Richard <email address hidden>
Date: Thu Jun 18 12:00:42 2020 -0400

Use sequential forward policy in coredns

    If possible, dns should be resolved through dnsmasq, in order to allow
    proper resolution of platform (e.g. controller) hostnames, which would
    fail to resolve from external nameservers.

This commit handles setting sequential policy over an upgrade.

    Closes-Bug: 1821026
    Change-Id: Ib9b09bcfe2b84226ef25cfaaa2fa9d1f8051409e
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-21:

#37

Fixes are merged in stx master and will be included in the stx.4.0 release. Given no users have raised an issue with this when using stx.3.0, the plan is not to port back the changes due to complexity.

tags:

removed: stx.3.0

Revision history for this message

Peng Peng (ppeng) wrote on 2020-10-13:

#38

TC: test_vm_with_config_drive, all passed on recent loads.

tags:

removed: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-16: Related fix proposed to stx-puppet (f/centos8)

#39

Related fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

StarlingX

Containers: Resolving hostname fails within nova containers leading to config_drive VM migration failures

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches