Ironic driver hash ring treats hostnames differing only by case as different hostnames

Bug #1866380 reported by melanie witt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
melanie witt
Pike
Fix Released
Low
Elod Illes
Queens
Fix Released
Low
melanie witt
Rocky
Fix Released
Low
melanie witt
Stein
Fix Released
Low
melanie witt
Train
Fix Released
Low
melanie witt

Bug Description

Recently we had a customer case where attempts to add new ironic nodes to an existing undercloud resulted in half of the nodes failing to be detected and added to nova. Ironic API returned all of the newly added nodes when called by the driver, but half of the nodes were not returned to the compute manager by the driver.

There was only one nova-compute service managing all of the ironic nodes of the all-in-one typical undercloud deployment.

After days of investigation and examination of a database dump from the customer, we noticed that at some point the customer had changed the hostname of the machine from something containing uppercase letters to the same name but all lowercase. The nova-compute service record had the mixed case name and the CONF.host (socket.gethostname()) had the lowercase name.

The hash ring logic adds all of the nova-compute service hostnames plus CONF.host to hash ring, then the ironic driver reports only the nodes it owns by retrieving a service hostname from the ring based on a hash of each ironic node UUID.

Because of the machine hostname change, the hash ring contained, for example: {'MachineHostName', 'machinehostname'} when it should have contained only one hostname. And because the hash ring contained two hostnames, the driver was able to retrieve only half of the nodes as nodes that it owned. So half of the new nodes were excluded and not added as new compute nodes.

I propose adding some logging to the driver related to the hash ring to help with debugging in the future.

Tags: ironic
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/711680

Changed in nova:
status: New → In Progress
melanie witt (melwitt)
summary: - Difficult to debug unexpected ironic driver behavior related to
- available nodes
+ Ironic driver hash ring treats hostnames differing only by case as
+ different hostnames
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/711680
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7145100ee4e732caa532d614e2149ef2a545287a
Submitter: Zuul
Branch: master

commit 7145100ee4e732caa532d614e2149ef2a545287a
Author: melanie witt <email address hidden>
Date: Fri Mar 6 17:05:28 2020 +0000

    Lowercase ironic driver hash ring and ignore case in cache

    Recently we had a customer case where attempts to add new ironic nodes
    to an existing undercloud resulted in half of the nodes failing to be
    detected and added to nova. Ironic API returned all of the newly added
    nodes when called by the driver, but half of the nodes were not
    returned to the compute manager by the driver.

    There was only one nova-compute service managing all of the ironic
    nodes of the all-in-one typical undercloud deployment.

    After days of investigation and examination of a database dump from the
    customer, we noticed that at some point the customer had changed the
    hostname of the machine from something containing uppercase letters to
    the same name but all lowercase. The nova-compute service record had
    the mixed case name and the CONF.host (socket.gethostname()) had the
    lowercase name.

    The hash ring logic adds all of the nova-compute service hostnames plus
    CONF.host to hash ring, then the ironic driver reports only the nodes
    it owns by retrieving a service hostname from the ring based on a hash
    of each ironic node UUID.

    Because of the machine hostname change, the hash ring contained, for
    example: {'MachineHostName', 'machinehostname'} when it should have
    contained only one hostname. And because the hash ring contained two
    hostnames, the driver was able to retrieve only half of the nodes as
    nodes that it owned. So half of the new nodes were excluded and not
    added as new compute nodes.

    This adds lowercasing of hosts that are added to the hash ring and
    ignores case when comparing the CONF.host to the hash ring members
    to avoid unnecessary pain and confusion for users that make hostname
    changes that are otherwise functionally harmless.

    This also adds logging of the set of hash ring members at level DEBUG
    to help enable easier debugging of hash ring related situations.

    Closes-Bug: #1866380

    Change-Id: I617fd59de327de05a198f12b75a381f21945afb0

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/713739

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/713739
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=588b0484bf6f5fe41514f1428aeaf5613635e35a
Submitter: Zuul
Branch: stable/train

commit 588b0484bf6f5fe41514f1428aeaf5613635e35a
Author: melanie witt <email address hidden>
Date: Fri Mar 6 17:05:28 2020 +0000

    Lowercase ironic driver hash ring and ignore case in cache

    Recently we had a customer case where attempts to add new ironic nodes
    to an existing undercloud resulted in half of the nodes failing to be
    detected and added to nova. Ironic API returned all of the newly added
    nodes when called by the driver, but half of the nodes were not
    returned to the compute manager by the driver.

    There was only one nova-compute service managing all of the ironic
    nodes of the all-in-one typical undercloud deployment.

    After days of investigation and examination of a database dump from the
    customer, we noticed that at some point the customer had changed the
    hostname of the machine from something containing uppercase letters to
    the same name but all lowercase. The nova-compute service record had
    the mixed case name and the CONF.host (socket.gethostname()) had the
    lowercase name.

    The hash ring logic adds all of the nova-compute service hostnames plus
    CONF.host to hash ring, then the ironic driver reports only the nodes
    it owns by retrieving a service hostname from the ring based on a hash
    of each ironic node UUID.

    Because of the machine hostname change, the hash ring contained, for
    example: {'MachineHostName', 'machinehostname'} when it should have
    contained only one hostname. And because the hash ring contained two
    hostnames, the driver was able to retrieve only half of the nodes as
    nodes that it owned. So half of the new nodes were excluded and not
    added as new compute nodes.

    This adds lowercasing of hosts that are added to the hash ring and
    ignores case when comparing the CONF.host to the hash ring members
    to avoid unnecessary pain and confusion for users that make hostname
    changes that are otherwise functionally harmless.

    This also adds logging of the set of hash ring members at level DEBUG
    to help enable easier debugging of hash ring related situations.

    Closes-Bug: #1866380

    Change-Id: I617fd59de327de05a198f12b75a381f21945afb0
    (cherry picked from commit 7145100ee4e732caa532d614e2149ef2a545287a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/713982

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/713982
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8f8667a8dd0e453eaef8f75a3fff25db62d4cc17
Submitter: Zuul
Branch: stable/stein

commit 8f8667a8dd0e453eaef8f75a3fff25db62d4cc17
Author: melanie witt <email address hidden>
Date: Fri Mar 6 17:05:28 2020 +0000

    Lowercase ironic driver hash ring and ignore case in cache

    Recently we had a customer case where attempts to add new ironic nodes
    to an existing undercloud resulted in half of the nodes failing to be
    detected and added to nova. Ironic API returned all of the newly added
    nodes when called by the driver, but half of the nodes were not
    returned to the compute manager by the driver.

    There was only one nova-compute service managing all of the ironic
    nodes of the all-in-one typical undercloud deployment.

    After days of investigation and examination of a database dump from the
    customer, we noticed that at some point the customer had changed the
    hostname of the machine from something containing uppercase letters to
    the same name but all lowercase. The nova-compute service record had
    the mixed case name and the CONF.host (socket.gethostname()) had the
    lowercase name.

    The hash ring logic adds all of the nova-compute service hostnames plus
    CONF.host to hash ring, then the ironic driver reports only the nodes
    it owns by retrieving a service hostname from the ring based on a hash
    of each ironic node UUID.

    Because of the machine hostname change, the hash ring contained, for
    example: {'MachineHostName', 'machinehostname'} when it should have
    contained only one hostname. And because the hash ring contained two
    hostnames, the driver was able to retrieve only half of the nodes as
    nodes that it owned. So half of the new nodes were excluded and not
    added as new compute nodes.

    This adds lowercasing of hosts that are added to the hash ring and
    ignores case when comparing the CONF.host to the hash ring members
    to avoid unnecessary pain and confusion for users that make hostname
    changes that are otherwise functionally harmless.

    This also adds logging of the set of hash ring members at level DEBUG
    to help enable easier debugging of hash ring related situations.

    Closes-Bug: #1866380

    Change-Id: I617fd59de327de05a198f12b75a381f21945afb0
    (cherry picked from commit 7145100ee4e732caa532d614e2149ef2a545287a)
    (cherry picked from commit 588b0484bf6f5fe41514f1428aeaf5613635e35a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/723050

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/723054

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/723055

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/723050
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=019e3da75bc6fb171b32a012ce339075fe690ca7
Submitter: Zuul
Branch: stable/rocky

commit 019e3da75bc6fb171b32a012ce339075fe690ca7
Author: melanie witt <email address hidden>
Date: Fri Mar 6 17:05:28 2020 +0000

    Lowercase ironic driver hash ring and ignore case in cache

    Recently we had a customer case where attempts to add new ironic nodes
    to an existing undercloud resulted in half of the nodes failing to be
    detected and added to nova. Ironic API returned all of the newly added
    nodes when called by the driver, but half of the nodes were not
    returned to the compute manager by the driver.

    There was only one nova-compute service managing all of the ironic
    nodes of the all-in-one typical undercloud deployment.

    After days of investigation and examination of a database dump from the
    customer, we noticed that at some point the customer had changed the
    hostname of the machine from something containing uppercase letters to
    the same name but all lowercase. The nova-compute service record had
    the mixed case name and the CONF.host (socket.gethostname()) had the
    lowercase name.

    The hash ring logic adds all of the nova-compute service hostnames plus
    CONF.host to hash ring, then the ironic driver reports only the nodes
    it owns by retrieving a service hostname from the ring based on a hash
    of each ironic node UUID.

    Because of the machine hostname change, the hash ring contained, for
    example: {'MachineHostName', 'machinehostname'} when it should have
    contained only one hostname. And because the hash ring contained two
    hostnames, the driver was able to retrieve only half of the nodes as
    nodes that it owned. So half of the new nodes were excluded and not
    added as new compute nodes.

    This adds lowercasing of hosts that are added to the hash ring and
    ignores case when comparing the CONF.host to the hash ring members
    to avoid unnecessary pain and confusion for users that make hostname
    changes that are otherwise functionally harmless.

    This also adds logging of the set of hash ring members at level DEBUG
    to help enable easier debugging of hash ring related situations.

    Closes-Bug: #1866380

     Conflicts:
            nova/virt/ironic/driver.py

    NOTE(melwitt): Conflict is because change
    I1b184ff37948dc403fe38874613cd4d870c644fd is not in Rocky.

    Change-Id: I617fd59de327de05a198f12b75a381f21945afb0
    (cherry picked from commit 7145100ee4e732caa532d614e2149ef2a545287a)
    (cherry picked from commit 588b0484bf6f5fe41514f1428aeaf5613635e35a)
    (cherry picked from commit 8f8667a8dd0e453eaef8f75a3fff25db62d4cc17)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/723054
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=620e5da840e50aa8a61030b10081821dc7653b94
Submitter: Zuul
Branch: stable/queens

commit 620e5da840e50aa8a61030b10081821dc7653b94
Author: melanie witt <email address hidden>
Date: Fri Mar 6 17:05:28 2020 +0000

    Lowercase ironic driver hash ring and ignore case in cache

    Recently we had a customer case where attempts to add new ironic nodes
    to an existing undercloud resulted in half of the nodes failing to be
    detected and added to nova. Ironic API returned all of the newly added
    nodes when called by the driver, but half of the nodes were not
    returned to the compute manager by the driver.

    There was only one nova-compute service managing all of the ironic
    nodes of the all-in-one typical undercloud deployment.

    After days of investigation and examination of a database dump from the
    customer, we noticed that at some point the customer had changed the
    hostname of the machine from something containing uppercase letters to
    the same name but all lowercase. The nova-compute service record had
    the mixed case name and the CONF.host (socket.gethostname()) had the
    lowercase name.

    The hash ring logic adds all of the nova-compute service hostnames plus
    CONF.host to hash ring, then the ironic driver reports only the nodes
    it owns by retrieving a service hostname from the ring based on a hash
    of each ironic node UUID.

    Because of the machine hostname change, the hash ring contained, for
    example: {'MachineHostName', 'machinehostname'} when it should have
    contained only one hostname. And because the hash ring contained two
    hostnames, the driver was able to retrieve only half of the nodes as
    nodes that it owned. So half of the new nodes were excluded and not
    added as new compute nodes.

    This adds lowercasing of hosts that are added to the hash ring and
    ignores case when comparing the CONF.host to the hash ring members
    to avoid unnecessary pain and confusion for users that make hostname
    changes that are otherwise functionally harmless.

    This also adds logging of the set of hash ring members at level DEBUG
    to help enable easier debugging of hash ring related situations.

    Closes-Bug: #1866380

    Change-Id: I617fd59de327de05a198f12b75a381f21945afb0
    (cherry picked from commit 7145100ee4e732caa532d614e2149ef2a545287a)
    (cherry picked from commit 588b0484bf6f5fe41514f1428aeaf5613635e35a)
    (cherry picked from commit 8f8667a8dd0e453eaef8f75a3fff25db62d4cc17)
    (cherry picked from commit 019e3da75bc6fb171b32a012ce339075fe690ca7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova pike-eol

This issue was fixed in the openstack/nova pike-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova rocky-eol

This issue was fixed in the openstack/nova rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.