Upgrade activate reports an error related to removing platform-nfs-ip-address when activate is executed multiple times

Bug #2015392 reported by Fabiano Correa Mercer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Fabiano Correa Mercer

Bug Description

Brief Description
-----------------
During the upgrade from Release 7 to Release 8, if for some reason the upgrade-activate fails
and need to be executed again, it is possible that the runtime manifest:

platform::network::update_platform_nfs_ip_references is called again and at that time it can
fail due to an empty variable:

 $plat_nfs_ip = $::platform::network::mgmt::params::platform_nfs_address

it can happen if this function was already executed and for some reason the entry was removed from
system.yaml but not from the Database.
So when the upgrade-activate is called again the protection that verifies if the platform-nfs-ip is in Database
will return TRUE and the runtime-manifest will execute again, but at this time the $plat_nfs_ip will be empty.

Severity
--------
Major: Upgrade can not be completed

Steps to Reproduce
------------------
It is not easy, I have to change the code to simulate the error
Using AIO-DX
Start an upgrade from an OLD CENTOS ( i.e: Rel 7 ) release to a new Debian Rel 8.
Upgrade controller-1 and controller-0
Before the activate:
Try to change some config to force a fail during the activate.
( I changed the network.pp and set the $plat_nfs_ip to undef and to '' too
run the upgrade-activate
check if Ruby error will happen.

Expected Behavior
------------------
The runtime maifest platform::network::update_platform_nfs_ip_references must be executed without problems
and can be executed other times without returning error

Actual Behavior
----------------
The runtime maifest platform::network::update_platform_nfs_ip_references executes correcty at first time
but if it runs again, it will fail.

Reproducibility
---------------
seen once.

System Configuration
--------------------
AIO-DX, IPv4

Branch/Pull Time/Commit
-----------------------
This issue is a side effect of the change: https://bugs.launchpad.net/starlingx/+bug/2012387

Last Pass
---------
N/A

Timestamp/Logs
--------------
At first time, the command executed correctly:
2023-04-01T23:31:29.203 [0;36mDebug: 2023-04-01 23:31:29 +0000 Executing: 'sm-deprovision service-group-member controller-services platform-nfs-ip --apply'[
2023-04-01T23:31:30.372 [mNotice: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Deprovision platform-nfs-ip (service-group-member platform-nfs-ip)]/returns: executed successfully[
2023-04-01T23:31:30.374 [0;36mDebug: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Deprovision platform-nfs-ip (service-group-member platform-nfs-ip)]: The container Class[Platform::Network::Update_platform_nfs_ip_references] will propagate my refresh event[
2023-04-01T23:31:30.375 [0;36mDebug: 2023-04-01 23:31:30 +0000 Exec[Deprovision Platform-NFS IP service in SM (service platform-nfs-ip)](provider=posix): Executing 'sm-deprovision service platform-nfs-ip'[
2023-04-01T23:31:30.377 [0;36mDebug: 2023-04-01 23:31:30 +0000 Executing: 'sm-deprovision service platform-nfs-ip'[
2023-04-01T23:31:30.541 [mNotice: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Deprovision Platform-NFS IP service in SM (service platform-nfs-ip)]/returns: executed successfully[
2023-04-01T23:31:30.543 [0;36mDebug: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Deprovision Platform-NFS IP service in SM (service platform-nfs-ip)]: The container Class[Platform::Network::Update_platform_nfs_ip_references] will propagate my refresh event[
2023-04-01T23:31:30.545 [0;36mDebug: 2023-04-01 23:31:30 +0000 Exec[Removing Plaform NFS IP address from interface: vlan603](provider=posix): Executing check 'ip -br addr show dev vlan603 2>/dev/null | grep '192.168.30.166/29' 1>/dev/null'[
2023-04-01T23:31:30.547 [0;36mDebug: 2023-04-01 23:31:30 +0000 Executing: 'ip -br addr show dev vlan603 2>/dev/null | grep '192.168.30.166/29' 1>/dev/null'[
2023-04-01T23:31:30.549 [0;36mDebug: 2023-04-01 23:31:30 +0000 Exec[Removing Plaform NFS IP address from interface: vlan603](provider=posix): Executing 'ip addr del 192.168.30.166/29 dev vlan603'[
2023-04-01T23:31:30.551 [0;36mDebug: 2023-04-01 23:31:30 +0000 Executing: 'ip addr del 192.168.30.166/29 dev vlan603'[
2023-04-01T23:31:30.553 [mNotice: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Removing Plaform NFS IP address from interface: vlan603]/returns: executed successfully[
2023-04-01T23:31:30.555 [0;36mDebug: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Removing Plaform NFS IP address from interface: vlan603]: The container Class[Platform::Network::Update_platform_nfs_ip_references] will propagate my refresh event[
2023-04-01T23:31:30.560 [0;36mDebug: 2023-04-01 23:31:30 +0000 Exec[Removing Plaform NFS IP address from /22.12/dnsmasq.hosts](provider=posix): Executing 'sed -i '/controller-platform-nfs/d' /opt/platform/config/22.12/dnsmasq.hosts'[
2023-04-01T23:31:30.562 [0;36mDebug: 2023-04-01 23:31:30 +0000 Executing: 'sed -i '/controller-platform-nfs/d' /opt/platform/config/22.12/dnsmasq.hosts'[
2023-04-01T23:31:30.564 [mNotice: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Removing Plaform NFS IP address from /22.12/dnsmasq.hosts]/returns: executed successfully[
2023-04-01T23:31:30.566 [0;36mDebug: 2023-04-01 23:31:30 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Removing Plaform NFS IP address from /22.12/dnsmasq.hosts]: The container Class[Platform::Network::Update_platform_nfs_ip_references] will propagate my refresh event[
2023-04-01T23:31:30.571 [0;36mDebug: 2023-04-01 23:31:30 +0000 Exec[Removing Plaform NFS IP address from /22.12/hieradata/system.yaml](provider=posix): Executing 'sed -i '/platform_nfs_address/d' /opt/platform/puppet/22.12/hieradata/system.yaml'[
2023-04-01T23:31:30.573 [0;36mDebug: 2023-04-01 23:31:30 +0000 Executing: 'sed -i '/platform_nfs_address/d' /opt/platform/puppet/22.12/hieradata/system.yaml'[

then it is called again and fails:

2023-04-01T23:31:46.407 [mNotice: 2023-04-01 23:31:46 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Deprovision Platform-NFS IP service in SM (service platform-nfs-ip)]/returns: executed successfully[
2023-04-01T23:31:46.408 [0;36mDebug: 2023-04-01 23:31:46 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Deprovision Platform-NFS IP service in SM (service platform-nfs-ip)]: The container Class[Platform::Network::Update_platform_nfs_ip_references] will propagate my refresh event[
2023-04-01T23:31:46.410 [0;36mDebug: 2023-04-01 23:31:46 +0000 Exec[Removing Plaform NFS IP address from interface: vlan603](provider=posix): Executing check 'ip -br addr show dev vlan603 2>/dev/null | grep '/29' 1>/dev/null'[
2023-04-01T23:31:46.412 [0;36mDebug: 2023-04-01 23:31:46 +0000 Executing: 'ip -br addr show dev vlan603 2>/dev/null | grep '/29' 1>/dev/null'[
2023-04-01T23:31:46.414 [0;36mDebug: 2023-04-01 23:31:46 +0000 Exec[Removing Plaform NFS IP address from interface: vlan603](provider=posix): Executing 'ip addr del /29 dev vlan603'[
2023-04-01T23:31:46.415 [0;36mDebug: 2023-04-01 23:31:46 +0000 Executing: 'ip addr del /29 dev vlan603'[
2023-04-01T23:31:46.418 [mNotice: 2023-04-01 23:31:46 +0000 /Stage[main]/Platform::Network::Update_platform_nfs_ip_references/Exec[Removing Plaform NFS IP address from interface: vlan603]/returns: Error: any valid prefix is expected rather than "/29".[
2023-04-01T23:31:46.419 [1;31mError: 2023-04-01 23:31:46 +0000 'ip addr del /29 dev vlan603' returned 1 instead of one of [0]
2023-04-01T23:31:46.421 /usr/lib/ruby/vendor_ruby/puppet/util/errors.rb:157:in `fail'
2023-04-01T23:31:46.423 /usr/lib/ruby/vendor_ruby/puppet/type/exec.rb:168:in `sync'

Test Activity
-------------
Regression Testing

Workaround
abort the upgrade and start again.

Changed in starlingx:
assignee: nobody → Fabiano Correa Mercer (fcorream)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/879683

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/879684

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/879684
Committed: https://opendev.org/starlingx/stx-puppet/commit/27f08b3e31e935267a83cf643a7b924383982e79
Submitter: "Zuul (22348)"
Branch: master

commit 27f08b3e31e935267a83cf643a7b924383982e79
Author: Fabiano Mercer <email address hidden>
Date: Wed Apr 5 17:04:41 2023 -0300

    Ignore platform_nfs cmds if plat_nfs_ip is empty

    During an upgrade from Rel. 7 to Rel. 8 the upgrade-activate failed
    and after some time it was executed again. In the second
    upgrade-activate execution, the update_platform_nfs_ip_references
    was called again, but it failed because plat_nfs_ip was empty.
    It happened because this function was already executed once and at
    the end of it the plat_nfs_ip is removed from system.yaml
    So if this variable is empty, it means that upgrade-activate was
    called again and there is nothing to do.

    Test plan
    PASS AIO-DX upgrade from Rel 7 to 8
    PASS manual change in this function changing the plat_nfs_ip
         value to:
         $plat_nfs_ip = undef
         $plat_nfs_ip = ''
         And executing the upgrade-activate and confirm that
         commands were not executed

    Partial-Bug: #2015392

    Signed-off-by: Fabiano Mercer <email address hidden>
    Change-Id: I1e9a9e31ab33b1d88125daf1195e8ac14b104e17

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/879683
Committed: https://opendev.org/starlingx/config/commit/dfc10a34a431952026495b280b2b199bf7e59e0b
Submitter: "Zuul (22348)"
Branch: master

commit dfc10a34a431952026495b280b2b199bf7e59e0b
Author: Fabiano Mercer <email address hidden>
Date: Wed Apr 5 16:51:49 2023 -0300

    Log and exception handling for platform_nfs_ip

    Adding log and exception handling for update_platform_nfs_ip_references
    to confirm if the execution was correct.
    During an upgrade from Rel. 7 to Rel. 8 the upgrade_activate failed
    and after some time it was executed again. In the second
    upgrade_activate execution, the update_platform_nfs_ip_references
    was called again, but this is not expected because, according to the
    logs, it was already executed and there is a condition where it just
    runs if the platform_nfs_ip is present in the database. Please note
    that at the end of this function, the platform_nfs IP is removed by
    address_destroy.

    Test plan
    PASS AIO-DX upgrade from Rel 7 to 8

    Partial-Bug: #2015392

    Signed-off-by: Fabiano Mercer <email address hidden>
    Change-Id: I8a162d372c04b26cc3a4db3fc45528678f9989a2

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Fixed by the above two commits, so marking as Fix Released

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.networking stx.update
Changed in starlingx:
importance: Medium → Low
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.