Backup & Restore HTTPS: helm repo add command failed due to lighttpd.service after active controller restore

Bug #1845504 reported by Senthil Mukundakumar
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Ovidiu Poncea

Bug Description

Brief Description
-----------------
The storage lab was converted from http to https and backup operation was performed. After active controller restore and unlock, it failed to become active.

The “helm repo add” command failed due to lighttpd.service was not up. When we tried to bring it up manually, it complained that file etc/ssl/private/server-cert.pem didn’t exist.

The “helm repo add” command is failing:

2019-09-25T15:56:56.456 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/File[/opt/platform/helm_charts]/ensure: created^[[0m
2019-09-25T15:56:56.458 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/File[/opt/platform/helm_charts]: The container Class[Platform::Helm] will propagate my refresh event^[[0m
2019-09-25T15:56:56.460 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Exec[restart lighttpd for helm](provider=posix): Executing 'systemctl restart lighttpd.service'^[[0m
2019-09-25T15:56:56.462 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Executing: 'systemctl restart lighttpd.service'^[[0m
2019-09-25T15:56:56.543 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/Exec[restart lighttpd for helm]/returns: executed successfully^[[0m
2019-09-25T15:56:56.545 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/Exec[restart lighttpd for helm]: The container Class[Platform::Helm] will propagate my refresh event^[[0m
2019-09-25T15:56:56.547 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/File[/www/pages/helm_charts/starlingx]/ensure: created^[[0m
2019-09-25T15:56:56.549 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/File[/www/pages/helm_charts/starlingx]: The container Platform::Helm::Repository[starlingx] will propagate my refresh event^[[0m
2019-09-25T15:56:56.551 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Exec[Generate index: /www/pages/helm_charts/starlingx](provider=posix): Executing 'helm repo index /www/pages/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.554 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Executing with uid=www gid=www: 'helm repo index /www/pages/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.598 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Generate index: /www/pages/helm_charts/starlingx]/returns: executed successfully^[[0m
2019-09-25T15:56:56.600 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Generate index: /www/pages/helm_charts/starlingx]: The container Platform::Helm::Repository[starlingx] will propagate my refresh event^[[0m
2019-09-25T15:56:56.602 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Exec[Adding StarlingX helm repo: starlingx](provider=posix): Executing 'helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.604 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Executing with uid=sysadmin gid=sys_protected: 'helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.644 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Adding StarlingX helm repo: starlingx]/returns: Error: Looks like "http://127.0.0.1:8080/helm_charts/starlingx" is not a valid chart repository or cannot be reached: Get http://127.0.0.1:8080/helm_charts/starlingx/index.yaml: dial tcp 127.0.0.1:8080: connect: connection refused^[[0m
2019-09-25T15:56:56.646 ^[[1;31mError: 2019-09-25 15:56:56 +0000 helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx returned 1 instead of one of [0]

controller-0:/var/log/puppet# systemctl restart lighttpd.service

controller-0:/var/log/puppet# systemctl status lighttpd.service
● lighttpd.service - Lightning Fast Webserver With Light System Requirements
   Loaded: loaded (/usr/lib/systemd/system/lighttpd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-09-25 18:28:37 UTC; 9s ago
  Process: 135381 ExecStart=/usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf (code=exited, status=255)
Main PID: 135381 (code=exited, status=255)

controller-0:/var/log/puppet# /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
2019-09-25 18:28:54: (configfile.c.59) Warning: please add "mod_openssl" to server.modules list in lighttpd.conf. A future release of lighttpd 1.4.x *will not* automatically load mod_openssl and lighttpd *will not* use SSL/TLS where your lighttpd.conf contains ssl.* directives
2019-09-25 18:28:54: (mod_openssl.c.434) SSL: BIO_read_filename('/etc/ssl/private/server-cert.pem') failed
2019-09-25 18:28:54: (server.c.1191) Initialization of plugins failed. Going down.

controller-0:/var/log/puppet# ls /etc/ssl/private/server-cert.pem
ls: cannot access /etc/ssl/private/server-cert.pem: No such file or directory
controller-0:/var/log/puppet# ls /etc/ssl/private/
openstack registry-cert.crt registry-cert.key registry-cert-pkcs1.key self-signed-server-cert.pem
controller-0:/var/log/puppet# ls /opt/platform/

This lab from http to https. It is not installed as a https lab initially. When https is enabled the certificate “server-cert.pem” will be generated in /etc/ssl/private/. Even though this certificate file is backed up it is not restored during platform restore.

The code that needs to be modified is in stx/ansible-playbooks/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/setup_registry_certificate_and_keys.yml:

- block:
  - name: Restore certificate and key files
    command: >-
      tar -C /etc/ssl/private -xpf {{ target_backup_dir }}/{{ backup_filename }} --transform='s,.*/,,'
      'etc/ssl/private/registry-cert*' (change to)à 'etc/ssl/private/*cert*’
    args:
      warn: false

  when: mode == 'restore'

It seems that when the lab is initially installed as a https lab, we don’t see this issue (We were able to unlock controller-0 but failed to unlocked other nodes LP-1844828). Does that mean the certificate “server-cert.pem” is installed automatically in /etc/ssl/private without needing to restore it?

Severity
--------
Provide the severity of the defect.
Critical: Controller-0 failed to become active after unlock

Steps to Reproduce
------------------
1. Create an environment for ansible remote host
2. Bring up the regular system with storage
3. Backup the system using ansible remotely
4. Re-install the controller with the same load
5. Restore the system using ansible remotely.
6. Unlock the active controller

Expected Behavior
------------------
The controller-0 should become Active after unlock

Actual Behavior
----------------
Controller-0 failed to become active after unlock

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Regular system with storage

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-09-24_15-36-38"

Test Activity
-------------
Feature Testing

description: updated
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/685390
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9930bafd71b61f42c3d6d5956cb96f3de1aa7bf2
Submitter: Zuul
Branch: master

commit 9930bafd71b61f42c3d6d5956cb96f3de1aa7bf2
Author: Wei Zhou <email address hidden>
Date: Fri Sep 27 11:22:57 2019 -0400

    Backup & Restore: Failed to unlock controller-0 after platform restore

    This issue happened only when https is enabled.

    Controller-0 failed to unlock because lighttpd.service was not up and
    that caused "helm repo add" command to fail when applying controller
    manifest. Because https is enabled, lighttpd service needs to access
    server-cert.pem certificate to start. Even though this certification
    is backed up but it is not restored during platform restore.

    Change-Id: I0d7915bc95064974675614be8eb4b15cb091f684
    Closes-Bug: 1845504
    Signed-off-by: Wei Zhou <email address hidden>

Changed in starlingx:
status: New → Fix Released
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote : Re: Backup & Restore: helm repo add command failed due to lighttpd.service after active controller restore

Marking as stx.3.0 / high priority as B&R is a feature deliverable for the release.

tags: added: stx.3.0 stx.update
Changed in starlingx:
importance: Undecided → High
Yang Liu (yliu12)
summary: - Backup & Restore: helm repo add command failed due to lighttpd.service
- after active controller restore
+ Backup & Restore HTTPS: helm repo add command failed due to
+ lighttpd.service after active controller restore
Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

Verified using load 2019-10-20_20-00-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.