StarlingX

Backup & Restore HTTPS: helm repo add command failed due to lighttpd.service after active controller restore

Bug #1845504 reported by Senthil Mukundakumar on 2019-09-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Ovidiu Poncea

Bug Description

Brief Description
-----------------
The storage lab was converted from http to https and backup operation was performed. After active controller restore and unlock, it failed to become active.

The “helm repo add” command failed due to lighttpd.service was not up. When we tried to bring it up manually, it complained that file etc/ssl/private/server-cert.pem didn’t exist.

The “helm repo add” command is failing:

2019-09-25T15:56:56.456 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/File[/opt/platform/helm_charts]/ensure: created^[[0m
2019-09-25T15:56:56.458 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/File[/opt/platform/helm_charts]: The container Class[Platform::Helm] will propagate my refresh event^[[0m
2019-09-25T15:56:56.460 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Exec[restart lighttpd for helm](provider=posix): Executing 'systemctl restart lighttpd.service'^[[0m
2019-09-25T15:56:56.462 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Executing: 'systemctl restart lighttpd.service'^[[0m
2019-09-25T15:56:56.543 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/Exec[restart lighttpd for helm]/returns: executed successfully^[[0m
2019-09-25T15:56:56.545 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm/Exec[restart lighttpd for helm]: The container Class[Platform::Helm] will propagate my refresh event^[[0m
2019-09-25T15:56:56.547 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/File[/www/pages/helm_charts/starlingx]/ensure: created^[[0m
2019-09-25T15:56:56.549 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/File[/www/pages/helm_charts/starlingx]: The container Platform::Helm::Repository[starlingx] will propagate my refresh event^[[0m
2019-09-25T15:56:56.551 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Exec[Generate index: /www/pages/helm_charts/starlingx](provider=posix): Executing 'helm repo index /www/pages/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.554 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Executing with uid=www gid=www: 'helm repo index /www/pages/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.598 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Generate index: /www/pages/helm_charts/starlingx]/returns: executed successfully^[[0m
2019-09-25T15:56:56.600 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Generate index: /www/pages/helm_charts/starlingx]: The container Platform::Helm::Repository[starlingx] will propagate my refresh event^[[0m
2019-09-25T15:56:56.602 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Exec[Adding StarlingX helm repo: starlingx](provider=posix): Executing 'helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.604 ^[[0;36mDebug: 2019-09-25 15:56:56 +0000 Executing with uid=sysadmin gid=sys_protected: 'helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx'^[[0m
2019-09-25T15:56:56.644 ^[[mNotice: 2019-09-25 15:56:56 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Adding StarlingX helm repo: starlingx]/returns: Error: Looks like "http://127.0.0.1:8080/helm_charts/starlingx" is not a valid chart repository or cannot be reached: Get http://127.0.0.1:8080/helm_charts/starlingx/index.yaml: dial tcp 127.0.0.1:8080: connect: connection refused^[[0m
2019-09-25T15:56:56.646 ^[[1;31mError: 2019-09-25 15:56:56 +0000 helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx returned 1 instead of one of [0]

controller-0:/var/log/puppet# systemctl restart lighttpd.service

controller-0:/var/log/puppet# systemctl status lighttpd.service
● lighttpd.service - Lightning Fast Webserver With Light System Requirements
   Loaded: loaded (/usr/lib/systemd/system/lighttpd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-09-25 18:28:37 UTC; 9s ago
  Process: 135381 ExecStart=/usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf (code=exited, status=255)
Main PID: 135381 (code=exited, status=255)

controller-0:/var/log/puppet# /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
2019-09-25 18:28:54: (configfile.c.59) Warning: please add "mod_openssl" to server.modules list in lighttpd.conf. A future release of lighttpd 1.4.x *will not* automatically load mod_openssl and lighttpd *will not* use SSL/TLS where your lighttpd.conf contains ssl.* directives
2019-09-25 18:28:54: (mod_openssl.c.434) SSL: BIO_read_filename('/etc/ssl/private/server-cert.pem') failed
2019-09-25 18:28:54: (server.c.1191) Initialization of plugins failed. Going down.

controller-0:/var/log/puppet# ls /etc/ssl/private/server-cert.pem
ls: cannot access /etc/ssl/private/server-cert.pem: No such file or directory
controller-0:/var/log/puppet# ls /etc/ssl/private/
openstack registry-cert.crt registry-cert.key registry-cert-pkcs1.key self-signed-server-cert.pem
controller-0:/var/log/puppet# ls /opt/platform/

This lab from http to https. It is not installed as a https lab initially. When https is enabled the certificate “server-cert.pem” will be generated in /etc/ssl/private/. Even though this certificate file is backed up it is not restored during platform restore.

The code that needs to be modified is in stx/ansible-playbooks/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/setup_registry_certificate_and_keys.yml:

- block:
  - name: Restore certificate and key files
    command: >-
      tar -C /etc/ssl/private -xpf {{ target_backup_dir }}/{{ backup_filename }} --transform='s,.*/,,'
      'etc/ssl/private/registry-cert*' (change to)à 'etc/ssl/private/*cert*’
    args:
      warn: false

when: mode == 'restore'

It seems that when the lab is initially installed as a https lab, we don’t see this issue (We were able to unlock controller-0 but failed to unlocked other nodes LP-1844828). Does that mean the certificate “server-cert.pem” is installed automatically in /etc/ssl/private without needing to restore it?

Severity
--------
Provide the severity of the defect.
Critical: Controller-0 failed to become active after unlock

Steps to Reproduce
------------------
1. Create an environment for ansible remote host
2. Bring up the regular system with storage
3. Backup the system using ansible remotely
4. Re-install the controller with the same load
5. Restore the system using ansible remotely.
6. Unlock the active controller

Expected Behavior
------------------
The controller-0 should become Active after unlock

Actual Behavior
----------------
Controller-0 failed to become active after unlock

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Regular system with storage

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-09-24_15-36-38"

Test Activity
-------------
Feature Testing

See original description

Tags:

Senthil Mukundakumar (smukunda) on 2019-09-26

description:

updated

Frank Miller (sensfan22) on 2019-09-26

Changed in starlingx:
assignee:	nobody → Ovidiu Poncea (ovidiu.poncea)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-27: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/685390
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9930bafd71b61f42c3d6d5956cb96f3de1aa7bf2
Submitter: Zuul
Branch: master

commit 9930bafd71b61f42c3d6d5956cb96f3de1aa7bf2
Author: Wei Zhou <email address hidden>
Date: Fri Sep 27 11:22:57 2019 -0400

Backup & Restore: Failed to unlock controller-0 after platform restore

This issue happened only when https is enabled.

    Controller-0 failed to unlock because lighttpd.service was not up and
    that caused "helm repo add" command to fail when applying controller
    manifest. Because https is enabled, lighttpd service needs to access
    server-cert.pem certificate to start. Even though this certification
    is backed up but it is not restored during platform restore.

    Change-Id: I0d7915bc95064974675614be8eb4b15cb091f684
    Closes-Bug: 1845504
    Signed-off-by: Wei Zhou <email address hidden>

Changed in starlingx:
status:	New → Fix Released

Yang Liu (yliu12) on 2019-10-04

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-08: Re: Backup & Restore: helm repo add command failed due to lighttpd.service after active controller restore

Marking as stx.3.0 / high priority as B&R is a feature deliverable for the release.

tags:	added: stx.3.0 stx.update
Changed in starlingx:
importance:	Undecided → High

Yang Liu (yliu12) on 2019-10-30

summary:

- Backup & Restore: helm repo add command failed due to lighttpd.service
- after active controller restore
+ Backup & Restore HTTPS: helm repo add command failed due to
+ lighttpd.service after active controller restore

Revision history for this message

Senthil Mukundakumar (smukunda) wrote on 2019-10-31:

Verified using load 2019-10-20_20-00-00

Senthil Mukundakumar (smukunda) on 2019-10-31

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.