DC: rbd mounted devices becomes read only after enabling https on system controller

Bug #1901449 reported by Difu Hu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bob Church

Bug Description

Brief Description
-----------------
On DC system, rbd mounted devices becomes read only after enabling https on system controller

Severity
--------
Major

Steps to Reproduce
------------------
1. pre-condition: https is not enabled on DC.
2. system --os-region-name RegionOne modify --https_enabled="true"

Expected Behavior
------------------
rbd mounted devices should be mounted with read and write permissions.

Actual Behavior
----------------
rbd mounted devices becomes read only after enabling https.

Reproducibility
---------------
Happened 1/1 time

System Configuration
--------------------
DC system
Lab-name: WCP_80_91 (DC-1)

Branch/Pull Time/Commit
-----------------------
2020-06-27_18-35-20 + some changes

Last Pass
---------
Not sure

Timestamp/Logs
--------------
controller-0
| 2020-10-21T12:45:59.492 | 326 | service-scn | haproxy | enabled-active | disabling | restart safe requested
| 2020-10-21T12:45:59.544 | 327 | service-scn | haproxy | disabling | disabled | disable success
| 2020-10-21T12:46:00.066 | 328 | service-scn | haproxy | disabled | enabling | enabled-active state requested
| 2020-10-21T12:46:00.085 | 329 | service-scn | haproxy | enabling | enabled-active | enable success
| 2020-10-21T12:46:00.786 | 330 | service-scn | horizon | enabled-active | disabling | restart safe requested
| 2020-10-21T12:46:00.795 | 331 | service-scn | horizon | disabling | disabled | disable success
| 2020-10-21T12:46:01.080 | 332 | service-scn | horizon | disabled | enabling | enabled-active state requested
| 2020-10-21T12:46:01.735 | 333 | service-scn | horizon | enabling | enabled-active | enable success
| 2020-10-21T12:46:02.100 | 334 | service-scn | lighttpd | enabled-active | disabling | restart safe requested

2020-10-21T12:47:48.236 controller-0 kernel: err [ 3127.702215] Aborting journal on device rbd0-8.
2020-10-21T12:47:48.236 controller-0 kernel: err [ 3127.707177] Buffer I/O error on dev rbd0, logical block 491520, lost sync page write
2020-10-21T12:47:48.236 controller-0 kernel: err [ 3127.715832] JBD2: Error -5 detected when updating journal superblock for rbd0-8.
2020-10-21T12:47:48.244 controller-0 kernel: warning [ 3127.724180] EXT4-fs (rbd0): discard request in group:17 block:20517 count:1 failed with -5
2020-10-21T12:47:49.237 controller-0 kernel: crit [ 3128.688532] EXT4-fs error (device rbd0): ext4_find_entry:1318: inode #131076: comm elasticsearch[m: reading directory lblock 0
2020-10-21T12:47:49.237 controller-0 kernel: crit [ 3128.701321] EXT4-fs error (device rbd0): ext4_read_inode_bitmap:163: comm elasticsearch[m: Cannot read inode bitmap - block_group = 16, inode_bitmap = 524304
2020-10-21T12:47:49.253 controller-0 kernel: crit [ 3128.717070] EXT4-fs error (device rbd0): ext4_journal_check_start:56: Detected aborted journal
2020-10-21T12:47:49.253 controller-0 kernel: crit [ 3128.724185] EXT4-fs (rbd0): Remounting filesystem read-only

2020-10-21T12:51:19.346 controller-0 kernel: err [ 3338.745010] Aborting journal on device rbd1-8.
2020-10-21T12:51:19.346 controller-0 kernel: err [ 3338.745013] Buffer I/O error on dev rbd1, logical block 19431424, lost sync page write
2020-10-21T12:51:19.346 controller-0 kernel: err [ 3338.745013] JBD2: Error -5 detected when updating journal superblock for rbd1-8.
2020-10-21T12:51:19.346 controller-0 kernel: crit [ 3338.745130] EXT4-fs error (device rbd1): ext4_journal_check_start:56: Detected aborted journal
2020-10-21T12:51:19.346 controller-0 kernel: crit [ 3338.745131] EXT4-fs (rbd1): Remounting filesystem read-only
2020-10-21T12:51:19.346 controller-0 kernel: crit [ 3338.745300] EXT4-fs error (device rbd1): ext4_wait_block_bitmap:516: comm elasticsearch[m: Cannot read block bitmap - block_group = 8, block_bitmap = 1052
2020-10-21T12:51:19.346 controller-0 kernel: crit [ 3338.745301] EXT4-fs error (device rbd1): ext4_discard_preallocations:4035: comm elasticsearch[m: Error reading block bitmap for 8

Test Activity
-------------
Regression Testing

Revision history for this message
Difu Hu (difuhu) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This looks like there was filesystem corruption

tags: added: stx.storage
description: updated
tags: added: stx.5.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium - one time occurrence, but should be investigated

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
2020-06-27_18-35-20
DC-3

log:
[2020-11-13 01:15:51,605] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:11::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --https_enabled="true"'

from kern.log
2020-11-13T01:18:39.667 controller-1 kernel: crit [ 3052.773012] EXT4-fs error (device rbd0): ext4_wait_block_bitmap:516: comm elasticsearch[m: Cannot read block bitmap - block_group = 7, block_bitmap = 1051

collect log added:
https://files.starlingx.kube.cengn.ca/launchpad/1901449

Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Revision history for this message
Bob Church (rchurch) wrote :
Download full text (8.6 KiB)

A runtime certificate update restarts containerd. Due to systemd depeendencies docker is restarted which triggers ceph-preshutdown.sh that unmounts the RBD devices.

2020-10-21T12:47:46.537 Info: 2020-10-21 12:47:46 +0000 /Stage[pre]/Platform::Config::Certs::Ssl_ca/File[create-ssl-ca-cert]: Filebucketed /etc/pki/ca-trust/source/anchors/ca-cert.pem to puppet with sum ed886921b19e510f522bb5005cf5a4c2
2020-10-21T12:47:46.539 Notice: 2020-10-21 12:47:46 +0000 /Stage[pre]/Platform::Config::Certs::Ssl_ca/File[create-ssl-ca-cert]/content: content changed '{md5}ed886921b19e510f522bb5005cf5a4c2' to '{md5}b30884f73776aab90df0fe8a831c3b44'
2020-10-21T12:47:46.542 Info: 2020-10-21 12:47:46 +0000 /Stage[pre]/Platform::Config::Certs::Ssl_ca/File[create-ssl-ca-cert]: Scheduling refresh of Exec[update-ca-trust ]
2020-10-21T12:47:46.544 Info: 2020-10-21 12:47:46 +0000 /Stage[pre]/Platform::Config::Certs::Ssl_ca/File[create-ssl-ca-cert]: Scheduling refresh of Exec[restart containerd]
2020-10-21T12:47:46.547 Debug: 2020-10-21 12:47:46 +0000 /Stage[pre]/Platform::Config::Certs::Ssl_ca/File[create-ssl-ca-cert]: The container Class[Platform::Config::Certs::Ssl_ca] will propagate my refresh event
2020-10-21T12:47:46.549 Debug: 2020-10-21 12:47:46 +0000 Exec[update-ca-trust ](provider=posix): Executing 'update-ca-trust'
2020-10-21T12:47:46.557 Debug: 2020-10-21 12:47:46 +0000 Executing: 'update-ca-trust'
2020-10-21T12:47:47.037 Notice: 2020-10-21 12:47:47 +0000 /Stage[pre]/Platform::Config::Certs::Ssl_ca/Exec[update-ca-trust ]: Triggered 'refresh' from 1 events
2020-10-21T12:47:47.039 Debug: 2020-10-21 12:47:47 +0000 /Stage[pre]/Platform::Config::Certs::Ssl_ca/Exec[update-ca-trust ]: The container Class[Platform::Config::Certs::Ssl_ca] will propagate my refresh event
2020-10-21T12:47:47.042 Debug: 2020-10-21 12:47:47 +0000 Exec[restart containerd](provider=posix): Executing 'pmon-restart containerd'
2020-10-21T12:47:47.044 Debug: 2020-10-21 12:47:47 +0000 Executing: 'pmon-restart containerd'
2020-10-21T12:47:47.072 [10651.00109] controller-0 pmond mon pmonMsg.cpp ( 701) pmon_service_inbox : Info : containerd process-restart ; by request
2020-10-21T12:47:48.072 [10651.00110] controller-0 pmond mon pmonHdlr.cpp (1107) unregister_process : Info : containerd Unregister (2438)
2020-10-21T12:47:48.072 [10651.00111] controller-0 pmond mon pmonHdlr.cpp ( 946) kill_running_process : Warn : containerd Killed (2438)
2020-10-21T12:47:48.072 [10651.00112] controller-0 pmond mon pmonHdlr.cpp (1311) respawn_process : Info : containerd Spawn (661307)

2020-10-21T12:47:48.077 controller-0 systemd[1]: info Stopping Docker Application Container Engine...

2020-10-21T12:47:48.086 [10651.00113] controller-0 pmond mon pmonHdlr.cpp ( 303) manage_process_failure :Error : dockerd failed (2473) (p:1 a:0)

2020-10-21T12:47:48.000 controller-0 ceph-preshutdown.sh: notice Unmapped /dev/rbd0
2020-10-21T12:47:48.000 controller-0 ceph-preshutdown.sh: notice Unmapped /dev/rbd1
2020-10-21T12:47:48.000 controller-0 ceph-preshutdown.sh: notice Unmounted /dev/rbd0
2020-10-21T12:47:48.000 controller-0 ceph-preshutdown.sh: notice Unmounted /dev/rbd...

Read more...

Changed in starlingx:
assignee: Mihnea Saracin (msaracin) → Bob Church (rchurch)
Revision history for this message
Bob Church (rchurch) wrote :
Revision history for this message
Bob Church (rchurch) wrote :
Changed in starlingx:
status: Triaged → Fix Committed
Bob Church (rchurch)
Changed in starlingx:
status: Fix Committed → Fix Released
Revision history for this message
Difu Hu (difuhu) wrote :

verified on 2021-01-14_20-00-06.

tags: removed: stx.5.0 stx.retestneeded
tags: added: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.