StarlingX

Controller failed to lock following a failover due to elastic pod failure to shutdown

Bug #1851287 reported by Tee Ngo on 2019-11-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Dan Voiculeasa

Bug Description

Brief Description
-----------------
The controller node failed to lock after a failover.

Severity
--------
Major

Steps to Reproduce
------------------
1. Bring up a standard system
2. Apply StarlingX monitor application
3. Force a failover by issuing the command "sudo reboot" on the primary controller (e.g. controller-0).
4. After the standby controller-1 has taken over and controller-0 has recovered from the reboot and becomes unlocked-enabled-available, perform a host swact
5. Lock the now standby controller

Expected Behavior
------------------
The standby controller is in locked state within 3 minutes.

Actual Behavior
----------------
The standby controller (controller-1) failed to lock as elasticsearch pods are stuck in terminating. VIM eventually timed out and failed the lock. The controller-1 daemon.log kept spewing the following logs:

2019-11-04T17:45:47.696 controller-1 dockerd[12252]: info time="2019-11-04T17:45:47.696483189Z" level=error msg="Error running exec 4e99d41ebaa999838e836cca8d6cdd36f4fdf5fab8bc55f0a3908a646ba19c96 in container: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown"
2019-11-04T17:45:47.696 controller-1 kubelet[88547]: info W1104 17:45:47.696940 88547 prober.go:108] No ref for container "docker://cc09882af3071eaa3d82bd6e91cb78d2a7a52a895960edfb3a24706ed96f540f" (mon-elasticsearch-data-1_monitor(bf423e8c-0fc3-40a4-a662-e9f42d3f581f):elasticsearch)

A follow up host-lock --force had be issued in order to lock controller-1. As soon as the system host-unlock was issued against controller-1, the monitor related pods on the primary controller-0 got into a bad state and Kibana dashboard became unreachable until controller-1 recovered from the unlock.

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-6bc668cd76-2xcrj 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-elasticsearch-client-1 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-elasticsearch-curator-1572890400-5bl5m 0/1 Error 0 17m fd00:206::8e22:765f:6121:eb55 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-7nbnr 0/1 Error 0 22m fd00:206::8e22:765f:6121:eb63 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-b2r5r 0/1 Error 0 8m51s fd00:206::8e22:765f:6121:eb7a controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-mwgf2 0/1 Error 0 15m fd00:206::8e22:765f:6121:eb42 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-q66mf 0/1 Error 0 12m fd00:206::8e22:765f:6121:eb79 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-tppf8 0/1 Error 0 20m fd00:206::8e22:765f:6121:eb6b controller-0 <none> <none>
monitor mon-elasticsearch-data-1 0/1 Pending 0 14m <none> <none> <none> <none>
monitor mon-elasticsearch-master-1 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-filebeat-bppwv 1/2 CrashLoopBackOff 127 9d fd00:206::a4ce:fec1:5423:e327 controller-1 <none> <none>
monitor mon-kibana-6cf57cfd5b-pg7cf 0/1 CrashLoopBackOff 16 67m fd00:206::8e22:765f:6121:eb59 controller-0 <none> <none>
monitor mon-logstash-0 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-metricbeat-7948cd594c-wwqlp 0/1 Error 18 67m fd00:206::8e22:765f:6121:eb6f controller-0 <none> <none>
monitor mon-metricbeat-tqzgn 0/1 Error 116 9d fd00:204::4 controller-1 <none> <none>

Reproducibility
---------------
Very frequent

System Configuration
--------------------
Standard system

Branch/Pull Time/Commit
-----------------------
Oct. 24th load

Last Pass
---------
This scenario was not executed by me previously and I do not know whether it was part of a regression test suite.

Timestamp/Logs
--------------
Attached is the daemon.log of the controller node that consistently failed host-lock following a failover.

Test Activity
-------------
Ad-hoc testing

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-11-04:

controller-1_daemon.log Edit (1.3 MiB, text/plain)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-06:

stx.3.0 / high priority - pods not shutting down preventing the host lock operation.

summary:	- Controller failed to lock following a failover + Controller failed to lock following a failover due to elastic pod + failure to shutdown
Changed in starlingx:
assignee:	nobody → Kevin Smith (kevin.smith.wrs)
importance:	Undecided → Medium
status:	New → Triaged
importance:	Medium → High
tags:	added: stx.3.0 stx.containers

Revision history for this message

Kevin Smith (kevin.smith.wrs) wrote on 2019-11-13:

The data and master pods are blocked on I/O writing to their respective rbd volumes according to kernel logs. There is a ceph osd bounce just before the block which is suspicious. In some cases, the situation will resolve itself over time, the pods will terminate and the lock will complete. If not, a host-lock --force with the accompanying reboot will forcefully resolve the problem. The exact same symptoms can be reproduced if the volumes are allowed to go full (which the elasticsearch-curator is preventing). Unfortunately, ability to reproduce on the lab that reported the problem has been lost.

I currently suspect the index shard re-sync between the data pods that occurs after node recovery. It is possible to disable routing (at least temporarily) before a pod starts up which could help the situation, but without the ability to reproduce it will be hard to verify if that helps.

Revision history for this message

Kevin Smith (kevin.smith.wrs) wrote on 2019-11-13:

SELECT_NODES_20191113.151457.tar Edit (264.8 MiB, application/x-tar)

Issue reproduced with a host-swact. No host-lock performed after swact, but pods would surely get stuck terminating if it was issued. Logs attached. Ceph continuously in below state.

controller-1:~$ ceph -s
  cluster:
    id: 427bf4e1-20f5-4c9a-a1f9-337796696e3a
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive, 64 pgs peering

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-1(active), standbys: controller-0
    osd: 2 osds: 2 up, 2 in

  data:
    pools: 1 pools, 64 pgs
    objects: 73.06 k objects, 283 GiB
    usage: 568 GiB used, 324 GiB / 892 GiB avail
    pgs: 100.000% pgs not active
             64 peering

Revision history for this message

Bob Church (rchurch) wrote on 2019-11-13:

Download full text (6.2 KiB)

Looks like this is a ceph recovery issue. The OSDs look to be stuck peering which blocks cluster access.

controller-1:/var/log/ceph$ ceph health detail | head
HEALTH_WARN Reduced data availability: 64 pgs inactive, 64 pgs peering
PG_AVAILABILITY Reduced data availability: 64 pgs inactive, 64 pgs peering
    pg 1.0 is stuck peering for 6124.369015, current state peering, last acting [1,0]
    pg 1.1 is stuck peering for 6119.913881, current state peering, last acting [0,1]
    pg 1.2 is stuck peering for 6149.545071, current state peering, last acting [1,0]
    pg 1.3 is stuck peering for 6148.786716, current state peering, last acting [1,0]
    pg 1.4 is stuck peering for 6187.537512, current state peering, last acting [1,0]
    pg 1.5 is stuck peering for 6182.361878, current state peering, last acting [1,0]
    pg 1.6 is stuck peering for 6119.918444, current state peering, last acting [0,1]
    pg 1.7 is stuck peering for 6119.918502, current state peering, last acting [0,1]

Our current recovery logic will not detect this scenario. A restart of either OSD daemon will fix the issue.

controller-1:~$ ceph -s
  cluster:
    id: 427bf4e1-20f5-4c9a-a1f9-337796696e3a
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive, 64 pgs peering

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-1(active), standbys: controller-0
    osd: 2 osds: 2 up, 2 in

  data:
    pools: 1 pools, 64 pgs
    objects: 73.06 k objects, 283 GiB
    usage: 568 GiB used, 324 GiB / 892 GiB avail
    pgs: 100.000% pgs not active
             64 peering

controller-1:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.87097 root storage-tier
-2 0.87097 chassis group-0
-4 0.43549 host controller-0
0 ssd 0.43549 osd.0 up 1.00000 1.00000
-3 0.43549 host controller-1
1 ssd 0.43549 osd.1 up 1.00000 1.00000

Looks like this is a ceph recovery issue. The OSDs look to be stuck peering which blocks cluster access.

Our current recovery logic will not detect this scenario. A restart of either OSD daemon will fix the issue.

controller-1:~$ ceph -s
  cluster:
    id:     427bf4e1-20f5-4c9a-a1f9-337796696e3a
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive, 64 pgs peering
 
  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-1(active), standbys: controller-0
    osd: 2 osds: 2 up, 2 in
 
  data:
    pools:   1 pools, 64 pgs
    objects: 73.06 k objects, 283 GiB
    usage:   568 GiB used, 324 GiB / 892 GiB avail
    pgs:     100.000% pgs not active
             64 peering
 
controller-1:~$ ceph osd tree
ID CLASS WEIGHT  TYPE NAME                 STATUS REWEIGHT PRI-AFF 
-1       0.87097 root storage-tier                                 
-2       0.87097     chassis group-0                               
-4       0.43549         host controller-0                         
 0   ssd 0.43549             osd.0             up  1.00000 1.00000 
-3       0.43549         host controller-1                         
 1   ssd 0.43549             osd.1             up  1.00000 1.00000

controller-1:~$ tail /var/log/ceph/ceph-osd.1.log
2019-11-13 15:33:52.302 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:53.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:54.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:55.279 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:56.267 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2723 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:57.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:58.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:59.274 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:00.281 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:01.318 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2726 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)

controller-1:~$ ssh sysadmin@controller-0 tail /var/log/ceph/ceph-osd.0.log

<snip>

sysadmin@controller-0's password: 
2019-11-13 15:34:41.298 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1819 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:42.273 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:43.299 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:44.299 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:45.275 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:46.244 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1821 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:47.260 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:48.249 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:49.245 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:34:50.290 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-11-13:

Re-assigning to Ovidiu to determine who should prime this issue.

Changed in starlingx:
assignee:	Kevin Smith (kevin.smith.wrs) → Ovidiu Poncea (ovidiu.poncea)

Dan Voiculeasa (dvoicule) on 2019-11-13

Changed in starlingx:
assignee:	Ovidiu Poncea (ovidiu.poncea) → Dan Voiculeasa (dvoicule)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-11-21:

There are 2 recommended actions to address this issue:
1) Update the ceph recovery code to detect if OSDs are in "stuck peering" state and restart the OSD.
2) Investigate if a semantic check needs to be added to prevent a lock of a host with OSDs if ceph is in "backfilling" state.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-25: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/695917

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-02: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/696938

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-09: Fix merged to integ (master)

#10

Reviewed: https://review.opendev.org/695917
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Submitter: Zuul
Branch: master

commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Author: Dan Voiculeasa <email address hidden>
Date: Thu Nov 21 15:01:36 2019 +0200

ceph-init-wrapper: Detect stuck peering OSDs and restart them

OSDs might become stuck peering.
Recover from such state.

Closes-bug: 1851287

Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-09:

#11

Next step is to cherrypick this commit to the r/stx.3.0 branch

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-10: Fix proposed to integ (r/stx.3.0)

#12

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/698222

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-10: Fix merged to integ (r/stx.3.0)

#13

Reviewed: https://review.opendev.org/698222
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=0bf4b546df8c7fdec8cfc6cb6f71b9609ee54306
Submitter: Zuul
Branch: r/stx.3.0

commit 0bf4b546df8c7fdec8cfc6cb6f71b9609ee54306
Author: Dan Voiculeasa <email address hidden>
Date: Thu Nov 21 15:01:36 2019 +0200

ceph-init-wrapper: Detect stuck peering OSDs and restart them

OSDs might become stuck peering.
Recover from such state.

Closes-bug: 1851287

    Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
    Signed-off-by: Dan Voiculeasa <email address hidden>
    (cherry picked from commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b)

Ghada Khalil (gkhalil) on 2019-12-10

tags:

added: in-r-stx30

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-11: Fix proposed to integ (f/centos8)

#14

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/698561

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-11: Fix merged to config (master)

#15

Reviewed: https://review.opendev.org/696938
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=37db1ee792d7ea1eee77b3b134f078a5fca5fdbe
Submitter: Zuul
Branch: master

commit 37db1ee792d7ea1eee77b3b134f078a5fca5fdbe
Author: Dan Voiculeasa <email address hidden>
Date: Mon Dec 2 18:47:12 2019 +0200

ceph: Add semantic check on host-lock to avoid data/service loss

    Avoid locking nodes that have OSDs in recovery state.
    If an OSD that fills others with newer data is stopped(host-lock), then
    ceph doesn't feed the old data to consumers, thus K8s pods get stuck.

    Parse `ceph health details` for PGs in `recovery_wait` or `recovering`
    state. Identify OSDs acting on those PGs. Identify which nodes own the
    OSDs. Deny the lock on those OSDs until ceph recovered.

Mock `ceph health details` and `ceph osd tree` in tests with a simple
AIO-DX configuration. Controller-0 with OSD.0. Controller-1 with OSD.1.

    Example `ceph health details` output:
    pg 1.0 is active+recovery_wait+degraded, acting [1,0]
    pg 1.1 is active+recovering+degraded, acting [1,0]

    Partial-Bug: 1851287
    Change-Id: Id644d1de5ba2a0bff51638fb9cb8a4d2732e7278
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-12-12:

#16

As indicated above, the solution for this LP is addressed by 2 actions:
1) Update the ceph recovery code to detect if OSDs are in "stuck peering" state and restart the OSD.
2) Investigate if a semantic check needs to be added to prevent a lock of a host with OSDs if ceph is in "backfilling" state.

The 1st action is higher priority and the commits are merged into r/stx.3.0.
The 2nd action is a good preventative addition but is not required for r/stx.3.0. It only needs to merge into master for the stx.4.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-12: Fix merged to integ (f/centos8)

#17

Download full text (7.1 KiB)

Reviewed: https://review.opendev.org/698561
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=9035cd1be8aa3138691c6c99219030dfbe77ebaf
Submitter: Zuul
Branch: f/centos8

commit 4aa661ce5666220d6beb2a3a3fac987cba4feb74
Author: Martin, Chen <email address hidden>
Date: Thu Nov 21 10:28:13 2019 +0800

    Build layering
    Rebase tarball for i40e Driver
    Rebase srpm for systemd 219-67.el7
    Rebase srpm for sudo
    Rebase srpm for ntp

    Depends-On: https://review.opendev.org/#/c/695061/
    Depends-On: https://review.opendev.org/#/c/695560/
    Depends-On: https://review.opendev.org/#/c/695637/
    Depends-On: https://review.opendev.org/#/c/695983/

Story: 2006166
Task: 37570

Change-Id: I7f33e0fb1319df3421318c4927d2a5675a490273
Signed-off-by: Martin, Chen <email address hidden>

commit 5d854355d873702b78ff6aa8c6fddc025c45be2d
Author: Jim Somerville <email address hidden>
Date: Mon Nov 25 16:07:17 2019 -0500

Uprev ntp to version 4.2.6p5-29.el7

    This solves:
    ntp: Stack-based buffer overflow in ntpq and ntpdc allows
    denial of service or code execution (CVE-2018-12327)

See the announcement link:

https://lists.centos.org/pipermail/centos-cr-announce/2019-August/006016.html

for more details.

    Here we refresh the meta patches and correct the crime of
    "name of patch file differs from git format-patch". We
    also clean up the commit short logs.

    Change-Id: I263465d85f06096296fdd478a302eb110ab1259c
    Closes-Bug: 1849197
    Depends-On: https://review.opendev.org/#/c/695983
    Signed-off-by: Jim Somerville <email address hidden>

commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Author: Dan Voiculeasa <email address hidden>
Date: Thu Nov 21 15:01:36 2019 +0200

ceph-init-wrapper: Detect stuck peering OSDs and restart them

OSDs might become stuck peering.
Recover from such state.

Closes-bug: 1851287

Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
Signed-off-by: Dan Voiculeasa <email address hidden>

commit f30cb74fef4b97721010ca9bc6a6b6dde03c4add
Author: Robin Lu <email address hidden>
Date: Fri Nov 22 11:01:27 2019 +0800

Update sudo srpm patch for CVE bug

    To fix below CVE, we will use sudo-1.8.23-4.el7_7.1.src.rpm
    And we have to update some patches according to new srpm.
    https://lists.centos.org/pipermail/centos-announce/2019-October/023499.html

CVE bug: CVE-2019-14287: sudo: can bypass certain policy blacklists

    Closes-Bug: 1852825
    Depends-On: https://review.opendev.org/#/c/695637/
    Change-Id: Ifc0a3423464fafce06cd504d9b427fc3433fb756
    Signed-off-by: Robin Lu <email address hidden>

commit 0231aba5cdcb96b15106591acfff280159050366
Author: Jim Somerville <email address hidden>
Date: Thu Nov 21 15:54:15 2019 -0500

Uprev systemd to version 219-67.el7

    This solves:
    systemd: line splitting via fgets() allows for state injection
    during daemon-reexec (CVE-2018-15686)

along with some other less critical issues. See the security
announcement link:

...

Reviewed:  https://review.opendev.org/698561
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=9035cd1be8aa3138691c6c99219030dfbe77ebaf
Submitter: Zuul
Branch:    f/centos8

commit 4aa661ce5666220d6beb2a3a3fac987cba4feb74
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Thu Nov 21 10:28:13 2019 +0800

Build layering
    Rebase tarball for i40e Driver
    Rebase srpm for systemd 219-67.el7
    Rebase srpm for sudo
    Rebase srpm for ntp
    
    Depends-On: https://review.opendev.org/#/c/695061/
    Depends-On: https://review.opendev.org/#/c/695560/
    Depends-On: https://review.opendev.org/#/c/695637/
    Depends-On: https://review.opendev.org/#/c/695983/
    
    Story: 2006166
    Task: 37570
    
    Change-Id: I7f33e0fb1319df3421318c4927d2a5675a490273
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 5d854355d873702b78ff6aa8c6fddc025c45be2d
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Mon Nov 25 16:07:17 2019 -0500

Uprev ntp to version 4.2.6p5-29.el7
    
    This solves:
    ntp: Stack-based buffer overflow in ntpq and ntpdc allows
    denial of service or code execution (CVE-2018-12327)
    
    See the announcement link:
    
    https://lists.centos.org/pipermail/centos-cr-announce/2019-August/006016.html
    
    for more details.
    
    Here we refresh the meta patches and correct the crime of
    "name of patch file differs from git format-patch".  We
    also clean up the commit short logs.
    
    Change-Id: I263465d85f06096296fdd478a302eb110ab1259c
    Closes-Bug: 1849197
    Depends-On: https://review.opendev.org/#/c/695983
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu Nov 21 15:01:36 2019 +0200

ceph-init-wrapper: Detect stuck peering OSDs and restart them
    
    OSDs might become stuck peering.
    Recover from such state.
    
    Closes-bug: 1851287
    
    Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit f30cb74fef4b97721010ca9bc6a6b6dde03c4add
Author: Robin Lu <bin1.lu@intel.com>
Date:   Fri Nov 22 11:01:27 2019 +0800

Update sudo srpm patch for CVE bug
    
    To fix below CVE, we will use sudo-1.8.23-4.el7_7.1.src.rpm
    And we have to update some patches according to new srpm.
    https://lists.centos.org/pipermail/centos-announce/2019-October/023499.html
    
    CVE bug: CVE-2019-14287: sudo: can bypass certain policy blacklists
    
    Closes-Bug: 1852825
    Depends-On: https://review.opendev.org/#/c/695637/
    Change-Id: Ifc0a3423464fafce06cd504d9b427fc3433fb756
    Signed-off-by: Robin Lu <bin1.lu@intel.com>

commit 0231aba5cdcb96b15106591acfff280159050366
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Thu Nov 21 15:54:15 2019 -0500

Uprev systemd to version 219-67.el7
    
    This solves:
    systemd: line splitting via fgets() allows for state injection
    during daemon-reexec (CVE-2018-15686)
    
    along with some other less critical issues.  See the security
    announcement link:
    
    https://lists.centos.org/pipermail/centos-cr-announce/2019-August/006149.html
    
    for more details.
    
    Here we rebase the patches, and fix the atrocious crime of "name of patch file
    doesn't match what git format-patch generates".  We also squash down the
    meta patches which add the patches to the spec file as part of
    good housekeeping.
    
    Change-Id: I01a3fa329bbad541a063cb604d1756892139967f
    Closes-Bug: 1849200
    Depends-On: https://review.opendev.org/#/c/695560
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit 2718976ddc5c272a97cc60651fe5d63a9a037406
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Tue Oct 29 15:39:05 2019 -0400

i40e Driver Upgrade in support of N3000 on-board NICs
    
    Uprev i40e to version 2.10.19.30
    i40evf gets replaced by iavf version 3.7.61.20
    
    The iavf driver supports both fortville and columbiaville,
    so they decided to rename from i40evf to something more generic.
    
    We get to drop the patch which polls for coming out of
    reset as it was incorporated upstream.
    
    The Intel FPGA Programmable Acceleration Card N3000 contains
    dual Intel XL710 NICs and an FPGA for acceleration purposes.
    This driver upgrade is required to support those NICs.
    
    Change-Id: Ifbec94bcc00a8cce9fe97bf0eb41556b8bd3e592
    Story: 2006740
    Task: 37542
    Depends-On: https://review.opendev.org/#/c/695061
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit 8a3722089d821573ce6766e6fa40d78a0fdbaded
Author: Joseph Richard <joseph.richard@windriver.com>
Date:   Tue Oct 29 10:54:58 2019 -0400

Drop initscripts patch running ipv6 dhcp as daemon
    
    This commit rebases initscripts patch set, dropping
    run-dhclient-as-daemon-for-ipv6.patch
    
    Currently, ifup-eth tries running ipv6 dhclient with the one-shot
    option, and if that fails, then retries indefinitely in the background.
    That has the side-effect of causing the ifup-post script to not be run
    if the first dhclient attempt fails, which will prevent routes on that
    interface from being created.  This is especially problematic in the
    case of a DOR, where the compute nodes may come up before dnsmasq is up
    on the controller.
    This is different from upstream centos, which will only try running
    dhclient with the one-shot option for ipv6.
    By reverting the initscripts patch to run as a daemon, ipv6 dhclient now
    runs as one-shot only, and if it fails, ifup-eth script exits without
    getting an address, and then the node fails to come up and reboot.
    While this may result in the compute node having an extra reboot in a DOR,
    that is preferable to the compute coming up incorrectly and requiring a
    lock/unlock to recover.
    
    Closes-bug: 1844579
    Change-Id: I5b7f6b7c878dc4e4737d986f11fae3301585fb1c
    Signed-off-by: Joseph Richard <joseph.richard@windriver.com>

commit 5afd5f90b29f6e097824f7c6f2fe7762597d9ad6
Author: Andy Ning <andy.ning@windriver.com>
Date:   Tue Nov 5 00:12:26 2019 -0500

update Barbican admin secret's user/project IDs during bootstrap
    
    In a DC system when subcloud is managed, keystone user/project IDs are
    synced with Central Cloud, including admin user and project. But the
    admin's secrets in Barbian still use the original user/project IDs,
    causing docker registry access failure when platform-integ-apps is
    reapplied.
    
    This change added a patch to keystone puppet manifest, that updates
    keystone admin user/project IDs to be the same as Central Cloud right
    after keystone is bootstrapped during subcloud deployment. This way any
    referece to admin user/project IDs after bootstrap will be using the
    IDs same as Central Cloud, including the ones in Barbican. This will
    solve the problem of retrieving central registry credential failure
    when platform-integ-apps is reapplied.
    
    Change-Id: I509a06b4b810620a1b3648837726f7f2771162a5
    Closes-Bug: 1851247
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

tags:

added: in-f-centos8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.