config_controller failure due to collectd failing to start

Bug #1797909 reported by Maria Yousaf
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
Config_controller failure on step 6/8: Applying controller manifest

Severity
--------
Critical

Steps to Reproduce
------------------
Install system

Expected Behavior
------------------
Config_controller succeeds

Actual Behavior
----------------
Config_controller fails as follows:

...
05/08: Creating system configuration ... DONE
06/08: Applying controller manifest ... Failed to execute controller manifest

Configuration failed: Failed to apply controller manifest. See /var/log/puppet/latest/puppet.log for details.

Reproducibility
---------------
Seen 2 out of 3 attempts on this particular system

System Configuration
--------------------
Storage system

Branch/Pull Time/Commit
-----------------------
stx.10.2018 as of 2018-10-15_01-52-00

Timestamp/Logs
--------------
The following was seen in puppet.log (not sure if it is related):

2018-10-15T14:09:50.323 ^[[1;31mError: 2018-10-15 14:09:50 +0000 Systemd start for collectd failed!
2018-10-15T14:09:50.326 journalctl log for collectd:
2018-10-15T14:09:50.328 -- No entries --
2018-10-15T14:09:50.331
2018-10-15T14:09:50.333 /usr/share/ruby/vendor_ruby/puppet/provider/service/systemd.rb:164:in `rescue in start'
2018-10-15T14:09:50.335 /usr/share/ruby/vendor_ruby/puppet/provider/service/systemd.rb:161:in `start'
2018-10-15T14:09:50.338 /usr/share/ruby/vendor_ruby/puppet/type/service.rb:103:in `block (3 levels) in <module:Puppet>'
2018-10-15T14:09:50.340 /usr/share/ruby/vendor_ruby/puppet/property.rb:487:in `set'
2018-10-15T14:09:50.343 /usr/share/ruby/vendor_ruby/puppet/property.rb:561:in `sync'
2018-10-15T14:09:50.345 /usr/share/ruby/vendor_ruby/puppet/type/service.rb:114:in `sync'
2018-10-15T14:09:50.347 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:236:in `sync'
2018-10-15T14:09:50.350 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:134:in `sync_if_needed'
2018-10-15T14:09:50.352 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:80:in `perform_changes'
2018-10-15T14:09:50.354 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:21:in `evaluate'
2018-10-15T14:09:50.356 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:230:in `apply'
2018-10-15T14:09:50.359 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:246:in `eval_resource'
2018-10-15T14:09:50.361 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:163:in `call'
2018-10-15T14:09:50.363 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:163:in `block (2 levels) in evaluate'
2018-10-15T14:09:50.365 /usr/share/ruby/vendor_ruby/puppet/util.rb:386:in `block in thinmark'
2018-10-15T14:09:50.368 /usr/share/ruby/benchmark.rb:296:in `realtime'
2018-10-15T14:09:50.370 /usr/share/ruby/vendor_ruby/puppet/util.rb:385:in `thinmark'
2018-10-15T14:09:50.372 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:163:in `block in evaluate'
2018-10-15T14:09:50.374 /usr/share/ruby/vendor_ruby/puppet/graph/relationship_graph.rb:118:in `traverse'
2018-10-15T14:09:50.376 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:154:in `evaluate'
2018-10-15T14:09:50.378 /usr/share/ruby/vendor_ruby/puppet/resource/catalog.rb:222:in `block in apply'
2018-10-15T14:09:50.381 /usr/share/ruby/vendor_ruby/puppet/util/log.rb:155:in `with_destination'
2018-10-15T14:09:50.383 /usr/share/ruby/vendor_ruby/puppet/transaction/report.rb:142:in `as_logging_destination'
2018-10-15T14:09:50.385 /usr/share/ruby/vendor_ruby/puppet/resource/catalog.rb:221:in `apply'
2018-10-15T14:09:50.387 /usr/share/ruby/vendor_ruby/puppet/configurer.rb:171:in `block in apply_catalog'
2018-10-15T14:09:50.389 /usr/share/ruby/vendor_ruby/puppet/util.rb:223:in `block in benchmark'
2018-10-15T14:09:50.392 /usr/share/ruby/benchmark.rb:296:in `realtime'
2018-10-15T14:09:50.394 /usr/share/ruby/vendor_ruby/puppet/util.rb:222:in `benchmark'
2018-10-15T14:09:50.396 /usr/share/ruby/vendor_ruby/puppet/configurer.rb:170:in `apply_catalog'
2018-10-15T14:09:50.398 /usr/share/ruby/vendor_ruby/puppet/configurer.rb:343:in `run_internal'
2018-10-15T14:09:50.401 /usr/share/ruby/vendor_ruby/puppet/configurer.rb:221:in `block in run'
2018-10-15T14:09:50.403 /usr/share/ruby/vendor_ruby/puppet/context.rb:65:in `override'
2018-10-15T14:09:50.405 /usr/share/ruby/vendor_ruby/puppet.rb:241:in `override'
2018-10-15T14:09:50.407 /usr/share/ruby/vendor_ruby/puppet/configurer.rb:195:in `run'
2018-10-15T14:09:50.409 /usr/share/ruby/vendor_ruby/puppet/application/apply.rb:350:in `apply_catalog'
2018-10-15T14:09:50.411 /usr/share/ruby/vendor_ruby/puppet/application/apply.rb:274:in `block in main'
2018-10-15T14:09:50.413 /usr/share/ruby/vendor_ruby/puppet/context.rb:65:in `override'
2018-10-15T14:09:50.415 /usr/share/ruby/vendor_ruby/puppet.rb:241:in `override'
2018-10-15T14:09:50.417 /usr/share/ruby/vendor_ruby/puppet/application/apply.rb:225:in `main'
2018-10-15T14:09:50.419 /usr/share/ruby/vendor_ruby/puppet/application/apply.rb:170:in `run_command'
2018-10-15T14:09:50.421 /usr/share/ruby/vendor_ruby/puppet/application.rb:344:in `block in run'
2018-10-15T14:09:50.423 /usr/share/ruby/vendor_ruby/puppet/util.rb:540:in `exit_on_fail'
2018-10-15T14:09:50.425 /usr/share/ruby/vendor_ruby/puppet/application.rb:344:in `run'
2018-10-15T14:09:50.427 /usr/share/ruby/vendor_ruby/puppet/util/command_line.rb:132:in `run'
2018-10-15T14:09:50.429 /usr/share/ruby/vendor_ruby/puppet/util/command_line.rb:72:in `execute'
2018-10-15T14:09:50.431 /usr/bin/puppet:5:in `<main>'^[[0m

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This appears to be an issue with a specific hardware lab; further investigation is in progress. But this will likely not gate stx.2018.10

Changed in starlingx:
importance: Undecided → Medium
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

From Don Penney:
The collectd service is failing to start, with the following in daemon.log:

2018-10-15T14:09:49.379 localhost collectd[41078]: info plugin_load: plugin "network" successfully loaded.
2018-10-15T14:09:49.380 localhost collectd[41078]: info plugin_load: plugin "python" successfully loaded.
2018-10-15T14:09:50.284 localhost collectd[41078]: info degrade notifier config function
2018-10-15T14:09:50.284 localhost collectd[41078]: info degrade notifier configured mtce port: 2101
2018-10-15T14:09:50.284 localhost collectd[41078]: info plugin_load: plugin "threshold" successfully loaded.
2018-10-15T14:09:50.285 localhost collectd[41078]: info plugin_load: plugin "df" successfully loaded.
2018-10-15T14:09:50.285 localhost collectd[41078]: info platform cpu usage plugin config function
2018-10-15T14:09:50.285 localhost collectd[41078]: info platform memory usage configured query command: '/proc/meminfo'
2018-10-15T14:09:50.285 localhost collectd[41078]: info Looking up "controller-0" failed. You have set the "FQDNLookup" option, but I cannot resolve my hostname to a fully qualified domain name. Please fix the network configuration.

description: updated
summary: - Config_controller failure on step 6/8: Applying controller manifest
+ config_controller failure on step 6/8: Applying controller manifest
summary: - config_controller failure on step 6/8: Applying controller manifest
+ config_controller failure due to collectd failing to start
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
status: New → Triaged
tags: added: stx.2019.03
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Targeting stx.2019.03 until further investigation. This appears to be an intermittent issue seen on one system only.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/627986

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-metal (master)

Fix proposed to branch: master
Review: https://review.openstack.org/627987

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-integ (master)

Fix proposed to branch: master
Review: https://review.openstack.org/627988

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-integ (master)

Reviewed: https://review.openstack.org/627988
Committed: https://git.openstack.org/cgit/openstack/stx-integ/commit/?id=c8f39de9a0d56b2c86e204445fc4b097ca718deb
Submitter: Zuul
Branch: master

commit c8f39de9a0d56b2c86e204445fc4b097ca718deb
Author: Eric MacDonald <email address hidden>
Date: Wed Jan 2 10:21:06 2019 -0500

    Implement collectd startup in manifest apply post stage

    Starting collectd too early in the manifest apply is seen
    to occasionally fail due to a dependency configuration on
    hostname resolution in FQDNLookup not being complete.

    Since influxdb is used by collectd and is a controller
    only service this update moves it to the manifest apply
    post stage as well and is filtered out from non
    controller load types.

    This issue is fixed by the following multi-git changes.

    stx-metal:
       Filter influxdb out of storage and compute only loads.
       No real inter git merge dependency

    stx-integ: This update.
       Add startup Before=pmond dependency

    stx-config:
       Move collectd config and startup to manifest apply post stage
       Move influxdb config and startup to manifest apply post stage

    Test Plan:
    PASS: Build iso
    PASS: verify install storage system and collectd startup
    PASS: Verify Storage system DOR
    PASS: Verify influxdb and extensions excluded in non-controller loads
    PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
    PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
    PASS: Verify collectd pmond process monitoring and recovery
    PASS: Verify influxdb pmond process monitoring and recovery

    PEND: Verify collectd statistics storage and fetch to/from influxdb
    PEND: Install AIO DX and verify collectd and influxdb startup

    Change-Id: I47d70b05bdbdd22f8fce2f56fcc287fac7371ace
    Closes-Bug: 1797909
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-integ (f/centos76)

Fix proposed to branch: f/centos76
Review: https://review.openstack.org/628060

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-integ (f/centos76)
Download full text (6.1 KiB)

Reviewed: https://review.openstack.org/628060
Committed: https://git.openstack.org/cgit/openstack/stx-integ/commit/?id=17b864fbebb63d768f65beea0b4a3ef6811cbbe8
Submitter: Zuul
Branch: f/centos76

commit c8f39de9a0d56b2c86e204445fc4b097ca718deb
Author: Eric MacDonald <email address hidden>
Date: Wed Jan 2 10:21:06 2019 -0500

    Implement collectd startup in manifest apply post stage

    Starting collectd too early in the manifest apply is seen
    to occasionally fail due to a dependency configuration on
    hostname resolution in FQDNLookup not being complete.

    Since influxdb is used by collectd and is a controller
    only service this update moves it to the manifest apply
    post stage as well and is filtered out from non
    controller load types.

    This issue is fixed by the following multi-git changes.

    stx-metal:
       Filter influxdb out of storage and compute only loads.
       No real inter git merge dependency

    stx-integ: This update.
       Add startup Before=pmond dependency

    stx-config:
       Move collectd config and startup to manifest apply post stage
       Move influxdb config and startup to manifest apply post stage

    Test Plan:
    PASS: Build iso
    PASS: verify install storage system and collectd startup
    PASS: Verify Storage system DOR
    PASS: Verify influxdb and extensions excluded in non-controller loads
    PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
    PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
    PASS: Verify collectd pmond process monitoring and recovery
    PASS: Verify influxdb pmond process monitoring and recovery

    PEND: Verify collectd statistics storage and fetch to/from influxdb
    PEND: Install AIO DX and verify collectd and influxdb startup

    Change-Id: I47d70b05bdbdd22f8fce2f56fcc287fac7371ace
    Closes-Bug: 1797909
    Signed-off-by: Eric MacDonald <email address hidden>

commit d33383743e5a0b494cbae23d2e389beb993d6a30
Author: slin14 <email address hidden>
Date: Wed Sep 26 20:07:50 2018 +0800

    rebase drbd drver patch to 8.4.11-1 version

    "compat-Statically-initialize-families.patch" is already contained in
    the new version, so delete it.
    Reset TIS_PATCH_VER to 0 since version is upgraded.

    Depends-On: https://review.openstack.org/605292
    Story: 2003597
    Task: 26588

    Change-Id: I628f5b0497df188ea9fa7b7860b56de78382c510
    Signed-off-by: slin14 <email address hidden>

commit fc9b6f94a899eaa9b32fecde0b7a2d9b0a1f65e5
Author: Sun Austin <email address hidden>
Date: Tue Dec 18 13:10:58 2018 +0800

    Fix: "map" issue for Python 2/3 compatible code

    Replace map(func, data) with [func(item) for item in data]

    Story: 2002909
    Task: 24563

    Change-Id: I83004eeba036908da483b247093818a6ac3f19c1
    Signed-off-by: Sun Austin <email address hidden>

commit 707a12317bc5bfbacace9657550aa363ee6d08b4
Author: Sun Austin <email address hidden>
Date: Tue Dec 18 11:43:00 2018 +0800

    Fix: "dict" issue for Python 2/3 compatible code

    Replace dict.iteritems() with dict.items()
    Rep...

Read more...

tags: added: in-f-centos76
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/627986
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=b83ad728ce18373960c3f23cc90dcf344332cce7
Submitter: Zuul
Branch: master

commit b83ad728ce18373960c3f23cc90dcf344332cce7
Author: Eric MacDonald <email address hidden>
Date: Wed Jan 2 10:13:24 2019 -0500

    Make collectd startup dependent on networking Anchor

    Starting collectd too early in the manifest apply is seen
    to occasionally fail due to a dependency configuration on
    hostname resolution in FQDNLookup not being complete.

    This is fixed by making collectd startup have a hard
    dependency on platform::networking by-way of a manifest
    require Anchor.

    As well, to handle the DOR case when controller manifest
    is not executed, this update also ensures that collectd
    and influxdb services are enabled in its manifest base
    class so these processes are auto started by init.

    Since influxdb is a controller only service it is removed
    from non controller load types.

    This issue is fixed by the following multi-git changes.

    stx-metal:
       Filter influxdb out of storage and compute only loads.
       No real inter git merge dependency

    stx-integ:
       Add startup Before=pmond dependency

    stx-config: This Update.
       Move collectd config and startup to manifest apply post stage
       Move influxdb config and startup to manifest apply post stage

    Test Plan:
    PASS: Build iso
    PASS: Verify install storage system and collectd startup
    PASS: Verify influxdb and extensions excluded in non-controller loads
    PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
    PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
    PASS: Verify collectd pmond process monitoring and recovery
    PASS: Verify influxdb pmond process monitoring and recovery
    PASS: Verify collectd statistics storage and fetch to/from influxdb
    PASS: Verify Install AIO DX and verify collectd and influxdb startup
    PASS: Verify Storage system DOR
    PASS: Verify AIO DX DOR

    Change-Id: Idff6382d835289f5986e98e3b4ee6e9c7a960287
    Closes-Bug: 1797909
    Signed-off-by: Eric MacDonald <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-metal (master)

Reviewed: https://review.openstack.org/627987
Committed: https://git.openstack.org/cgit/openstack/stx-metal/commit/?id=64c1d400b9d17a08da80f9995a24fe87e6b0f6c1
Submitter: Zuul
Branch: master

commit 64c1d400b9d17a08da80f9995a24fe87e6b0f6c1
Author: Eric MacDonald <email address hidden>
Date: Wed Jan 2 09:53:15 2019 -0500

    Implement collectd startup in manifest apply post stage

    Starting collectd too early in the manifest apply is seen
    to occasionally fail due to a dependency configuration on
    hostname resolution in FQDNLookup not being complete.

    Since influxdb is used by collectd and is a controller
    only service this update moves it to the manifest apply
    post stage as well and is filtered out from non
    controller load types.

    This issue is fixed by the following multi-git changes.

    stx-metal: This update.
       Filter influxdb out of storage and compute only loads.
       No real inter git merge dependency

    stx-integ:
       Add startup Before=pmond dependency

    stx-config:
       Move collectd config and startup to manifest apply post stage
       Move influxdb config and startup to manifest apply post stage

    Test Plan:
    PASS: Build iso
    PASS: verify install storage system and collectd startup
    PASS: Verify Storage system DOR
    PASS: Verify influxdb and extensions excluded in non-controller loads
    PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
    PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
    PASS: Verify collectd pmond process monitoring and recovery
    PASS: Verify influxdb pmond process monitoring and recovery

    PEND: Verify collectd statistics storage and fetch to/from influxdb
    PEND: Install AIO DX and verify collectd and influxdb startup

    Change-Id: I8c71f36978620e0650062cc848bfb9d85f6810b2
    Closes-Bug: 1797909
    Signed-off-by: Eric MacDonald <email address hidden>

Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.