Fuel for OpenStack

[Update] Update cluster from 9.1 to 9.2 failed on neutron task

Bug #1642613 reported by Ilya Bumarskov on 2016-11-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Anton Chevychalov	Fuel for OpenStack 9.2

Bug Description

Detailed bug description:
Update cluster from 9.1 to 9.2 (snapshot id#515) failed on nova-db task.

Steps to reproduce:
   - Deploy 9.1 env with following nodes:
         * Controller
         * Compute, cinder
         * Compute, cinder
   - Add proposed repo for env (http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-11-14-194322)
   - Download mos-mu tool on master (git clone https://github.com/aepifanov/mos_mu.git)
   - Make a perform a preparation playbook (ansible-playbook playbooks/mos9_prepare.yml -e '{"env_id":<env_id>}')
   - Update fuel node (ansible-playbook playbooks/update_fuel.yml)
   - Update env (fuel2 update --env <ENV_ID> install)

Observed behavior:
Deployment has failed:
Could not prefetch neutron_subnet provider 'neutron': Can't retrieve subnet-list because Neutron or Keystone API is not available.
Diagnostic snapshot: https://drive.google.com/open?id=0B8nyPqe6rrN1MU15Nlc4dDZhZFk

Tags:

Ilya Bumarskov (ibumarskov) on 2016-11-17

Changed in fuel:
importance:	Undecided → High
milestone:	none → 9.2

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-11-17:

The issue is related to package upgrade procedure, after it pacemaker start dying constantly.

Linux team, can you check?

[16:07:44] Oleksiy Molchanov: Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: check_active_before_startup_processes: Process lrmd terminated (pid=25845)
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: pcmk_process_exit: Respawning failed child process: lrmd
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: start_child: Forked child 6582 for process lrmd
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: check_active_before_startup_processes: Process pengine terminated (pid=25847)
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: pcmk_process_exit: Respawning failed child process: pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: start_child: Using uid=107 and group=114 for process pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: start_child: Forked child 6583 for process pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores/root
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: qb_ipcs_us_publish: server name: lrmd
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: error: qb_ipcs_us_publish: Could not bind AF_UNIX (): Address already in use (98)
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: qb_ipcs_us_withdraw: withdrawing server sockets
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: error: mainloop_add_ipc_server: Could not start lrmd IPC server: Address already in use (-98)
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: error: main: Failed to create IPC server: shutting down and inhibiting respawn
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: crm_xml_cleanup: Cleaning up memory from libxml2
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: warning: pcmk_child_exit: The lrmd process (6582) can no longer be respawned, shutting the cluster down.
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: stop_child: Stopping crmd: Sent -15 to process 5986
Nov 16 18:07:54 [5986] node-3.test.domain.local crmd: notice: crm_signal_dispatch: Invoking handler for signal 15: Terminated

The issue is related to package upgrade procedure, after it pacemaker start dying constantly.

Linux team, can you check?

[16:07:44] Oleksiy Molchanov: Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:   notice: check_active_before_startup_processes:    Process lrmd terminated (pid=25845)
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:   notice: pcmk_process_exit:        Respawning failed child process: lrmd
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:     info: start_child:      Forked child 6582 for process lrmd
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:   notice: check_active_before_startup_processes:    Process pengine terminated (pid=25847)
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:   notice: pcmk_process_exit:        Respawning failed child process: pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:     info: start_child:      Using uid=107 and group=114 for process pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:     info: start_child:      Forked child 6583 for process pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:     info: mcp_cpg_deliver:  Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:     info: mcp_cpg_deliver:  Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:     info: mcp_cpg_deliver:  Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:     info: mcp_cpg_deliver:  Ignoring process list sent by peer for local node
Nov 16 18:07:54 [6582] node-3.test.domain.local       lrmd:     info: crm_log_init:     Changed active directory to /var/lib/pacemaker/cores/root
Nov 16 18:07:54 [6582] node-3.test.domain.local       lrmd:     info: qb_ipcs_us_publish:       server name: lrmd
Nov 16 18:07:54 [6582] node-3.test.domain.local       lrmd:    error: qb_ipcs_us_publish:       Could not bind AF_UNIX (): Address already in use (98)
Nov 16 18:07:54 [6582] node-3.test.domain.local       lrmd:     info: qb_ipcs_us_withdraw:      withdrawing server sockets
Nov 16 18:07:54 [6582] node-3.test.domain.local       lrmd:    error: mainloop_add_ipc_server:  Could not start lrmd IPC server: Address already in use (-98)
Nov 16 18:07:54 [6582] node-3.test.domain.local       lrmd:    error: main:     Failed to create IPC server: shutting down and inhibiting respawn
Nov 16 18:07:54 [6582] node-3.test.domain.local       lrmd:     info: crm_xml_cleanup:  Cleaning up memory from libxml2
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:  warning: pcmk_child_exit:  The lrmd process (6582) can no longer be respawned, shutting the cluster down.
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:   notice: pcmk_shutdown_worker:     Shutting down Pacemaker
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd:   notice: stop_child:       Stopping crmd: Sent -15 to process 5986
Nov 16 18:07:54 [5986] node-3.test.domain.local       crmd:   notice: crm_signal_dispatch:      Invoking handler for signal 15: Terminated

Changed in fuel:
assignee:	nobody → MOS Linux (mos-linux)
status:	New → Confirmed

Ilya Bumarskov (ibumarskov) on 2016-11-18

tags:

added: blocker-for-qa

Revision history for this message

Ruslan Khozinov (rkhozinov) wrote on 2016-11-18:

It seems that bug related to incorrect settings for ns_IPADDR2 (/var/lib/heartbeat/trace_ra/ns_IPaddr2), that can't up vip__management.

When I've tried to restart vip__management resource I saw ERROR message in syslog

<27>Nov 18 12:39:22 node-5 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Nov 18 12:39:22 node-5 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Nov 18 12:39:25 node-5 ocf-ns_IPaddr2: ERROR: Error: an inet prefix is expected rather than "undef".

I've enabled trace for the resource:
http://paste.ubuntu.com/23495234/ stop resource
http://paste.ubuntu.com/23495235/ start resource

The errors is related to "undef" values from vip__* configuration those are passed to ocf script
and call ocf step with invalid values:

ocf_run ip netns exec haproxy undef
ocf_run ip netns exec haproxy ip route add undef dev b_management

It seems that puppet manifests pass through the "undef" value to resources instead of ''

Revision history for this message

Anton Chevychalov (achevychalov) wrote on 2016-11-18:

Looks like we have zombie lrmd process from previous version of pacemaker.

Revision history for this message

Ruslan Khozinov (rkhozinov) wrote on 2016-11-18:

we've found related issue.

during the failover test with shutdown of the primary controller
we've faced with the next problem:

nova-compute lost connection with rabbit because of default gateway for bg_mgmt is set to 10.109.1.6

auto br-mgmt
iface br-mgmt inet static
bridge_ports enp0s5
address 10.109.1.5/24
gateway 10.109.1.6

root@node-3:~# ip route
default via 10.109.1.6 dev br-mgmt
10.109.0.0/24 dev br-fw-admin proto kernel scope link src 10.109.0.6
10.109.1.0/24 dev br-mgmt proto kernel scope link src 10.109.1.5
10.109.2.0/24 dev br-storage proto kernel scope link src 10.109.2.5
unreachable 169.254.169.254 scope host
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1

and neutron-openvswith agent can't up.

the root cause is in the vip__management resource configuration on the new primary controller

primitive vip__management ocf:fuel:ns_IPaddr2 \
        params base_veth=v_management bridge=br-mgmt cidr_netmask=24 gateway=none gateway_metric=0 iflabel=ka ip=10.109.1.7 iptables_comment=undef ns=haproxy ns_iptables_start_rules=undef ns_iptables_stop_rules=undef ns_veth=b_management other_networks=undef \
        meta failure-timeout=60 migration-threshold=3 resource-stickiness=1 target-role=Started \
        op monitor interval=5 timeout=20 trace_ra=1 \
        op start interval=0 timeout=30 trace_ra=1 \
        op stop interval=0 timeout=30 trace_ra=1

When I've changed the default route for the nova-compute node to 10.109.1.7 (configured by pacemaker) and restarted neutron-opevswitch (successfully connected rabbit with 10.109.1.7 as default route) agent - an instance has been created successfully.

Revision history for this message

Anton Chevychalov (achevychalov) wrote on 2016-11-18:

It looks like we have to problem in that bug. One related to pacemaker one to configuration.

Dmitry Teselkin (teselkin-d) on 2016-11-22

Changed in fuel:
assignee:	MOS Linux (mos-linux) → Anton Chevychalov (achevychalov)

Revision history for this message

Anton Chevychalov (achevychalov) wrote on 2016-11-24:

Pacemaker trouble moved to separate bug https://launchpad.net/bugs/1644152

Revision history for this message

Anton Chevychalov (achevychalov) wrote on 2016-11-28:

There are a lot side effects from pacemaker bug and other qa blockers. Put this on hold until we have good snapshot that allow confirm that bug.

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Ilya Bumarskov (ibumarskov) wrote on 2017-01-12:

Can't reproduce on snapshot-id #748

Changed in fuel:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.