[Update] Update cluster from 9.1 to 9.2 failed on neutron task

Bug #1642613 reported by Ilya Bumarskov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Anton Chevychalov

Bug Description

Detailed bug description:
Update cluster from 9.1 to 9.2 (snapshot id#515) failed on nova-db task.

Steps to reproduce:
   - Deploy 9.1 env with following nodes:
         * Controller
         * Compute, cinder
         * Compute, cinder
   - Add proposed repo for env (http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-11-14-194322)
   - Download mos-mu tool on master (git clone https://github.com/aepifanov/mos_mu.git)
   - Make a perform a preparation playbook (ansible-playbook playbooks/mos9_prepare.yml -e '{"env_id":<env_id>}')
   - Update fuel node (ansible-playbook playbooks/update_fuel.yml)
   - Update env (fuel2 update --env <ENV_ID> install)

Observed behavior:
Deployment has failed:
Could not prefetch neutron_subnet provider 'neutron': Can't retrieve subnet-list because Neutron or Keystone API is not available.
Diagnostic snapshot: https://drive.google.com/open?id=0B8nyPqe6rrN1MU15Nlc4dDZhZFk

Changed in fuel:
importance: Undecided → High
milestone: none → 9.2
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

The issue is related to package upgrade procedure, after it pacemaker start dying constantly.

Linux team, can you check?

[16:07:44] Oleksiy Molchanov: Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: check_active_before_startup_processes: Process lrmd terminated (pid=25845)
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: pcmk_process_exit: Respawning failed child process: lrmd
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: start_child: Forked child 6582 for process lrmd
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: check_active_before_startup_processes: Process pengine terminated (pid=25847)
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: pcmk_process_exit: Respawning failed child process: pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: start_child: Using uid=107 and group=114 for process pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: start_child: Forked child 6583 for process pengine
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores/root
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: qb_ipcs_us_publish: server name: lrmd
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: error: qb_ipcs_us_publish: Could not bind AF_UNIX (): Address already in use (98)
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: qb_ipcs_us_withdraw: withdrawing server sockets
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: error: mainloop_add_ipc_server: Could not start lrmd IPC server: Address already in use (-98)
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: error: main: Failed to create IPC server: shutting down and inhibiting respawn
Nov 16 18:07:54 [6582] node-3.test.domain.local lrmd: info: crm_xml_cleanup: Cleaning up memory from libxml2
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: warning: pcmk_child_exit: The lrmd process (6582) can no longer be respawned, shutting the cluster down.
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
Nov 16 18:07:54 [5968] node-3.test.domain.local pacemakerd: notice: stop_child: Stopping crmd: Sent -15 to process 5986
Nov 16 18:07:54 [5986] node-3.test.domain.local crmd: notice: crm_signal_dispatch: Invoking handler for signal 15: Terminated

Changed in fuel:
assignee: nobody → MOS Linux (mos-linux)
status: New → Confirmed
tags: added: blocker-for-qa
Revision history for this message
Ruslan Khozinov (rkhozinov) wrote :

It seems that bug related to incorrect settings for ns_IPADDR2 (/var/lib/heartbeat/trace_ra/ns_IPaddr2), that can't up vip__management.

When I've tried to restart vip__management resource I saw ERROR message in syslog

<27>Nov 18 12:39:22 node-5 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Nov 18 12:39:22 node-5 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Nov 18 12:39:25 node-5 ocf-ns_IPaddr2: ERROR: Error: an inet prefix is expected rather than "undef".

I've enabled trace for the resource:
http://paste.ubuntu.com/23495234/ stop resource
http://paste.ubuntu.com/23495235/ start resource

The errors is related to "undef" values from vip__* configuration those are passed to ocf script
and call ocf step with invalid values:

ocf_run ip netns exec haproxy undef
ocf_run ip netns exec haproxy ip route add undef dev b_management

It seems that puppet manifests pass through the "undef" value to resources instead of ''

Revision history for this message
Anton Chevychalov (achevychalov) wrote :

Looks like we have zombie lrmd process from previous version of pacemaker.

Revision history for this message
Ruslan Khozinov (rkhozinov) wrote :

we've found related issue.

during the failover test with shutdown of the primary controller
we've faced with the next problem:

nova-compute lost connection with rabbit because of default gateway for bg_mgmt is set to 10.109.1.6

auto br-mgmt
iface br-mgmt inet static
bridge_ports enp0s5
address 10.109.1.5/24
gateway 10.109.1.6

root@node-3:~# ip route
default via 10.109.1.6 dev br-mgmt
10.109.0.0/24 dev br-fw-admin proto kernel scope link src 10.109.0.6
10.109.1.0/24 dev br-mgmt proto kernel scope link src 10.109.1.5
10.109.2.0/24 dev br-storage proto kernel scope link src 10.109.2.5
unreachable 169.254.169.254 scope host
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1

and neutron-openvswith agent can't up.

the root cause is in the vip__management resource configuration on the new primary controller

primitive vip__management ocf:fuel:ns_IPaddr2 \
        params base_veth=v_management bridge=br-mgmt cidr_netmask=24 gateway=none gateway_metric=0 iflabel=ka ip=10.109.1.7 iptables_comment=undef ns=haproxy ns_iptables_start_rules=undef ns_iptables_stop_rules=undef ns_veth=b_management other_networks=undef \
        meta failure-timeout=60 migration-threshold=3 resource-stickiness=1 target-role=Started \
        op monitor interval=5 timeout=20 trace_ra=1 \
        op start interval=0 timeout=30 trace_ra=1 \
        op stop interval=0 timeout=30 trace_ra=1

When I've changed the default route for the nova-compute node to 10.109.1.7 (configured by pacemaker) and restarted neutron-opevswitch (successfully connected rabbit with 10.109.1.7 as default route) agent - an instance has been created successfully.

Revision history for this message
Anton Chevychalov (achevychalov) wrote :

It looks like we have to problem in that bug. One related to pacemaker one to configuration.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Anton Chevychalov (achevychalov)
Revision history for this message
Anton Chevychalov (achevychalov) wrote :

Pacemaker trouble moved to separate bug https://launchpad.net/bugs/1644152

Revision history for this message
Anton Chevychalov (achevychalov) wrote :

There are a lot side effects from pacemaker bug and other qa blockers. Put this on hold until we have good snapshot that allow confirm that bug.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

Can't reproduce on snapshot-id #748

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.