No warning for HA with OSD and MON roles on same nodes

Bug #1267937 reported by mauro
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Dmitry Borodaenko

Bug Description

Mirantis Fuel 4.0 deployment

3 compute nodes + 3 controller nodes

Node-13 is leader in ceph cluster

[root@node-15 ceph]# ceph quorum_status
{"election_epoch":24,"quorum":[0,1,2],"quorum_names":["node-13","node-14","node-15"],"quorum_leader_name":"node-13","monmap":{"epoch":3,"fsid":"4572353c-8777-4639-a16d-c581a3ebdf7e","modified":"2014-01-10 09:02:03.288830","created":"0.000000","mons":[{"rank":0,"name":"node-13","addr":"192.168.121.3:6789\/0"},{"rank":1,"name":"node-14","addr":"192.168.121.4:6789\/0"},{"rank":2,"name":"node-15","addr":"192.168.121.5:6789\/0"}]}}

After Node-13 power crash ---> Node 14 elected new leader in a cluster of 2 ( node-14 node-15) but ceph remains unavailable:

ceph -s --> output " pending"

Openstack objects ( isntances volumes ) unavailable form CLI&Horizon dashboard

The situation recovers only after Node-13 comes back to operation.

Revision history for this message
mauro (maurof) wrote :

in attachment ceph.log

snapshot available under request

Revision history for this message
mauro (maurof) wrote :

the described problem seems not occurring if the affected node is not the leader

Andrew Woodward (xarses)
tags: added: ceph
Revision history for this message
mauro (maurof) wrote :
Download full text (3.1 KiB)

Thanks Andrew, I see you routed the porblem in ceph area.
Do you think is someone starting the investigation?

In my opinion ceph mon is active on the remainig 2 nodes , but in any case ceph is not accessible.

[root@node-15 ceph]# ceph -s
2014-01-13 09:02:26.467554 7f5574237700 0 -- :/1002881 >> 192.168.121.3:6789/0 pipe(0x7f556800a480 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f556800a6e0).fault
2014-01-13 09:02:29.468596 7f5574136700 0 -- :/1002881 >> 192.168.121.3:6789/0 pipe(0x7f556800ad80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f5568002f50).fault

[root@node-15 ceph]# ps aux | grep ceph
root 2223 0.2 0.8 787920 143404 ? Ssl Jan10 11:45 /usr/bin/ceph-osd -i 13 --pid-file /var/run/ceph/osd.13.pid -c /etc/ceph/ceph.conf
root 3780 0.2 0.4 713348 71992 ? Ssl Jan10 11:28 /usr/bin/ceph-osd -i 16 --pid-file /var/run/ceph/osd.16.pid -c /etc/ceph/ceph.conf
root 6541 0.2 1.3 867784 216584 ? Ssl Jan10 12:38 /usr/bin/ceph-osd -i 17 --pid-file /var/run/ceph/osd.17.pid -c /etc/ceph/ceph.conf
root 41060 0.0 0.0 103240 884 pts/0 S+ 08:56 0:00 grep ceph
root 42717 0.0 0.2 253716 49184 ? Sl Jan10 0:58 /usr/bin/ceph-mon -i node-15 --pid-file /var/run/ceph/mon.node-15.pid -c /etc/ceph/ceph.conf
root 42870 0.3 0.8 791888 147336 ? Ssl Jan10 12:57 /usr/bin/ceph-osd -i 11 --pid-file /var/run/ceph/osd.11.pid -c /etc/ceph/ceph.conf
root 44785 0.3 0.9 802136 150416 ? Ssl Jan10 14:32 /usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.conf
root 47618 0.2 1.0 806588 166588 ? Ssl Jan10 12:25 /usr/bin/ceph-osd -i 15 --pid-file /var/run/ceph/osd.15.pid -c /etc/ceph/ceph.conf

 According to the following output the system in unsusable: DHCP and L3 agent do not works:

[root@node-15 ceph]# crm status
Last updated: Mon Jan 13 08:41:13 2014
Last change: Fri Jan 10 16:38:49 2014 via crmd on node-14.prisma
Stack: openais
Current DC: node-15.prisma - partition with quorum
Version: 1.1.8-1.el6-1f8858c
3 Nodes configured, 3 expected votes
17 Resources configured.

Online: [ node-14.prisma node-15.prisma ]
OFFLINE: [ node-13.prisma ]

 vip__management_old (ocf::heartbeat:IPaddr2): Started node-14.prisma
 vip__public_old (ocf::heartbeat:IPaddr2): Started node-15.prisma

Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-14.prisma node-15.prisma ]
     Stopped: [ p_haproxy:2 ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-14.prisma node-15.prisma ]
     Stopped: [ p_mysql:2 ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-14.prisma node-15.prisma ]
     Stopped: [ p_neutron-openvswitch-agent:2 ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-14.prisma node-15.prisma ]
     Stopped: [ p_neutron-metadata-agent:2 ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-15.prisma (unmanaged) FAILED
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-14.prisma (unmanaged) FAILED
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-14.prisma

...

Read more...

Changed in fuel:
assignee: nobody → Andrey Korolyov (xdeller)
milestone: none → 4.1
Revision history for this message
Andrey Korolyov (xdeller) wrote :

It is bad practice to mix OSD and MON roles on the same nodes and be ready to power failures. Also looks like your client simply lose connectivity to the quorum between

2014-01-10 14:16:11.699720 mon.0 192.168.121.3:6789/0 391 : [INF] pgmap v272: 492 pgs: 492 active+clean; 644 MB data, 39089 MB used, 44631 GB / 44670 GB avail
2014-01-10 14:39:54.201945 mon.2 192.168.121.5:6789/0 2 : [INF] mon.node-15 calling new monitor election

So please consider to separate those roles before failure testing.

Changed in fuel:
importance: Undecided → Wishlist
status: New → Won't Fix
Revision history for this message
Andrey Korolyov (xdeller) wrote :

Dima Borodaenko, please discuss with UI team possibility of adding warning for such config. since it is not true HA.

Revision history for this message
mauro (maurof) wrote :

Deployment was done according to Fuel options.

I noticed the following behaviour: in a cluster or 3 ceph nodes as 1 ceph node is down the following occurs:

a) if the crashed ceph-node is not the "original leader ":
           --> it s possible to create/attach/detach delete new volumes
          --> It s not possibile to detach/attach/ delete existing volumes ( the command hangs)

b) if the crashed ceph-node is the "original leader" :
ceph and RDB commands do not work anymore
      --> it s not possible to create/attach/detach delete new volumes ( the command hangs)
      --> It s not possibile to detach/attach delete existing volumes ( the command hangs)

Concerning point b) in my opinion ceph.conf file addresses only the ceph-mon of the the original leader node,
and once it 's downn there is no more chance to access to volume service , rdb and so on..

here below the current ceph.conf (Now we deployed 4.0 with the same topology. Openstack is up&running .
( node-8 node-10 node-12 controllers+ceph)
(node-7 node-9 node11 compute))

[root@node-8 ~]# more /etc/ceph/ceph.conf
[global]
filestore_xattr_use_omap = true
mon_host = 192.168.121.4
fsid = 1a305a5d-0dd7-488b-9c71-b008158329a6
mon_initial_members = node-8
auth_supported = cephx
osd_journal_size = 2048
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 100
public_network = 192.168.121.0/24
osd_pool_default_pgp_num = 100
osd_mkfs_type = xfs
cluster_network = 192.168.122.0/24

Wouldn' t it be useful to set "mon_host = 192.168.121.2" ( virtual ip address)???

[root@node-8 ~]# ip a | grep 192.168.121
    inet 192.168.121.4/24 brd 192.168.121.255 scope global br-mgmt
[root@node-8 ~]# ssh node-10
Warning: Permanently added 'node-10,192.168.121.6' (RSA) to the list of known hosts.
Last login: Mon Jan 27 17:20:13 2014 from 192.168.120.100
[root@node-10 ~]# ip a | grep 192.168.121
    inet 192.168.121.6/24 brd 192.168.121.255 scope global br-mgmt
    inet 192.168.121.2/24 brd 192.168.121.255 scope global secondary br-mgmt:ka
[root@node-10 ~]# ssh node-12
Warning: Permanently added 'node-12,192.168.121.8' (RSA) to the list of known hosts.
Last login: Mon Jan 27 16:48:04 2014 from 192.168.121.6
[root@node-12 ~]# ip a | grep 192.168.121
    inet 192.168.121.8/24 brd 192.168.121.255 scope global br-mgmt

Revision history for this message
Andrey Korolyov (xdeller) wrote :

Ehm. Speaking of setting to the VIP - no, it is bad practice. I wonder why we`re not pushing an array of monitors everywhere. Raising bug level then since it is way more important than just daemon placement. Thank you for pointing it out!

Changed in fuel:
assignee: Andrey Korolyov (xdeller) → Dmitry Borodaenko (dborodaenko)
status: Won't Fix → Confirmed
importance: Wishlist → High
Revision history for this message
mauro (maurof) wrote :

Ok your welcome

I figure out that the described behaviour is the expected one according to the "state of art" and to my specific deployment .
Could you confirm it?

thanks
 mauro

Changed in fuel:
importance: High → Critical
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Andrey, we already have a bug about not pushing an array of monitors everywhere:
https://bugs.launchpad.net/fuel/+bug/1268579

As far as I understand, the other remaining issue highlighted in this bug was lack of warning in the UI for combining OSD and MON on the same node. I think this is a more general problem than just Ceph, I think there should be a general notice in the role assignment screen to warn that combining more than one node per role will reduce reliability and dedicated nodes should be allocated for all roles if sufficient hardware is available.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Typo, the above comment should read: "combining more than one role per node".

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Please confirm my conclusions, and, if you agree, downgrade this and assign back to me for resolution of the UI warning aspect.

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Andrey Korolyov (xdeller)
Revision history for this message
Andrey Korolyov (xdeller) wrote :

mauro - yes, you are correct - at first, your setup is not supposed to hold node kill always correctly, according to Inktank words, and at second we`re not describing mon array as we should in the client section of ceph.conf

summary: - ceph unavailable after leader node power crash
+ No warning for HA with OSD and MON roles on same nodes
Changed in fuel:
importance: Critical → Medium
assignee: Andrey Korolyov (xdeller) → Dmitry Borodaenko (dborodaenko)
tags: added: docs
tags: added: customer-found
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

So what is the resolution? Is there nothing we could do except fixing https://bugs.launchpad.net/fuel/+bug/1268579 ?

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

There's two more things we can do:

1) add a note in documentation that although Fuel allows to combine multiple roles on the same node, it is recommended to use dedicated OSD nodes in production for better reliability -- that's why there's a docs tag on this bug.

2) add ceph-mon as a standalone role (trivial to do once granular deployment is implemented) to separate failure domains of Ceph monitor and OpenStack controller components -- should be done separately under the blueprint I've just created:

https://blueprints.launchpad.net/fuel/+spec/ceph-mon-role

Mike Scherbakov (mihgen)
Changed in fuel:
milestone: 4.1 → 5.0
Changed in fuel:
milestone: 5.0 → 5.1
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Vladimir, the only action that still needs to be addressed in this bug is the documentation, the code change requirements are all documented elsewhere. The documentation change should be doable in 5.0.

Mike Scherbakov (mihgen)
Changed in fuel:
milestone: 5.1 → 5.0
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

And what's regarding separate ceph-mon role?

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

As I mentioned above in comment #14, separate ceph-mon role should be addressed by a blueprint. The BP linked above has been superceded by a more generic BP that covers ceph-rgw in addition to ceph-mon:
https://blueprints.launchpad.net/fuel/+spec/fuel-ceph-roles

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

In fact, there's already a warning against placing Ceph OSD role on Controller nodes that was added here:
https://review.openstack.org/#/c/78631/2/pages/reference-architecture/0030-cluster-sizing.rst

Changed in fuel:
status: Confirmed → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.