Comment 44 for bug 1811941

Revision history for this message
Amer Hwitat (amer.hwitat) wrote :

well the solution from my side is to guarantee that all openstack servers will be running in case of a timeout (connection drop), is to go and edit all .conf files of nova keystone swift cinder glance neutron ...etc and to add timeout=999999999.. or some specific reasonable amout of time, combined with a fix to the openvswitch way of handling this because it's not fault tolerant or fail over component and needs a patch on your repos immediatley ... I recommend to make it redirect to the loopback NIC which is running any way and not used most of the time .

otherwise whoever who reads this article and benefit from illegally you will find your self in the delemma of possible service drop in case of DOS attack, and might find youn servers fried and down loose amount of considerable money the worst case scenario ... it reall that software can fireup CPUs in this case, eventhough I don't think so it going to make your main site down, and you will get in the trouble of bringing everything back and possible lose of customers if you give a real time service like telecommunication, also eventhough I didn't test this scenario live on a multi-node environment and I don't think it's going to be extreme case like this but this error I had with openvswitch and made my VM crash is CPU related...

[root@localhost log]#
Message from syslogd@localhost at Jan 23 02:23:31 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [ovsdb-server:10088]

so make sure test it on a test environment that has Director and to test on nodes (control,compute,volume) because it has a branching affect and it causes underdog (undercloud) problem, overdog (overcloud) problem, and a watchdog (OVS) problem... :)))))))

like when you have problems with connectivity re-route traffic to loopback so that you don't crash, or make a failover core switch in the local network to make backup plan to your datacenter, that is common sense.

I know what I do with this installation is not common sense but the worst case scenario is that your Main site looses connectivity, and DR site comes up and running , then your customers won't be interrupted and they will have good experience with your network, but you will have a headache bringing back the Main Hardware and OS to their status before loosing the power for example (backup generators won't allow), or Main backbone infrastructure is down for any reason, then you might spend hours bringing back and rescue and restore your Main site, if you don't have a failover OVS which is not present at the moment in my installation, and I don't know if it exists, computer systems do not sweet talk to you or pay you complement when they fail , they hit hard, and you get back lashes from you management, I learned this back in 2002 when I was put in the situation of bringing back data (OS) from a mirrored RAID HD, when it should have hotspaire mechanism to it in the Bios and I didn't need to do anything except to change the HD manually and hot plug it, and you know what it didn't work, this was luckly the OS not the Archived RAID 5 DataBase...

that's also why I work as Academic not Technical side ..