provisioning fails because of cobbler race conditions

Bug #1328873 reported by Aleksandra Fedorova on 2014-06-11
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Vladimir Sharshov
4.1.x
High
MOS Maintenance
5.0.x
High
Vladimir Sharshov

Bug Description

We add nodes to cluster, they get registered in cobbler, then we reboot them and they fail to get data to boot. Supposedly because cobbler is in the process of rewriting their config files.

Attached screenshot with the error from node vncdisplay.

Aleksandra Fedorova (bookwar) wrote :
Changed in fuel:
milestone: 4.1.2 → 5.1
Nastya Urlapova (aurlapova) wrote :
Nastya Urlapova (aurlapova) wrote :

Jun 10 15:12:26 2014 - INFO | received on stdout: Shutting down dnsmasq: [ OK ]
Starting dnsmasq: [FAILED]

Tue Jun 10 15:12:26 2014 - DEBUG | received on stderr:
dnsmasq: failed to bind DHCP server socket: Address already in use

Tue Jun 10 15:12:26 2014 - ERROR | service dnsmasq restart failed

Matthew Mosesohn (raytrac3r) wrote :

The problem looks like astute runs cobbler sync twice in quick succession. Cobbler sync takes 4-5 seconds, so we need to make sure cobbler sync is not running before we call it again. It can break if dnsmasq is already trying to restart and then it gets asked to restart again. It's not so easy to fix the init script, but it is easier to teach astute to not run cobbler sync multiple times.

Evgeniy L (rustyrobot) wrote :

I looked at the logs, and looked at the code.

1. astute calls sync once after it adds all nodes to cobbler [1]

2. in logs I see that sync was called twice

 -rw-r--r-- root/root 2309 fuel-snapshot-2014-06-10_19-39-58/localhost/var/log/cobbler/tasks/2014-06-10_121610_sync.log
 -rw-r--r-- root/root 3091 fuel-snapshot-2014-06-10_19-39-58/localhost/var/log/cobbler/tasks/2014-06-10_121614_sync.log

I assume that first call was caused by puppet, and second by astute (in astute logs I see only one sync call)

[1] https://github.com/stackforge/fuel-astute/blob/9f0a2a3135ace2e8d9bf7a4c5f661b8326ace285/lib/astute/cobbler_manager.rb#L46

Changed in fuel:
assignee: nobody → Evgeniy L (rustyrobot)
importance: Undecided → High
status: New → Triaged
Evgeniy L (rustyrobot) wrote :

Matthew, do you have any idea why this could happen? (see my comment above)

Matthew Mosesohn (raytrac3r) wrote :

Evgeniy, it's really unlikely to see cobbler sync run within 4 seconds of the previous run. I don't believe Puppet could do this. Vladimir Sharshov is back from vacation today, so he should continue looking at this.

Changed in fuel:
assignee: Evgeniy L (rustyrobot) → Vladimir Sharshov (vsharshov)
Vladimir Sharshov (vsharshov) wrote :

Problem logs have unexpected lines.

Typical behavior:
# Start cobbler service
Fri Jun 27 12:01:58 2014 - DEBUG | XMLRPC running on 25151
Fri Jun 27 12:02:15 2014 - DEBUG | XMLRPC running on 25151
# End puppet run, cobbler ready to perform job

In our logs:
Tue Jun 10 12:16:09 2014 - DEBUG | API handle initialized
Tue Jun 10 12:16:09 2014 - DEBUG | XMLRPC running on 25151
Tue Jun 10 12:16:09 2014 - DEBUG | API handle initialized
Tue Jun 10 12:16:09 2014 - DEBUG | XMLRPC running on 25151

And after this lines we process all setup process (and most important of it: add os profiles)

To prevent situation where cobbler container can potentially report us wrong ready status i add new check https://review.openstack.org/103118

This is potentially fix, but bug is hard to reproduce and we should get new logs with this problem after merge 103118.

Vladimir Sharshov (vsharshov) wrote :

Small additional.

Logs in /var/log/cobbler/tasks always contain 2 syncs just after cobbler report as ready to work and before any tasks from Astute.
2014-06-27_131542_sync.log
2014-06-27_131549_sync.log

Reviewed: https://review.openstack.org/103118
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=cd0283ececcb69b2b1e9dbde51bb09d71cf7b7c8
Submitter: Jenkins
Branch: master

commit cd0283ececcb69b2b1e9dbde51bb09d71cf7b7c8
Author: Vladimir Sharshov <email address hidden>
Date: Fri Jun 27 16:47:50 2014 +0400

    Strong check for cobbler container (profiles)

    Instead of just check cobbler process now
    we will also check availability of system
    profiles. Without profiles no nodes can
    process provisioning stage.

    This is possible fix for #1328873.

    Change-Id: If93c21f0f771fce282cd3211903585d3ca2d1d14
    Related-Bug: #1328873

Vladimir Sharshov (vsharshov) wrote :

Can someone from QA try to reproduce it?

Changed in fuel:
status: Triaged → In Progress
Nastya Urlapova (aurlapova) wrote :

for 5.0-76

Reviewed: https://review.openstack.org/103738
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=31470f355c57108ec800c4c5bb398e6e66c656bb
Submitter: Jenkins
Branch: stable/5.0

commit 31470f355c57108ec800c4c5bb398e6e66c656bb
Author: Vladimir Sharshov <email address hidden>
Date: Fri Jun 27 16:47:50 2014 +0400

    Strong check for cobbler container (profiles)

    Instead of just check cobbler process now
    we will also check availability of system
    profiles. Without profiles no nodes can
    process provisioning stage.

    This is possible fix for #1328873.

    Change-Id: If93c21f0f771fce282cd3211903585d3ca2d1d14
    Related-Bug: #1328873
    (cherry picked from commit cd0283ececcb69b2b1e9dbde51bb09d71cf7b7c8)

Changed in fuel:
status: In Progress → Fix Committed
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 4.1.x - there no plans to release maintenance updates for 4.1.x versions

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers