[DIB] Use of rc.local to modify paths creates race condition

Bug #1324582 reported by Trevor McKay
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Sahara
Fix Released
High
Ivan Berezovskiy

Bug Description

The file sahara-image-elements/elements/hadoop/post-install.d/firstboot is installed on images as /etc/rc.d/rc.local. It contains lines to modify /mnt:

    chown root:root /mnt
    mkdir -p /mnt/log/hadoop
    chown hadoop:hadoop /mnt/log/hadoop

The vanilla 1.2.1 plugin also modifies /mnt in the /tmp/sahara-hadoop-init.sh script created by the plugin:

  chown -R hadoop:hadoop /mnt
  chmod -R 755 /mnt

When the rc.local file is run by systemd as the rc-local.service, it creates a race condition with the /tmp/sahara-hadoop-init.sh script. (timestamps showing the race in a failed case are shown below). In the failure case, the datanode service on a worker will not start because /mnt has the wrong ownership (https://bugs.launchpad.net/sahara/+bug/1295807)

This is particularly evident with Fedora 1.2.1 images since the rc.local file contains a 'sleep' statement and a "until do" loop in the Fedora case (but it could theoretically happen with any plugin that modifies /mnt or other things that firstboot touches)

       Fedora )
          sleep 20
          rm /etc/resolv.conf
          service network restart
          if [ $(lsb_release -rs) -ge '19' ]; then
             user=fedora
          else
             user=ec2-user
          fi
          until [[ -n $(grep "$user:" /etc/passwd) && -n $(cat /etc/group | grep "$user:") ]];
          do
             sleep 1
          done
          chown -R $user:$user /home/$user
       ;;

Below is output from a worker node where a timestamp was added to the /etc/rc.d/rc.local file on the image using virt-edit, and another timestamp was added to the /tmp/sahara-hadoop-init.sh script created by the plugin code:

[root@testcluster-workers-003 tmp]# cat rc-local.debug
chowned mnt at Thu May 29 14:24:45 UTC 2014

[root@testcluster-workers-003 tmp]# cat sahara-hadoop-init.log
+ echo -n
+ cat /etc/hadoop/hadoop-env.sh
+ cp /tmp/hadoop-env.sh /etc/hadoop/hadoop-env.sh
+ sed -i 's,export HADOOP_LOG_DIR=.*,export HADOOP_LOG_DIR=/mnt/log/hadoop/$USER/,' /etc/hadoop/hadoop-env.sh
+ sed -i 's,export HADOOP_SECURE_DN_LOG_DIR=.*,export HADOOP_SECURE_DN_LOG_DIR=/mnt/log/hadoop/hdfs,' /etc/hadoop/hadoop-env.sh
+ chown -R hadoop:hadoop /mnt
+ chmod -R 755 /mnt
++ date
+ echo 'sahara-init finished at' Thu May 29 14:24:36 UTC 2014
sahara-init finished at Thu May 29 14:24:36 UTC 2014

We have a machine where this is very easy to reproduce, probably 50% of the time or greater using Fedora images. However, it does not appear on all machines (the nature of a race)

Tags: dib
Revision history for this message
Trevor McKay (tmckay) wrote :

Note, made https://bugs.launchpad.net/sahara/+bug/1295807 a duplicate of this new bug

Changed in sahara:
status: New → Confirmed
importance: Undecided → High
milestone: none → juno-1
Revision history for this message
Trevor McKay (tmckay) wrote :

rc-local was modified as follows:

   $ virt-edit -a ~/src/sahara-icehouse-vanilla-1.2.1-fedora-20.qcow2 /etc/rc.d/rc.local

Add after line 36:

    echo "chowned /mnt to root at " `date` >> /tmp/rc-local.debug

resulting in:

     36 chown root:root /mnt
     37 echo "chowned /mnt to root at " `date` >> /tmp/rc-local.debug
     38 mkdir -p /mnt/log/hadoop
     39 chown hadoop:hadoop /mnt/log/hadoop

Revision history for this message
Chad Roberts (croberts) wrote :

In addition, to produce the desired debug output in sahara-hadoop-init.log, plugins/vanilla/v1_2_1/config_helper.py had the following added at the end of generate_setup_script

script_lines.append("echo 'sahara-init finished at' `date`")

Revision history for this message
Chad Roberts (croberts) wrote :

The datanode startup will fail with the following (because /mnt/lib/hadoop/hdfs/datanode can't be created.

2014-05-29 14:24:47,887 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG: host = testcluster-workers-003.novalocal/10.0.4.9
STARTUP_MSG: args = []
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:27:42 PDT 2013
STARTUP_MSG: java = 1.7.0_51
************************************************************/
2014-05-29 14:24:48,666 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2014-05-29 14:24:48,817 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2014-05-29 14:24:48,818 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2014-05-29 14:24:48,818 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2014-05-29 14:24:49,544 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2014-05-29 14:24:49,558 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2014-05-29 14:24:49,814 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in dfs.data.dir: can not create directory: /mnt/lib/hadoop/hdfs/datanode
2014-05-29 14:24:49,814 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: All directories in dfs.data.dir are invalid.
2014-05-29 14:24:49,814 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2014-05-29 14:24:49,824 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at testcluster-workers-003.novalocal/10.0.4.9
************************************************************/

Revision history for this message
Trevor McKay (tmckay) wrote :

Using guestfish with the fedora image, it looks like /mnt is already owned by root when the image is created, so the "chown" in firstboot isn't needed:

$ guestfish -a sahara-icehouse-vanilla-1.2.1-fedora-20.qcow2 -i

><fs> ll /mnt
total 8
drwxr-xr-x 2 root root 4096 Aug 7 2013 .
drwxr-xr-x 18 root root 4096 Mar 22 06:49 ..

><fs> ll /
total 84
....
drwxr-xr-x 2 root root 4096 Aug 7 2013 mnt
....

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara-image-elements (master)

Reviewed: https://review.openstack.org/97193
Committed: https://git.openstack.org/cgit/openstack/sahara-image-elements/commit/?id=44ba82ddfb1e379e605793e71af2fcfa774007b3
Submitter: Jenkins
Branch: master

commit 44ba82ddfb1e379e605793e71af2fcfa774007b3
Author: iberezovskiy <email address hidden>
Date: Mon Jun 2 14:29:47 2014 +0400

    Remove chown of /mnt directory

    /mnt is always owned by root in vanilla images,
    so we don't need to do chown.

    Also execution "chown root:root /mnt" may be run
    after plugin command "chown -R hadoop:hadoop /mnt"
    sometimes (if plugin will make chown before rc.local script)
    and then datanode can't be started.

    Change-Id: If72308a9969fd9d1bf8bab58f8294b43fb292957
    Partial-bug: #1324582
    Closes-bug: #1295807

Changed in sahara:
milestone: juno-1 → juno-2
Changed in sahara:
milestone: juno-2 → juno-3
Changed in sahara:
milestone: juno-3 → next
Revision history for this message
Trevor McKay (tmckay) wrote :

Marked this as fixed. Not sure why the commit is labeled "Partial-bug", afaik this fixes the issue

Changed in sahara:
status: Confirmed → Fix Released
Changed in sahara:
milestone: next → 2014.2
assignee: nobody → Ivan Berezovskiy (iberezovskiy)
Luigi Toscano (ltoscano)
tags: added: dib
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.