Upgrade Jenkins EC2 plugin to 1.17

Bug #1003831 reported by Paul Sokolovsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android Infrastructure
Fix Released
High
Paul Sokolovsky
Linaro CI
Fix Released
Critical
Stevan Radaković

Bug Description

From mail:

Version 1.15 of Jenkins EC2 plugin was released couple of days ago.
Appears to have generic fixes here and there, nothing which immediately
would resolve our woes, but may help anyway:

https://wiki.jenkins-ci.org/display/JENKINS/Amazon+EC2+Plugin#AmazonEC2Plugin-Changelog

So, I'd like to propose spending time on Connect hacking sessions to
test it (on sandbox) and upgrade production if all goes well.

Changed in linaro-android-infrastructure:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Now there's 1.16 - seems to be a regression in 1.15

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Testing this on https://ec2-23-20-81-190.compute-1.amazonaws.com/jenkins/ , so far there seem to be issues.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

A note - once upgraded to 1.16 it's not possible to (easily) downgrade back to 1.14, because 1.16 renamed and migrates few of settings in Jenkins global config.xml .

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Test plan:

1. Create new sandbox
3. Bump instance cap to 25
2. Create a new build on it based on https://android-build.linaro.org/builds/~pfalcon/panda-ics-gcc47-tilt-tracking-blob/
3. Start the build
4. Verify that builds works.
5. Upgrade EC2 plugin to 1.16
6. Start the same build as in p.3
7. Check that results match p.4

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Result for steps 1-4

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
summary: - Upgrade Jenkins EC2 plugin to 1.15
+ Upgrade Jenkins EC2 plugin to 1.16
Changed in linaro-android-infrastructure:
status: Confirmed → Triaged
assignee: nobody → Paul Sokolovsky (pfalcon)
Changed in linaro-android-infrastructure:
status: Triaged → In Progress
Revision history for this message
Paul Sokolovsky (pfalcon) wrote : Re: Upgrade Jenkins EC2 plugin to 1.16

This was worked on during Connect, and following was found:

1. 1.16 changed config format. i.e. after upgrading from 1.14 to 1.16 it's not possible to downgrade back using Jenkins bultin mechanism. Specifically, 1.16 changes at least enum value names for instances types, etc.

2. 1.16 changed from which user accounts slave init, etc. scripts are run, i.e. current linaro-android-build-tools script don't run with it out of the box. There're relatively small changes to update them to run with 1.16 (the changes as of now present on i-1f150379), but they need more testing, and it's unclear if these changes in 1.16 are intended, or regression.

Summing up, this needs more work: testing, reviewing 1.14 to 1.16 source code changes and possibly contacting Jenkins mailing list/EC2 plugin maintainers, deciding on our side if we want to upgrade soon with such situation in mind (can't easily switch back to 1.14).

Revision history for this message
Loïc Minier (lool) wrote :

Changelog mentions that init script is only run once on boot rather than on each build, and also that the user jenkins connects as isn't hardcoded to root anymore; this seems to match the behavior changes you noticed.

Did we finish this testing by now?

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

> Did we finish this testing by now?

No, in the sense that it's not clear how to proceed further - upgrading to 1.16+ will require changing of build scripts and switchover time (which with contingency management may be few days). There's also risk that if we see big issues later, we can't easily downgrade. I.e. this needs scheduling on higher level than of individual Infrastructure engineers.

Changed in linaro-android-infrastructure:
status: In Progress → Incomplete
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
Changed in linaro-android-infrastructure:
status: Incomplete → In Progress
importance: Medium → High
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Step 1: Upgrade ec2 plugin to 1.17, wait for jenkins to restart.
Step 2: New version has per slave type instance caps. They're 0 after upgrade (smart), need to set to sane value. Manage Jenkins -> Configure System -> Cloud -> For each in "AMIs" -> Advanced - > Instance Cap, set to the same value as global cap.
Step 3: New version no longer does Jenkins internal operations (like copying client jars) as root, but rather as "ubuntu". That means that workspace needs to be owned by "ubuntu". Add following commands to each slave init script:

# required for ec2 plugin 1.17
mkdir -p /mnt/jenkins
chown ubuntu /mnt/jenkins
chmod 770 /mnt/jenkins

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Step 4: Job build script is also no longer run as root, but as ubuntu again. Following change is required:

-build-tools/node/build us-east-1.ec2-git-mirror.linaro.org "$CONFIG"
+sudo -H -E build-tools/node/build us-east-1.ec2-git-mirror.linaro.org "$CONFIG"

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
summary: - Upgrade Jenkins EC2 plugin to 1.16
+ Upgrade Jenkins EC2 plugin to 1.17
Changed in linaro-android-infrastructure:
milestone: none → 2012.10
Changed in linaro-ci:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Recently, we experiencing cases of not properly started up build slaves being not automatically terminated by Jenkins. But I believe I saw such cases both on ci.* (old plugin version) and a-b (new plugin version).

Otherwise, new plugin works well, closing for android infra, still pending for ci.linaro.org

Changed in linaro-android-infrastructure:
status: In Progress → Fix Released
Revision history for this message
Loïc Minier (lool) wrote :

it seems pretty bad to have a snippet added to all build scripts; we should rather fix the AMIs or do that in some common pre-start script; in the worst case, we should add a snippet to pull a common pre-start script from a shared location and run that :-)

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

> fix the AMIs
> common pre-start script from a shared location

We don't store build scripts in AMIs. The best practice is exactly to have build script in a shared location and pull them directly before the build. The idea is that AMI stores slowly-changing background environment, but actual data/scripts may be changes at any time and should be always current. Caveat: a-n adhere to the best practice above 100%, ci.* adheres to it <unknown>%, there still may be jobs which have entire build script in Jenkins.

> some common pre-start script

Yes, adding some wrapper script doing sudo and dispatching to the original build script is viable option. For a-b, I opted to not add another layer, but keep it clean by automatically mass-migrating existing uniform stuff. Again, for ci.*, wrapper script is a viable option.

Changed in linaro-ci:
milestone: none → 2012.11
Changed in linaro-ci:
status: Triaged → In Progress
assignee: nobody → Stevan Radaković (stevanr)
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Stevan has ci.linaro.org clone sandbox for migration at https://ec2-54-242-157-93.compute-1.amazonaws.com/jenkins/ . Once slave init scripts were tweaked, most builds worked immediately, some required minor tweaking per Stevan. Most "complicated" were OE builds. Investigating them, turned out that slave init script as given in step 3 above requires tweaking:

mkdir -p /mnt/ci_build
chown ubuntu /mnt/ci_build
chmod u+rw,g+rw,o-w,+Xr /mnt/ci_build

(the main change is to allow everyone to search dir hierarchy (as a precaution against further failures, I also added world readability, that may be not required)). Then, OE build script required stuffing "sudo" in front of some commands. Latest version of updated script is here:

https://ec2-54-242-157-93.compute-1.amazonaws.com/jenkins/job/openembedded-armv8-minimal/configure

Revision history for this message
Stevan Radaković (stevanr) wrote :

Other types of jobs' configuration must be tweaked per following instructions (mostly include adding sudo):

precise, quantal:
mount -t tmpfs -o size=6G tmpfs builddir
to
sudo mount -t tmpfs -o size=6G tmpfs builddir

umount builddir
to
sudo umount builddir

quantal only:
mv linaro-quantal-* ${WORKSPACE}
to
sudo mv linaro-quantal-* ${WORKSPACE}

openembedded:
useradd -d ${WORKSPACE} oe || true
to
sudo useradd -d ${WORKSPACE} oe || true

echo "oe ALL=(ALL) NOPASSWD:ALL" | tee /etc/sudoers.d/91-oe
to
echo "oe ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/91-oe

chmod 640 /etc/sudoers.d/91-oe
to
sudo chmod 640 /etc/sudoers.d/91-oe

chown -R oe:oe builddir
to
sudo chown -R oe:oe builddir

Revision history for this message
Stevan Radaković (stevanr) wrote :

Ec2 plugin is upgraded to 1.17 on ci.linaro.org as well. We're currently monitoring the job situation and fixing configurations on the run if needed. Marking this one as resolved.

Changed in linaro-ci:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.