Problem with X11 in silo builders

Bug #1532672 reported by Michi Henning
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
launchpad-buildd
Expired
Undecided
Unassigned

Bug Description

When building with citrain, we are getting persistent test failures on s390x and powerpc. The same test passes on all other architectures. The failing test uses xvfb to write some image data to a canvas. If xvfb fails, we capture the error output from xvfb, which is:

_XSERVTransmkdir: Owner of /tmp/.X11-unix should be set to root

The directory is not being created with the correct permissions, so xvfb doesn't do its thing, and the test fails.

It appears that something is misconfigured in the builders? We've seen this problem once or twice before on s-jenkins, where it would fail on s-jenkins, but only when we happened to get a particular builder; with a different builder, the same test for the same architure succeeded.

See here for one example of the failure. We simply capture the error output from xvfb and dump it at the end of the test. (The failing test is number 9.)

https://launchpadlibrarian.net/233564855/buildlog_ubuntu-xenial-powerpc.thumbnailer_2.3+16.04.20160109-0ubuntu1_BUILDING.txt.gz

William Grant (wgrant)
affects: launchpad → launchpad-buildd
Revision history for this message
William Grant (wgrant) wrote :

The "Owner of /tmp/.X11-unix should be set to root" message is just a warning, not a fatal error. Errors earlier in the log suggest that it's reading pixels fine, but the pixels are unexpectedly black. Are you sure it's xvfb-run that's failing, and not simply some buggy app or test code?

Changed in launchpad-buildd:
status: New → Incomplete
Revision history for this message
Michi Henning (michihenning) wrote :

Yes, I'm sure that it's xvfb. This test has not failed for many months any several architectures. Whenever we have seen it fail, it has failed with this message (as it did previously in s-jenkins), and it has never produced this message when it passes (as far as I know).

We are reading black pixels because nothing was drawn, and nothing was drawn because xvfb couldn't get the buffer off the ground. I think the root cause is that, if /tmp/.X11-unix has the wrong ownership, X refuses to create a socket there that is used by xvfb. (I'm not an X11 person, so the analysis may not be totally correct.)

Revision history for this message
William Grant (wgrant) wrote : Re: [Bug 1532672] Re: Problem with X11 in silo builders

On 11/01/16 17:50, Michi Henning wrote:
> Yes, I'm sure that it's xvfb. This test has not failed for many months
> any several architectures. Whenever we have seen it fail, it has failed
> with this message (as it did previously in s-jenkins), and it has never
> produced this message when it passes (as far as I know).
>
> We are reading black pixels because nothing was drawn, and nothing was
> drawn because xvfb couldn't get the buffer off the ground. I think the
> root cause is that, if /tmp/.X11-unix has the wrong ownership, X refuses
> to create a socket there that is used by xvfb. (I'm not an X11 person,
> so the analysis may not be totally correct.)

It's only going to display the message when it fails, because it doesn't
display stderr when it doesn't fail. It really doesn't look fatal.

How was the problem fixed on s-jenkins?

Revision history for this message
Michi Henning (michihenning) wrote :

On s-jenkins, there were some stuck processes that prevent access to a socket. But, come to think of it, that displayed the message about not being owned by root, *plus* something else (and the something else pointed at the real problem).

I'll print out the stderr log unconditionally to verify. Thanks for keeping me honest!

But this now raises an issue I've long had a problem with: how on earth are we going to debug a failure that is present only on powerpc and s390x? None of us has hardware to do any debugging. Is there porter box access for s390x? (I would expect not.)

For powerpc, we can use a porter box (but only for Xenial). Any failures on Vivid we can't deal with via the porter boxes because they are not available with vivid+overlay.

Almost certainly, the problem is caused by broken codecs for these architectures. But the problem is difficult to diagnose remotely. And, even if we can in this case, we still have the general issue that we are required to work on architectures we don't have access to, and that the builds for these architectures only happen in a silo, but not on s-jenkins. That means we typically find out only at five minutes to midnight.

Revision history for this message
Colin Watson (cjwatson) wrote :

You can ask Foundations to get access to something roughly equivalent to a porter box for s390x, and they can probably also help you get access to more flexible porting arrangements for powerpc.

Revision history for this message
Michi Henning (michihenning) wrote :

Thanks for that Colin, will do! I'll see whether I can track down what's going on here on the arm64 porter box first.

Revision history for this message
Michi Henning (michihenning) wrote :

Sorry, "powerpc, not "arm64".

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for launchpad-buildd because there has been no activity for 60 days.]

Changed in launchpad-buildd:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.