Thursday, June 6, 2013

Problems running lots of X11 sessions concurrently when testing under Hudson/Jenkins

So we are running a quite a bit of automated UI testing and we have found over a large number of concurrent test nodes that even when the Xvnc plugin has correctly started that the environment is not always ready or reliable. This blog post looks at two changes we made that have improved our test stability in recent times.

The first and most important thing to settle when using the Xvnc plugin is to make sure that the user that is running the tests has a normal window manager. By default the default VNC server will start using "twm" as defined in the .vnc/xstartup file. This has significant behavioural differences from the more standard Gnome / KDE and indeed the window managers on Windows and a Mac. Rather than deal with this we have standardised on running with Gnome.

The problem is though is that in our experience if you have multiple gnome sessions starting at the same time that sometimes they will fail with odd error messages. My esteemed colleague Mark Warner came up with the following xstartup script which just puts a simple lock around the startup code.

# It uses a lockfile in the user's .vnc directory to ensure that
# only one instance of gnome will start up for the given user
# at once. This should work around problems associated with
# multiple gnome instances starting and failing to lock certain
# files. It will start gnome after 5 mins of failing to lock as
# a fail safe so a gnome startup will always be attempted.

SESSION_VNC_NAME=VNC-`hostname -f`

function clean_up {
  # don't delete the lock file if we didn't create it
  if [ $LOCKED -eq 1 ]; then
    echo "$(date) All finished, deleting lock file."
    rm -f $LOCK_FILE

# make sure that we delete the lock file if we get killed

# retry for 5 mins then assume something died and delete
# the existing lockfile taking control of it for ourselves.
echo "$(date) Attempting to obtain startup lock."
if lockfile -l 300 $LOCK_FILE; then
  echo "$(date) Startup lock aquired."
  echo "$(date) Lock failed... starting anyway."

# give X a little breather before we start gnome
sleep 10 

# Start Gnome
gnome-session &

# just let gnome settle before we relinquish the lock
sleep 20

Now this resolved spurious test failures that were due to a missing window manager, but as recently the number of nodes increased we found that we were still seeing troubling intermittent failures that took the form of failures to connect to the window server for no obvious reason.

This would happen even if previously in the job we had displayed a window successfully.

[exec] java.lang.InternalError: Can't connect to X11 window server using ':58' as the value of the DISPLAY variable.
     [exec]  at sun.awt.X11GraphicsEnvironment.initDisplay(Native Method)
     [exec]  at sun.awt.X11GraphicsEnvironment.access$200(
     [exec]  at sun.awt.X11GraphicsEnvironment$
     [exec]  at Method)
     [exec]  at sun.awt.X11GraphicsEnvironment.(
     [exec]  at java.lang.Class.forName0(Native Method)
     [exec]  at java.lang.Class.forName(
     [exec]  at java.awt.GraphicsEnvironment.createGE(
     [exec]  at java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(
     [exec]  at org.netbeans.core.startup.Main.start(
     [exec]  at
     [exec]  at
     [exec] Result: 2

This would also happen right at the start when gnome was being set up, as shown here by looking at the content of the logs under .vnc .

_XSERVTransmkdir: Owner of /tmp/.X11-unix should be set to root

Xvnc Free Edition 4.1.2
Copyright (C) 2002-2005 RealVNC Ltd.
See for information on VNC.
Underlying X server release 70101000, The X.Org Foundation

Thu May 30 03:51:34 2013
 vncext:      VNC extension running!
 vncext:      Listening for VNC connections on port 5974
 vncext:      Listening for HTTP connections on port 5874
 vncext:      created VNC server for screen 0
Thu May 30 03:51:37 BST 2013 Attempting to obtain startup lock.
Thu May 30 03:53:45 BST 2013 Startup lock aquired.
Start gnome for OL4 or OL5
AUDIT: Thu May 30 03:53:56 2013: 10280 Xvnc: client 1 rejected from local host
Xlib: connection to ":74.0" refused by server
Xlib: No protocol specified

(gnome-session:10543): Gtk-WARNING **: cannot open display:  
Thu May 30 03:54:15 BST 2013 All finished, deleting lock file.
AUDIT: Thu May 30 04:14:03 2013: 10280 Xvnc: client 1 rejected from local host
AUDIT: Thu May 30 04:14:22 2013: 10280 Xvnc: client 1 rejected from local host
FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.

After a little bit of playing around suspicion fell to the .Xauthority file we we knew could cause error messages like this and could become corrupted. Removing /zeroing the file didn't help so like the Gnome problems some kind of concurrency / contention issues was suspected.

After a bit of playing about we decided that for our test user it was going to be best to just have a copy of this file per host. This works for us as we only have the single executor per machine in order to get consistent testing results. In our case we just added the following line in our .bashrc for the NIS user we run our tests as and restarted all our of slave nodes.

# make sure we don't share the Xauthority file between hosts

export XAUTHORITY=~/.Xauthority.`hostname -f`

Since we have made this change we have seen no failures of the above type after nearly week, that of course doesn't rule out that we have just reduced the occurrence of the issue to a reasonably small level; but we are happy that we can run tests again. Of course the moment I post this blog I am going to loose that lovely run of blue balls. :-)

No comments: