Wednesday, March 08, 2006

Running buildbot on various platforms

I started to run a bunch of buildbot slaves at work, on various platforms. I encountered some issues that I want to document here for future reference. The version of buildbot I used in all scenarios was 0.7.2. The buildbot master is running on a RHEL3 server. Note that I'm not going to talk about the general buildbot setup -- if you need guidance in configuring buildbot, read this post of mine.

Before I discuss platform-specific issues, I want to mention the issue of timeouts. If you want to run a command that takes a long time on the buildbot slave, you need to increase the default timeout (which is 1200 sec. = 20 min.) for the ShellCommand definitions in the buildmaster's master.cfg file -- otherwise, the master will mark that command as failed after the timeout expires. To modify the default timeout, simply add a keyword argument such as timeout=3600 to the ShellCommand (or derived class) instantiation in master.cfg. I have for example this line in the builders section of my master.cfg file:

client_smoke_tests = s(ClientSmokeTests, command="%s/buildbot/run_smoke_tests.py" % BUILDBOT_PATH, timeout=3600)

where ClientSmokeTests is a class I derived from ShellCommand (if you need details on this, see again my previous post on buildbot.)


Buildbot on Windows

My setup: Windows 2003 server, Active Python 2.4.2

Issue with subprocess module: I couldn't use the subprocess module to run commands on the slave. I got errors such as these:

    p = Popen(arglist, stdout=PIPE, stderr=STDOUT)
File "C:\Python24\lib\subprocess.py", line 533, in __init__
(p2cread, p2cwrite,
File "C:\Python24\lib\subprocess.py", line 593, in _get_handles
p2cread = self._make_inheritable(p2cread)
File "C:\Python24\lib\subprocess.py", line 634, in _make_inheritable
DUPLICATE_SAME_ACCESS)
TypeError: an integer is required

I didn't have too much time to spend troubleshooting this, so I ended up replacing calls to subprocess to calls to popen2.popen3(). This solved the problem.

Also, I'm not currently running the buildbot process as a Windows service, although it's on my TODO list. I wrote a simple .bat file which I called startbot.bat:

buildbot start C:\qa\pylts\buildbot\QA

To start buildbot, I launched startbot.bat from the command prompt and I left it running.

Note that on Windows, the buildbot script gets installed in C:\Python24\scripts, and there is also a buildbot.bat batch file in the same scripts directory, which calls the buildbot script.

Issue with buildbot.bat: it contains a hardcoded path to Python23. I had to change that to Python24 so that it correctly finds the buildbot script in C:\Python24\scripts.

Buildbot on Solaris

My setup: one Solaris 9 SPARC server, one Solaris 10 SPARC server, both running Python 2.3.3

Issue with ZopeInterface on Solaris 10: when I tried to install ZopeInterface via 'easy_install http://www.zope.org/Products/ZopeInterface/3.1.0c1/ZopeInterface-3.1.0c1.tgz', a compilation step failed with:

/usr/include/sys/wait.h:86: error: parse error before "siginfo_t"

A google search revealed that this was a gcc-related issue specific to Solaris 10. Based on this post, I ran:

# cd /usr/local/lib/gcc-lib/sparc-sun-solaris2.10/3.3.2/install-tools
# ./mkheaders


After these steps, I was able to install ZopeInterface and the rest of the packages required by buildbot.

For reference, here is what I have on the Solaris 10 box in terms of gcc packages:

# pkginfo | grep -i gcc

system SFWgcc2 gcc-2 - GNU Compiler Collection
system SFWgcc2l gcc-2 - GNU Compiler Collection Runtime Libraries
system SFWgcc34 gcc-3.4.2 - GNU Compiler Collection
system SFWgcc34l gcc-3.4.2 - GNU Compiler Collection Runtime Libraries
application SMCgcc gcc
system SUNWgcc gcc - The GNU C compiler
system SUNWgccruntime GCC Runtime libraries

Here is what uname -a returns:

# uname -a
SunOS sunv2403 5.10 Generic sun4u sparc SUNW,Sun-Fire-V240

Issue with exit codes from child processes not intercepted correctly: on both Solaris 9 and Solaris 10, buildbot didn't seem to intercept correctly the exit code from the scripts which were running on the build slaves. I was able to check that I had the correct exit codes by running the scripts at the command line, but within buildbot the scripts just hung as if they hadn't finish.

Some searches on the buildbot-devel mailing list later, I found the solution via this post: I replaced usepty = 1 with usepty = 0 in buildbot.tac on the Solaris slaves, then I restarted the buildbot process on the slaves, and everything was fine.

Buildbot on AIX

My setup: AIX 5.2 on an IBM P510 server, Python 2.4.1

No problems here. Everything went smoothly.

5 comments:

Matthew said...

I'm setting up a buildbot on Debian Etch and I'm going to post my problems here in the hopes that someone else will see them.

Once I got the buildbot running, the buildslave would carry out the first build and then get lost in a nasty connect/disconnect cycle. After fooling around with the keepalive for a while, I changed usePty to 0 and that seems to have worked. I now have a happy green build.

Moral of the story: don't be afraid to fool around with the usePty setting.

Grig Gheorghiu said...

Matthew -- thanks for the comment, and for the solution. In my case, I had a different problem that was solved by the same usePty trick on Solaris. Good to see it worked for you.

Dan said...

I've found that SLES and Suse should/need to have /etc/profile sourced into the running environment for everything to run smoothly. You'll get things to run without this, it will just be lots of annoying work.

Grig Gheorghiu said...

Dan -- thanks a lot for the hint. I'm sure it will be handy for somebody some day :-)

Grig

John Pye said...

Hi there

You might want to see Installing a Buildbot service on Windows for information about setting up a Windows service for Buildbot.

Cheers
JP