Wednesday, March 08, 2006

Running buildbot on various platforms

I started to run a bunch of buildbot slaves at work, on various platforms. I encountered some issues that I want to document here for future reference. The version of buildbot I used in all scenarios was 0.7.2. The buildbot master is running on a RHEL3 server. Note that I'm not going to talk about the general buildbot setup -- if you need guidance in configuring buildbot, read this post of mine.

Before I discuss platform-specific issues, I want to mention the issue of timeouts. If you want to run a command that takes a long time on the buildbot slave, you need to increase the default timeout (which is 1200 sec. = 20 min.) for the ShellCommand definitions in the buildmaster's master.cfg file -- otherwise, the master will mark that command as failed after the timeout expires. To modify the default timeout, simply add a keyword argument such as timeout=3600 to the ShellCommand (or derived class) instantiation in master.cfg. I have for example this line in the builders section of my master.cfg file:

client_smoke_tests = s(ClientSmokeTests, command="%s/buildbot/run_smoke_tests.py" % BUILDBOT_PATH, timeout=3600)

where ClientSmokeTests is a class I derived from ShellCommand (if you need details on this, see again my previous post on buildbot.)


Buildbot on Windows

My setup: Windows 2003 server, Active Python 2.4.2

Issue with subprocess module: I couldn't use the subprocess module to run commands on the slave. I got errors such as these:

    p = Popen(arglist, stdout=PIPE, stderr=STDOUT)
File "C:\Python24\lib\subprocess.py", line 533, in __init__
(p2cread, p2cwrite,
File "C:\Python24\lib\subprocess.py", line 593, in _get_handles
p2cread = self._make_inheritable(p2cread)
File "C:\Python24\lib\subprocess.py", line 634, in _make_inheritable
DUPLICATE_SAME_ACCESS)
TypeError: an integer is required

I didn't have too much time to spend troubleshooting this, so I ended up replacing calls to subprocess to calls to popen2.popen3(). This solved the problem.

Also, I'm not currently running the buildbot process as a Windows service, although it's on my TODO list. I wrote a simple .bat file which I called startbot.bat:

buildbot start C:\qa\pylts\buildbot\QA

To start buildbot, I launched startbot.bat from the command prompt and I left it running.

Note that on Windows, the buildbot script gets installed in C:\Python24\scripts, and there is also a buildbot.bat batch file in the same scripts directory, which calls the buildbot script.

Issue with buildbot.bat: it contains a hardcoded path to Python23. I had to change that to Python24 so that it correctly finds the buildbot script in C:\Python24\scripts.

Buildbot on Solaris

My setup: one Solaris 9 SPARC server, one Solaris 10 SPARC server, both running Python 2.3.3

Issue with ZopeInterface on Solaris 10: when I tried to install ZopeInterface via 'easy_install http://www.zope.org/Products/ZopeInterface/3.1.0c1/ZopeInterface-3.1.0c1.tgz', a compilation step failed with:

/usr/include/sys/wait.h:86: error: parse error before "siginfo_t"

A google search revealed that this was a gcc-related issue specific to Solaris 10. Based on this post, I ran:

# cd /usr/local/lib/gcc-lib/sparc-sun-solaris2.10/3.3.2/install-tools
# ./mkheaders


After these steps, I was able to install ZopeInterface and the rest of the packages required by buildbot.

For reference, here is what I have on the Solaris 10 box in terms of gcc packages:

# pkginfo | grep -i gcc

system SFWgcc2 gcc-2 - GNU Compiler Collection
system SFWgcc2l gcc-2 - GNU Compiler Collection Runtime Libraries
system SFWgcc34 gcc-3.4.2 - GNU Compiler Collection
system SFWgcc34l gcc-3.4.2 - GNU Compiler Collection Runtime Libraries
application SMCgcc gcc
system SUNWgcc gcc - The GNU C compiler
system SUNWgccruntime GCC Runtime libraries

Here is what uname -a returns:

# uname -a
SunOS sunv2403 5.10 Generic sun4u sparc SUNW,Sun-Fire-V240

Issue with exit codes from child processes not intercepted correctly: on both Solaris 9 and Solaris 10, buildbot didn't seem to intercept correctly the exit code from the scripts which were running on the build slaves. I was able to check that I had the correct exit codes by running the scripts at the command line, but within buildbot the scripts just hung as if they hadn't finish.

Some searches on the buildbot-devel mailing list later, I found the solution via this post: I replaced usepty = 1 with usepty = 0 in buildbot.tac on the Solaris slaves, then I restarted the buildbot process on the slaves, and everything was fine.

Buildbot on AIX

My setup: AIX 5.2 on an IBM P510 server, Python 2.4.1

No problems here. Everything went smoothly.

5 comments:

desmaj said...

I'm setting up a buildbot on Debian Etch and I'm going to post my problems here in the hopes that someone else will see them.

Once I got the buildbot running, the buildslave would carry out the first build and then get lost in a nasty connect/disconnect cycle. After fooling around with the keepalive for a while, I changed usePty to 0 and that seems to have worked. I now have a happy green build.

Moral of the story: don't be afraid to fool around with the usePty setting.

Grig Gheorghiu said...

Matthew -- thanks for the comment, and for the solution. In my case, I had a different problem that was solved by the same usePty trick on Solaris. Good to see it worked for you.

BikeMan said...

I've found that SLES and Suse should/need to have /etc/profile sourced into the running environment for everything to run smoothly. You'll get things to run without this, it will just be lots of annoying work.

Grig Gheorghiu said...

Dan -- thanks a lot for the hint. I'm sure it will be handy for somebody some day :-)

Grig

John Pye said...

Hi there

You might want to see Installing a Buildbot service on Windows for information about setting up a Windows service for Buildbot.

Cheers
JP

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...