Friday, October 09, 2009

Compiling, installing and test-running Scribe

I went to the Hadoop World conference last week and one thing I took away was how Facebook and other companies handle the problem of scalable logging within their infrastructure. The solution found by Facebook was to write their own logging server software called Scribe (more details on the FB blog).

Scribe is mentioned in one of the best presentations I attended at the conference -- 'Hadoop and Hive Development at Facebook' by Dhruba Borthakur and Zheng Shao. If you look at page 4, you'll see the enormity of the situation they're facing: 4 TB of compressed data (mostly logs) handled every day, and 135 TB of compressed data scanned every day. All this goes through Scribe, so that gives me a warm fuzzy feeling that it's indeed scalable and robust. For more details on Scribe, see the wiki page of the project. It's my intention here to detail the steps needed for compiling and installing it, since I found that to be a non-trivial process to say the least. I'm glad Facebook open-sourced Scribe, but its packaging could have been a bit more straightforward. Anyway, here's what I did to get it to run. I followed roughly the same steps on Ubuntu and on Gentoo.

1) Install pre-requisite packages

On Ubuntu, I had to install the following packages via apt-get: g++, make, build-essential, flex, bison, libtool, mono-gmcs, libevent-dev.

2) Install the boost libraries

Very important: scribe needs boost 1.36 or newer, so make sure you don't have older boost libraries already installed. If you install libboost-* in Ubuntu, it tries to bring down 1.34 or 1.35, which will NOT work with scribe. If you have libboost-* already installed, you need to uninstall them. Now. Trust me, I spent several hours pulling my hair on this one.

- download the latest boost source code from SourceForge (I got boost 1.40 from here)

- untar it, then cd into the boost directory and run:

$ ./boostrap.sh
$ ./bjam
$ sudo ./bjam install

3) Install thrift and fb303

- get thrift source code with git, compile and install:

$ git clone git://git.thrift-rpc.org/thrift.git
$ cd thrift
$ ./bootstrap.sh
$ ./configure
$ make
$ sudo make install

- compile and install the Facebook fb303 library:

$ cd contrib/fb303
$ ./bootstrap.sh
$ make
$ sudo make install

- install the Python modules for thrift and fb303:

$ cd TOP THRIFT DIRECTORY
$ cd lib/py
$ sudo python setup.py install
$ cd TOP THRIFT DIRECTORY
$ cd contrib/fb303/py
$ sudo python setup.py install

To check that the python modules have been installed properly, run:

$ python -c 'import thrift' ; python -c 'import fb303'

4) Install Scribe

- download latest source code from SourceForge (I got it from here)

- untar, then run:

$ cd scribe
$ ./bootstrap.sh
$ make
$ sudo make install
$ sudo ldconfig (this is necessary so that the boost shared libraries are loaded)

- install Python modules for scribe:

$ cd lib/py
$ sudo python setup.py install

- to test that scribed (the scribe server process) was installed correctly, just run 'scribed' at a command line; you shouldn't get any errors
- to test that the scribe Python module was installed correctly, run
$ python -c 'import scribe'

5) Initial Scribe configuration

- create configuration directory -- in my case I created /etc/scribe
- copy one of the example config files from TOP_SCRIBE_DIRECTORY/examples/example*conf to /etc/scribe/scribe.conf -- a good one to start with is example1.conf
- edit /etc/scribe/scribe.conf and replace file_path (which points to /tmp) to a location more suitable for your system
- you may also want to replace max_size, which dictates how big the local files can be before they're rotated (by default it's 1 MB, which is too small -- I set it to 100 MB)
- run scribed either with nohup or in a screen session (it doesn't seem to have a daemon mode):

$ scribed -c /etc/scribe/scribe.conf

6) Test run

To test Scribe, you can install it on a remote machine, configure scribed on that machine to use a configuration file similar to examples/example2client.conf, then change remote_host in the config file to point to the central scribe server configured in step 5.

Once scribed is configured and running on the remote machine, you can test it with a nice utility written by Silas Sewell, called scribe_pipe. For example, you can pipe an Apache log file from the remote machine to the central scribe server by running:

cat apache_access_log | ./scribe_pipe apache.access

On the scribe server, you should see at this point a directory called apache.access under the main file_path directory, and files called apache.access_00000, apache.access_00001 etc (in chunks of max_size bytes).

I'll post separately about actually using Scribe in production. I hope this post will at least get you started on using Scribe and save you some headaches during its installation process.

11 comments:

Anonymous said...

Thank you for posting these instructions here. I remember installing scribe 6 months ago, it was pain.

Anonymous said...

Installing libboost no longer needs to be done from source, as of Ubuntu Karmic. The packages are sufficiently up to date now.

RJ said...

I looked around a lot before I cam across your post. Thank you so much for making my life easier!

Jon said...

Thank you for this post. I have been struggling to install scribe on an Ubuntu 9.10 Karmic instance on Amazon EC2. Your instructions worked like a charm. I neglected to install libevent-dev in my first pass. This resulted in complaints about TNonblockingServer not found when I tried to build scribe. I installed libevent-1.4.13, followed your steps again, and achieved success.

Unknown said...

Tried these instructions on Ubuntu 8.10 and unfortunately there must be a step missing. Running the bootstrap.sh in scribe (after following all the other instructions) results in the following error (including the last couple of lines for context):

checking whether the Boost::System library is available... yes
checking whether the Boost::Filesystem library is available... yes
configure: error: Could not link against !

Unknown said...

A further note, I tried building scribe on Centos and got the same error.

Grig Gheorghiu said...

Raoul -- sorry, I'm really swamped so I can't be of much help. Hopefully other people will find your comment via google and post something...

Unknown said...

Using a later version of boost (I was using 1.39, I switched to 1.42) solved that problem. But now I get the error:

error: ‘class boost::filesystem::basic_directory_entry, std::allocator >, boost::filesystem::path_traits> >’ has no member named ‘filename’

The impression I'm coming away with is that the boost library is very troublesome to compile with---the API seems completely unstable, changing with even minor revisions.

Unknown said...

Success, finally. Wish I could say it was something clever, but I just updated my scribe source to the very latest from git.

Grig Gheorghiu said...

Raoul -- glad to hear it finally worked. With cutting-edge technologies, 'update from the latest git revision' often does the trick (I had the same experience with tornado for example.)

Just curious -- how are you using scribe, i.e. in what topology, and for what kind of services?

test said...

I wrote an entry in my blog about compiling with boost 1.47 and a sample mbean to see scribe nodes statistics
http://developersnightmare.blogspot.com/2011/09/facebook-scribe-status-mbean.html

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...