Posts

Showing posts from 2011

Load balancing and SSL in EC2

Here is another post I wrote for this year's Sysadvent blog. It briefly mentions some ways you can do load balancing in EC2, and focuses on how to upload SSL certificates to an Elastic Load Balancer using command-line tools. Any comments appreciated!

Analyzing logs with Pig and Elastic MapReduce

This is a blog post I wrote for this year's Sysadvent series. If you're not familiar with the Sysadvent blog, you should be, if you are at all interested in system administration/devops topics. It is maintained by the indefatigable Jordan Sissel, and it contains posts contributed by various authors, 25 posts per year, from Dec 1st through Dec 25th. The posts cover a variety of topics, and to give you a taste, here are the articles so far this year:

Day 1: "Don't bash your process outputs" by Phil Hollenback
Day 2: "Strategies for Java deployment" by Kris Buytaert
Day 3: "Share skills and permissions with code" by Jordan Sissel
Day 4: "A guide to packaging systems" by Jordan Sissel
Day 5: "Tracking requests with Request Tracker" by Christopher Webber
Day 6: "Always be hacking" by John Vincent
Day 7: "Change and proximity of communication" by Aaron Nichols
Day 8: "Running services with systemd" b…

Crowd Mood - an indicator of health for products/projects

I thought I just coined a new term -- Crowd Mood -- but a quick Google search revealed a 2009 paper on "Crowd Behavior at Mass Gatherings: A Literature Review" (PDF) which says:

In the mass-gathering literature, the use of terms “crowd behavior”, “crowd type”, “crowd management”, and “crowd mood” are used in variable contexts. More practically, the term “crowd mood” has become an accepted measure of probable crowd behavior outcomes. This is particularly true in the context of crowds during protests/riots, where attempts have been made to identify factors that lead to a change of mood that may underpin more violent behavior.

Instead of protests and riots, the crowd behavior I'm referring to is the reaction of users of software products or projects. I think the overall Crowd Mood of these users is a good indicator of the health of those products/projects. I may state the obvious here, and maybe it's been done already, but I'm not aware of large-scale studies that tr…

Seven years of blogging in less than 500 words

In a couple of days, this blog will be 7 years old. Hard to believe so much time has passed since my first Hello World post.
I keep reading that blogging is on the wane, and it's true, mainly because of the popularity of Twitter. But I strongly believe that blogging is still important, and that more people should do it. For me, it's a way to give back to the community. I can't even remember how many times I found solutions to my technical problems by reading a blog post. I personally try to post something on my blog every single time I solve an issue that I've struggled with. If you post documentation to a company wiki (assuming it's not confidential), I urge you to try to also blog publicly about it – think of it as a public documentation that can help both you and others.
Blogging is also a great way to further your career. Back in September 2008 I blogged about my experiences with EC2. I had launched an m1.small instance and I had started to play with it. L…

Troubleshooting memory allocation errors in Elastic MapReduce

Yesterday we ran into an issue with some Hive scripts running within an Amazon Elastic MapReduce cluster. Here's the error we got:


Caused by: java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:877) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474) at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:289) ... 11 more Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at org.apache.hadoop.util.Shell.runCommand(Shell.java:176) at org.apache.hadoop.util.Shell.run(Shell.java:161) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) …

Experiences with Amazon Elastic MapReduce

We started to use AWS Elastic MapReduce (EMR) in earnest a short time ago, with the help of Bradford Stephens from Drawn to Scale. We needed somebody to jumpstart our data analytics processes and workflow, and Bradford's help was invaluable. At some point we'll probably build our own Hadoop cluster either in EC2 or in-house, but for now EMR is doing the job just fine.

We started with an EMR cluster containing the master + 5 slave nodes, all m1.xlarge. We still have that cluster up and running, but in the mean time I've also experimented with launching clusters, running our data analytics processes on them, then shutting them down -- which is the 'elastic' type of workflow that takes full advantage of the pay-per-hour model of EMR.

Before I go into the details of launching and managing EMR clusters, here's the general workflow that we follow for our data analytics processes on a nightly basis:

We gather data from various sources such as production databases, ad i…

More fun with the LSI MegaRAID controllers

As I mentioned in a previous post, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.

Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.

For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).

Check…

Good email sending practices

I'm not going to call these 'best practices' but I hope they'll be useful if you're looking for ways to improve your email sending capabilities so as to maximize the odds that a message intended for a given recipient actually reaches that recipient's inbox.

Make sure your mail servers are not configured as open relaysThis should go without saying, but it should still be your #1 concernUse ACLs and only allow relaying from the IPs of your application serversCheck your servers by using a mail relay testing serviceMake sure you have reverse DNS entries for the IP addresses you're sending mail fromThis is another one of the oldies but goldies that you should double-checkUse DKIMFrom the Wikipedia entryDomainKeys Identified Mail (DKIM) is a method for associating a domain name to an email message, thereby allowing a person, role, or organization to claim some responsibility for the message. The association is set up by means of a digital signature which can be…

Slow database? Check RAID battery!

Executive Summary: 

If your Dell database servers get slow suddenly, and I/O seems sluggish, do yourself a favor and check if the RAID battery is currently going through its 'relearning' cycle. If this is so, then the Write-Back policy is disabled and Write-Through is enabled -- as a result writes become very slow compared to the standard operation.

Details:

This turns out to be a fairly well known problem with RAID controllers in Dell servers, specifically LSI controllers. The default mode of operation for the RAID battery is to periodically go through a so-called 'relearn cycle', where it discharges, then charges and recalibrates itself by finding the current charge. In this timeframe, as I mentioned, Write-Back is disabled and Write-Through is enabled.

For our MySQL servers, we have innodb_flush_log_at_trx_commit set to 1, which means that every commit if flushed to disk. In consequence, the Write-Through mode will severely impact the performance of the writes to the…