Friday, April 14, 2006

Should acceptance tests be included in the continuous build process?

This is the title of a post by Dave Nicolette, a post prompted by some back-and-forth comments Dave and I left to each other on my blog regarding the frequency of running acceptance tests. Dave argues that acceptance tests do not really belong in a continuous integration build, because they do not have the same scope as unit tests, and they do not give the developers the feedback they need, regardless of how fast they actually run.

Here are some of my thoughts on this subject. First of all -- thank you, Dave, for your comments and blog post, which prompted me to better clarify to myself some of these things. I will argue in what follows that the speed of tests is of the essence, and is a big factor in determining which tests are run when.

Following Brian Marick's terminology, let's first distinguish between customer-facing (or business-facing) tests and code-facing (or technology-facing tests). I think it's an important distinction.

Customer-facing tests are high-level tests expressed in the business domain language, and they are created (ideally) through the collaboration of customers, business analysts, testers and developers. When a customer-facing test passes, it gives the customer a warm fuzzy feeling that the application does what it's supposed to do. Customer-facing tests are usually called acceptance tests. They can operate at the business logic level (in which case, at least in agile environments, they're usually created in executed with tools such as Fit or FitNesse), or at the GUI level (in which case a variety of tools can be used; for Web application, a combination of twill and Selenium will usually do the trick).

Code-facing tests are lower-level tests expressed in the language of the programmers. They deal with the nitty-gritty of the application, and when they pass, they give developers a warm fuzzy feeling that their code does what they intended it to do, and that they didn't break any existing code when they refactored it. Unit tests are a prime example of code-facing tests.

That said, there are some types of tests that can be seen as both customer-facing and code-facing. I'll talk more about them later on in this post.

Let me now discuss the various types of testing that Dave mentions in his post.

Unit tests

Unit tests are clearly code-facing tests. Michael Feathers, in his now-classical book "Working Effectively With Legacy Code", says that good unit tests have two qualities:
  • they run fast
  • they help us localize problems
Note that the very first quality is related to the speed of execution. If unit tests are not fast, that usually means that they depend on external interfaces such as databases and various network-related services, and thus they are not truly unit tests. Here is Michael Feathers again:

"Unit tests run fast. If they don't run fast, they aren't unit tests.

Other kinds of tests often masquerade as unit tests. A test is not a unit test if:

  1. It talks to a database.

  2. It communicates across a network.

  3. It touches the file system.

  4. You have to do special things to your environment (such as editing configuration files) to run it.

Tests that do these things aren't bad. Often they are worth writing, and you generally will write them in unit test harnesses. However, it is important to be able to separate them from true unit tests so that you can keep a set of tests that you can run fast whenever you make changes."

For more great advice on writing good unit tests, see Roy Osherove's blog post on "Achieving and Recognizing Testable Software Designs". Fast run time is again one of Roy's criteria for good unit tests.

How fast should a unit test be? Michael Feathers says that if a unit test takes more than 1/10 of a second to run, it's too slow. I'd say a good number to shoot for in terms of the time it should take for all your unit tests to run is 2 minutes, give or take.

Integration/functional tests

The terminology becomes a bit muddier from now on. While most people agree on what a unit test is, there are many different definitions for integration/functional/acceptance/system tests.

By integration testing, I mean testing several pieces of your code together. An integration test exercises a certain path through your code, a path that is usually described by means of an acceptance test. The boundary between integration tests and acceptance tests is somehow fuzzy -- which is why I think it helps to keep Brian Marick's categories in mind. In this discussion, I look at integration tests as code-facing tests that are extracted from customer-facing acceptance tests by shunting/stubbing/mocking external interfaces -- databases, network services, even heavy-duty file system operations.

This has the immediate effect of speeding up the tests. Another effect of stubbing/mocking the interfaces is that errors can be easily simulated by the stub objects, so that your error checking and exception-handling code can be thoroughly exercised. It's much harder to exercise these aspects of your code if you depend on real failures from external interfaces, which tend to be random and hard to reproduce.

One other benefit I found in stubbing/mocking external interfaces is that it keeps you on your toes when you write code. You tend to think more about dependencies between different parts of your code (see Jim Shore's post on "Dependency Injection Demystified") and as a result, your code becomes cleaner and more testable.

The downside of stubbing/mocking all the external interfaces is that it can take considerable time, which is the main reason why not many teams do it. But one can also say that writing unit tests takes precious time away from development, and we all know the dark and scary places where that concept leads...

To come back to the question expressed in the title of this post, and in Dave's post, I think that the type of integration/acceptance testing that I just described is a prime candidate for inclusion in the continuous integration build. It runs fast, it is code-facing, it gives developers instant feedback, it exercises paths through the application that are not being exercised by unit tests alone. Apart from the fact that it does take time to write this type of tests, I see no downside in including them in the continuous build.

One caveat: if you're using a tool such as Fit or FitNesse to describe and run your acceptance tests, then you'll be probably using the same tool for running the integration tests with stubbed/mocked interfaces. This has the potential of confusing the customers, who will not necessarily know whether they're looking at the real deal or at mock testing (see this story by David Chelimsky on "Fostering Credibility in Customer Tests" for more details on what can go wrong in this context). In cases like these, I think it's worth labelling the tests in big bold letters on the FitNesse wiki pages: "END TO END TESTS AGAINST A LIVE DATABASE" vs. "PURE BUSINESS LOGIC TESTS AGAINST MOCK OBJECTS".

Although it is a somewhat artificial distinction, for the purpose of this discussion it might help to differentiate between integration tests (those tests that have all the external interface stubbed) and functional tests, which I define as tests that have some, but not all external interfaces stubbed. For example, in the MailOnnaStick application, we stubbed the HTTP interface by using twill and WSGI in-process testing. This made the tests much faster, since we didn't have to start up a Web server or incur the network latency penalty.

I argue that if such functional tests are fast, they should be included in the continuous integration process, for the same reasons that integration tests should. Continous integration is all about frequent feedback, and the more feedback you have about different paths through your application, as well as about different interactions between the components of your application, the better off you are. The main criterion for deciding whether to include functional tests in the build that happens on every check-in is again the execution time of these tests.

And to touch on another point raised by Dave, I think you shouldn't include in your continuous integration process those tests that you know will fail because they have no code to support them yet. I agree with Dave that those tests serve no purpose in terms of feedback to the developers. Those are the true, "classical" acceptance tests that I discuss next.

However, here's a comment Titus had on this feedback issue when I showed him a draft of this post:
"One thing to point out about the
acceptance tests is that (depending on how you do your iteration
planning) you may well write the acceptance tests well in advance of
the code that will make them succeed -- TDD on a long leash, I guess.
My bet is that people would take a different view of them if we
agreed that the green/red status of the continuous integration tests
would be independent of the acceptance tests, or if you could flag
the acceptance tests that are *supposed* to pass vs those that aren't.

That way you retain the feel-good aspect of knowing everything is
working to date."
I think flagging acceptance tests as "must pass" vs. "should pass at some point", and gradually moving tests from the second category into the first is a good way to go. There's nothing like having a green bar to give you energy and boost your morale :-)

Acceptance tests

In an agile environment, acceptance tests are often expressed as "storytests" by capturing the user stories/requirements together with acceptance criteria (tests) that validate their implementation. I believe Joshua Kerievsky first coined the term storytesting, a term which is being used more and more in the agile community. I'm a firm believer in writing tests that serve as documentation -- I call this Agile Documentation, and I gave a talk about it at PyCon 2006.

Acceptance tests are customer-facing tests par excellence. As such, I believe they do need to exercise your application in environments that are as close as possible to your customers' environments. I wrote a blog post about this, where I argued that acceptance tests should talk to the database -- and you can replace database with any external interface that your application talks to.

However, even with these restrictions, I believe you can still put together a subset of acceptance tests that run fast and can be included in a continous integration process -- a process which run maybe not on each and every check-in, but definitely every 3 hours for example.

For example, an acceptance test that talks to a database can be sped up by having a small test database, but a database which still contains corner cases that can potentially wreak havoc on unsuspecting code. In-memory databases can sometimes be used successfully in cases like this. For network services -- for example getting weather information -- you can build a mock service running on a local server. This avoids dependencies on external services, while still exercising the application in a way that is much closer to the real environment.

System tests

By system testing, people usually understand testing the application in an environment that reproduces as close as possible the production environment. Many times, these types of tests are hard to automate, and definitely a certain amount of manual exploratory testing is needed. This being said, there is ample room for automation, if for nothing else but for regression testing/smoke testing purposes -- by which I mean deploying a new build of the application in the system test environment and making sure that nothing is horribly broken.

Now this is a type of tests that would be hard to run in a continous integration process, mostly because of time constraints. I think it should still be run periodically, perhaps overnight.

Performance/load/stress tests

Here there be dragons. I'm not going to cover this topic here, as I devoted a few blog posts to it already: "Performance vs. load vs. stress testing", "More on performance vs. load testing", "HTTP performance testing with httperf, autobench and openload".

Conclusions

This has been a somewhat long-winded post, but I hope my point is clear: the more tests you run on a continuous basis, and the more aspects of your application these tests cover, the better off you are. I call this holistic testing, and I touch on it in this post where I mention Elisabeth Hendrickson's "Better Testing, Worse Testing" article. One of the main conclusions Titus and I reached in our Agile Testing work and tutorial was that holistic testing is the way to go, as no one type of testing does it all.

As a developer, you need to the comfort of knowing that your refactoring hasn't broken existing code. If your continuous integration process can run unit, integration and functional tests fast and give you instant feedback, then your comfort level is so much higher (fast is the operational word here though). Here is an excerpt from a great post by Jeffrey Fredrick to the agile-testing mailing list:
"the sooner you learn about a problem the cheaper it is to fix,
thus the highest value comes from the test that provides feedback the
soonest. in practical terms this means that tests that run
automatically after a check-in/build will provide more value than
those that require human intervention, and among automated tests those
that run faster will provide more value than those that run slower.
thus my question about "how long does the test take to execute?" -- I
don't want a slow test getting in the way of the fast feedback I could
be getting from other tests; better to buy a separate machine to run
those slow ones.

related note: imperfect tests that are actually running provide
infinitely more value than long automation projects that spend months
writing testing frameworks but no actual tests. 'nuff said?"
My thoughts exactly :-)

If you're curious to see how Titus and I integrated our tests in buildbot, read Titus's Buildbot Technology Narrative.

To summarize, here's a continuous integration strategy that I think would work well for most people:

1. run unit tests and integration tests (with all external interfaces stubbed) on every check-in
2. also run fast functional tests (with some external interfaces stubbed) on every check-in
3. run acceptance tests with small data sets every 2 or 3 hours
4. run acceptance tests with large data sets and system tests every night

As usual, comments are very welcome.

9 comments:

Anonymous said...

Grig,

I'm really grateful you took the time to respond in this way. It's possible I wasn't the only reader who got the wrong message from the original post, and it's an important topic. Sometimes long-windedness is required to explain a concept properly, so don't worry on that account.

Regarding what you describe as integration/functional tests, I completely agree. It seems we have the same view about all the details you brought up regarding the purpose and scope of various kinds of tests as well as when it is appropriate to stub or mock external interfaces.

The type of test I had in mind was the customer-facing acceptance test. Even there, it seems we really are in agreement about when it is or is not appropriate to include an acceptance test in the continuous build.

I fully agree with Titus that we should try to do TDD on a long leash. It sounds like BDD, and it makes perfect sense to extend the concept of TDD to that level. I'd like to see automated acceptance tests written before code is written for the explicit purpose of serving as executable "specifications." Then we're down to no documentation and no formal hand-off from analysts to developers (a common practice in larger organizations), and no ambiguity about the definition of "done." That eliminates a significant potential point of miscommunication from every iteration of every project — a huge value-add, especially in a company that has a lot of projects going on at any given time. To paraphrase Titus, there's nothing like having a red bar for the right reasons to boost your morale.

We've been trying to get to that stage on some of our projects, but it's a tough nut to crack when it comes to the customer-facing tests. Reality keeps intruding on our beautiful theory. Bad, bad, reality! And some of the tools we know and love just don't handle Ajax very gracefully (yet?).

Concerning Titus' suggestion about flagging the "should pass" acceptance tests, what about the idea of having different build scripts? We've taken the approach of having a second build script that is not initiated automatically on check-in, but is available for developers to run whenever they want to run acceptance tests.

I also like your idea of Agile Documentation. I'm not sure it's an entirely new idea, but it's good that you've given it a name and are promoting the concept formally. In a perfect world, everyone follows sound software engineering practices, designs and refactors to patterns, and follows widely-accepted conventions. Meanwhile, API documentation can be generated for us. That leaves little need for the comprehensive technical documentation that is an expected deliverable in most traditional methodologies. A high level architectural diagram and some explanation of anything not done according to convention is all the documentation we need. The test suite serves as an accurate technical "document." Unlike traditional technical documents, the test suite is guaranteed to be completely up to date at all times, since it's checked in along with the code and we know (because it's a perfect world) the next team to touch the code base will use TDD to make their enhancements.

In our corporate environment, system testing is out of scope of the project team's work. There's a separate group that puts new apps through their paces for security, performance, capacity, failover, and living in harmony with other apps that share the same server environment. I can see that level of testing being brought into the scope of a project team's work, but our company doesn't happen to do things that way.

Thanks again for the excellent explanation. Keep up the good work!

Grig Gheorghiu said...

Dave,

Glad you found this post useful, and it does seem like we are on agreement on most of the points.

Regarding separate build scripts, I was thinking about that myself, I ended up not mentioning it in my post. I actually know people who have 2 sets of FitNesse tests: one set containing acceptance-tests-as-specifications-which are-not-implemented-yet (and in consequence it's red until implemented) and another set with tests that exercise code already implemented -- and this set should always be green. I think it's a very good strategy.

Regarding Ajax testing, have you looked at Selenium? I have a couple of posts on testing Ajax-specific functionality with Selenium:

http://agiletesting.blogspot.com/2006/01/testing-commentary-and-thus-ajax-with.html
http://agiletesting.blogspot.com/2006/03/ajax-testing-with-selenium-using_21.html

Finally, you're right, I didn't invent the term Agile Documentation, far from it. I believe that people understand different things by this concept though, that's why I'm careful to always define my understanding of it. Part of my talk was inspired by an article written by Brian Button in Better Software, called "Double duty" (where the tests act as documentation).

Anonymous said...

I was disapointed in your notion of integration testing. The traditional view of integration testing, sometimes referred to as boundary testing, is to exercise the interfaces between S/W units being integrated. It is assumed that the developers have adequately performed unit tests, using a host of white box techniques, but when you get to integration testing, you must insure that the interfaces are solid.

It would help if the interfaces are defined and hopefully documented in some way, and this is where I think the Agile methodology fails - who wants to spend time keeping functional and interface documentation up to date and in synch with the code?

I've been searching but so far have not found any agile solutions to managing interfaces between S/W modules, which suggests to me nobody regards this as an issue..

Grig Gheorghiu said...

Anonymous -- I think you're missing the fact that in an Agile methodology, the integration tests *are* the documentation for the interfaces under test. Take FitNesse (fitnesse.org) for example: a FitNesse Wiki page contains interface documentation/specification *and* live running tests at the same time.

Anonymous said...

Fantastic article -- thanks!

I'd like to respond to the discussion about the "should pass" tests...

In my experience, those tests that just sit there running red (or in some sort of "ignored" state) amount to a Hunt & Thomas-style "broken window." A failing test that should pass now looks just like a failing test that should pass two or three sprints down the road... once you get used to tolerating the red, it's hard to get back to an all-green-all-the-time mindset.

The idea does have merit, though... If you have some way to time-bomb the "should-pass" tests so that after X sprints they come to life, that might help things.

Thanks again for the thoughtful piece.

Anonymous said...

Great post. I have been a follower of yourself and David Nicolette for some time now.

On my projects I have always strived to create a Health Dashboard in which I can understand if we are missing a beat in those places that might be hard to find if we dont deal with them sooner. With a powerful machine or network of machines, I have strived to always run the acceptance tests as part of the logical continuous integration environment. I might have one build running on every check-in all of the programmer/unit tests. In a different build I have run the acceptance tests on a different schedule - Dave mentions this too.

One of the issues that comes up is in dealing with the acceptance tests not passing at the begining of the iteration/release. I have found using numbers to indicate how many acceptance tests are completed or remain incomplete is helpful for the iteration. Also important is to know whether existing functionality has ceased to work. Using a rudimentary flag, it is possible to spot when a test that was passing is now not passing, which is when the alarm sounds on the build machine.

Thanks for a stimulating read.

Nick.
www.lab49.com

Anonymous said...

Thanks for putting together this post, it's a great overview.

Here's a question: I'm working on transitioning my team from a traditional stage/gate process with separate QA and build staff to more of a test driven agile approach. A key piece of that transition effort is to find the high value test development efforts and get the team to buy into those early. Where would you start--i.e. which of the flavors of testing--where the goal is to get some early buy in from the dev team to put the energy into automating tests?

Grig Gheorghiu said...

Hi, Jim

Thanks for your comments. I'd definitely try to get the developers to write unit tests, since they are the ones most qualified to do so. Testers don't write production code, so they shouldn't write unit tests. They can help developers on that path though by showing them how to write unit tests, or by suggesting boundary condition-type tests that developers should think about.

Once you have a solid unit testing suite in place, you can have developers collaborate with testers and business analysts/customers in defining acceptance tests. I've seen cases where a developer was dedicated to the task of helping testers write acceptance tests in FitNesse/PyFIT -- since that required considerable scripting/programming skills.

HTH,

Grig

Jeff Anderson said...

great post,
and the subject matter of this blog is extremely timely...

I'm currently figuring out how to incorporate "agile testing" and to our developer teams. I plan to share your blog with the rest of the team...

http://agileconsulting.blogspot.com

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...