Should acceptance tests be included in the continuous build process?
Here are some of my thoughts on this subject. First of all -- thank you, Dave, for your comments and blog post, which prompted me to better clarify to myself some of these things. I will argue in what follows that the speed of tests is of the essence, and is a big factor in determining which tests are run when.
Following Brian Marick's terminology, let's first distinguish between customer-facing (or business-facing) tests and code-facing (or technology-facing tests). I think it's an important distinction.
Customer-facing tests are high-level tests expressed in the business domain language, and they are created (ideally) through the collaboration of customers, business analysts, testers and developers. When a customer-facing test passes, it gives the customer a warm fuzzy feeling that the application does what it's supposed to do. Customer-facing tests are usually called acceptance tests. They can operate at the business logic level (in which case, at least in agile environments, they're usually created in executed with tools such as Fit or FitNesse), or at the GUI level (in which case a variety of tools can be used; for Web application, a combination of twill and Selenium will usually do the trick).
Code-facing tests are lower-level tests expressed in the language of the programmers. They deal with the nitty-gritty of the application, and when they pass, they give developers a warm fuzzy feeling that their code does what they intended it to do, and that they didn't break any existing code when they refactored it. Unit tests are a prime example of code-facing tests.
That said, there are some types of tests that can be seen as both customer-facing and code-facing. I'll talk more about them later on in this post.
Let me now discuss the various types of testing that Dave mentions in his post.
Unit tests are clearly code-facing tests. Michael Feathers, in his now-classical book "Working Effectively With Legacy Code", says that good unit tests have two qualities:
- they run fast
- they help us localize problems
"Unit tests run fast. If they don't run fast, they aren't unit tests.
Other kinds of tests often masquerade as unit tests. A test is not a unit test if:
It talks to a database.
It communicates across a network.
It touches the file system.
You have to do special things to your environment (such as editing configuration files) to run it.
For more great advice on writing good unit tests, see Roy Osherove's blog post on "Achieving and Recognizing Testable Software Designs". Fast run time is again one of Roy's criteria for good unit tests.
How fast should a unit test be? Michael Feathers says that if a unit test takes more than 1/10 of a second to run, it's too slow. I'd say a good number to shoot for in terms of the time it should take for all your unit tests to run is 2 minutes, give or take.
The terminology becomes a bit muddier from now on. While most people agree on what a unit test is, there are many different definitions for integration/functional/acceptance/system tests.
By integration testing, I mean testing several pieces of your code together. An integration test exercises a certain path through your code, a path that is usually described by means of an acceptance test. The boundary between integration tests and acceptance tests is somehow fuzzy -- which is why I think it helps to keep Brian Marick's categories in mind. In this discussion, I look at integration tests as code-facing tests that are extracted from customer-facing acceptance tests by shunting/stubbing/mocking external interfaces -- databases, network services, even heavy-duty file system operations.
This has the immediate effect of speeding up the tests. Another effect of stubbing/mocking the interfaces is that errors can be easily simulated by the stub objects, so that your error checking and exception-handling code can be thoroughly exercised. It's much harder to exercise these aspects of your code if you depend on real failures from external interfaces, which tend to be random and hard to reproduce.
One other benefit I found in stubbing/mocking external interfaces is that it keeps you on your toes when you write code. You tend to think more about dependencies between different parts of your code (see Jim Shore's post on "Dependency Injection Demystified") and as a result, your code becomes cleaner and more testable.
The downside of stubbing/mocking all the external interfaces is that it can take considerable time, which is the main reason why not many teams do it. But one can also say that writing unit tests takes precious time away from development, and we all know the dark and scary places where that concept leads...
To come back to the question expressed in the title of this post, and in Dave's post, I think that the type of integration/acceptance testing that I just described is a prime candidate for inclusion in the continuous integration build. It runs fast, it is code-facing, it gives developers instant feedback, it exercises paths through the application that are not being exercised by unit tests alone. Apart from the fact that it does take time to write this type of tests, I see no downside in including them in the continuous build.
One caveat: if you're using a tool such as Fit or FitNesse to describe and run your acceptance tests, then you'll be probably using the same tool for running the integration tests with stubbed/mocked interfaces. This has the potential of confusing the customers, who will not necessarily know whether they're looking at the real deal or at mock testing (see this story by David Chelimsky on "Fostering Credibility in Customer Tests" for more details on what can go wrong in this context). In cases like these, I think it's worth labelling the tests in big bold letters on the FitNesse wiki pages: "END TO END TESTS AGAINST A LIVE DATABASE" vs. "PURE BUSINESS LOGIC TESTS AGAINST MOCK OBJECTS".
Although it is a somewhat artificial distinction, for the purpose of this discussion it might help to differentiate between integration tests (those tests that have all the external interface stubbed) and functional tests, which I define as tests that have some, but not all external interfaces stubbed. For example, in the MailOnnaStick application, we stubbed the HTTP interface by using twill and WSGI in-process testing. This made the tests much faster, since we didn't have to start up a Web server or incur the network latency penalty.
I argue that if such functional tests are fast, they should be included in the continuous integration process, for the same reasons that integration tests should. Continous integration is all about frequent feedback, and the more feedback you have about different paths through your application, as well as about different interactions between the components of your application, the better off you are. The main criterion for deciding whether to include functional tests in the build that happens on every check-in is again the execution time of these tests.
And to touch on another point raised by Dave, I think you shouldn't include in your continuous integration process those tests that you know will fail because they have no code to support them yet. I agree with Dave that those tests serve no purpose in terms of feedback to the developers. Those are the true, "classical" acceptance tests that I discuss next.
However, here's a comment Titus had on this feedback issue when I showed him a draft of this post:
"One thing to point out about theI think flagging acceptance tests as "must pass" vs. "should pass at some point", and gradually moving tests from the second category into the first is a good way to go. There's nothing like having a green bar to give you energy and boost your morale :-)
acceptance tests is that (depending on how you do your iteration
planning) you may well write the acceptance tests well in advance of
the code that will make them succeed -- TDD on a long leash, I guess.
My bet is that people would take a different view of them if we
agreed that the green/red status of the continuous integration tests
would be independent of the acceptance tests, or if you could flag
the acceptance tests that are *supposed* to pass vs those that aren't.
That way you retain the feel-good aspect of knowing everything is
working to date."
In an agile environment, acceptance tests are often expressed as "storytests" by capturing the user stories/requirements together with acceptance criteria (tests) that validate their implementation. I believe Joshua Kerievsky first coined the term storytesting, a term which is being used more and more in the agile community. I'm a firm believer in writing tests that serve as documentation -- I call this Agile Documentation, and I gave a talk about it at PyCon 2006.
Acceptance tests are customer-facing tests par excellence. As such, I believe they do need to exercise your application in environments that are as close as possible to your customers' environments. I wrote a blog post about this, where I argued that acceptance tests should talk to the database -- and you can replace database with any external interface that your application talks to.
However, even with these restrictions, I believe you can still put together a subset of acceptance tests that run fast and can be included in a continous integration process -- a process which run maybe not on each and every check-in, but definitely every 3 hours for example.
For example, an acceptance test that talks to a database can be sped up by having a small test database, but a database which still contains corner cases that can potentially wreak havoc on unsuspecting code. In-memory databases can sometimes be used successfully in cases like this. For network services -- for example getting weather information -- you can build a mock service running on a local server. This avoids dependencies on external services, while still exercising the application in a way that is much closer to the real environment.
By system testing, people usually understand testing the application in an environment that reproduces as close as possible the production environment. Many times, these types of tests are hard to automate, and definitely a certain amount of manual exploratory testing is needed. This being said, there is ample room for automation, if for nothing else but for regression testing/smoke testing purposes -- by which I mean deploying a new build of the application in the system test environment and making sure that nothing is horribly broken.
Now this is a type of tests that would be hard to run in a continous integration process, mostly because of time constraints. I think it should still be run periodically, perhaps overnight.
Here there be dragons. I'm not going to cover this topic here, as I devoted a few blog posts to it already: "Performance vs. load vs. stress testing", "More on performance vs. load testing", "HTTP performance testing with httperf, autobench and openload".
This has been a somewhat long-winded post, but I hope my point is clear: the more tests you run on a continuous basis, and the more aspects of your application these tests cover, the better off you are. I call this holistic testing, and I touch on it in this post where I mention Elisabeth Hendrickson's "Better Testing, Worse Testing" article. One of the main conclusions Titus and I reached in our Agile Testing work and tutorial was that holistic testing is the way to go, as no one type of testing does it all.
As a developer, you need to the comfort of knowing that your refactoring hasn't broken existing code. If your continuous integration process can run unit, integration and functional tests fast and give you instant feedback, then your comfort level is so much higher (fast is the operational word here though). Here is an excerpt from a great post by Jeffrey Fredrick to the agile-testing mailing list:
"the sooner you learn about a problem the cheaper it is to fix,My thoughts exactly :-)
thus the highest value comes from the test that provides feedback the
soonest. in practical terms this means that tests that run
automatically after a check-in/build will provide more value than
those that require human intervention, and among automated tests those
that run faster will provide more value than those that run slower.
thus my question about "how long does the test take to execute?" -- I
don't want a slow test getting in the way of the fast feedback I could
be getting from other tests; better to buy a separate machine to run
those slow ones.
related note: imperfect tests that are actually running provide
infinitely more value than long automation projects that spend months
writing testing frameworks but no actual tests. 'nuff said?"
If you're curious to see how Titus and I integrated our tests in buildbot, read Titus's Buildbot Technology Narrative.
To summarize, here's a continuous integration strategy that I think would work well for most people:
1. run unit tests and integration tests (with all external interfaces stubbed) on every check-in
2. also run fast functional tests (with some external interfaces stubbed) on every check-in
3. run acceptance tests with small data sets every 2 or 3 hours
4. run acceptance tests with large data sets and system tests every night
As usual, comments are very welcome.