Chasing Bugs: PHPUnit Hides Errors

I love stories about hunting down interesting bugs. I find things like left-pad breaking the internet, cloudbleed, and the sleep bug in Eve Online extraordinarily interesting.

The Bug

Recently, I went on my own bug hunt. I’d seen some strange behavior in our PHPUnit test suite. The tests were passing when run normally, but certain tests would consistently fail with process isolation enabled. The behavior was similar to what is shown below, which is a from a simple test suite I made to reproduce the problem:

mkasberg$ phpunit .
PHPUnit 5.7.15 by Sebastian Bergmann and contributors.

...                                                                 3 / 3 (100%)

Time: 124 ms, Memory: 8.00MB

OK (3 tests, 3 assertions)

mkasberg$ phpunit --process-isolation .
PHPUnit 5.7.15 by Sebastian Bergmann and contributors.

..E                                                                 3 / 3 (100%)

Time: 421 ms, Memory: 8.00MB

There was 1 error:

1) ThingTest::testTheThing
array_keys() expects parameter 1 to be array, integer given

/home/mkasberg/Desktop/php-test/ThingTest.php:7

ERRORS!
Tests: 3, Assertions: 2, Errors: 1.

Uh-oh. That’s not good. I mean, the failure alone is bad enough, but the fact that it only shows up with process isolation turned on is particularly troubling. Process isolation starts a new PHP process for each unit test. If tests are passing without process isolation and failing with process isolation, it implies that the tests are changing global state. These type of bugs are notoriously hard to track down.

The Search Begins

I started by simply debugging one of the failing tests. And I found rather quickly that the error (shown with process isolation enabled) was correct. Very simply, a function was being called with incorrect parameters. Now, that’s easy to fix, but I didn’t want to fix it yet. Before I could fix it, I had to understand why it appeared to pass when run as part of a larger test suite. A test suite that hides failures is a really scary thing.

The scale of the problem I faced was overwhelming. The results shown above are from a simple test suite designed to reproduce the problem. The suite where I originally discovered the problem had more than 1000 different tests, more than 20 of which would fail with process isolation enabled. Figuring out which one of the 1000+ tests was affecting the global state was a daunting task (because it could literally be any of them).

Needle in a Haystack

My first step was to reduce the size of the problem. Disable as many test cases as possible while still producing the behavior I originally saw. Basically, I turned off different parts of the test suite until I figured out exactly which combination of tests was causing the problem. This is a bit strange - imagine, I have a test suite with more than 1000 tests that should be producing roughly 20 errors, but isn’t. And I’m turning off entire chunks of the test suite, hoping that I still don’t see any errors even though I should. Eventually, I was able to reduce the problem to 3 tests.

One of the tests was the test that is supposed to fail. The other two tests were required to run before the failing test to reproduce the behavior that hides the error. If I turned off either of the other two tests, the error would be shown. These two tests were somehow affecting the global state of the PHP process. I created a gist on Github that shows my own (simplified) reproduction of the problem.

Finding the Bug

At this point, although I knew which tests were causing the problem, I still didn’t know why. In the codebase where I first discovered the problem, the problematic function call was buried 15 layers down the call stack. To find it, I took the same approach that I took to finding the 3 problematic tests. Comment out large chunks of code ’til something changes. If I comment out a large chunk of code and the error shows up because of it, I know that something in that chunk is causing the problem. I didn’t care if I broke any of the other tests along the way - I just needed to find out which line of code was preventing PHPUnit from showing the correct error message.

After a bit of searching, I found it. And after finding it, it made perfect sense. Our code was calling set_error_handler() to do some custom logging with errors. In a production environment, our custom error handler would be configured to send error messages to certain places, or act on them in other ways. In PHPUnit, none of this config was loaded. So it would eat error messages without reporting them anywhere.

The Solution

If I was writing all of this from scratch, I’d probably solve this problem with dependency injection. In production, set_error_handler() would actually be called. And in unit tests, I’d use a mock that would never actually call the set_error_handler() function. But I wasn’t writing this from scratch. This error handler was in a very old part of a very large codebase. I needed to be sure that set_error_handler() would be called 100% of the time in production. And, for fear of running into the same problem again, I needed to be sure that it would never be called inside a unit test. The solution I came up with is ugly. But it always works.

if (!isset($GLOBALS['argv'][0]) || strpos($GLOBALS['argv'][0], 'phpunit') === false) {
    set_error_handler(...);
}

Lessons Learned

The biggest lesson I learned from this experience is really just reinforcing something I already knew. PHP has several functions like ini_set(), set_error_handler(), etc. that can affect the global state of a PHP process. If you can’t avoid using these functions entirely, at least isolate them in your code base so they’re not inter-mingled with other logic in your code.
Run your PHPUnit test suite with --process-isolation occasionally. It’s slower, but it might show you problems you didn’t otherwise know about.

Mike Kasberg