How to Write Unit Tests That Actually Catch Bugs (Not Just Pass)

A passing test suite is not the same as a safe codebase.

This is the uncomfortable truth that every developer eventually discovers the hard way: you can have 90% code coverage, hundreds of green checkmarks in your CI pipeline, and still ship a bug that takes down production. Not because testing doesn't work — but because the tests were written to pass, not to break.

The distinction matters enormously. Tests written to pass find only the problems you already knew to look for. Tests written to break find the problems you didn't know existed — the edge cases, the integration failures, the logic errors hiding behind a happy path that always works in your local environment.

This guide is about the second kind. It covers what makes unit tests genuinely effective, the specific patterns that expose real bugs versus the ones that generate false confidence, and the practical techniques that separate test suites that protect production from test suites that decorate a CI dashboard.

Why most unit tests don't actually catch bugs

The problem starts with how most developers think about testing. The mental model is: write code, write a test that confirms the code works, move on. This produces tests that verify the implementation you just wrote — which is exactly the wrong thing to verify.

A false positive occurs when a test passes even though a bug is present. This is the most dangerous kind of test failure because it doesn't look like a failure at all. Your CI pipeline is green. Your coverage report looks healthy. And somewhere in your codebase, a bug is waiting for a specific input combination, a concurrent access pattern, or an edge case nobody thought to write a test for.

The three root causes of tests that pass without catching bugs:

1. Testing the implementation instead of the behaviour. When your test knows too much about how a function works internally — which private methods it calls, which intermediate values it produces — it becomes brittle and incomplete. It breaks when you refactor the implementation even if the behaviour is correct, and it passes when the behaviour is wrong but the internal mechanics match expectations.

2. Only testing the happy path. The happy path is what happens when everything works. Most bugs live in the unhappy paths: invalid inputs, empty collections, null values, concurrent access, network failures, integer overflow, division by zero, off-by-one errors at boundary conditions. Testing only the case where the code works tells you almost nothing about whether the code is reliable.

3. Writing tests after the code is already working. When you write tests after implementation, your mental model is anchored to the working case. You know the code works, so you write tests that confirm what you know. You systematically miss the cases that would reveal the code's failure modes — because you're not thinking about failure modes, you're thinking about verification.

Principle 1: Test behaviour, not implementation

The most impactful shift in unit testing philosophy is moving from "does this code do what I wrote?" to "does this code behave correctly from the outside?"

Implementation-testing (wrong approach):

// Test knows too much about internals
test('calculateDiscount calls validateUser and applyPercentage', () => {
  const validateSpy = jest.spyOn(utils, 'validateUser');
  const applySpy = jest.spyOn(utils, 'applyPercentage');

  calculateDiscount(user, 0.1);

  expect(validateSpy).toHaveBeenCalledWith(user);
  expect(applySpy).toHaveBeenCalledWith(100, 0.1);
});

Behaviour-testing (right approach):

// Test only cares about observable output
test('calculateDiscount applies 10% to eligible user order', () => {
  const eligibleUser = { id: 1, tier: 'premium', orderTotal: 100 };
  const result = calculateDiscount(eligibleUser, 0.10);
  expect(result).toBe(90);
});

test('calculateDiscount returns full price for ineligible user', () => {
  const basicUser = { id: 2, tier: 'basic', orderTotal: 100 };
  const result = calculateDiscount(basicUser, 0.10);
  expect(result).toBe(100);
});

The second approach survives refactoring. You can completely rewrite how calculateDiscount works internally — different validation logic, different calculation method — and the tests still pass if and only if the behaviour is correct. The first approach breaks whenever you touch the internals, regardless of whether the behaviour changed.

The practical rule: your test should only call the public interface of the unit under test. If your test reaches into private methods or checks internal state that isn't exposed publicly, you're testing implementation.

Principle 2: Structure every test with Arrange-Act-Assert

The Arrange-Act-Assert (AAA) pattern is the most reliable structure for writing readable, maintainable, effective unit tests. Every test is divided into three distinct phases with a clear separation between them.

def test_transfer_reduces_sender_balance():
    # Arrange — set up the state needed for this test
    sender   = Account(id=1, balance=500.00)
    receiver = Account(id=2, balance=100.00)
    transfer = TransferService(db=FakeDatabase())

    # Act — perform the single action being tested
    transfer.execute(from_account=sender, to_account=receiver, amount=200.00)

    # Assert — verify the expected outcome
    assert sender.balance == 300.00

def test_transfer_increases_receiver_balance():
    # Arrange
    sender   = Account(id=1, balance=500.00)
    receiver = Account(id=2, balance=100.00)
    transfer = TransferService(db=FakeDatabase())

    # Act
    transfer.execute(from_account=sender, to_account=receiver, amount=200.00)

    # Assert
    assert receiver.balance == 300.00

def test_transfer_raises_when_insufficient_funds():
    # Arrange
    sender   = Account(id=1, balance=50.00)
    receiver = Account(id=2, balance=100.00)
    transfer = TransferService(db=FakeDatabase())

    # Act + Assert (for exceptions, combined is standard)
    with pytest.raises(InsufficientFundsError):
        transfer.execute(from_account=sender, to_account=receiver, amount=200.00)

Notice that each test covers exactly one assertion about one behaviour. This is deliberate. When a test with ten assertions fails, you know something broke — but you don't know what. When a test with one assertion fails, you know exactly which behaviour is broken.

The AAA structure also exposes design problems. If your Arrange section is enormous — many lines of setup, many dependencies to configure — that's a signal that the unit under test has too many responsibilities or too many external dependencies. Hard-to-test code is usually poorly designed code.

Principle 3: Write tests for the cases that actually break in production

The single most effective change most teams can make to their test suites is expanding what they test beyond the happy path. Production bugs don't live in the cases you thought of — they live in the cases you didn't.

The categories of test cases most teams skip:

Boundary values. If your function accepts values from 1 to 100, test 0, 1, 100, and 101. Bugs cluster at boundaries. An off-by-one error at a boundary condition is one of the most common bugs in production, and it's invisible if you only test with values in the middle of the range.

def test_discount_applies_at_minimum_qualifying_amount():
    # Boundary: exactly at the threshold
    assert calculate_discount(order_total=50.00) == 45.00

def test_discount_does_not_apply_below_minimum():
    # Boundary: one unit below
    assert calculate_discount(order_total=49.99) == 49.99

def test_discount_applies_above_minimum():
    # Boundary: clearly above
    assert calculate_discount(order_total=100.00) == 90.00

Null, empty, and zero values. null and undefined cause a disproportionate share of production bugs. Test what happens when a required field is missing, when a collection is empty, when a numeric input is zero.

test('formatUserName returns empty string for null input', () => {
  expect(formatUserName(null)).toBe('');
});

test('processOrders handles empty order list gracefully', () => {
  expect(() => processOrders([])).not.toThrow();
  expect(processOrders([])).toEqual({ total: 0, count: 0 });
});

Error and exception paths. If your code can fail, test that it fails correctly. Wrong error type, swallowed exceptions, partial state after a failure — these are production incidents waiting to happen.

test('saveUser rolls back transaction when email validation fails', async () => {
  const invalidUser = { name: 'Jane', email: 'not-an-email' };
  const db = new FakeDatabase();

  await expect(saveUser(invalidUser, db)).rejects.toThrow(ValidationError);
  expect(db.getUserCount()).toBe(0); // Verify rollback happened
});

Concurrent and ordering-dependent scenarios. If multiple calls can happen in sequence or in parallel, test the combinations that break things: what if the same resource is modified twice in quick succession? What if a callback fires before the initial setup completes?

Principle 4: Use the right test double for the right job

Test doubles — mocks, stubs, fakes, and spies — are the mechanism that lets you test a unit in isolation from its dependencies. Using the wrong type creates tests that either miss real bugs or produce false failures.

Stub — returns a predefined value. Use when you need a dependency to return something specific to set up your test's scenario, but you don't care how it was called.

class StubPaymentGateway:
    def charge(self, amount, card):
        return PaymentResult(success=True, transaction_id="txn_test_123")

def test_order_creates_confirmation_on_successful_payment():
    gateway = StubPaymentGateway()
    order = Order(items=[...], total=99.99)
    result = checkout(order, gateway, card=test_card)
    assert result.confirmation_number is not None

Mock — verifies that a specific interaction happened. Use when the interaction itself is the thing being tested — not just the output.

def test_failed_payment_sends_notification_to_user():
    mock_notifier = Mock()
    failing_gateway = StubFailingPaymentGateway()
    order = Order(items=[...], total=99.99)

    checkout(order, failing_gateway, card=test_card, notifier=mock_notifier)

    mock_notifier.send_failure_notification.assert_called_once_with(
        user_id=order.user_id,
        reason="Payment declined"
    )

Fake — a working simplified implementation of a dependency. Use for dependencies that would be impractical to use in tests (real databases, real filesystems, real HTTP clients) but where you need the full logic to work correctly.

class FakeUserRepository:
    def __init__(self):
        self._users = {}

    def save(self, user):
        self._users[user.id] = user

    def find_by_id(self, user_id):
        return self._users.get(user_id)

    def find_by_email(self, email):
        return next(
            (u for u in self._users.values() if u.email == email),
            None
        )

The rule of thumb: use stubs and fakes by default. Reach for mocks only when the interaction itself is the behaviour you're testing — not just the output. Over-reliance on mocks produces tests that verify implementation rather than behaviour, which is the root cause of the problem this guide is trying to solve.

Principle 5: Name tests as precise failure descriptions

The name of a failing test is the first thing a developer reads when the CI pipeline breaks. A good test name tells you exactly what broke, in what context, and what was expected — without opening the test file.

Bad test names:

test_user()
test_discount_calculation()
testEmailValidation()
test1()

Good test names:

test_calculateDiscount_returns_zero_for_negative_order_total
test_saveUser_raises_DuplicateEmailError_when_email_already_exists
test_transfer_execute_rollsback_both_accounts_when_receiver_update_fails
test_processPayment_sends_notification_on_gateway_timeout

The pattern that works: [function or unit]_[condition or input]_[expected behaviour]. It's verbose but precise. When this test fails in CI at 2 AM, the engineer on call knows exactly what broke without opening the code.

A good test name is itself a specification. If you read all your test names in sequence, they should describe the complete expected behaviour of your unit — every success case, every failure case, every edge case — without a line of code being visible.

Principle 6: Keep tests deterministic — eliminate randomness and time dependencies

A test that sometimes passes and sometimes fails is worse than no test at all. It erodes trust in the test suite, causes engineers to re-run CI hoping for a green pass, and eventually gets ignored or disabled. This is called a flaky test, and it's one of the most corrosive problems in a test suite.

The two most common causes of flaky tests:

Randomness. Any test that depends on a random value — a random ID, a randomly generated input, a random delay — can produce different results on different runs. Replace randomness with fixed, predictable values in tests.

# Flaky — test outcome depends on random input
def test_user_id_is_unique():
    user1 = create_user()  # generates random ID internally
    user2 = create_user()
    assert user1.id != user2.id  # might fail if IDs collide

# Deterministic — use controlled, predictable IDs
def test_two_users_with_same_id_raises_duplicate_error():
    repo = FakeUserRepository()
    repo.save(User(id="fixed-id-123", email="a@example.com"))
    with pytest.raises(DuplicateIdError):
        repo.save(User(id="fixed-id-123", email="b@example.com"))

Time. Any test that calls Date.now(), datetime.now(), or equivalent is implicitly depending on when the test runs. Inject time as a dependency so tests can control it.

// Flaky — depends on real time
function isTokenExpired(token) {
  return Date.now() > token.expiresAt;
}

// Testable — time is injected
function isTokenExpired(token, currentTime = Date.now()) {
  return currentTime > token.expiresAt;
}

// Deterministic test
test('isTokenExpired returns true for token expiring in the past', () => {
  const expiredToken = { expiresAt: 1000000 };
  expect(isTokenExpired(expiredToken, 2000000)).toBe(true);
});

test('isTokenExpired returns false for token expiring in the future', () => {
  const validToken = { expiresAt: 9999999999 };
  expect(isTokenExpired(validToken, 1000000)).toBe(false);
});

The general principle: any external dependency that can vary between test runs — time, randomness, network state, filesystem state — should be injected so the test controls it.

Principle 7: Avoid logic in tests

When your test contains if statements, for loops, switch cases, or string concatenation, you have introduced the possibility of a bug in your test itself. The last place you want to find a bug is in the code that's supposed to find bugs.

Tests should be a direct, unconditional statement of expected behaviour. No branching, no iteration, no conditional assertions.

Test with logic (wrong):

test('discount is applied correctly for all user tiers', () => {
  const tiers = ['bronze', 'silver', 'gold'];
  const expectedDiscounts = [0, 0.05, 0.10];

  for (let i = 0; i < tiers.length; i++) {
    const user = { tier: tiers[i], orderTotal: 100 };
    const result = calculateDiscount(user);
    expect(result).toBe(100 - (100 * expectedDiscounts[i]));
    // Bug hiding here: if expectedDiscounts[i] is wrong,
    // the loop passes with an incorrect assertion
  }
});

Three flat tests (right):

test('calculateDiscount applies no discount for bronze tier', () => {
  expect(calculateDiscount({ tier: 'bronze', orderTotal: 100 })).toBe(100);
});

test('calculateDiscount applies 5% discount for silver tier', () => {
  expect(calculateDiscount({ tier: 'silver', orderTotal: 100 })).toBe(95);
});

test('calculateDiscount applies 10% discount for gold tier', () => {
  expect(calculateDiscount({ tier: 'gold', orderTotal: 100 })).toBe(90);
});

Three separate tests are more lines of code. They're also independently verifiable, independently named, and independently reportable in your CI output. When one fails, you know exactly which tier's discount logic is broken.

---ip

Principle 8: Test at the right level — unit tests are not the only answer

Unit tests are fast, focused, and cheap to run. They're also limited: a unit that works correctly in isolation can still fail when integrated with another unit that works correctly in isolation. Both units pass their tests. The integration fails.

The testing pyramid describes the right distribution:

Unit tests (base — most tests): Fast, isolated, test one function or class in isolation. Run on every commit. Cover all edge cases and failure modes at this level because they're cheapest here.

Integration tests (middle — fewer tests): Test how units work together — a service class working with a real (or realistic fake) database, an API endpoint processing a real HTTP request through the full middleware stack. Catch the bugs that unit tests miss.

End-to-end tests (top — fewest tests): Test critical user journeys through the full system. Slow, expensive, brittle — reserve them for the highest-value flows you cannot afford to break.

The mistake most teams make is treating unit tests as the only kind, reaching 80% code coverage, and wondering why bugs still reach production. Unit tests cannot catch integration failures. The gaps between your units are where the most expensive production bugs live.

Know what level to write each test at. A behaviour that's trivial to test at the unit level should be tested there. A behaviour that emerges from the interaction of multiple components needs an integration test.

Principle 9: Treat tests as first-class code

Tests are not a secondary deliverable. They are part of the codebase and should be held to the same standards as production code: readable, maintainable, reviewed, and refactored when they accumulate complexity.

The patterns that produce unmaintainable test suites:

Repeated setup. If every test in a file starts with the same 20 lines of setup, extract it into a beforeEach or a shared factory function. When the setup changes, you change it in one place.

Magic numbers. expect(result).toBe(90) — where does 90 come from? expect(result).toBe(ORDER_TOTAL - (ORDER_TOTAL * GOLD_DISCOUNT_RATE)) makes the assertion self-documenting.

Giant test files. A test file with 500 tests covering ten different classes is a maintenance burden. Organise tests to mirror the structure of the code they test — one test file per module or class.

Dead tests. Tests that always pass no matter what the production code does, tests that are skipped permanently, tests whose describe and it blocks no longer match what the code does. Audit and remove them. Tests you don't trust are worse than no tests — they generate false confidence.

The Boy Scout Rule applies to tests as much as production code: every time you touch a test file, leave it slightly cleaner than you found it.

Principle 10: Use mutation testing to find gaps in your coverage

Code coverage tells you which lines your tests execute. It does not tell you whether your tests would catch a bug on those lines. A test that covers a line without asserting anything about the outcome of that line provides zero protection.

Mutation testing is the technique that reveals this gap. A mutation testing tool makes small, systematic changes to your production code — inverting a boolean, changing a > to >=, replacing a + with a - — and runs your test suite against each mutated version. If your tests pass on the mutated code, you have found a gap: your tests execute that code but don't verify its correctness.

// Production code
function isEligibleForDiscount(user) {
  return user.orderCount > 10 && user.tier === 'premium';
}

// A mutation testing tool might change > to >= and rerun tests:
// return user.orderCount >= 10 && user.tier === 'premium';
// If your tests still pass, you don't have a test for the boundary at 10.

Popular mutation testing tools by ecosystem:

Language	Tool
JavaScript / TypeScript	Stryker
Python	mutmut, Cosmic Ray
Java	PIT (Pitest)
Go	go-mutesting
C#	Stryker.NET
PHP	Infection

Mutation testing is slower than regular test runs — it runs your suite once per mutation, and there are many mutations. Run it on your most critical modules rather than your entire codebase. The mutations that survive (tests don't catch the change) are your highest-priority gaps to fill.

A pre-commit checklist for every test you write

Before committing any new test, run through this:

Check	What it catches
Does the test name describe the failure precisely?	Uninformative test names
Does the test have exactly one assertion (or a tight group)?	Tests that mix multiple behaviours
Is the Arrange section minimal?	Overtly complex units under test
Does it test behaviour, not implementation internals?	Brittle, refactoring-sensitive tests
Does it cover at least one edge case or failure path?	Happy-path-only coverage
Is there any logic (if/for/switch) in the test?	Bugs hiding in test logic
Could this test fail for a different reason on a different machine?	Flaky time or environment dependencies
Would this test catch the bug it's supposed to catch if I introduced it?	False confidence from weak assertions

The last question is the most important. Before finalising a test, deliberately introduce the bug it's meant to prevent — change the implementation to be wrong — and confirm the test fails. If it still passes, the test is not protecting you.

Frequently asked questions

What is the difference between a unit test and an integration test? A unit test tests a single function, method, or class in complete isolation from its real dependencies — databases, networks, and external services are replaced with fakes, stubs, or mocks. An integration test tests how multiple units work together, often using real or realistic versions of dependencies. Unit tests are faster and cheaper; integration tests catch the bugs that emerge from how components interact.

What is a good unit test coverage percentage? Coverage percentage is a weak proxy for test quality. 80% coverage with tests that only verify the happy path provides less protection than 60% coverage with tests that systematically cover edge cases and failure paths. That said, below 70% coverage for critical business logic is a warning sign. Above 90% can indicate over-testing implementation details. Focus on coverage of your most critical and complex logic, not on a target number.

What is the AAA pattern in unit testing? Arrange, Act, Assert. Every test is structured into three phases: Arrange sets up the data and dependencies for the test; Act performs the single operation being tested; Assert verifies the expected outcome. The pattern improves readability, reveals poorly designed code through overly complex Arrange sections, and ensures each test has a single clear purpose.

What are test doubles and when should I use each type? Test doubles are stand-ins for real dependencies. Stubs return predefined values and are used when you need a dependency to behave a certain way but don't care how it was called. Mocks verify that specific interactions happened and are used when the interaction itself is the behaviour being tested. Fakes are simplified working implementations used when you need a dependency's logic to work correctly but can't use the real implementation in tests. Use stubs and fakes by default; reach for mocks only when interaction verification is specifically what you're testing.

What is mutation testing? Mutation testing systematically modifies your production code in small ways — inverting booleans, changing comparison operators, swapping arithmetic operators — and checks whether your tests detect each change. Mutations that survive (don't cause test failures) reveal gaps in your test suite where code is executed but not meaningfully verified. It's the most reliable way to measure whether your tests would actually catch bugs, not just pass.

How do I test code that depends on the current time or random values? Inject time and randomness as dependencies rather than calling them directly in your code. Instead of calling Date.now() inside a function, accept the current time as a parameter with Date.now() as the default. In tests, pass a fixed timestamp. Instead of generating a random value inside a function, accept a random number generator as a dependency and pass a seeded, predictable generator in tests. This pattern — dependency injection for non-deterministic values — is the standard approach to making time and randomness testable.

Should I write tests before or after the code? Writing tests first (Test-Driven Development) produces better-designed, more testable code and systematically covers failure modes before they exist. Writing tests after code is better than not writing tests at all, but it anchors your thinking to the working case, making it harder to identify the edge cases that reveal real bugs. At minimum, write tests immediately after each small unit of code, not at the end of a feature — the further removed testing is from writing, the harder it becomes to think about failure modes.