Coping Strategies for Unreliable Regression Tests

Stefan Winter

LMU Munich

Slides Available

https://www.stefan-winter.net/presentations/flaky_tests_coping.html

Regression Testing

Regression Testing

Flaky tests: Non-deterministic failures of automated regression tests in continuous integration pipelines that are not caused by software regressions.

  • Waste developer time
  • Block and delay releases
  • Diminish trust in testing

Comprehensive root cause analyses (Luo et al. 2014; Gruber et al. 2021; Hashemi, Tahir, and Rasheed 2022)

Flaky Tests

Example: libkdumpfile

#! /bin/sh

name=xlatmap
resultfile="out/${name}.result"
expectfile="$srcdir/$name.expect"

echo -n "Checking... "
./xlatmap >"$resultfile"

Excerpt from:
https://github.com/ptesarik/libkdumpfile/blob/c54a90c2756e0ca7f9b45662ad3c987403ee7360/tests/xlatmap-check

Test Interference

Example: libkdumpfile

#! /bin/sh

name=xlatmap
resultfile="out/${name}.result"
expectfile="$srcdir/$name.expect"

mkdir -p out

echo -n "Checking... "
./xlatmap >"$resultfile"

Excerpt from accepted fix:
https://github.com/ptesarik/libkdumpfile/blob/e6c5fde6ac7201185292539bef7203c9618ac773/tests/xlatmap-check

Flaky Test Coping Strategy in Industry: Rerun

  • Google: 10x
  • Mozilla: 10x + 5x + “CHAOS_MODE”
  • Spotify: 3-5x with coverage instrumentation
  • Dropbox Athena: Mark tests in pre-submit, re-run in post-submit
  • Microsoft Azure: Configurable test/pipeline reruns

If detected: Proceed with integration (no regression),

skip execution of test in the future

Viability of the Strategy

(Wendler and Winter 2024)

  1. Select Java projects from IDoFT with flaky tests that have known flakiness-introducing commit (FIC) (Lam et al. 2020)
  2. Run test suites from FIC to iDFlakies-commit & repeat 30 times

Result: 5 flaky tests in the study reveal regressions in the commit history
→ Quarantining tests diminishes test suite power
→ Better approaches than “flag + skip” desirable

Flaky Test Coping Strategies in Academia

Detection and Repair

Order Dependencies (OD)

Source: (Luo et al. 2014)

Order Dependencies (OD)

Source: (Gruber et al. 2021)

OD Detection Overhead

  • Complete detection: Run all test suite permutations (\(n!\))
    • libkdumpfile has 184 tests
    • \(2.2 \times 10^{338}\) test suite permutations
    • 22s per test suite run → \(4.9 \times 10^{339}\)s
      (estimated age of universe: \(4.3 \times 10^{17}\)s)

OD Detection Overhead

  • Empirical results: Pairwise permutations mostly suffice (\(n\cdot(n-1)\)) (Zhang et al. 2014; Shi et al. 2019)
    • Factorial down to quadratic complexity
    • 33,672 test pair runs and > 2h for libkdumpfile
    • Still too long to run in CI

Reducing OD Detection Cost

(Eder and Winter 2024)

Insight: No shared resource access → no order dependency

Idea: Run every test once and record access rights on files, sockets, …

Reducing OD Detection Cost

(Eder and Winter 2024)

Insight: No shared resource access → no order dependency

Idea: Run every test once and record access rights on files, sockets, …

Reducing OD Detection Cost

File Descriptors Filtered (FDF)

Reducing OD Detection Cost

Overlay FS + FDF (OFSFDF)

  • Insight: Not every write permission leads to an actual change
  • Idea: Snapshot filesystem before/after test run

Reducing OD Detection Cost

Overlay FS + FDF (OFSFDF)

Results

Test Order Reductions


libkdumpfile:
33,672 test pair runs and > 2h

4 test pair runs and < 1s

Summary

Flaky tests threaten regression testing.
Coping strategies:

Research Overview

Research focus: Software Dependability, Software Testing, Reproducibility