Regression-Test History Data for Flaky-Test Research

Philipp Wendler

LMU Munich

Stefan Winter

LMU Munich

Ulm University

Flakiness Definitions

The repetitionist view

Flaky tests are “tests that can intermittently pass or fail, even for the same code version.” (Luo et al. 2014)

“tests that cause spurious failures without any code changes, i.e., flaky tests(Gruber et al. 2021)

“Tests that fail inconsistently, without changes to the code under test, are described as flaky.” (Parry et al. 2021)

The regressionist view

“some test failures may not be due to the latest changes but due to non-determinism in the tests, popularly called flaky tests” (Bell et al. 2018)

“our goal is to distinguish between flaky failures and failures caused by regression(Gruber et al. 2023a)

Imbalanced Availability of Research Data

The repetitionist view

Is there any flaky test in our code base that we should fix?

Numerous research datasets available, many integrated in IDoFT (Lam 2020)

The regressionist view

Is a test failure in CI due to a regression in the CUT or not?

Single dataset of flaky test failures within regression test history (Gruber et al. 2023b)

Options for Filling the Gap

Requirements:

  1. Regression test data along commit history
  2. Including flaky failures

Options:

  1. Search regression test data for flaky failures
  1. Simulate regression testing for projects with known flaky tests

Our Initial Attempt

  1. Select projects from IDoFT with NOD flaky tests that have known flakiness-introducing commit (FIC) (Lam et al. 2020)

Test Category TIC FIC iDFlakies-commit
org.fluentd.logger.TestFluentLogger.testReconnection NDOD;UD;NOD 5fd46383 5fd46383 da14ec34
org.fluentd.logger.TestFluentLogger.testClose NDOD;UD 87e957ae 87e957ae da14ec34
  1. Run test suites from FIC to iDFlakies-commit & repeat 30 times

Summary Stats (fluent-logger)

Slug (Module) FIC Hash Tests Commits Av. Commits/Test Flaky Tests Tests w/ Consistent Failures Total Distinct Histories
fluent/fluent-logger-java 5fd463 19 131 105.6 11 2 8.0x10^32
fluent/fluent-logger-java 87e957 19 160 122.4 11 3 2.1x10^31

Total Distinct Histories: Example

Commit Hash Test Method Distinct Results
5fd4638 testNormal03 2
de2b9f4 testNormal03 1
30a7221 testNormal03 1
6aece14 testNormal03 1
d1077ae testNormal03 3
a7da917 testNormal03 1
7f5eb6b testNormal03 1
43869ca testNormal03 2
a646dbf testNormal03 2
2f3f8a2 testNormal03 1

Nr. possible histories:
\(2 \times 3 \times 2 \times 2 = 24\)

Commit Hash Verdict Verdict Type Message
5fd4638 passed . .
5fd4638 failure java.lang.AssertionError expected:<10000> but was:<0>
d1077ae passed . .
d1077ae failure java.lang.AssertionError expected:<10000> but was:<3543>
d1077ae failure java.lang.AssertionError expected:<10000> but was:<2234>
43869ca passed . .
43869ca failure java.lang.AssertionError expected:<10000> but was:<2234>
a646dbf passed . .
a646dbf failure java.lang.AssertionError expected:<10000> but was:<0>

Summary Stats (All Projects)

Slug (Module) FIC Hash Tests Commits Av. Commits/Test Flaky Tests Tests w/ Consistent Failures Total Distinct Histories
TooTallNate/Java-WebSocket 822d40 146 75 75.0 24 1 2.6x10^9
apereo/java-cas-client (cas-client-core) 5e3655 157 65 61.7 3 2 1.0x10^7
eclipse-ee4j/tyrus (tests/e2e/standard-config) ce3b8c 185 16 16.0 12 0 261
feroult/yawp (yawp-testing/yawp-testing-appengine) abae17 1 191 191.0 1 1 8
fluent/fluent-logger-java 5fd463 19 131 105.6 11 2 8.0x10^32
fluent/fluent-logger-java 87e957 19 160 122.4 11 3 2.1x10^31
javadelight/delight-nashorn-sandbox d0d651 81 113 100.6 2 5 4.2x10^10
javadelight/delight-nashorn-sandbox d19eee 81 93 83.5 1 5 2.6x10^9
sonatype-nexus-community/nexus-repository-helm 5517c8 18 32 32.0 0 0 18
spotify/helios (helios-services) 023260 190 448 448.0 0 37 190
spotify/helios (helios-testing) 78a864 43 474 474.0 0 7 43

Initial Insights in Dataset Preparation

  • Result encoding makes a difference!
    • Most studies to date make binary pass/fail disctinction
    • Looking at exception messages can make a difference
  • Flaky tests can reveal actual regressions in other commits
    • Actual regressions in commit histories on main/master
    • 5 tests in dataset detect actual regressions and are flaky
  • 31 flaky tests fail significantly differently across different commits

Binary vs. Non-Binary Results

testNormal03 from fluent-logger in commit 43b2c3d.
Only verdict: Non-flaky
Including message: Flaky

Verdict Verdict Type Message
failure java.lang.AssertionError expected:<10000> but was:<9339>
failure java.lang.AssertionError expected:<10000> but was:<7726>
failure java.lang.AssertionError expected:<10000> but was:<8166>
failure java.lang.AssertionError expected:<10000> but was:<5235>
failure java.lang.AssertionError expected:<10000> but was:<6180>
failure java.lang.AssertionError expected:<10000> but was:<8818>
failure java.lang.AssertionError expected:<10000> but was:<9630>
failure java.lang.AssertionError expected:<10000> but was:<8801>
failure java.lang.AssertionError expected:<10000> but was:<8067>
failure java.lang.AssertionError expected:<10000> but was:<8507>
failure java.lang.AssertionError expected:<10000> but was:<6533>
failure java.lang.AssertionError expected:<10000> but was:<5308>
failure java.lang.AssertionError expected:<10000> but was:<7450>
failure java.lang.AssertionError expected:<10000> but was:<7889>
failure java.lang.AssertionError expected:<10000> but was:<9343>
failure java.lang.AssertionError expected:<10000> but was:<7490>
failure java.lang.AssertionError expected:<10000> but was:<8353>
failure java.lang.AssertionError expected:<10000> but was:<8815>
failure java.lang.AssertionError expected:<10000> but was:<7697>
failure java.lang.AssertionError expected:<10000> but was:<8965>
failure java.lang.AssertionError expected:<10000> but was:<8459>
failure java.lang.AssertionError expected:<10000> but was:<8326>
failure java.lang.AssertionError expected:<10000> but was:<8372>
failure java.lang.AssertionError expected:<10000> but was:<8292>
failure java.lang.AssertionError expected:<10000> but was:<5553>
failure java.lang.AssertionError expected:<10000> but was:<8938>
failure java.lang.AssertionError expected:<10000> but was:<9669>
failure java.lang.AssertionError expected:<10000> but was:<7566>
failure java.lang.AssertionError expected:<10000> but was:<8791>
failure java.lang.AssertionError expected:<10000> but was:<6360>

Regressions in Commit Histories

  • testNormal01 from fluent-logger flags regression in commit 7046496
  • The regression is fixed in the next commit 4924e54
  • The same test is flaky in the commit preceding the regression
  • The same test is flaky in the commit after the fixing commit

Failure Distribution Differences Across Commits

  • Fisher’s exact test
  • Independence of result distribution from commit hash
  • Rejected for 31 tests at \(\alpha = 0.05\)

Example: testNormal03 from fluent-logger

Commit Hash Distinct results
167dee4 2
189337a 3
2e67bc0 2
3268963 2
36ae754 2
37744e2 2
3ae1bbd 2
43869ca 2
43b2c3d 30
4ecd3f2 2
58610c7 2
82b109d 2
87e957a 27
8d418ae 2
8fe164f 2
a061b9e 2
abc5024 2
aef9865 2
b70b1f0 2
b97b239 2
cc7a1f8 3
cd9bae3 2
cfffb7e 29
d608b06 2
ff26da1 2

Dataset Availability & Future Directions

  • Dataset extension
    • FIC knowledge of lesser importance → more projects, any commit range
  • Dataset usage
    • “Give me \(a\) test result histories with \(b\) flaky test results, \(c\) non-flaky failures, and at least \(d\) commits”
  • Significant failure distributions differences: Automate localization
  • Your ideas?

Summary

Test repetitions
but along a commit range
for studying flaky test effects on regression testing.

Slug (Module) FIC Hash Tests Commits Av. Commits/Test Flaky Tests Tests w/ Consistent Failures Total Distinct Histories
TooTallNate/Java-WebSocket 822d40 146 75 75.0 24 1 2.6x10^9
apereo/java-cas-client (cas-client-core) 5e3655 157 65 61.7 3 2 1.0x10^7
eclipse-ee4j/tyrus (tests/e2e/standard-config) ce3b8c 185 16 16.0 12 0 261
feroult/yawp (yawp-testing/yawp-testing-appengine) abae17 1 191 191.0 1 1 8
fluent/fluent-logger-java 5fd463 19 131 105.6 11 2 8.0x10^32
fluent/fluent-logger-java 87e957 19 160 122.4 11 3 2.1x10^31
javadelight/delight-nashorn-sandbox d0d651 81 113 100.6 2 5 4.2x10^10
javadelight/delight-nashorn-sandbox d19eee 81 93 83.5 1 5 2.6x10^9
sonatype-nexus-community/nexus-repository-helm 5517c8 18 32 32.0 0 0 18
spotify/helios (helios-services) 023260 190 448 448.0 0 37 190
spotify/helios (helios-testing) 78a864 43 474 474.0 0 7 43

References

Alshammari, Abdulrahman, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. Flaky Test Dataset to Accompany "FlakeFlagger: Predicting Flakiness Without Rerunning Tests".” Zenodo. https://doi.org/10.5281/zenodo.5014076.
Bell, Jonathan, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests.” In Proc. ICSE, 433–44. ICSE ’18. ACM. https://doi.org/10.1145/3180155.3180164.
Gruber, Martin, Michael Heine, Norbert Oster, Michael Philippsen, and Gordon Fraser. 2023a. Practical Flaky Test Prediction using Common Code Evolution and Test History Data.” In Proc. ICST, 210–21. IEEE. https://doi.org/10.1109/ICST57152.2023.00028.
———. 2023b. Practical flaky test prediction using common code evolution and test history data [replication package].” Figshare. https://doi.org/10.6084/m9.figshare.21363075.
Gruber, Martin, Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2021. An Empirical Study of Flaky Tests in Python.” In 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST), 148–58. https://doi.org/10.1109/ICST49551.2021.00026.
Lam, Wing. 2020. International Dataset of Flaky Tests (IDoFT).” http://mir.cs.illinois.edu/flakytests.
Lam, Wing, Stefan Winter, Anjiang Wei, Tao Xie, Darko Marinov, and Jonathan Bell. 2020. “A Large-Scale Longitudinal Study of Flaky Tests.” Proc. ACM Program. Lang. 4 (OOPSLA). https://doi.org/10.1145/3428270.
Luo, Qingzhou, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests.” In Proc. FSE, 643–53. FSE 2014. ACM. https://doi.org/10.1145/2635868.2635920.
Mozilla. 2023. treeherder.” https://treeherder.mozilla.org/.
Parry, Owain, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2021. A Survey of Flaky Tests.” ACM Trans. Softw. Eng. Methodol. 31 (1). https://doi.org/10.1145/3476105.
Song, Xuezhi, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, Jin Song Dong, and Hong Mei. 2022. “RegMiner: Towards Constructing a Large Regression Dataset from Code Evolution History.” In Proc. ISSTA, 314–26. ACM. https://doi.org/10.1145/3533767.3534224.