Regression-Test History Data for Flaky-Test Research

Flakiness Definitions

The repetitionist view

Flaky tests are “tests that can intermittently pass or fail, even for the same code version.” (Luo et al. 2014)

“tests that cause spurious failures without any code changes, i.e., flaky tests” (Gruber et al. 2021)

“Tests that fail inconsistently, without changes to the code under test, are described as flaky.” (Parry et al. 2021)

The regressionist view

“some test failures may not be due to the latest changes but due to non-determinism in the tests, popularly called flaky tests” (Bell et al. 2018)

“our goal is to distinguish between flaky failures and failures caused by regression” (Gruber et al. 2023a)

Imbalanced Availability of Research Data

The repetitionist view

Is there any flaky test in our code base that we should fix?

Numerous research datasets available, many integrated in IDoFT (Lam 2020)

The regressionist view

Is a test failure in CI due to a regression in the CUT or not?

Single dataset of flaky test failures within regression test history (Gruber et al. 2023b)

Options for Filling the Gap

Requirements:

Regression test data along commit history
Including flaky failures

Options:

Search regression test data for flaky failures
- RegMiner (Song et al. 2022)
- Mozilla treeherder (Mozilla 2023)

Simulate regression testing for projects with known flaky tests
- IDoFT (Lam 2020)
- FlakeFlagger (Alshammari et al. 2021)
- …

Our Initial Attempt

Select projects from IDoFT with NOD flaky tests that have known flakiness-introducing commit (FIC) (Lam et al. 2020)

Test	Category	TIC	FIC	iDFlakies-commit
org.fluentd.logger.TestFluentLogger.testReconnection	NDOD;UD;NOD	5fd46383	5fd46383	da14ec34
org.fluentd.logger.TestFluentLogger.testClose	NDOD;UD	87e957ae	87e957ae	da14ec34

Run test suites from FIC to iDFlakies-commit & repeat 30 times

Summary Stats (fluent-logger)

Slug (Module)	FIC Hash	Tests	Commits	Av. Commits/Test	Flaky Tests	Tests w/ Consistent Failures	Total Distinct Histories
fluent/fluent-logger-java	5fd463	19	131	105.6	11	2	8.0x10^32
fluent/fluent-logger-java	87e957	19	160	122.4	11	3	2.1x10^31

Total Distinct Histories: Example

Commit Hash	Test Method	Distinct Results
5fd4638	testNormal03	2
de2b9f4	testNormal03	1
30a7221	testNormal03	1
6aece14	testNormal03	1
d1077ae	testNormal03	3
a7da917	testNormal03	1
7f5eb6b	testNormal03	1
43869ca	testNormal03	2
a646dbf	testNormal03	2
2f3f8a2	testNormal03	1

Nr. possible histories:
\(2 \times 3 \times 2 \times 2 = 24\)

Commit Hash	Verdict	Verdict Type	Message
5fd4638	passed	.	.
5fd4638	failure	java.lang.AssertionError	expected:<10000> but was:<0>
d1077ae	passed	.	.
d1077ae	failure	java.lang.AssertionError	expected:<10000> but was:<3543>
d1077ae	failure	java.lang.AssertionError	expected:<10000> but was:<2234>
43869ca	passed	.	.
43869ca	failure	java.lang.AssertionError	expected:<10000> but was:<2234>
a646dbf	passed	.	.
a646dbf	failure	java.lang.AssertionError	expected:<10000> but was:<0>

Summary Stats (All Projects)

Slug (Module)	FIC Hash	Tests	Commits	Av. Commits/Test	Flaky Tests	Tests w/ Consistent Failures	Total Distinct Histories
TooTallNate/Java-WebSocket	822d40	146	75	75.0	24	1	2.6x10^9
apereo/java-cas-client (cas-client-core)	5e3655	157	65	61.7	3	2	1.0x10^7
eclipse-ee4j/tyrus (tests/e2e/standard-config)	ce3b8c	185	16	16.0	12	0	261
feroult/yawp (yawp-testing/yawp-testing-appengine)	abae17	1	191	191.0	1	1	8
fluent/fluent-logger-java	5fd463	19	131	105.6	11	2	8.0x10^32
fluent/fluent-logger-java	87e957	19	160	122.4	11	3	2.1x10^31
javadelight/delight-nashorn-sandbox	d0d651	81	113	100.6	2	5	4.2x10^10
javadelight/delight-nashorn-sandbox	d19eee	81	93	83.5	1	5	2.6x10^9
sonatype-nexus-community/nexus-repository-helm	5517c8	18	32	32.0	0	0	18
spotify/helios (helios-services)	023260	190	448	448.0	0	37	190
spotify/helios (helios-testing)	78a864	43	474	474.0	0	7	43

Initial Insights in Dataset Preparation

Result encoding makes a difference!
- Most studies to date make binary pass/fail disctinction
- Looking at exception messages can make a difference

Flaky tests can reveal actual regressions in other commits
- Actual regressions in commit histories on main/master
- 5 tests in dataset detect actual regressions and are flaky

31 flaky tests fail significantly differently across different commits

Binary vs. Non-Binary Results

testNormal03 from fluent-logger in commit 43b2c3d.
Only verdict: Non-flaky
Including message: Flaky

Verdict	Verdict Type	Message
failure	java.lang.AssertionError	expected:<10000> but was:<9339>
failure	java.lang.AssertionError	expected:<10000> but was:<7726>
failure	java.lang.AssertionError	expected:<10000> but was:<8166>
failure	java.lang.AssertionError	expected:<10000> but was:<5235>
failure	java.lang.AssertionError	expected:<10000> but was:<6180>
failure	java.lang.AssertionError	expected:<10000> but was:<8818>
failure	java.lang.AssertionError	expected:<10000> but was:<9630>
failure	java.lang.AssertionError	expected:<10000> but was:<8801>
failure	java.lang.AssertionError	expected:<10000> but was:<8067>
failure	java.lang.AssertionError	expected:<10000> but was:<8507>
failure	java.lang.AssertionError	expected:<10000> but was:<6533>
failure	java.lang.AssertionError	expected:<10000> but was:<5308>
failure	java.lang.AssertionError	expected:<10000> but was:<7450>
failure	java.lang.AssertionError	expected:<10000> but was:<7889>
failure	java.lang.AssertionError	expected:<10000> but was:<9343>
failure	java.lang.AssertionError	expected:<10000> but was:<7490>
failure	java.lang.AssertionError	expected:<10000> but was:<8353>
failure	java.lang.AssertionError	expected:<10000> but was:<8815>
failure	java.lang.AssertionError	expected:<10000> but was:<7697>
failure	java.lang.AssertionError	expected:<10000> but was:<8965>
failure	java.lang.AssertionError	expected:<10000> but was:<8459>
failure	java.lang.AssertionError	expected:<10000> but was:<8326>
failure	java.lang.AssertionError	expected:<10000> but was:<8372>
failure	java.lang.AssertionError	expected:<10000> but was:<8292>
failure	java.lang.AssertionError	expected:<10000> but was:<5553>
failure	java.lang.AssertionError	expected:<10000> but was:<8938>
failure	java.lang.AssertionError	expected:<10000> but was:<9669>
failure	java.lang.AssertionError	expected:<10000> but was:<7566>
failure	java.lang.AssertionError	expected:<10000> but was:<8791>
failure	java.lang.AssertionError	expected:<10000> but was:<6360>

Regressions in Commit Histories

testNormal01 from fluent-logger flags regression in commit 7046496

The regression is fixed in the next commit 4924e54

The same test is flaky in the commit preceding the regression

The same test is flaky in the commit after the fixing commit

Failure Distribution Differences Across Commits

Fisher’s exact test
Independence of result distribution from commit hash
Rejected for 31 tests at \(\alpha = 0.05\)

Example: testNormal03 from fluent-logger

Commit Hash	Distinct results
167dee4	2
189337a	3
2e67bc0	2
3268963	2
36ae754	2
37744e2	2
3ae1bbd	2
43869ca	2
43b2c3d	30
4ecd3f2	2
58610c7	2
82b109d	2
87e957a	27
8d418ae	2
8fe164f	2
a061b9e	2
abc5024	2
aef9865	2
b70b1f0	2
b97b239	2
cc7a1f8	3
cd9bae3	2
cfffb7e	29
d608b06	2
ff26da1	2

Dataset Availability & Future Directions

Dataset extension
- FIC knowledge of lesser importance → more projects, any commit range

Dataset usage
- “Give me \(a\) test result histories with \(b\) flaky test results, \(c\) non-flaky failures, and at least \(d\) commits”

Significant failure distributions differences: Automate localization

Your ideas?

Summary

Test repetitions
but along a commit range
for studying flaky test effects on regression testing.

Slug (Module)	FIC Hash	Tests	Commits	Av. Commits/Test	Flaky Tests	Tests w/ Consistent Failures	Total Distinct Histories
TooTallNate/Java-WebSocket	822d40	146	75	75.0	24	1	2.6x10^9
apereo/java-cas-client (cas-client-core)	5e3655	157	65	61.7	3	2	1.0x10^7
eclipse-ee4j/tyrus (tests/e2e/standard-config)	ce3b8c	185	16	16.0	12	0	261
feroult/yawp (yawp-testing/yawp-testing-appengine)	abae17	1	191	191.0	1	1	8
fluent/fluent-logger-java	5fd463	19	131	105.6	11	2	8.0x10^32
fluent/fluent-logger-java	87e957	19	160	122.4	11	3	2.1x10^31
javadelight/delight-nashorn-sandbox	d0d651	81	113	100.6	2	5	4.2x10^10
javadelight/delight-nashorn-sandbox	d19eee	81	93	83.5	1	5	2.6x10^9
sonatype-nexus-community/nexus-repository-helm	5517c8	18	32	32.0	0	0	18
spotify/helios (helios-services)	023260	190	448	448.0	0	37	190
spotify/helios (helios-testing)	78a864	43	474	474.0	0	7	43

References

Alshammari, Abdulrahman, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. “Flaky Test Dataset to Accompany "FlakeFlagger: Predicting Flakiness Without Rerunning Tests".” Zenodo. https://doi.org/10.5281/zenodo.5014076.

Bell, Jonathan, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. “DeFlaker: Automatically Detecting Flaky Tests.” In Proc. ICSE, 433–44. ICSE ’18. ACM. https://doi.org/10.1145/3180155.3180164.

Gruber, Martin, Michael Heine, Norbert Oster, Michael Philippsen, and Gordon Fraser. 2023a. “Practical Flaky Test Prediction using Common Code Evolution and Test History Data.” In Proc. ICST, 210–21. IEEE. https://doi.org/10.1109/ICST57152.2023.00028.

———. 2023b. “Practical flaky test prediction using common code evolution and test history data [replication package].” Figshare. https://doi.org/10.6084/m9.figshare.21363075.

Gruber, Martin, Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2021. “An Empirical Study of Flaky Tests in Python.” In 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST), 148–58. https://doi.org/10.1109/ICST49551.2021.00026.

Lam, Wing. 2020. “International Dataset of Flaky Tests (IDoFT).” http://mir.cs.illinois.edu/flakytests.

Lam, Wing, Stefan Winter, Anjiang Wei, Tao Xie, Darko Marinov, and Jonathan Bell. 2020. “A Large-Scale Longitudinal Study of Flaky Tests.” Proc. ACM Program. Lang. 4 (OOPSLA). https://doi.org/10.1145/3428270.

Luo, Qingzhou, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. “An Empirical Analysis of Flaky Tests.” In Proc. FSE, 643–53. FSE 2014. ACM. https://doi.org/10.1145/2635868.2635920.

Mozilla. 2023. “treeherder.” https://treeherder.mozilla.org/.

Parry, Owain, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2021. “A Survey of Flaky Tests.” ACM Trans. Softw. Eng. Methodol. 31 (1). https://doi.org/10.1145/3476105.

Song, Xuezhi, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, Jin Song Dong, and Hong Mei. 2022. “RegMiner: Towards Constructing a Large Regression Dataset from Code Evolution History.” In Proc. ISSTA, 314–26. ACM. https://doi.org/10.1145/3533767.3534224.