Error Propagation Analyses for Shared-Memory Systems

Stefan Winter

stefan.winter@uni-ulm.de

University of Ulm

LMU Munich

Slide Deck Availability

https://www.stefan-winter.net/presentations/shared-memory_EPA.html

Speaker Background

Dr.-Ing: TU Darmstadt, 2015
Postdoc: TU Darmstadt, UIUC, LMU Munich
Lecturer/Interim prof.: HAW Landshut, LMU Munich, Ulm University

Research focus: Dependable software, mostly “low-level” code

Motivation

// Your Program

void printSum(int a, int b) {
  printf(“%d”, a – b);
}

Motivation

// Your Program

void printSum(int a, int b) {
-  printf(“%d”, a – b);
+  printf(“%d”, a + b);
}

libc

Operating System

Problem Magnitude

How many defects in external dependencies?
- NASA: 0.1 defects/KLOC after deployment (Dvorak 2009)
- For commodity software: ~1 defect/KLOC good

How many KLOC?
- Support libraries (glib, etc.): Several MLOC
- System libraries (libc, etc.): 1-2 MLOC
- Operating System: 20-60 MLOC

Fault Containment

// Your Program

void printSum(int a, int b) {
-  printf(“%d”, a – b);
+  printf(“%d”, a + b);
}

libc

Operating System

Error Propagation Analyses (EPA)

How do defects in one component affect other components?

Three Steps:

Inject defect into external component

2. Execute program

Detect behavioral deviations

Assumption: Federated systems

Isolated components
Explicit interfaces

Interface Injections

Interface Error Injections: Mutating interface parameters

Interface Injections

Federated Systems Assumption

void user() {
  int a = 40;
  int b = 2;
  int result = library_function(a, b);
  ...
}

int library_function(int a, int b) {
  return a + b;
}

Interface injection:

Return value of library_function is external input

Replace by
- Fixed value: 42 → 0
- Random value: 42 → 2275
- Mutated value: 42 → 43

Interface Injections

Federated Systems Assumption

void user() {
  struct test *t;
  library_function(1, t);
  ...
}

void library_function(int a, struct test *b) {
  bool init = (a - 1 ==  0);
  if (init)
    b->val = 0;
}

No return value as external input

Shared-memory via heap pointer t

Injection into the pointer
- Maybe other valid heap
- Mostly unallocated/protected memory
  → segfault

Just inject in pointed-to heap

Interface Injections

void user() {
  struct test *t;
  library_function(1, t);
  ...
}

void library_function(int a, struct test *b) {
  bool init = (a - 1 ==  0);
  if (init)
-   b->val = 0;
+   char *str = "test"; 
+   b->string = str;
}

Heap contains pointers
→ Injection likely to invalidate
Only parts of shared data touched by external code
Strategy 1: Ignore it (and miss relevant tests)
Strategy 2: Inject into heap memory
- Many wasted tests
Strategy 3: Track all memory “stores”
- Filter irrelevant cases
- Requires tracking allocations, frees
- Orientation: sysbench cpu run
  - 4.8M reads & 1.2M stores

Analyzing Error Propagation in Shared Memory (Lanzaro et al. 2014)

Execute with dynamic binary instrumentation
Construct reachability graphs from function parameters & return values
Compare clean/faulty store traces for reachable memory

Trace Comparison Example

Alignment of longest common sub-sequences (Hunt and Szymanski 1977)

Experimental Evaluation

Name	Size (loc)	Faults
Libxml2	155k	1471
Libbzip2	6k	463
SQLite	78k	1023

User code: Developer tests

Results

Memory Corruptions Are Frequent

38.2% of all experiments
61.8% showed no effects
- Not uncommon for fault injections
- No activation
- No propagation

Results

Memory Corruptions Are Large

Results

Memory Corruptions Are Silent

Return values with memory corruption
Name	−1	NULL ptr	0	Wrong ptr	Wrong value
Libxml2	32 (24.2%)	33 (25.0%)	1 (0.8%)	5 (3.8%)	2 (1.5%)
Libbzip2	0 (0%)	4 (11.8%)	0 (0%)	28 (82.3%)	2 (5.9%)
SQLite	0 (0%)	2 (2.8%)	0 (0%)	1 (1.4%)	69 (95.8%)

Limitations

Scalability

Trace size
Dynamic binary instrumentation overheads

Scalability Improvement

Debugger-based introspection

Break-points at library function call/return (call-site)
Ad-hoc reachability graph construction
- Entry: Determine root-set (parameters, return value)
- Exit: Dump root-set-reachable data
Comparison across clean/faulty runs

Results

Empirical study of interface errors (Natella et al. 2018):

SPECInt 2006 benchmark
17-50 minutes run time
Interface: Benchmark management ↔︎ main algorithm

Confirmed findings: Interface data corruptions are

frequent (31.8%)
large (up to 183 MiB)
silent (45.8% without indication)

Additional contribution: Pattern catalog of interface errors

Applied EPA

Source: (Palix et al. 2011)

Shared-Memory EPA for Unknown User Code

Unknown call-sites for driver code
Memory visible everywhere
No dynamic binary instrumentation
kgdb, but where to set breakpoints

TrEKer (Coppik et al. 2017)

TrEKer Instrumentation and Trace Processing

LLVM optimization pass:

load, store, getelementptr instructions → add tracing call
trace function calls, return addresses, parameters, return values
Reachability analysis in post-processing

TrEKer Results

Evaluation subjects: btrfs, f2fs, nvme
23376 injected faults
Workload: load module, FS operations, unload module

Summary

References

Coppik, Nicolas, Oliver Schwahn, Stefan Winter, and Neeraj Suri. 2017. “TrEKer: Tracing Error Propagation in Operating System Kernels.” In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017.

Dvorak, Daniel. 2009. “NASA Study on Flight Software Complexity.” In AIAA Infotech@Aerospace Conference. Infotech@Aerospace Conferences. American Institute of Aeronautics; Astronautics. https://arc.aiaa.org/doi/10.2514/6.2009-1882.

Hunt, James W., and Thomas G. Szymanski. 1977. “A fast algorithm for computing longest common subsequences.” Commun. ACM 20 (5): 350–53. https://doi.org/10.1145/359581.359603.

Lanzaro, Anna, Roberto Natella, Stefan Winter, Domenico Cotroneo, and Neeraj Suri. 2014. “An Empirical Study of Injected Versus Actual Interface Errors.” In Proceedings of the 2014 International Symposium on Software Testing and Analysis, 397–408. ISSTA.

Natella, R., S. Winter, D. Cotroneo, and N. Suri. 2018. “Analyzing the Effects of Bugs on Software Interfaces.” IEEE Transactions on Software Engineering 46 (3): 280–301.

Palix, Nicolas, Gaël Thomas, Suman Saha, Christophe Calvès, Julia Lawall, and Gilles Muller. 2011. “Faults in linux: ten years later.” In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, 305–18. ASPLOS XVI. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1950365.1950401.