Error Propagation Analyses for Shared-Memory Systems

Stefan Winter

University of Ulm

LMU Munich

Slide Deck Availability

https://www.stefan-winter.net/presentations/shared-memory_EPA.html

Speaker Background

  • Dr.-Ing: TU Darmstadt, 2015
  • Postdoc: TU Darmstadt, UIUC, LMU Munich
  • Lecturer/Interim prof.: HAW Landshut, LMU Munich, Ulm University

Research focus: Dependable software, mostly “low-level” code

[ICSE’11] [DSN’13] [ASE‘17] [TSE’20] [ISSTA’14] [ICSE’15] [ISSTA’19] [ISSRE’19] [PRDC’18] [ICST’17] [DSN’20] [ISSRE’20] [OOPSLA’20]

Motivation

// Your Program

void printSum(int a, int b) {
  printf(%d”, a – b);
}

Motivation

// Your Program

void printSum(int a, int b) {
-  printf(“%d”, a – b);
+  printf(“%d”, a + b);
}

libc

Operating System

Problem Magnitude

  1. How many defects in external dependencies?

    • NASA: 0.1 defects/KLOC after deployment (Dvorak 2009)
    • For commodity software: ~1 defect/KLOC good
  1. How many KLOC?
    • Support libraries (glib, etc.): Several MLOC
    • System libraries (libc, etc.): 1-2 MLOC
    • Operating System: 20-60 MLOC

Fault Containment

// Your Program

void printSum(int a, int b) {
-  printf(“%d”, a – b);
+  printf(“%d”, a + b);
}

libc

Operating System

Error Propagation Analyses (EPA)

How do defects in one component affect other components?

Three Steps:

    1. Inject defect into external component

    2. Execute program

    1. Detect behavioral deviations
Assumption: Federated systems
    • Isolated components
    • Explicit interfaces

Interface Injections

Interface Error Injections: Mutating interface parameters

Interface Injections

Federated Systems Assumption

void user() {
  int a = 40;
  int b = 2;
  int result = library_function(a, b);
  ...
}
int library_function(int a, int b) {
  return a + b;
}

Interface injection:

  • Return value of library_function is external input
  • Replace by
    • Fixed value: 420
    • Random value: 422275
    • Mutated value: 4243

Interface Injections

Federated Systems Assumption

void user() {
  struct test *t;
  library_function(1, t);
  ...
}
void library_function(int a, struct test *b) {
  bool init = (a - 1 ==  0);
  if (init)
    b->val = 0;
}

Stack (user) Stack (component) Heap struct test t b points to input-output parameter other data

  • No return value as external input
  • Shared-memory via heap pointer t
  • Injection into the pointer
    • Maybe other valid heap
    • Mostly unallocated/protected memory
      → segfault
  • Just inject in pointed-to heap

Interface Injections

void user() {
  struct test *t;
  library_function(1, t);
  ...
}
void library_function(int a, struct test *b) {
  bool init = (a - 1 ==  0);
  if (init)
-   b->val = 0;
+   char *str = "test"; 
+   b->string = str;
}

Stack (user) Stack (component) Heap struct test t b points to input-output parameter char* str

  • Heap contains pointers
    → Injection likely to invalidate
  • Only parts of shared data touched by external code
  • Strategy 1: Ignore it (and miss relevant tests)
  • Strategy 2: Inject into heap memory
    • Many wasted tests
  • Strategy 3: Track all memory “stores”
    • Filter irrelevant cases
    • Requires tracking allocations, frees
    • Orientation: sysbench cpu run
      • 4.8M reads & 1.2M stores

Analyzing Error Propagation in Shared Memory (Lanzaro et al. 2014)

  1. Execute with dynamic binary instrumentation
  2. Construct reachability graphs from function parameters & return values
  3. Compare clean/faulty store traces for reachable memory

Trace Comparison Example

Alignment of longest common sub-sequences (Hunt and Szymanski 1977)

Experimental Evaluation

Name Size (loc) Faults
Libxml2 155k 1471
Libbzip2 6k 463
SQLite 78k 1023

User code: Developer tests

Results

Memory Corruptions Are Frequent

  • 38.2% of all experiments
  • 61.8% showed no effects
    • Not uncommon for fault injections
    • No activation
    • No propagation

Results

Memory Corruptions Are Large

Results

Memory Corruptions Are Silent

Return values with memory corruption
Name −1 NULL ptr 0 Wrong ptr Wrong value No indication
Libxml2 32 (24.2%) 33 (25.0%) 1 (0.8%) 5 (3.8%) 2 (1.5%)
Libbzip2 0 (0%) 4 (11.8%) 0 (0%) 28 (82.3%) 2 (5.9%)
SQLite 0 (0%) 2 (2.8%) 0 (0%) 1 (1.4%) 69 (95.8%)

Limitations

Scalability

  • Trace size
  • Dynamic binary instrumentation overheads

Scalability Improvement

Debugger-based introspection

  • Break-points at library function call/return (call-site)
  • Ad-hoc reachability graph construction
    • Entry: Determine root-set (parameters, return value)
    • Exit: Dump root-set-reachable data
  • Comparison across clean/faulty runs

Results

Empirical study of interface errors (Natella et al. 2018):

  • SPECInt 2006 benchmark
  • 17-50 minutes run time
  • Interface: Benchmark management ↔︎ main algorithm

Confirmed findings: Interface data corruptions are

  • frequent (31.8%)
  • large (up to 183 MiB)
  • silent (45.8% without indication)

Additional contribution: Pattern catalog of interface errors

Applied EPA

Source: (Palix et al. 2011)

Shared-Memory EPA for Unknown User Code

  • Unknown call-sites for driver code
  • Memory visible everywhere
  • No dynamic binary instrumentation
  • kgdb, but where to set breakpoints

TrEKer (Coppik et al. 2017)

image/svg+xml Module InstrumentedModule Kernel Runtime QEMU ExperimentControl&ResultDetection SSH serial console Trace Exp. Log ProcessedTrace Trace AnalysisSymbolification instr.

TrEKer Instrumentation and Trace Processing

LLVM optimization pass:

  • load, store, getelementptr instructions → add tracing call
  • trace function calls, return addresses, parameters, return values
  • Reachability analysis in post-processing

TrEKer Results

  • Evaluation subjects: btrfs, f2fs, nvme
  • 23376 injected faults
  • Workload: load module, FS operations, unload module

Summary

References

Coppik, Nicolas, Oliver Schwahn, Stefan Winter, and Neeraj Suri. 2017. TrEKer: Tracing Error Propagation in Operating System Kernels.” In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017.
Dvorak, Daniel. 2009. NASA Study on Flight Software Complexity.” In AIAA Infotech@Aerospace Conference. Infotech@Aerospace Conferences. American Institute of Aeronautics; Astronautics. https://arc.aiaa.org/doi/10.2514/6.2009-1882.
Hunt, James W., and Thomas G. Szymanski. 1977. A fast algorithm for computing longest common subsequences.” Commun. ACM 20 (5): 350–53. https://doi.org/10.1145/359581.359603.
Lanzaro, Anna, Roberto Natella, Stefan Winter, Domenico Cotroneo, and Neeraj Suri. 2014. An Empirical Study of Injected Versus Actual Interface Errors.” In Proceedings of the 2014 International Symposium on Software Testing and Analysis, 397–408. ISSTA.
Natella, R., S. Winter, D. Cotroneo, and N. Suri. 2018. Analyzing the Effects of Bugs on Software Interfaces.” IEEE Transactions on Software Engineering 46 (3): 280–301.
Palix, Nicolas, Gaël Thomas, Suman Saha, Christophe Calvès, Julia Lawall, and Gilles Muller. 2011. Faults in linux: ten years later.” In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, 305–18. ASPLOS XVI. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1950365.1950401.