Pipline and Ideas


When it comes to fuzzing and some other automatic test generation techniques, the nature of test case redundancy property makes it almost impossible for huma n testers to examine all the outcome of the test cases (i.e. whether a test passes or fails). While human testers often focus only on the output of the program, seldom do they care about whether the execution trace is actually correct.

For Fintech software, the huge input and output also makes it really hard to establish a test oracle which can automatically classify the outcome of test cases. Thus, a automatic test oracle generation tool may be crucial to the current Fintech industry.


Inspired by Roper1, and based on our current set of SUTs (Software under test), we have the initial pipline as follows:

Some of the works or materials have already been done by former research, procedures in the pipline above is listed below:

Procedure Mark Note
Mutation and Instrumentation Gong Xin has done relative work
Test case generation (FACTS/FinExpert) Already done for CSTP and bcbip, work needed for UVC FinExpert need to be changed (combination)
Run test and collect info Harry may need to do himself
IO pair + Execution trace $\rightarrow$ vector How did Gong convert trace to vector? Anything reusable? Gong has done research on the level of trace, Zhenming has read paper
Unsupervised learning Gong implemented with HCA (Hierarchical Cluster Analysis)
Select a subset of test case to label
Semi-supervised learning Harry need to do himself Several paper has evaluated some algorithms already

Potential Research Topics, Ideas, and Variables in the Pipline

Below are some potential research topics and variables in the pipline that need to be decided/tested:

  1. Mutation: how to simulate the faulty version? How many / what mutantors should we implement? How many bugs in one mutant?
  2. For execution trace collection, what's the level of coverage? (statement level, block level...)
  3. Unsupervised learning: what algorithm? How many clusters?
  4. Labeling data: how many test cases should be "manually labeled"?
  5. Semi-supervised learning: what algorithms?
  6. Result evaluation:
    1. Confusion metrics
    2. Correct output but incorrect execution trace?
    3. For Test Reduction done by Gong Xin, does it reduces the failed cases but leaves the passed cases in the same cluster?
    4. How does this procedure work in real industry?

Questions and Difficulties

  1. CSTP has no output, its outcome merely depends on whether exceptions happens. It it suitable for this topic?
  2. SUT: is Defects4J a good set of SUTs for this topic?

  1. M. Roper, "Using Machine Learning to Classify Test Outcomes," 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), 2019, pp. 99-100, doi: 10.1109/AITest.2019.00009.