Stage 2.3: Comparing Clones

Introduction

Now that my pass correctly identifies clones by comparing base names (Stage 2.2), it’s time to implement logic to see if the clones are significantly the same. To check if the functions are the same, I will have to implement strong logic to compare the functions based on significant logic (ignoring inconsistant, or unimportant logic), or I would have to normalize the functions in some way.

Planning

Approach 1: Storing Normalized Function Code as Text

While researching how to do this, I stumbled upon the word “Canonicalization”. According to Wikipedia, in Computer Science this term often means converting data that has more than one possible representation, into a “standard” form. The first approach I considered was to normalize the code itself, at that specific point in the pass chain. This would involve iterating through the function text, and whenever we encounter an SSA variable, BB, or Label – we map it to a sequential number, instead of whatever value is there. This would be a good, simple solution, but it isn’t very efficient, and requires a hefty amount of memory.

Approach 2: Comparing Functions Line by Line

Similarly to the first approach, this approach would be dealing with the function code at that point in the compilation. There are a few problems with this approach – in order to have both clones available for iteration, there would have to be extra logic added to retrieve the “other” function variant, for iteration. Or alternatively, store a potentially long text of the function, just to make those comparisons. The second problem I saw with this, is that it seems like a very long process. There are many “ivtmp”, and seemingly meaningless lines that are included in the source code, that could simply be ignored when we are considering the function logic itself.

Ultimately, this approach wouldn’t take advantage of GIMPLE, which is an intermediate representation of the code’s logic. This seems like a less efficient, or appropriate approach, especially since our pass was deliberately placed after our code was converted to GIMPLE.

Approach 3: Generating Signature

This approach, which is the one I ultimately chose, involves generating some sort of unique function signature for every function. This signature would encompass logic to differentiate functions based on logic and flow, ignoring temporary or varying representations, using GIMPLE.

Ideally, I would only generate a signature for the cloned functions, to avoid extra operations, but I still haven’t found a way to iterate through 2 function bodies at the same time, without storing the data (which is memory intensive).

Pass Structure

First, I will modify the function collection storage method, to include the function signature. For this, I created a small struct, funcData, that will hold both strings – name and signature. Storing the signature of every function will ensure that I can compare 2 function signatures later on. I also added a new method, get_signature(), which will return a string of the signature.

// funcData holds the function's full name, as well as signature.
struct funcData { std::string full_name; std::string signature; };

/* 
   Currently, this pass generates a function signature based on 
   gimple statements, and other metrics. It detects function 
   clones using the base name of the function, and compares their 
   signature, outputting PRUNE or NOPRUNE based on analysis.
*/
class pass_efagin : public gimple_opt_pass
{
public:
  pass_efagin (gcc::context *ctxt): gimple_opt_pass 
    (pass_data_efagin, ctxt){}
  bool gate (function *) final override { return 1; }

  unsigned int execute (function *) final override;
	
private:  
  // Holds function clone names and signatures, with base name.
  static std::unordered_map<std::string, std::vector<funcData>> 
    fun_map;

  std::string get_signature(function* fun);
};

// Initialize static member
std::unordered_map<std::string, std::vector<funcData>> 
  pass_efagin::fun_map;

I modified the execute() method, to call the get_signature() method to generate it, create an instance of the funcData struct, and add it to the vector in the map. When 2 clones are found, I added logic to compare the signatures of the clones, and make a decision – output PRUNE if the functions are significantly the same, an NOPRUNE if they are different.

For now, I ensured that all the functions output a signature, for debugging.

unsigned int pass_efagin::execute(function* fun)
{
  // If dump file not enabled, exit
  if (!dump_file || !fun )
    return 0;

  // Get the full name of the function
  std::string name = IDENTIFIER_POINTER(DECL_NAME(fun->decl));
  fprintf(dump_file, "--------Examining Func: %s-----\n", 
    name.c_str());
	
  // Skip resolver functions
  if (name.find(".resolver") != std::string::npos) {
    fprintf(dump_file, "---End of Diagnostic (resolver)---\n");
    return 0;
  }

  // Extract base function name (remove everything after first dot)
  std::string base_name = name.find_first_of(".") 
    != std::string::npos 
    ? name.substr(0, name.find_first_of(".")) 
    : name;

  funcData fdata = {name, get_signature(fun)};
  fprintf(dump_file, "SIGNATURE:\n%s\n", fdata.signature.c_str());

  auto it = fun_map.find(base_name);
 
  // If this base name already has registered clones, print them
  if (it != fun_map.end()) {
    for (auto& variant : it->second) {
      fprintf(dump_file, "CLONE: %s\nCLONE SIGNATURE:\n%s\n", 
        variant.full_name.c_str(), variant.signature.c_str());
      
      if (fdata.signature == variant.signature) {
        fprintf(dump_file, "[%s] = PRUNE\n", base_name.c_str());
      } else {
        fprintf(dump_file, "[%s] = NOPRUNE\n", base_name.c_str());
      }
    }
  }
  
  // Add function data to the clones map
  fun_map[base_name].push_back(fdata);
  fprintf(dump_file, "--------End of Diagnostic---\n\n");

  return 0;
}

Signature Components

The most important and complicated part of this stage, is the decision of what to put in the signature. The balance between efficiency and accuracy is extremely important. At this stage, I chose a couple of easy to use metrics, that will not always be accurate, but will still give insight into the code.

Basic Block Count

First, I will calculate the number of basic blocks in the function, using an existing utility function. This value will be simliar to the number of lines in the code, although 2 functions with the same amount of lines do not always generate the same number of basic blocks. 2 variants for the same base function, may not have the same number of basic blocks, due to different optimizations.

n_basic_blocks_for_fn(fun)

Edge Counts (Control Flow Graph)

An edge, in the context of analyzing the function flow, refers to a flow between 2 basic blocks. The arrows on the control flow graph here, represent edges – connections between blocks. Conditional statements can create more edges, possible paths. Therefore, the number of edges is indicative of the overall structure of the function.

Source code with control flow graph, from researchgate.

n_edges_for_fn(fun)

Gimple Statements Code

Each gimple statement can be classified into a type, that has a cooresponding code (documented in gimple.def). This is a simple way to signature unique – add the gimple statement codes, in the order they appear in, in the code. Technically speaking, since all the codes are registered in the signature, there is no point to output a total count of statements. But for ease of testing, I will still include it.

Implementation

Here is the implementation of the get_signature() function, which generates the signature. At this stage, the signature prioritizes the readability and analysis capability, as opposed to efficiency. This will help me diagnose possible problems, if they come along.

// Generate a signature for a function
std::string pass_efagin::get_signature(function* fun) {
  std::stringstream signature;

  basic_block bb;
  int num_gimple_stmts = 0;

  FOR_EACH_BB_FN(bb, fun) {
    // iterate through all gimple statments in block
    for (gimple_stmt_iterator gsi = gsi_start_bb(bb); 
      !gsi_end_p(gsi); gsi_next(&gsi)) {
      
      // Record statement type with a space prefix
      signature << gimple_code(gsi_stmt(gsi)) << " ";
      num_gimple_stmts++;
    }
  }
  
  signature 
    << "\nBB\t" << n_basic_blocks_for_fn(fun) 
    << "\nEG\t" << n_edges_for_fn(fun) 
    << "\nGMP-S\t" << num_gimple_stmts;
  
  return signature.str();
}

Testing with Function Clones

I tested the pass with the sample code provided by professor Chris Tyler, same files mentioned and cited in Stage 2.2. Here are the results of the test on the x86 machine.

Test results from aarch64

`aarch64`: PRUNE Output

;; Function scale_samples.rng (scale_samples._Mrng, funcdef_no=25, decl_uid=5531, cgraph_uid=30, symbol_order=28)

--------Examining Func: scale_samples.rng-----
SIGNATURE:
2 2 2 2 1 6 6 6 6 6 1 6 6 6 1 1 6 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 1 6 6 1 6 6 6 6 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 6 6 6 2 2 6 6 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      34
EG      44
GMP-S   130
CLONE: scale_samples
CLONE SIGNATURE:
2 2 2 2 1 6 6 6 6 6 1 6 6 6 1 1 6 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 1 6 6 1 6 6 6 6 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 6 6 6 2 2 6 6 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      34
EG      44
GMP-S   130
[scale_samples] = PRUNE
--------End of Diagnostic---

`aarch64` NOPRUNE Output

;; Function scale_samples.sve2 (scale_samples._Msve2, funcdef_no=25, decl_uid=5531, cgraph_uid=30, symbol_order=28)

--------Examining Func: scale_samples.sve2-----
SIGNATURE:
2 2 2 2 1 6 6 6 6 6 6 1 6 6 8 2 2 6 8 6 6 6 6 6 8 2 2 2 6 6 8 1 6 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      15
EG      18
GMP-S   49
CLONE: scale_samples
CLONE SIGNATURE:
2 2 2 2 1 6 6 6 6 6 1 6 6 6 1 1 6 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 1 6 6 1 6 6 6 6 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 6 6 6 2 2 6 6 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      34
EG      44
GMP-S   130
[scale_samples] = NOPRUNE
--------End of Diagnostic---

PRUNE Output:

;; Function scale_samples.popcnt (scale_samples.popcnt, funcdef_no=25, decl_uid=3985, cgraph_uid=30, symbol_order=28)

-------------Examining Func: scale_samples.popcnt--------
SIGNATURE:
2 2 2 2 1 6 6 6 6 6 1 6 6 6 1 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 1 2 2 6 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      33
EG      44
GMP-S   165
CLONE IDENTIFIED: scale_samples
CLONE SIGNATURE:
2 2 2 2 1 6 6 6 6 6 1 6 6 6 1 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 1 2 2 6 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      33
EG      44
GMP-S   165
[scale_samples] = PRUNE
-------------End of Diagnostic-------------

NOPRUNE Output:

;; Function scale_samples.arch_x86_64_v3 (scale_samples.arch_x86_64_v3, funcdef_no=25, decl_uid=3985, cgraph_uid=30, symbol_order=28)

--------Examining Func: scale_samples.arch_x86_64_v3-----
SIGNATURE:
2 2 2 2 1 6 6 6 6 6 1 6 6 6 1 1 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 1 6 6 1 6 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 6 2 2 6 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      42
EG      56
GMP-S   193
CLONE: scale_samples
CLONE SIGNATURE:
2 2 2 2 1 6 6 6 6 6 1 6 6 6 1 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 1 6 6 1 2 2 6 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 6 2 2 1 2 2 6 6 6 6 6 6 6 6 6 2 2 2 6 6 2 2 6 6 6 6 6 6 2 2 2 6 1 10
BB      33
EG      44
GMP-S   165
[scale_samples] = NOPRUNE
--------End of Diagnostic---

Testing Edge Cases

To test edge cases and possible areas of improvment for my pass, I designed a test file that has similar functions. The functions are almost the same, the only difference is the operation being performed – addition vs. multiplication.

#include <stdio.h>

int func1(int a, int b) {
    if (a > 0) {
        return a + b;
    } else {
        return b;
    }
}

int func2(int x, int y) {
    if (x > 10) {
        return x * y;
    } else {
        return y;
    }
}

int main(void) {
  printf("%d %d\n", func2(1,2), func1(1,2));
  return 1;
}

The result was surprising – since the GIMPLE statement code of the operations is the same, the resulting signature was also the same.

;; Function func1 (func1, funcdef_no=11, decl_uid=3305, cgraph_uid=12, symbol_order=11)

-------------Examining Func: func1--------
SIGNATURE:
2 1 2 6 10
BB      6
EG      6
GMP-S   5
-------------End of Diagnostic-------------

...

;; Function func2 (func2, funcdef_no=12, decl_uid=3309, cgraph_uid=13, symbol_order=12)

-------------Examining Func: func2--------
SIGNATURE:
2 1 2 6 10
BB      6
EG      6
GMP-S   5
-------------End of Diagnostic-------------
...

This test highlighted a weekness in my pass – although the signature correctly identifies the function’s overall structure, it does not include any logic from the function.

Conclusion

The criteria for Stage 2 was to improve the gcc pass so that it:

Identifies functions that have been cloned.
Assess the clones, and determine if they are “substantially the same”.
Output an organized diagnostic dump, that indicates which functions we should PRUNE(the same), which we NOPRUNE.

Although my pass accomplished all these goals, the signature should be improved to encompass some function logic, to avoid “false positives” when it comes to identifying similar clones. As always, the balance between efficiency and accuracy is important. In the coming blogs, I will be attempting to improve the accuracy of the pass, while keeping efficiency in mind.