Tuesday, October 15, 2013

Using Clang to analyze legacy C++ source code

Legacy Code

At IAR Systems where I work, we have a fair amount of C++ source code, some of which is very old. The oldest parts probably date back to the mid 90s. We've done a fairly good job at maintaining the code so many parts are still in active use.

There is one particular code base which has been of interest to me for some time. The code base is (in code years) very old, I believe many parts of it has roots in the mid 90s. The code is used to describe configuration data together with functional parts which describe relationships between the objects in the configuration. There are two main problems with the code:

  1. The only way to reliably change properties in the model is to interact with a UI: pressing buttons, entering text, etc. The model can be serialized to XML, but the XML contents cannot be correctly understood without access to the C++ code.
  2. The only output possible is a sequence of strings on a certain format to be passed on to other tools.
In other words, you can edit the model using a UI, and you can execute the model by passing the results to other tools. For example, it is not possible to
  • Enumerate all properties which can be modified.
  • Describe the properties (their type, the possible values)
  • Describe how properties interact with other properties. The set of valid values of a property may depend on the values of other properties, but this is hidden in the C++ code.
A typical piece of code might look like this:

class Person;
class Manager : public Person
{
    ...
};

class Assassin : public Person
{
    ...
};

class SecretAgentBehavior : public DefaultSecretAgentBehavior
{
    SecretAgentBehavior(Person boss) { ... }
    std::string getArguments()
    {
        Assignment a = boss.getAssignment();
        return "-dostuff " + a.getName + " -o output.txt";
    }
};

// Bob The Boss is a secret agent manager
Manager boss("Bob The Boss", new BossBehavior /* defined elsewhere */);

// John Smith is an assassin, and Bob The Boss is his manager
Assassin john("John Smith", new SecretAgentBehavior(boss));

I would like to be able to extract as much useful information from this code base as possible, to facilitate a future migration of the configuration data to a system which makes it easier to analyze the data and generate different outputs based on the data. Enter Clang.

Clang

The first approach I tried was to try to modify the code itself to produce some useful information, but this did not work well. The necessary metadata was simply missing. For example, the only place where the type of the model elements (the secret agents in the example above) was stored was as C++ types. The limited support for reflection in C++ made this approach impossible.

I then got this "crazy idea" to try to use Clang to parse the C++ source code. The source code itself is fairly structured and follows a handful of different patterns, so we are not talking about analyzing arbitrary C++ source code. Also, we are only interested in parsing the source code, not generating executable code. This is fortunate, since the code in question has never been compiled by anything other than Visual Studio, and uses lots and lots of Visual Studio-specific things (MFC, for example). Would it be possible for Clang to at least be able to build an AST of the code?

It turned out to be if not trivial, at least fairly easy. The main problem was that Clang was unable to parse the Windows-specific header files defining things like CString, CWinApp, etc. I solved this by placing dummy definitions in a special header file. To make sure that all source files which expects to get these definitions actually get them, I created a set of replacement header files (afx.h, windows.h, etc.) which all simply included the header file with the dummy defininitions. For example, the definition of CWinApp looks like this:

class CWinApp
{
public:
  HICON LoadIcon(LPCTSTR name);
  HICON LoadIcon(UINT resid);
};

That's it. Since Clang does not need to compile and link, these kind of dummy definitions are enough.

Ok, so once the code base passed successfully through clang-check, then what? How do we get any useful information out?

AST Matchers

There are good tutorials on how to write a Clang tool, so I will skip over that here.

The Clang AST is a complex beast (run clang-check -ast-dump on any non-trivial program), and to make it easier to navigate and make sense of the AST, the AST Matchers API allows you to write "simple" rules describing the parts of the AST that you are interested in.

For the example above, a rule which matches the persons may look like this:

varDecl(
    hasDescendant(
        constructExpr(
            hasType(recordDecl(isSameOrDerivedFrom("Person"))),
            hasArgument(0, expr().bind("name")),
            hasArgument(1, expr().bind("behavior")))));

The bind() calls are placed so that the corresponding AST node can be extracted in the callback function which is invoked each time the rule matches.

The rule will be invoked twice, once for the boss and one for the assassin. The callback function looks like this:

virtual void run(const MatchFinder::MatchResult &result) 
{
    const Expr *name = result.Nodes.getNodeAs("name");
    const Expr *behavior = result.Nodes.getNodeAs("behavior");

    // Now we can do interesting things
    name->dumpColor();
    behavior->dumpColor();
}

Since Clang gives us access to the entire C++ AST (including information from the preprocessor and the source code itself), we can extract all sorts of useful information from here. For example, we can generate output which contains the configuration data on a structured format, together with the source code implementing the functional parts.

person:
   name: Bob The Boss
   type: Manager
   source: { 
Manager boss("Bob The Boss", new BossBehavior /* defined elsewhere */);
   }

Of course, we still would need to implement the getArguments function somewhere.

Conclusion

Clang is the perfect tool (or rather, platform) for analyzing C/C++ source code. It gives full access to the entire AST, since this is the same AST which is used by the actual Clang compiler, it gives you the complete AST, and not some approximation.  The AST matchers framework also is a major time-saver, since it allows you to match out the parts of the AST you are interested in without having to write large statemachine-like code to keep track of where in the AST you are.