Tuesday, October 15, 2013

Using Clang to analyze legacy C++ source code

Legacy Code

At IAR Systems where I work, we have a fair amount of C++ source code, some of which is very old. The oldest parts probably date back to the mid 90s. We've done a fairly good job at maintaining the code so many parts are still in active use.

There is one particular code base which has been of interest to me for some time. The code base is (in code years) very old, I believe many parts of it has roots in the mid 90s. The code is used to describe configuration data together with functional parts which describe relationships between the objects in the configuration. There are two main problems with the code:

  1. The only way to reliably change properties in the model is to interact with a UI: pressing buttons, entering text, etc. The model can be serialized to XML, but the XML contents cannot be correctly understood without access to the C++ code.
  2. The only output possible is a sequence of strings on a certain format to be passed on to other tools.
In other words, you can edit the model using a UI, and you can execute the model by passing the results to other tools. For example, it is not possible to
  • Enumerate all properties which can be modified.
  • Describe the properties (their type, the possible values)
  • Describe how properties interact with other properties. The set of valid values of a property may depend on the values of other properties, but this is hidden in the C++ code.
A typical piece of code might look like this:

class Person;
class Manager : public Person

class Assassin : public Person

class SecretAgentBehavior : public DefaultSecretAgentBehavior
    SecretAgentBehavior(Person boss) { ... }
    std::string getArguments()
        Assignment a = boss.getAssignment();
        return "-dostuff " + a.getName + " -o output.txt";

// Bob The Boss is a secret agent manager
Manager boss("Bob The Boss", new BossBehavior /* defined elsewhere */);

// John Smith is an assassin, and Bob The Boss is his manager
Assassin john("John Smith", new SecretAgentBehavior(boss));

I would like to be able to extract as much useful information from this code base as possible, to facilitate a future migration of the configuration data to a system which makes it easier to analyze the data and generate different outputs based on the data. Enter Clang.


The first approach I tried was to try to modify the code itself to produce some useful information, but this did not work well. The necessary metadata was simply missing. For example, the only place where the type of the model elements (the secret agents in the example above) was stored was as C++ types. The limited support for reflection in C++ made this approach impossible.

I then got this "crazy idea" to try to use Clang to parse the C++ source code. The source code itself is fairly structured and follows a handful of different patterns, so we are not talking about analyzing arbitrary C++ source code. Also, we are only interested in parsing the source code, not generating executable code. This is fortunate, since the code in question has never been compiled by anything other than Visual Studio, and uses lots and lots of Visual Studio-specific things (MFC, for example). Would it be possible for Clang to at least be able to build an AST of the code?

It turned out to be if not trivial, at least fairly easy. The main problem was that Clang was unable to parse the Windows-specific header files defining things like CString, CWinApp, etc. I solved this by placing dummy definitions in a special header file. To make sure that all source files which expects to get these definitions actually get them, I created a set of replacement header files (afx.h, windows.h, etc.) which all simply included the header file with the dummy defininitions. For example, the definition of CWinApp looks like this:

class CWinApp
  HICON LoadIcon(LPCTSTR name);
  HICON LoadIcon(UINT resid);

That's it. Since Clang does not need to compile and link, these kind of dummy definitions are enough.

Ok, so once the code base passed successfully through clang-check, then what? How do we get any useful information out?

AST Matchers

There are good tutorials on how to write a Clang tool, so I will skip over that here.

The Clang AST is a complex beast (run clang-check -ast-dump on any non-trivial program), and to make it easier to navigate and make sense of the AST, the AST Matchers API allows you to write "simple" rules describing the parts of the AST that you are interested in.

For the example above, a rule which matches the persons may look like this:

            hasArgument(0, expr().bind("name")),
            hasArgument(1, expr().bind("behavior")))));

The bind() calls are placed so that the corresponding AST node can be extracted in the callback function which is invoked each time the rule matches.

The rule will be invoked twice, once for the boss and one for the assassin. The callback function looks like this:

virtual void run(const MatchFinder::MatchResult &result) 
    const Expr *name = result.Nodes.getNodeAs("name");
    const Expr *behavior = result.Nodes.getNodeAs("behavior");

    // Now we can do interesting things

Since Clang gives us access to the entire C++ AST (including information from the preprocessor and the source code itself), we can extract all sorts of useful information from here. For example, we can generate output which contains the configuration data on a structured format, together with the source code implementing the functional parts.

   name: Bob The Boss
   type: Manager
   source: { 
Manager boss("Bob The Boss", new BossBehavior /* defined elsewhere */);

Of course, we still would need to implement the getArguments function somewhere.


Clang is the perfect tool (or rather, platform) for analyzing C/C++ source code. It gives full access to the entire AST, since this is the same AST which is used by the actual Clang compiler, it gives you the complete AST, and not some approximation.  The AST matchers framework also is a major time-saver, since it allows you to match out the parts of the AST you are interested in without having to write large statemachine-like code to keep track of where in the AST you are.


Jamming123 said...

Hi there,
thanks a lot for this blog. Is it possible to get the visual studio project that uses clang to analyze old MFC code? I really appreciate that. My email is j_dayyeh@yahoo.ca.

Jesper Eskilson said...

There is no Visual Studio project involved. We've implemented a Clang-based tool which can analyze C++ source code, but this is not integrated into Visual Studio itself in any way. The tool takes a bunch och C++ source code and produces an XML-file containing information about the code base.

To be able to analyze code which uses MFC-classes, we have a header file which contains enough MFC-declarations so that the code will go through the Clang parser. This header files contains things like "typedef long LONG_PTR;" and "class CMenu;".

(This is actually easier to do on Linux, as Clang is better integrated with C++ standard libraries on Linux. On Windows, you need to figure out which C++ standard libs to use. We ended up using a subset of GCC's libstdc++ headers.)

What exactly do you mean by "analyzing old MFC code"? What do you want to do?

Jamming123 said...

I have some old MFC projects that have been developed in VC6 (1998). It is well-known that VC6 is not standards-compliant. A lot of those projects fail to even build in VC2005 and later.
I'm trying to migrate the non-GUI stuff to other platforms. One example of what clang could be used is, say, to find all instances of CString, and find all calls to member functions of CString on those instances. Then, we could replace CString with either the standard std::string.
That's just one example.
There are many ways that, I believe clang could help out.

I use Windows as my dev machine.
It would be nice to take a look at the source code of your tool (including the header file with enough MFC-stuff). That way, I could have a starting point.

Jesper Eskilson said...

I'll see if it possible to publish at least parts of the source code in some sort of open-source fashion. Most of the internals are of little public interest, although the stuff which sets up the compilation database might be useful for others.

Jamming123 said...

That would be awesome. Thanks again.

ruben2020 said...

Hi Jesper,

I'm not sure what kind of information you want to extract from your analysis of the C++ code.

I have an open source project, CodeQuery, which can do some level of analysis of C++ source code with the help of cscope and ctags.

Do take a look: https://github.com/ruben2020/codequery

Jesper Eskilson said...

The main benefit of using Clang is that you know that this is not some approximation of your C++ code produced to aid in navigating source code, this is the actual AST produced by a real C++ compiler. When stumbling over weird templating code and other ugly C++ stuff, having a real C++ compiler is of great help.