RTL-check --------- RTL-check is a framework for static analysis of programs from a safety and security perspective. It performs analysis on RTL, which is the low-level intermediate representation generated by GCC. We created this framework for a research project that deals with static analysis and memory safety of programs developed in unsafe languages such as C or C++. In particular, we want to detect buffer overflows staticaly and prove their absence. However, we believe this framework will be useful for other projects involving static analysis. The projected features of the system will be: - Intelligent structures to deal with RTL (see below) - Language to describe program invariants (see below) - RTL checker to verify if program invariants hold - Automatic inference of program invariants, when possible - A framework in which it is easy to plug in and test new static analysis algorithms, save the results and reuse them in a subsequent phase of analysis. (more below) However, this is an open source project, and its developement will be driven by its developers. We expect it to grow beyond what was initialy planned... Intermediate Representation of GCC (RTL) ---------------------------------------- Doing analysis on the intermediate representation of GCC (RTL) is an important feature of this project. Analysing code at the RTL level brings us several advantages: - Independance of source language (almost all languages supported by GCC are translated to RTL) - Independance of target architecture (architecture-specific code is generated from RTL) - A language much more simple than C to base our analysis on (C is very complex, C++ is still more complex) - We can ignore ambiguities from the C standard because GCC resolved them and we can concentrate on verifying that the particular code generated by GCC is safe instead of trying to ensure that all interpretations of the C standard are safe. (We assume the same compiler is used to build the program and to extract RTL used for the analysis.) - We can potentially find errors introduced by GCC if it generates the wrong code or if it does bad optimizations since many optimizations are done on RTL. - We can use some results of basic static analysis done by GCC which are stored in RTL (assuming we trust GCC) On the other hand, there are inconvenients: - Data representation is not described in RTL - Data types are not described either - It will not be possible to check code containing inline assembly - We do not check if there are errors in the translation from RTL to native - When an error is found in RTL, it may be difficult to tell from where in the source code errors it originate We believe we can workaround all these limitations. Language to describe invariants ------------------------------- This language must be powerful enough to explain why each memory acces is correct in a formally verifiable way on RTL. We do not know yet what this language will look like. It will most likely evolve rapidly at the begining of the project as we discover how complex it is to explain why some memory access is safe. The language might be based on TAL annotations, Splint annotations or we might invent something completely new. There are things to keep in mind when designing this language: - The properties it expresses must be verifiable automatically (the level of abstraction cannot be too high) - They must be verifiable efficiently (finding invariants can be difficult, but checking them should be easy) - This language does not have to be human-readable (It might help, but if the language is too complex or if it requires writing a lot of boring stuff, we can create macros or tools that make it easier to write (and understand) invariants. Automatic inference of program invariants ----------------------------------------- We know it is not possible to infer all the program invariants required to ensure that any given safe program, in fact, is safe. There are algorithms to find some of them, but for others, we will have to resort to some tricks. Here are some possibilities. (We do not plan to implement them all!): - We can create a database of common patterns and their corresponding invariants. The system would try to match them in the program. - We can allow the user to add more patterns and their corresponding to the database invariants - We can use statistical techniques to guess some invariants - We can use AI techniques to find more invariants or to find them more efficiently (automatic learning, ...) - The user can enter the whole invariant for a loop or a function (it might be useful to describe a library that is not available in source form) - The user can enter it partialy, and ask the system to infer the rest (he might tell it explicitely what pattern or algorithm to use for a more in-depth analysis) In all cases, the system will have to verify formally that the invariants are correct. Framework --------- We will build a framework that will consist of many parts: - A class library to manipulate and analyze RTL - Basic static analysis algorithms (demand or not) - Code to verify invariants - Algorithms to infer invariants required to check safety - A database containing: - the RTL of the modules to be analyzed - auxillary information about modules (data, types, what is exported, ...) - relasionships between modules (how to link a program) - infered (or entered) invariants about the code and their dependancies - the results of the different static analysis algorithms - A shell (gui and/or command oriented): - that can guide the user in certifying that a program is secure - that gives the user access to all the power of the framework (to test new hypothesis, to plug in new algorithms, create new patterns, ...) - Possibly, a method to execute computation intensive algorithms remotely (distributed) When a module is modified, we would like to recompute only the minimum information and keep the results that are still valid in the database. We will also have to make modifications to GCC to dump RTL and auxillary informations about modules at compile-time. Implementation -------------- We have not yet decided the language that we will use to implement all this. The prototype is in Python. The dynamic nature of Python makes it a very good language for this kind of developement, especially at the begining, when there is a lot of experimentation to do. The main drawback of Python is its speed. If it becomes a problem, possibly when more complex algorithms will be implemented, we may switch to another language or try Psycho, Pyrex or Starkiller. Copyright (C) 2004 Patrice Lacroix