I need to do some project specific automatic checking of source codes written in
ID: 659057 • Letter: I
Question
I need to do some project specific automatic checking of source codes written in C++
Limitations:
Algorithm and its implementation should be simple, easily maintainable, extendable and understandable without need of Computer Science degree.
Speed and memory consumption of algorithm does not matter. I can run it at night and see result next morning. So it can be 100-1000x slower than fast but hard to maintain algorithms.
I want to write it myself and have fine control about it. No parser generator tools like flex, bison etc.
I dont want to use existing C++ parsers like CLang etc.
Please in answer dont use formal language theory. Try to explain it with simple language, understandable to ordinary self-taught programmer.
So what is the best and easy to implement universal parsing algorithm if we assume speed and memory is not an issue?
EDIT: I experimenting with this approaches and their combinations:
a) Initially some time ago i tried some form of recursive descent parsing.
partical succes done with it, but its too rigid and hard to maintain, hard to hack and extend.
b) Repeated applying set of transformations on array of objects until there is no transformation to apply.
At beginning each character from parsed source code is one object in array Then i have set of transformation functions:
bool Build_...( vector<unique_ptr<CObject>>& objects );
And each of those function removes one or more lower level object from array and replaces them with higher level object for example Build_Whitespacefinds sequence of space and tab characters and replaces them with one CWhitespace object. Build_Comments find sequence of characters which makes comment and replaces them with CComment object etc. (I dont want to remove whitespace and comments, whole processing needs to be done with them ) After each cycle is removed some lower level object and replaced with higher level object. And at end of this process there is few higher level objects like CFunctionDefinition, CClassDefiniton etc.
So basically it works this way:
bool objectsChanged = true;
while ( objectsChanged ) {
objectsChanged = false;
objectsChanged |= Build_Whitespace( objects );
objectsChanged |= Build_EndLines( objects );
objectsChanged |= Build_Names( objects );
.
.
objectsChanged |= Build_ForCycle( objects );
.
.
}
This algorithm is of course very slow, but works, is simple and i think it can parse language of any complexity. It has also advantage of partial parsing, can parse incomplete code because it leaves language objects as is if they have unknown syntax.
c) Currently i am experimenting with parsing using reverse generation of source codes.
Instead of trying to parse source code i am generating syntax tree for all possible source codes and compare each resulting code with one that needs to be parsed until i find one that match. This sounds crazy because there is infinite many possible source codes. But its not so bad as it sounds because comparing can be done early and whole tested subtrees avoided at first generated character which does not match.
Actually i think b) is better for my purposes and easier to use and hack than c) which is more rigid and for succesful parsing of some subset of code needs complete gramatical rules for every possible code.
What other posibilites you know about? What do you think about selected approaches? Some other ideas?
Explanation / Answer
Parsing full C++ is very hard. And in a large enough code, nearly all of C++ syntactical features are used (since the hard to parse features occur in most C++ code).
Also, what is actually parsed is preprocessed output. So you should at least parse the output of the preprocessor.
For example, it is quite hard -since very contextual- to understand if an occurrence of sort refers to one of the std::sort (which one?) or to something else.
BTW, even if you believe to use a small subset of C++ syntax, it is still hard to parse (since contextual), and you need to somehow understand the standard headers (which are internally using a lot of C++ features).
So I believe you are going on the wrong route. The realistic way is to reuse an existing C++ parser. You could buy one (Edison C++ parser is rumored to be very expansive), or you could reuse one from an existing C++ free software compiler, such as Clang/LLVM (which you don't want to use: I don't know it well but I guess it would be reasonable to use Clang) or GCC.
GCC can be customized, notably at the Gimple level, by coding your own extension in MELT; you won't see C++ parse trees, but Gimple instructions. MELT can also handle GCC Generic parse trees.
You should explain what practical project specific automatic checking you want to do. FWIW, MELT was exactly designed for such use cases (project specific checking).
Actually, if your code base is small, it might be perhaps easier to convert it (manually) to something else (perhaps your own C or C++ like DSL) and implement a translator from your language to C++; but to do that you need a computer science background
BTW, parsing is not the hardest problem (it is very complex, but compilers solved it). Analyzing your code might be harder (intractable), or undecidable (read about the halting problem). Both parsing and static program analysis require some expertise (this is why Computer Science degrees are useful for) - so get or buy some.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.