From Source Code to Executable — A fully functional compiler implementing Lexical, Syntax, Semantic, TAC Generation, Assembly, Linking, and Loading phases.
This compiler implements the classical compilation pipeline described in:
- Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2006). Compilers: Principles, Techniques, and Tools (2nd ed.). Addison-Wesley. (The "Dragon Book")
- Cooper, K. D., & Torczon, L. (2011). Engineering a Compiler (2nd ed.). Morgan Kaufmann.
- Appel, A. W. (2004). Modern Compiler Implementation in C. Cambridge University Press.
compiler_project/
│
├── Makefile ← Build system (run: make)
├── sample.src ← Sample source file to compile
│
├── include/ ← All header files (.h)
│ ├── Token.h ← Token types + Token struct (Phase 1)
│ ├── Lexer.h ← Lexical Analyser class (Phase 1)
│ ├── AST.h ← Abstract Syntax Tree node types (Phase 2)
│ ├── Parser.h ← Recursive descent Parser (Phase 2)
│ ├── SemanticAnalyzer.h← Type checker + Symbol Table (Phase 3)
│ ├── CodeGen.h ← Three-Address Code generator (Phase 4)
│ ├── Assembler.h ← TAC → x86-like Assembly (Phase 5)
│ ├── Linker.h ← Object file linker (Phase 6)
│ └── Loader.h ← Virtual machine / executor (Phase 7)
│
├── src/ ← All implementation files (.cpp)
│ ├── main.cpp ← Driver: runs all phases in order
│ ├── Token.cpp ← Token type name helper
│ ├── AST.cpp ← AST node toString() methods
│ ├── Lexer.cpp ← Character-by-character scanner
│ ├── Parser.cpp ← Grammar rules → AST builder
│ ├── SemanticAnalyzer.cpp ← Scope/type checking + symbol table
│ ├── CodeGen.cpp ← AST → Three-Address Code
│ ├── Assembler.cpp ← TAC → Assembly mnemonics
│ ├── Linker.cpp ← Symbol resolution + code merging
│ └── Loader.cpp ← Load + execute instructions
│
└── output/ ← Created by make; holds .o files + binary
└── compiler ← The final executable
- g++ with C++17 support (GCC 7+ or Clang 5+)
- make
cd compiler_project
makeg++ -std=c++17 -o compiler.exe main.cpp Lexer.cpp Parser.cpp AST.cpp Token.cpp SemanticAnalyzer.cpp CodeGen.cpp Assembler.cpp Linker.cpp Loader.cpp
./comipler.exe./output/compiler sample.srcmake cleanInput: Raw source code string
Output: List of Token objects
The Lexer reads characters one by one and groups them into tokens (the smallest meaningful units). For example:
int x = 42 + y;
→ [INT,"int"] [IDENT,"x"] [EQUAL,"="] [NUMBER,"42"] [PLUS,"+"] [IDENT,"y"] [SEMI,";"]
Input: List of tokens
Output: Abstract Syntax Tree (AST)
The Parser checks that tokens follow the grammar rules and builds a tree structure. For example:
int sum = x + y;
→ VarDeclNode("int","sum")
└─ BinaryOpNode("+")
├─ IdentifierNode("x")
└─ IdentifierNode("y")
Input: AST
Output: Symbol Table + validation
Checks meaning: variable used before declaration, type mismatches, duplicate declarations. Builds a Symbol Table mapping each name to its type.
Input: AST
Output: Three-Address Code (TAC) instructions
Converts the tree into flat, simple instructions where each operation uses at most 3 operands:
t0 = 10
x = t0
t1 = x + y
sum = t1
Input: TAC instructions
Output: x86-like assembly mnemonics
Maps each TAC instruction to one or more assembly instructions using registers (EAX, EBX…) and opcodes (MOV, ADD, CMP, JE…).
Input: One or more "object files"
Output: Fully linked instruction stream with resolved symbols
Merges object files, injects the runtime library (print_val), assigns virtual addresses to all labels, and resolves cross-file symbol references.
Input: Linked instructions
Output: Program output (runs the program!)
Simulates an OS loader placing the program in memory and a CPU running it via a fetch-decode-execute loop. Supports MOV, ADD, SUB, IMUL, IDIV, CMP, Jcc, CALL, RET.
// Variable declaration
int x = 10;
float y = 3.14;
// Assignment
x = x + 1;
// Arithmetic: + - * /
int result = a + b * 2;
// Comparison: == != < > <= >=
if (x < y) {
print(x);
} else {
print(y);
}
// While loop
while (x > 0) {
print(x);
x = x - 1;
}
// Print
print(expression);
// Return
return value;
PHASE 1: Tokens printed in a table
PHASE 2: AST printed as indented tree
PHASE 3: Each semantic check logged + Symbol Table
PHASE 4: Numbered TAC instructions
PHASE 5: Assembly listing with comments
PHASE 6: Symbol map with virtual addresses
PHASE 7: Program output lines prefixed with >>>