Skip to content

Ashitosh0302/CompileX

Repository files navigation

🧠 C++ Compiler: Complete 7-Stage Pipeline execution

Full pipeline: Source → Lexer → Parser → Semantic → TAC → Assembly → Linker → Loader

From Source Code to Executable — A fully functional compiler implementing Lexical, Syntax, Semantic, TAC Generation, Assembly, Linking, and Loading phases.

C++17 License Build Status


📚 Context

This compiler implements the classical compilation pipeline described in:

  • Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2006). Compilers: Principles, Techniques, and Tools (2nd ed.). Addison-Wesley. (The "Dragon Book")
  • Cooper, K. D., & Torczon, L. (2011). Engineering a Compiler (2nd ed.). Morgan Kaufmann.
  • Appel, A. W. (2004). Modern Compiler Implementation in C. Cambridge University Press.

Directory Structure

compiler_project/
│
├── Makefile              ← Build system (run: make)
├── sample.src            ← Sample source file to compile
│
├── include/              ← All header files (.h)
│   ├── Token.h           ← Token types + Token struct (Phase 1)
│   ├── Lexer.h           ← Lexical Analyser class (Phase 1)
│   ├── AST.h             ← Abstract Syntax Tree node types (Phase 2)
│   ├── Parser.h          ← Recursive descent Parser (Phase 2)
│   ├── SemanticAnalyzer.h← Type checker + Symbol Table (Phase 3)
│   ├── CodeGen.h         ← Three-Address Code generator (Phase 4)
│   ├── Assembler.h       ← TAC → x86-like Assembly (Phase 5)
│   ├── Linker.h          ← Object file linker (Phase 6)
│   └── Loader.h          ← Virtual machine / executor (Phase 7)
│
├── src/                  ← All implementation files (.cpp)
│   ├── main.cpp          ← Driver: runs all phases in order
│   ├── Token.cpp         ← Token type name helper
│   ├── AST.cpp           ← AST node toString() methods
│   ├── Lexer.cpp         ← Character-by-character scanner
│   ├── Parser.cpp        ← Grammar rules → AST builder
│   ├── SemanticAnalyzer.cpp ← Scope/type checking + symbol table
│   ├── CodeGen.cpp       ← AST → Three-Address Code
│   ├── Assembler.cpp     ← TAC → Assembly mnemonics
│   ├── Linker.cpp        ← Symbol resolution + code merging
│   └── Loader.cpp        ← Load + execute instructions
│
└── output/               ← Created by make; holds .o files + binary
    └── compiler          ← The final executable

How to Build and Run

Prerequisites

  • g++ with C++17 support (GCC 7+ or Clang 5+)
  • make

Build

cd compiler_project
make

Run with built-in sample program

g++ -std=c++17 -o compiler.exe main.cpp Lexer.cpp Parser.cpp AST.cpp Token.cpp SemanticAnalyzer.cpp CodeGen.cpp Assembler.cpp Linker.cpp Loader.cpp

./comipler.exe

Run with your own source file

./output/compiler sample.src

Clean build artifacts

make clean

The 7 Compilation Phases — What Each Does

Phase 1: Lexical Analysis (Lexer.h / Lexer.cpp)

Input: Raw source code string
Output: List of Token objects

The Lexer reads characters one by one and groups them into tokens (the smallest meaningful units). For example:

int x = 42 + y;
→ [INT,"int"] [IDENT,"x"] [EQUAL,"="] [NUMBER,"42"] [PLUS,"+"] [IDENT,"y"] [SEMI,";"]

Phase 2: Syntax Analysis (Parser.h / Parser.cpp + AST.h / AST.cpp)

Input: List of tokens
Output: Abstract Syntax Tree (AST)

The Parser checks that tokens follow the grammar rules and builds a tree structure. For example:

int sum = x + y;
→ VarDeclNode("int","sum")
       └─ BinaryOpNode("+")
              ├─ IdentifierNode("x")
              └─ IdentifierNode("y")

Phase 3: Semantic Analysis (SemanticAnalyzer.h / .cpp)

Input: AST
Output: Symbol Table + validation

Checks meaning: variable used before declaration, type mismatches, duplicate declarations. Builds a Symbol Table mapping each name to its type.

Phase 4: Intermediate Code Generation (CodeGen.h / .cpp)

Input: AST
Output: Three-Address Code (TAC) instructions

Converts the tree into flat, simple instructions where each operation uses at most 3 operands:

t0 = 10
x  = t0
t1 = x + y
sum = t1

Phase 5: Assembly (Assembler.h / .cpp)

Input: TAC instructions
Output: x86-like assembly mnemonics

Maps each TAC instruction to one or more assembly instructions using registers (EAX, EBX…) and opcodes (MOV, ADD, CMP, JE…).

Phase 6: Linking (Linker.h / .cpp)

Input: One or more "object files"
Output: Fully linked instruction stream with resolved symbols

Merges object files, injects the runtime library (print_val), assigns virtual addresses to all labels, and resolves cross-file symbol references.

Phase 7: Loading & Execution (Loader.h / .cpp)

Input: Linked instructions
Output: Program output (runs the program!)

Simulates an OS loader placing the program in memory and a CPU running it via a fetch-decode-execute loop. Supports MOV, ADD, SUB, IMUL, IDIV, CMP, Jcc, CALL, RET.


Mini-Language Syntax

// Variable declaration
int  x = 10;
float y = 3.14;

// Assignment
x = x + 1;

// Arithmetic: + - * /
int result = a + b * 2;

// Comparison: == != < > <= >=
if (x < y) {
    print(x);
} else {
    print(y);
}

// While loop
while (x > 0) {
    print(x);
    x = x - 1;
}

// Print
print(expression);

// Return
return value;

Expected Output (built-in sample)

PHASE 1: Tokens printed in a table
PHASE 2: AST printed as indented tree
PHASE 3: Each semantic check logged + Symbol Table
PHASE 4: Numbered TAC instructions
PHASE 5: Assembly listing with comments
PHASE 6: Symbol map with virtual addresses
PHASE 7: Program output lines prefixed with >>>

About

From raw source code to execution in 7 phases. A fully handcrafted C++ compiler with TAC generation, x86-like assembly, linking, and a virtual machine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages