This repository contains the code for VulGNN's model and data processing pipeline.
The primary dependencies of this project are:
- PyTorch Geometric (version 2.3.1)
- PyTorch
- NumPy
- scikit-learn
- Joern (we used version 2.0.107 - many versions may work, but versions which are significantly older than this may not properly parse some samples)
It is built for a CUDA-enabled environment, but could be adapted to CPU easily by removing calls to "cuda()" or "to()" on tensors.
- Extract the normalized Juliet source code
- Follow data processing steps in the DataPreparation readme, starting at
generate_cpgs.pyand skippingcpg_normalizer.py(for our Juliet dataset - otherwise, run all necessary scripts). Make sure to change any in-code relative paths to the location you want. - Run
main.pyto start training
- Top level
main.py- Main training execution scriptopen_data.py- Functions used in main to load the datasetnetwork.py- Contains the GNN models- DataPreparation
- Contains scripts used to prepare data for the model. See readme in directory for more information.
- sent2vec
- A dockerized version of sent2vec. Can be used to (manually) generate CPGs using sent2vec tokenization. This processing is not automated in any way - the container is purely an environment to run sent2vec.
- Data
- Contains an archive of our normalized subset of Juliet. It is the same as the one used in VulCNN.
- Five attentional message passing layers containing:
- The 160-node message passing operation with 4 heads of attention using GeneralConv for most tests or RGATConv for heterogeneous data
- Parametric ReLU with a learnable parameter for each node as opposed to one for the entire layer
- Graph normalization
- Random 3.5% dropout
- Global mean pool
- Random dropout 3.5%
- A final linear layer with two nodes representing the binary classification
GeneralConv is configured with mean aggregation and dot-product attention while RGATConv is configured with within-relation attention, F-scaled cardinality preservation, and concatenation disabled. Parameters not mentioned here or above are left at their defaults. Training parameters/features include weighted cross entropy loss (besides in the weighted loss experiment), the Adam optimizer with default parameters, 350 epochs, and a batch size of 256.
We provide the following general hardware info, notes, and timings to allow for estimation of feasibility/timings on other hardware.
- One RTX 8000 (48GB PCIe)
- Note that the heterogenenous network uses nearly the entire VRAM of this GPU - it will not function on GPUs with less than ~44-48GB VRAM without modification of the network.
- Dual Xeon E5-2630 v3 (total of 16 cores)
- Multicore performance is important in most data processing steps. The more cores you have, the faster most of these steps complete.
- 540 GB of RAM
- Relatively little system memory is used for this application.
Training the standard, 64-length, homogeneous network for 350 epochs takes about 70-80 minutes on this hardware with no other intensive software running concurrently.
If you use something from this repository or reference this work, please use the following citation:
@article{farmer2026software,
title={Software Vulnerability Detection Using a Lightweight Graph Neural Network},
author={Farmer, Miles and Ufuktepe, Ekincan and Watson, Anne and Carvalho, Hialo Muniz and Okun, Vadim and Maasaoui, Zineb and Palaniappan, Kannappan},
journal={arXiv preprint arXiv:2603.29216},
year={2026}
}