Skip to content

CIVA-Lab/VulGNN

Repository files navigation

VulGNN

This repository contains the code for VulGNN's model and data processing pipeline.

Getting Started

Environment

The primary dependencies of this project are:

  • PyTorch Geometric (version 2.3.1)
  • PyTorch
  • NumPy
  • scikit-learn
  • Joern (we used version 2.0.107 - many versions may work, but versions which are significantly older than this may not properly parse some samples)

It is built for a CUDA-enabled environment, but could be adapted to CPU easily by removing calls to "cuda()" or "to()" on tensors.

Execution

  1. Extract the normalized Juliet source code
  2. Follow data processing steps in the DataPreparation readme, starting at generate_cpgs.py and skipping cpg_normalizer.py (for our Juliet dataset - otherwise, run all necessary scripts). Make sure to change any in-code relative paths to the location you want.
  3. Run main.py to start training

Repository Structure

  • Top level
    • main.py - Main training execution script
    • open_data.py - Functions used in main to load the dataset
    • network.py - Contains the GNN models
    • DataPreparation
      • Contains scripts used to prepare data for the model. See readme in directory for more information.
    • sent2vec
      • A dockerized version of sent2vec. Can be used to (manually) generate CPGs using sent2vec tokenization. This processing is not automated in any way - the container is purely an environment to run sent2vec.
    • Data
      • Contains an archive of our normalized subset of Juliet. It is the same as the one used in VulCNN.

Model Architecture

  • Five attentional message passing layers containing:
    • The 160-node message passing operation with 4 heads of attention using GeneralConv for most tests or RGATConv for heterogeneous data
    • Parametric ReLU with a learnable parameter for each node as opposed to one for the entire layer
    • Graph normalization
    • Random 3.5% dropout
  • Global mean pool
  • Random dropout 3.5%
  • A final linear layer with two nodes representing the binary classification

GeneralConv is configured with mean aggregation and dot-product attention while RGATConv is configured with within-relation attention, F-scaled cardinality preservation, and concatenation disabled. Parameters not mentioned here or above are left at their defaults. Training parameters/features include weighted cross entropy loss (besides in the weighted loss experiment), the Adam optimizer with default parameters, 350 epochs, and a batch size of 256.

Hardware Used for Training

We provide the following general hardware info, notes, and timings to allow for estimation of feasibility/timings on other hardware.

Hardware and Notes

  • One RTX 8000 (48GB PCIe)
    • Note that the heterogenenous network uses nearly the entire VRAM of this GPU - it will not function on GPUs with less than ~44-48GB VRAM without modification of the network.
  • Dual Xeon E5-2630 v3 (total of 16 cores)
    • Multicore performance is important in most data processing steps. The more cores you have, the faster most of these steps complete.
  • 540 GB of RAM
    • Relatively little system memory is used for this application.

Timings

Training the standard, 64-length, homogeneous network for 350 epochs takes about 70-80 minutes on this hardware with no other intensive software running concurrently.

Referencing This Work

If you use something from this repository or reference this work, please use the following citation:

@article{farmer2026software,
  title={Software Vulnerability Detection Using a Lightweight Graph Neural Network},
  author={Farmer, Miles and Ufuktepe, Ekincan and Watson, Anne and Carvalho, Hialo Muniz and Okun, Vadim and Maasaoui, Zineb and Palaniappan, Kannappan},
  journal={arXiv preprint arXiv:2603.29216},
  year={2026}
}

About

Code vulnerability detection with Graph Attention Networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors