Skip to content

Latest commit

 

History

History
409 lines (305 loc) · 11.3 KB

File metadata and controls

409 lines (305 loc) · 11.3 KB

Model Optimization Guide

This guide explains how to use cross-validation and hyperparameter tuning to optimize your Intrusion Detection System's Random Forest classifier for better performance.

Overview

Model optimization helps you:

  • Improve accuracy: Find the best hyperparameters for your specific dataset
  • Reduce overfitting: Validate model performance across multiple data splits
  • Increase reliability: Ensure consistent performance on unseen data
  • Optimize trade-offs: Balance between precision, recall, and computational cost

What is Cross-Validation?

Cross-validation is a technique to evaluate model performance by:

  1. Splitting the training data into K folds (typically 5)
  2. Training the model K times, each time using a different fold as validation
  3. Averaging the results to get a more reliable performance estimate

Benefits:

  • Detects overfitting (model memorizing training data)
  • Provides confidence intervals for performance metrics
  • Uses all data for both training and validation

What is Hyperparameter Tuning?

Hyperparameters are settings that control how the model learns. For Random Forest:

  • n_estimators: Number of trees in the forest
  • max_depth: Maximum depth of each tree
  • min_samples_split: Minimum samples required to split a node
  • min_samples_leaf: Minimum samples required at a leaf node
  • max_features: Number of features to consider for best split
  • bootstrap: Whether to use bootstrap sampling

Grid Search systematically tests all combinations to find the best settings.

Prerequisites

1. Complete Initial Training

You must first run the main notebook to:

  • Generate preprocessed data (train_processed.csv, test_processed.csv)
  • Train the baseline model (intrusion_detection_model_unsw.pkl)

2. Verify Setup

python3 test_setup.py

This checks that all required files and packages are available.

Running the Optimization

Step 1: Hyperparameter Tuning

python3 model_optimization.py

What it does:

  1. Loads the preprocessed training and test data
  2. Evaluates the baseline model performance
  3. Performs 5-fold cross-validation on the baseline
  4. Runs GridSearchCV to test parameter combinations
  5. Trains the optimized model with best parameters
  6. Compares baseline vs optimized performance
  7. Saves the optimized model and results

Expected runtime: 10-30 minutes (depends on dataset size and CPU)

Output:

INTRUSION DETECTION SYSTEM - MODEL OPTIMIZATION
================================================================

Step 1: Loading preprocessed data...
✓ Training data loaded: (X, Y)
✓ Test data loaded: (X, Y)

Step 2: Evaluating baseline model...
Baseline Model Performance:
  Accuracy:  0.9009
  Precision: 0.9876
  Recall:    0.7982
  F1-Score:  0.8829
  ROC-AUC:   0.9654

Step 3: Performing K-Fold Cross-Validation...
Cross-Validation Results:
  Mean CV Accuracy: 0.8995 (+/- 0.0123)

Step 4: Hyperparameter Tuning with GridSearchCV...
Parameter grid size: 288 combinations
[GridSearchCV progress output...]

Best Parameters:
  n_estimators: 200
  max_depth: 30
  min_samples_split: 2
  min_samples_leaf: 1
  max_features: sqrt
  bootstrap: True

Step 5: Evaluating optimized model...
Optimized Model Performance:
  Accuracy:  0.9156
  Precision: 0.9901
  Recall:    0.8301
  F1-Score:  0.9029
  ROC-AUC:   0.9721

Performance Improvements:
  Accuracy  : +1.47%
  Precision : +0.25%
  Recall    : +3.19%
  F1        : +2.00%
  Roc_auc   : +0.67%

✓ Optimized model saved
✓ Results saved

Step 2: Generate Visualizations

python3 model_optimization_visualizations.py

What it does:

  1. Loads both baseline and optimized models
  2. Generates 5 comparison visualizations
  3. Saves all charts as PNG files

Expected runtime: 2-5 minutes

Generated Outputs

Models

  1. intrusion_detection_model_optimized.pkl
    • Random Forest model with tuned hyperparameters
    • Ready for deployment
    • Better performance than baseline

Data Files

  1. optimization_results.csv

    • Comparison table with all metrics
    • Columns: Metric, Baseline, Optimized, Improvement (%)
    • Easy to import into Excel or other tools
  2. best_hyperparameters.csv

    • Optimal parameter values found by GridSearch
    • Use these settings for future training
    • Documents the model configuration

Visualizations

  1. optimization_metrics_comparison.png

    • Side-by-side bar chart
    • Compares all 5 metrics (accuracy, precision, recall, F1, ROC-AUC)
    • Shows exact values on each bar
    • Best for: Presentations, reports
  2. optimization_improvement.png

    • Horizontal bar chart
    • Shows percentage improvement for each metric
    • Green bars = improvement, Red bars = degradation
    • Best for: Quick assessment of optimization impact
  3. optimization_confusion_matrices.png

    • Two heatmaps side-by-side
    • Left: Baseline model confusion matrix
    • Right: Optimized model confusion matrix
    • Best for: Understanding prediction errors
  4. optimization_roc_curves.png

    • ROC curves for both models
    • Includes AUC scores
    • Diagonal line shows random classifier
    • Best for: Evaluating classification quality
  5. optimization_learning_curves.png

    • Two plots showing training vs validation scores
    • Left: Baseline model learning curve
    • Right: Optimized model learning curve
    • Shaded areas show standard deviation
    • Best for: Diagnosing overfitting/underfitting

Understanding the Results

Interpreting Metrics

Accuracy: Overall correctness

  • Higher is better
  • Can be misleading with imbalanced datasets

Precision: Of all predicted attacks, how many were real?

  • Important for minimizing false alarms
  • High precision = fewer false positives

Recall: Of all real attacks, how many did we detect?

  • Important for catching all threats
  • High recall = fewer missed attacks

F1-Score: Harmonic mean of precision and recall

  • Balances both metrics
  • Good overall performance indicator

ROC-AUC: Area under ROC curve

  • Measures classification quality across all thresholds
  • 1.0 = perfect, 0.5 = random

What to Look For

Good Signs:

  • ✓ All metrics improved or stayed the same
  • ✓ Cross-validation scores are consistent (low std dev)
  • ✓ Learning curves converge (training and validation close)
  • ✓ ROC-AUC close to 1.0

Warning Signs:

  • ⚠ Large gap between training and validation scores (overfitting)
  • ⚠ High variance in cross-validation scores (unstable model)
  • ⚠ Some metrics improved but others degraded significantly
  • ⚠ Very long training time with minimal improvement

Typical Improvements

Based on network intrusion detection, you might see:

  • Accuracy: +1-3% improvement
  • Recall: +2-5% improvement (better attack detection)
  • F1-Score: +1-3% improvement
  • ROC-AUC: +0.5-1% improvement

Even small improvements are valuable in security applications!

Use Cases

1. Production Deployment

Use the optimized model for:

  • Real-time intrusion detection
  • Integration with Snort IDS
  • REST API deployment
  • Automated threat response

2. Model Documentation

The results provide:

  • Proof of model validation
  • Performance benchmarks
  • Configuration documentation
  • Comparison baselines

3. Further Optimization

Use insights to:

  • Identify which metrics need improvement
  • Decide if more data is needed
  • Determine if feature engineering would help
  • Guide selection of alternative algorithms

4. Stakeholder Communication

Use visualizations to:

  • Demonstrate model improvements
  • Justify computational costs
  • Explain model reliability
  • Support deployment decisions

Troubleshooting

GridSearch takes too long

Solutions:

  1. Reduce parameter grid size (fewer values to test)
  2. Use RandomizedSearchCV instead of GridSearchCV
  3. Reduce cv parameter (use 3-fold instead of 5-fold)
  4. Use a smaller subset of training data for tuning

Memory errors during optimization

Solutions:

  1. Reduce n_estimators in parameter grid
  2. Limit max_depth values
  3. Use fewer cross-validation folds
  4. Process on a machine with more RAM

Minimal or no improvement

Possible reasons:

  1. Baseline model already well-tuned
  2. Dataset too small for complex models
  3. Features not informative enough
  4. Need different algorithm (try XGBoost, Neural Networks)

Next steps:

  • Try feature engineering
  • Collect more training data
  • Experiment with different algorithms
  • Consider ensemble methods

Optimized model performs worse

Possible causes:

  1. Overfitting to validation set
  2. Random variation in data splits
  3. Parameter grid doesn't include optimal values

Solutions:

  • Use nested cross-validation
  • Expand parameter search space
  • Increase cross-validation folds
  • Check for data leakage

Advanced Usage

Custom Parameter Grid

Edit model_optimization.py to test different parameters:

param_grid = {
    'n_estimators': [50, 100, 150, 200, 250, 300],
    'max_depth': [5, 10, 15, 20, 25, 30, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'class_weight': ['balanced', None]  # For imbalanced datasets
}

Using RandomizedSearchCV

For faster tuning with large parameter spaces:

from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42, n_jobs=-1),
    param_distributions=param_grid,
    n_iter=50,  # Test 50 random combinations
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2,
    random_state=42
)

Optimizing for Different Metrics

Change the scoring parameter to optimize for specific goals:

# Optimize for recall (catch more attacks)
grid_search = GridSearchCV(..., scoring='recall')

# Optimize for precision (fewer false alarms)
grid_search = GridSearchCV(..., scoring='precision')

# Optimize for F1 (balance)
grid_search = GridSearchCV(..., scoring='f1')

# Optimize for ROC-AUC
grid_search = GridSearchCV(..., scoring='roc_auc')

Best Practices

  1. Always use cross-validation: Never tune on test data
  2. Document parameters: Save best_hyperparameters.csv with model
  3. Version models: Keep both baseline and optimized for comparison
  4. Monitor production: Track if optimized model performs as expected
  5. Retune periodically: As data changes, optimal parameters may change
  6. Consider trade-offs: Faster models vs more accurate models

Next Steps

After optimization, consider:

  1. Feature Selection: Use optimized model with reduced features
  2. Ensemble Methods: Combine multiple optimized models
  3. Multi-class Classification: Optimize for specific attack types
  4. Real-time Deployment: Deploy optimized model to production
  5. Continuous Learning: Retrain periodically with new data

References

Support

For issues or questions:

  1. Check this guide first
  2. Review the main README.md
  3. Check console output for error messages
  4. Verify all prerequisites are met

Happy Optimizing! 📊