Week 18 Lab: CSE-CIC-IDS2018 Dataset Analysis

Learning Outcomes

By the end of this lab, you will be able to:

Understand the structure and contents of the CSE-CIC-IDS2018 dataset
Analyze network traffic and system logs from a cloud environment
Detect various attack scenarios, such as infiltration and botnet activity
Use Splunk to analyze and visualize large-scale security datasets

Objective

This lab will provide hands-on experience with the CSE-CIC-IDS2018 dataset, a large-scale, diverse, and labeled dataset for intrusion detection. You will learn to analyze and detect various attacks in a cloud environment.

Dataset Overview

The CSE-CIC-IDS2018 dataset contains network traffic and system logs from a cloud-based infrastructure. It was created by the Canadian Institute for Cybersecurity (CIC) and includes:

10 days of network activity (Wednesday to Friday over two weeks)
Multiple attack scenarios: Brute Force, DoS, DDoS, Web Attacks, Infiltration, Botnet
80+ features extracted from network flows
Labeled data with attack types and timestamps
Total size: Approximately 16GB (compressed)

Dataset Location: /home/ubuntu/soc-training-program/Datasets/CSE-CIC-IDS2018/

Attack Timeline

Date	Day	Attack Type	Time Window
14/02/2018	Wednesday	Benign	All day
15/02/2018	Thursday	Benign	All day
16/02/2018	Friday	Benign	All day
20/02/2018	Tuesday	FTP-BruteForce	10:00 - 11:00
20/02/2018	Tuesday	SSH-Bruteforce	14:00 - 15:00
21/02/2018	Wednesday	DoS-GoldenEye	10:00 - 11:00
21/02/2018	Wednesday	DoS-Slowloris	15:00 - 16:00
22/02/2018	Thursday	DoS-SlowHTTPTest	10:00 - 11:00
22/02/2018	Thursday	DoS-Hulk	15:00 - 16:00
23/02/2018	Friday	DDoS-LOIC-HTTP	10:00 - 11:00
23/02/2018	Friday	DDoS-HOIC	15:30 - 16:30
28/02/2018	Wednesday	Infiltration	14:00 - 17:00
01/03/2018	Thursday	Botnet	10:00 - 12:00
02/03/2018	Friday	Web Attack - Brute Force	10:00 - 11:00
02/03/2018	Friday	Web Attack - XSS	13:30 - 14:30
02/03/2018	Friday	Web Attack - SQL Injection	15:30 - 16:00

Prerequisites

Splunk instance running (from Module 3)
At least 30GB of free disk space
Basic understanding of cloud infrastructure
Wireshark installed
Python 3.x with pandas library (for data analysis)

Lab Duration

Approximately 5-6 hours

Part 1: Dataset Download and Preparation (45 minutes)

Step 1: Download the CSE-CIC-IDS2018 Dataset

Navigate to the dataset directory:

cd /home/ubuntu/soc-training-program/Datasets/CSE-CIC-IDS2018/

Read the README for download instructions:
```
cat README.md
```
Download the dataset from the official source:
- Official URL: https://www.unb.ca/cic/datasets/ids-2018.html
- Download both CSV files and PCAPs (if available)
Verify the download:
```
ls -lh
du -sh *
```

Step 2: Understand the Dataset Structure

The dataset is organized by date and attack type:

CSE-CIC-IDS2018/
├── Processed Traffic Data for ML Algorithms/
│   ├── Friday-02-03-2018_TrafficForML_CICFlowMeter.csv
│   ├── Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv
│   ├── Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv
│   ├── Friday-23-02-2018_TrafficForML_CICFlowMeter.csv
│   ├── Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv
│   ├── Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv
│   ├── Tuesday-20-02-2018_TrafficForML_CICFlowMeter.csv
│   ├── Friday-16-02-2018_TrafficForML_CICFlowMeter.csv
│   ├── Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv
│   └── Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv
└── README.md

Step 3: Examine the CSV Structure

View the first few lines of a CSV file:

head -20 "Processed Traffic Data for ML Algorithms/Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv"

Count the number of rows:

wc -l "Processed Traffic Data for ML Algorithms/"*.csv

Get column names:

head -1 "Processed Traffic Data for ML Algorithms/Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv" | tr ',' '\n' | nl

Key Features in the Dataset:

Flow identifiers: Dst Port, Protocol, Timestamp, Flow ID
Packet statistics: Tot Fwd Pkts, Tot Bwd Pkts, TotLen Fwd Pkts, TotLen Bwd Pkts
Timing features: Flow Duration, Flow IAT Mean, Flow IAT Std, Flow IAT Max, Flow IAT Min
Flag counts: FIN Flag Cnt, SYN Flag Cnt, RST Flag Cnt, PSH Flag Cnt, ACK Flag Cnt
Packet length statistics: Fwd Pkt Len Max, Fwd Pkt Len Min, Fwd Pkt Len Mean, Fwd Pkt Len Std
Label: Attack type or "Benign"

Part 2: Importing Data into Splunk (60 minutes)

Step 4: Prepare Data for Splunk Ingestion

Create a new index in Splunk for the CSE-CIC-IDS2018 dataset:
- Log in to Splunk
- Go to Settings → Indexes → New Index
- Index name: cse_cic_ids_2018
- Max size: 50GB
- Click Save

Optional: Combine all CSV files into one (for easier ingestion):

cd "Processed Traffic Data for ML Algorithms/"

# Extract header from first file
head -1 Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv > combined_dataset.csv

# Append all data (skip headers)
for file in *.csv; do
    tail -n +2 "$file" >> combined_dataset.csv
done

# Check the combined file
wc -l combined_dataset.csv

Step 5: Upload Data to Splunk

Method 1: Web UI Upload (for smaller files)

Go to Settings → Add Data → Upload
Select the CSV file
Set Source Type:
- Source type: csv
- Delimiter: , (comma)
- Header: Check "Extract field names"
Input Settings:
- Index: cse_cic_ids_2018
- Source type: cse_cic_ids_2018_csv
Click Submit

Method 2: Splunk Forwarder (for large files)

Copy the CSV files to the Splunk monitored directory:
```
sudo cp *.csv /opt/splunk/var/spool/splunk/
```

Configure inputs.conf:

sudo nano /opt/splunk/etc/system/local/inputs.conf

Add the following:

[monitor:///opt/splunk/var/spool/splunk/*.csv]
disabled = false
index = cse_cic_ids_2018
sourcetype = csv

Restart Splunk:
```
sudo /opt/splunk/bin/splunk restart
```

Step 6: Verify Data Ingestion

Search for the ingested data:
```
index=cse_cic_ids_2018
| stats count
```

Check the time range:

index=cse_cic_ids_2018
| stats min(_time) as earliest, max(_time) as latest
| eval earliest=strftime(earliest, "%Y-%m-%d %H:%M:%S"), latest=strftime(latest, "%Y-%m-%d %H:%M:%S")

Verify labels:

index=cse_cic_ids_2018
| stats count by Label
| sort -count

Part 3: Analyzing Infiltration Attack (90 minutes)

Exercise 1: Understanding the Infiltration Scenario

Attack Description: The infiltration attack simulates an Advanced Persistent Threat (APT) scenario where an attacker:

Gains initial access through a vulnerability
Establishes persistence
Performs reconnaissance
Moves laterally within the network
Exfiltrates data

Date: Wednesday, 28/02/2018, 14:00 - 17:00

Step 7: Filter Infiltration Traffic in Splunk

Basic infiltration query:

index=cse_cic_ids_2018 Label="Infiltration"
| table _time, "Src IP", "Dst IP", "Src Port", "Dst Port", Protocol, "Flow Duration", "Tot Fwd Pkts", "Tot Bwd Pkts"

Identify the attacker's IP:

index=cse_cic_ids_2018 Label="Infiltration"
| stats count by "Src IP"
| sort -count

Identify targeted systems:

index=cse_cic_ids_2018 Label="Infiltration"
| stats count by "Dst IP"
| sort -count

Analyze the timeline:

index=cse_cic_ids_2018 Label="Infiltration"
| timechart span=5m count by "Src IP"

Step 8: Identify Initial Compromise

Find the first infiltration event:

index=cse_cic_ids_2018 Label="Infiltration"
| sort _time
| head 10
| table _time, "Src IP", "Dst IP", "Dst Port", Protocol

Analyze port usage:

index=cse_cic_ids_2018 Label="Infiltration"
| stats count by "Dst Port"
| sort -count

Questions to Answer:

What is the attacker's IP address?
What was the first port targeted?
What protocol was used for initial access?
How long did the infiltration last?

Step 9: Detect Lateral Movement

Identify connections between internal hosts:

index=cse_cic_ids_2018 Label="Infiltration"
| where match("Src IP", "^192\.168\.") AND match("Dst IP", "^192\.168\.")
| stats count by "Src IP", "Dst IP"
| sort -count

Visualize lateral movement:

index=cse_cic_ids_2018 Label="Infiltration"
| stats count by "Src IP", "Dst IP"
| where count > 10
| eval connection="Src IP" + " -> " + "Dst IP"
| table connection, count

Create a network diagram:
- Use the Splunk Force Directed app (if available)
- Or export data and visualize with Gephi/Cytoscape

Step 10: Detect Data Exfiltration

Find large data transfers:

index=cse_cic_ids_2018 Label="Infiltration"
| eval total_bytes="TotLen Fwd Pkts" + "TotLen Bwd Pkts"
| where total_bytes > 1000000
| table _time, "Src IP", "Dst IP", total_bytes
| sort -total_bytes

Identify outbound connections:

index=cse_cic_ids_2018 Label="Infiltration"
| where NOT match("Dst IP", "^192\.168\.")
| stats sum("TotLen Fwd Pkts") as bytes_sent by "Src IP", "Dst IP"
| sort -bytes_sent

Deliverable: Create a timeline of the infiltration attack with key events:

Initial compromise
Reconnaissance activities
Lateral movement
Data exfiltration

Part 4: Analyzing Botnet Activity (90 minutes)

Exercise 2: Detecting Botnet C2 Communication

Attack Description: The botnet scenario includes infected hosts communicating with a Command and Control (C2) server.

Date: Thursday, 01/03/2018, 10:00 - 12:00

Step 11: Identify Botnet Traffic Patterns

Filter botnet traffic:

index=cse_cic_ids_2018 Label="Bot"
| table _time, "Src IP", "Dst IP", "Dst Port", Protocol, "Flow Duration", "Flow IAT Mean"

Identify potential C2 servers:

index=cse_cic_ids_2018 Label="Bot"
| stats count, dc("Src IP") as unique_sources by "Dst IP"
| where unique_sources > 5
| sort -count

Analyze beaconing behavior:

index=cse_cic_ids_2018 Label="Bot"
| stats avg("Flow IAT Mean") as avg_interval, stdev("Flow IAT Mean") as stdev_interval by "Src IP", "Dst IP"
| where stdev_interval < 1000
| sort avg_interval

Beaconing Characteristics:

Regular, periodic connections
Low standard deviation in connection intervals
Small packet sizes
Long-lived connections

Step 12: Analyze Botnet Communication Patterns

Packet size analysis:

index=cse_cic_ids_2018 Label="Bot"
| stats avg("Fwd Pkt Len Mean") as avg_fwd_size, avg("Bwd Pkt Len Mean") as avg_bwd_size by "Src IP"
| eval size_ratio=avg_fwd_size/avg_bwd_size
| table "Src IP", avg_fwd_size, avg_bwd_size, size_ratio

Connection duration analysis:

index=cse_cic_ids_2018 Label="Bot"
| stats avg("Flow Duration") as avg_duration, max("Flow Duration") as max_duration by "Src IP"
| sort -avg_duration

Protocol analysis:

index=cse_cic_ids_2018 Label="Bot"
| stats count by Protocol, "Dst Port"
| sort -count

Step 13: Create a Botnet Detection Dashboard

Create a Splunk dashboard with the following panels:

Botnet Activity Timeline:

index=cse_cic_ids_2018 Label="Bot"
| timechart span=10m count

Top Infected Hosts:

index=cse_cic_ids_2018 Label="Bot"
| stats count by "Src IP"
| sort -count
| head 10

C2 Servers:

index=cse_cic_ids_2018 Label="Bot"
| stats count, dc("Src IP") as infected_hosts by "Dst IP"
| sort -infected_hosts

Beaconing Visualization:

index=cse_cic_ids_2018 Label="Bot"
| timechart span=1m count by "Src IP" limit=5

Deliverable: Screenshot of your botnet detection dashboard

Part 5: Comparative Analysis of Attack Types (60 minutes)

Exercise 3: Compare Multiple Attack Scenarios

Create a comparison query:

index=cse_cic_ids_2018
| eval attack_category=case(
    Label="Benign", "Benign",
    Label LIKE "%Brute%", "Brute Force",
    Label LIKE "%DoS%", "Denial of Service",
    Label LIKE "%DDoS%", "DDoS",
    Label LIKE "%Web Attack%", "Web Attack",
    Label="Infiltration", "Infiltration",
    Label="Bot", "Botnet",
    1=1, "Other"
)
| stats count, avg("Flow Duration") as avg_duration, avg("Tot Fwd Pkts") as avg_fwd_pkts, avg("Tot Bwd Pkts") as avg_bwd_pkts by attack_category
| sort -count

Analyze packet characteristics by attack type:

index=cse_cic_ids_2018
| stats avg("Fwd Pkt Len Mean") as avg_fwd_len, avg("Bwd Pkt Len Mean") as avg_bwd_len, avg("Flow IAT Mean") as avg_iat by Label
| sort Label

Create a feature comparison table:

index=cse_cic_ids_2018
| stats 
    avg("Flow Duration") as avg_duration,
    avg("Tot Fwd Pkts") as avg_fwd_pkts,
    avg("Tot Bwd Pkts") as avg_bwd_pkts,
    avg("Flow Byts/s") as avg_bytes_per_sec,
    avg("Flow Pkts/s") as avg_pkts_per_sec
    by Label
| sort Label

Exercise 4: Build a Machine Learning Model (Optional)

If you have Python with scikit-learn installed:

Export data from Splunk:

index=cse_cic_ids_2018
| fields "Flow Duration", "Tot Fwd Pkts", "Tot Bwd Pkts", "Flow Byts/s", "Flow Pkts/s", Label
| outputlookup cse_cic_ids_2018_ml_data.csv

Train a simple classifier:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Load data
data = pd.read_csv("cse_cic_ids_2018_ml_data.csv")

# Prepare features and labels
X = data.drop("Label", axis=1)
y = data["Label"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Part 6: Creating Detection Rules (45 minutes)

Step 14: Develop Splunk Alerts

Infiltration Detection Alert:

index=cse_cic_ids_2018
| where match("Src IP", "^192\.168\.") AND NOT match("Dst IP", "^192\.168\.")
| stats sum("TotLen Fwd Pkts") as bytes_sent by "Src IP"
| where bytes_sent > 10000000
| eval alert_type="Potential Data Exfiltration"

Save as alert: "Potential Data Exfiltration"
Trigger: When results > 0
Schedule: Every 5 minutes

Botnet Beaconing Detection Alert:

index=cse_cic_ids_2018
| stats count, avg("Flow IAT Mean") as avg_interval, stdev("Flow IAT Mean") as stdev_interval by "Src IP", "Dst IP"
| where count > 20 AND stdev_interval < 1000
| eval alert_type="Potential Botnet Beaconing"

Save as alert: "Botnet Beaconing Detected"
Trigger: When results > 0
Schedule: Every 10 minutes

Brute Force Detection Alert:

index=cse_cic_ids_2018 "Dst Port" IN (21, 22, 3389)
| stats count by "Src IP", "Dst IP", "Dst Port"
| where count > 50
| eval alert_type="Brute Force Attack"

Save as alert: "Brute Force Attack Detected"
Trigger: When results > 0
Schedule: Every 5 minutes

Deliverables

Submit the following:

Lab Report (Markdown or PDF):
- Analysis of infiltration attack with timeline
- Analysis of botnet activity with C2 identification
- Comparison of attack types
- Answers to all questions
Splunk Dashboards:
- Infiltration attack dashboard
- Botnet detection dashboard
- Overall security monitoring dashboard
- Export as XML or screenshots
Detection Rules:
- All Splunk queries and alerts
- Explanation of detection logic
IOC List:
- Malicious IP addresses
- C2 servers
- Attack signatures
Optional: ML Model Results:
- Classification report
- Confusion matrix
- Feature importance analysis

Evaluation Criteria

Completeness: Did you complete all exercises?
Technical Accuracy: Are your analyses correct?
Detection Rules: Are your Splunk queries effective?
Documentation: Is your report well-organized?
Dashboards: Are your dashboards informative?
Insights: Did you provide meaningful insights?

Additional Resources

Next Steps

After completing this lab, you will have experience analyzing sophisticated attacks in a cloud environment. In the next module, you will learn about threat hunting and proactive security monitoring techniques.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week 18 Lab: CSE-CIC-IDS2018 Dataset Analysis

Learning Outcomes

Objective

Dataset Overview

Attack Timeline

Prerequisites

Lab Duration

Part 1: Dataset Download and Preparation (45 minutes)

Step 1: Download the CSE-CIC-IDS2018 Dataset

Step 2: Understand the Dataset Structure

Step 3: Examine the CSV Structure

Part 2: Importing Data into Splunk (60 minutes)

Step 4: Prepare Data for Splunk Ingestion

Step 5: Upload Data to Splunk

Step 6: Verify Data Ingestion

Part 3: Analyzing Infiltration Attack (90 minutes)

Exercise 1: Understanding the Infiltration Scenario

Step 7: Filter Infiltration Traffic in Splunk

Step 8: Identify Initial Compromise

Step 9: Detect Lateral Movement

Step 10: Detect Data Exfiltration

Part 4: Analyzing Botnet Activity (90 minutes)

Exercise 2: Detecting Botnet C2 Communication

Step 11: Identify Botnet Traffic Patterns

Step 12: Analyze Botnet Communication Patterns

Step 13: Create a Botnet Detection Dashboard

Part 5: Comparative Analysis of Attack Types (60 minutes)

Exercise 3: Compare Multiple Attack Scenarios

Exercise 4: Build a Machine Learning Model (Optional)

Part 6: Creating Detection Rules (45 minutes)

Step 14: Develop Splunk Alerts

Deliverables

Evaluation Criteria

Additional Resources

Next Steps

FilesExpand file tree

Week-18-Lab-CSE-CIC-IDS2018-Dataset-Analysis.md

Latest commit

History

Week-18-Lab-CSE-CIC-IDS2018-Dataset-Analysis.md

File metadata and controls

Week 18 Lab: CSE-CIC-IDS2018 Dataset Analysis

Learning Outcomes

Objective

Dataset Overview

Attack Timeline

Prerequisites

Lab Duration

Part 1: Dataset Download and Preparation (45 minutes)

Step 1: Download the CSE-CIC-IDS2018 Dataset

Step 2: Understand the Dataset Structure

Step 3: Examine the CSV Structure

Part 2: Importing Data into Splunk (60 minutes)

Step 4: Prepare Data for Splunk Ingestion

Step 5: Upload Data to Splunk

Step 6: Verify Data Ingestion

Part 3: Analyzing Infiltration Attack (90 minutes)

Exercise 1: Understanding the Infiltration Scenario

Step 7: Filter Infiltration Traffic in Splunk

Step 8: Identify Initial Compromise

Step 9: Detect Lateral Movement

Step 10: Detect Data Exfiltration

Part 4: Analyzing Botnet Activity (90 minutes)

Exercise 2: Detecting Botnet C2 Communication

Step 11: Identify Botnet Traffic Patterns

Step 12: Analyze Botnet Communication Patterns

Step 13: Create a Botnet Detection Dashboard

Part 5: Comparative Analysis of Attack Types (60 minutes)

Exercise 3: Compare Multiple Attack Scenarios

Exercise 4: Build a Machine Learning Model (Optional)

Part 6: Creating Detection Rules (45 minutes)

Step 14: Develop Splunk Alerts

Deliverables

Evaluation Criteria

Additional Resources

Next Steps