Network Intrusion Detection

By Eldar · Published June 29, 2026

Compares Logistic Regression, Random Forest, XGBoost, and LightGBM on the UNSW-NB15 dataset for binary network intrusion detection.

  • network-security
  • intrusion-detection
  • lightgbm
  • xgboost
  • random-forest
  • logistic-regression
13 cells1 experiment192 views0 forks

Inside this notebook

# Network Intrusion Detection — Lightweight Model Comparison on UNSW-NB15 This notebook compares four lightweight classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM) for binary network intrusion detection using the **UNSW-NB15** dataset (temporal train/test split). **Dataset:** [`lacg030175/UNSW-NB15`](https://huggingface.co/datasets/lacg030175/UNSW-NB15) on Hugging Face — 175,341 training samples, 82,332 test samples, 44 features. **Goal:** Classify network flows as Normal (0) or Attack (1), evaluate on precision, recall, F1-score, and ROC-AUC, and select the best-performing lightweight model for deployment.

# Cell 2: Install and import all dependencies

%pip install -q datasets xgboost lightgbm imbalanced-learn

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import issparse

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
…
Note: you may need to restart the kernel to use updated packages.
✅ All imports successful
pandas 2.2.3 | numpy 2.1.2 | sklearn loaded | xgboost 3.2.0 | lightgbm 4.6.0
# Cell 3: Load UNSW-NB15 dataset from Hugging Face (temporal split)

print("Loading UNSW-NB15 dataset (temporal split)...")
ds = load_dataset("lacg030175/UNSW-NB15", "temporal")

X_train = ds["train"].to_pandas()
X_test = ds["test"].to_pandas()

print(f"\nTrain shape: {X_train.shape}")
print(f"Test shape:  {X_test.shape}")

print("\n--- Class distribution (label) ---")
print("Training set:")
print(X_train['label'].value_counts().rename({0: 'Normal', 1: 'Attack'}))
print(f"Attack ratio: {X_train['label'].mean():.2%}")

print("\nTest set:")
print(X_test['label'].value_counts().rename({0: 'Normal', 1: 'Attack'}))
…
Loading UNSW-NB15 dataset (temporal split)...

Train shape: (175341, 44)
Test shape:  (82332, 44)

--- Class distribution (label) ---
Training set:
label
Attack    119341
Normal     56000
Name: count, dtype: int64
Attack ratio: 68.06%

Test set:
label
Attack    45332
Normal    37000
Name: count, dtype: int64
Attack ratio: 55.06%

First 5 rows:

Column dtypes:
int64      29
float64    11
object      4
Name: count, dtype: int64
# Cell 4: EDA — class distribution, missing values, feature types

print("=" * 60)
print("CLASS DISTRIBUTION")
print("=" * 60)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

train_counts = X_train['label'].value_counts()
test_counts = X_test['label'].value_counts()

axes[0].pie(train_counts, labels=['Normal', 'Attack'], autopct='%1.1f%%',
            colors=['#2ecc71', '#e74c3c'], startangle=90, explode=(0.05, 0.05))
axes[0].set_title(f'Training Set (n={len(X_train):,})')

axes[1].pie(test_counts, labels=['Normal', 'Attack'], autopct='%1.1f%%',
            colors=['#2ecc71', '#e74c3c'], startangle=90, explode=(0.05, 0.05))
axes[1].set_title(f'Test Set (n={len(X_test):,})')
…
============================================================
CLASS DISTRIBUTION
============================================================

============================================================
MISSING VALUES
============================================================
✅ No missing values in either split!

============================================================
FEATURE TYPES
============================================================
Numeric columns:     40
Categorical columns: 4…
# Cell 5: Preprocessing — drop leakage cols, encode categoricals, scale numerics

from sklearn.preprocessing import StandardScaler

print("=" * 60)
print("PREPROCESSING PIPELINE")
print("=" * 60)

# Drop columns not useful for binary classification
drop_cols = ['id'] if 'id' in X_train.columns else []
if 'attack_cat' in X_train.columns:
    drop_cols.append('attack_cat')

X_train_clean = X_train.drop(columns=drop_cols, errors='ignore')
X_test_clean = X_test.drop(columns=drop_cols, errors='ignore')

print(f"Dropped columns: {drop_cols}")
print(f"Train after drop: {X_train_clean.shape}")
…
============================================================
PREPROCESSING PIPELINE
============================================================
Dropped columns: ['attack_cat']
Train after drop: (175341, 43)
Test after drop:  (82332, 43)

Categorical to encode: ['proto', 'service', 'state']
Numeric features: 39

One-hot encoding categorical features...
One-hot columns: 155
Scaling numeric features with StandardScaler...

Final training features:   (175341, 194)
Final test features:       (82332,…
# Cell 6: Stratified train/validation split (80/20)

from sklearn.model_selection import train_test_split

print("=" * 60)
print("TRAIN / VALIDATION SPLIT (Stratified)")
print("=" * 60)

X_tr, X_val, y_tr, y_val = train_test_split(
    X_train_final, y_train,
    test_size=0.2,
    random_state=42,
    stratify=y_train
)

print(f"Training set:       {X_tr.shape[0]:,} samples")
print(f"  Normal: {np.sum(y_tr == 0):,} | Attack: {np.sum(y_tr == 1):,}")
print(f"Validation set:     {X_val.shape[0]:,} samples")
…
============================================================
TRAIN / VALIDATION SPLIT (Stratified)
============================================================
Training set:       140,272 samples
  Normal: 44,800 | Attack: 95,472
Validation set:     35,069 samples
  Normal: 11,200 | Attack: 23,869

Features: 194

Class ratio (neg/pos) for XGBoost scale_pos_weight: 0.4692

✅ Split complete
# Cell 7: Train 4 lightweight models

import time
import warnings
warnings.filterwarnings('ignore')

print("=" * 60)
print("TRAINING LIGHTWEIGHT MODELS")
print("=" * 60)

models = {}
training_times = {}

# ── 1. Logistic Regression ──
print("\n[1/4] Logistic Regression (max_iter=1000, class_weight='balanced')...")
t0 = time.time()
lr = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42, n_jobs=-1)
lr.fit(X_tr, y_tr)
…

This is a preview. Open the live notebook to see all 13 cells with their charts and full outputs, or fork it into your own Clusy workspace.