Network Intrusion Detection
By Eldar · Published June 29, 2026
Compares Logistic Regression, Random Forest, XGBoost, and LightGBM on the UNSW-NB15 dataset for binary network intrusion detection.
- network-security
- intrusion-detection
- lightgbm
- xgboost
- random-forest
- logistic-regression
Inside this notebook
# Network Intrusion Detection — Lightweight Model Comparison on UNSW-NB15 This notebook compares four lightweight classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM) for binary network intrusion detection using the **UNSW-NB15** dataset (temporal train/test split). **Dataset:** [`lacg030175/UNSW-NB15`](https://huggingface.co/datasets/lacg030175/UNSW-NB15) on Hugging Face — 175,341 training samples, 82,332 test samples, 44 features. **Goal:** Classify network flows as Normal (0) or Attack (1), evaluate on precision, recall, F1-score, and ROC-AUC, and select the best-performing lightweight model for deployment.
# Cell 2: Install and import all dependencies
%pip install -q datasets xgboost lightgbm imbalanced-learn
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import issparse
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
…Note: you may need to restart the kernel to use updated packages. ✅ All imports successful pandas 2.2.3 | numpy 2.1.2 | sklearn loaded | xgboost 3.2.0 | lightgbm 4.6.0
# Cell 3: Load UNSW-NB15 dataset from Hugging Face (temporal split)
print("Loading UNSW-NB15 dataset (temporal split)...")
ds = load_dataset("lacg030175/UNSW-NB15", "temporal")
X_train = ds["train"].to_pandas()
X_test = ds["test"].to_pandas()
print(f"\nTrain shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
print("\n--- Class distribution (label) ---")
print("Training set:")
print(X_train['label'].value_counts().rename({0: 'Normal', 1: 'Attack'}))
print(f"Attack ratio: {X_train['label'].mean():.2%}")
print("\nTest set:")
print(X_test['label'].value_counts().rename({0: 'Normal', 1: 'Attack'}))
…Loading UNSW-NB15 dataset (temporal split)... Train shape: (175341, 44) Test shape: (82332, 44) --- Class distribution (label) --- Training set: label Attack 119341 Normal 56000 Name: count, dtype: int64 Attack ratio: 68.06% Test set: label Attack 45332 Normal 37000 Name: count, dtype: int64 Attack ratio: 55.06% First 5 rows: Column dtypes: int64 29 float64 11 object 4 Name: count, dtype: int64
# Cell 4: EDA — class distribution, missing values, feature types
print("=" * 60)
print("CLASS DISTRIBUTION")
print("=" * 60)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
train_counts = X_train['label'].value_counts()
test_counts = X_test['label'].value_counts()
axes[0].pie(train_counts, labels=['Normal', 'Attack'], autopct='%1.1f%%',
colors=['#2ecc71', '#e74c3c'], startangle=90, explode=(0.05, 0.05))
axes[0].set_title(f'Training Set (n={len(X_train):,})')
axes[1].pie(test_counts, labels=['Normal', 'Attack'], autopct='%1.1f%%',
colors=['#2ecc71', '#e74c3c'], startangle=90, explode=(0.05, 0.05))
axes[1].set_title(f'Test Set (n={len(X_test):,})')
…============================================================ CLASS DISTRIBUTION ============================================================ ============================================================ MISSING VALUES ============================================================ ✅ No missing values in either split! ============================================================ FEATURE TYPES ============================================================ Numeric columns: 40 Categorical columns: 4…
# Cell 5: Preprocessing — drop leakage cols, encode categoricals, scale numerics
from sklearn.preprocessing import StandardScaler
print("=" * 60)
print("PREPROCESSING PIPELINE")
print("=" * 60)
# Drop columns not useful for binary classification
drop_cols = ['id'] if 'id' in X_train.columns else []
if 'attack_cat' in X_train.columns:
drop_cols.append('attack_cat')
X_train_clean = X_train.drop(columns=drop_cols, errors='ignore')
X_test_clean = X_test.drop(columns=drop_cols, errors='ignore')
print(f"Dropped columns: {drop_cols}")
print(f"Train after drop: {X_train_clean.shape}")
…============================================================ PREPROCESSING PIPELINE ============================================================ Dropped columns: ['attack_cat'] Train after drop: (175341, 43) Test after drop: (82332, 43) Categorical to encode: ['proto', 'service', 'state'] Numeric features: 39 One-hot encoding categorical features... One-hot columns: 155 Scaling numeric features with StandardScaler... Final training features: (175341, 194) Final test features: (82332,…
# Cell 6: Stratified train/validation split (80/20)
from sklearn.model_selection import train_test_split
print("=" * 60)
print("TRAIN / VALIDATION SPLIT (Stratified)")
print("=" * 60)
X_tr, X_val, y_tr, y_val = train_test_split(
X_train_final, y_train,
test_size=0.2,
random_state=42,
stratify=y_train
)
print(f"Training set: {X_tr.shape[0]:,} samples")
print(f" Normal: {np.sum(y_tr == 0):,} | Attack: {np.sum(y_tr == 1):,}")
print(f"Validation set: {X_val.shape[0]:,} samples")
…============================================================ TRAIN / VALIDATION SPLIT (Stratified) ============================================================ Training set: 140,272 samples Normal: 44,800 | Attack: 95,472 Validation set: 35,069 samples Normal: 11,200 | Attack: 23,869 Features: 194 Class ratio (neg/pos) for XGBoost scale_pos_weight: 0.4692 ✅ Split complete
# Cell 7: Train 4 lightweight models
import time
import warnings
warnings.filterwarnings('ignore')
print("=" * 60)
print("TRAINING LIGHTWEIGHT MODELS")
print("=" * 60)
models = {}
training_times = {}
# ── 1. Logistic Regression ──
print("\n[1/4] Logistic Regression (max_iter=1000, class_weight='balanced')...")
t0 = time.time()
lr = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42, n_jobs=-1)
lr.fit(X_tr, y_tr)
…This is a preview. Open the live notebook to see all 13 cells with their charts and full outputs, or fork it into your own Clusy workspace.