Student Dropout Risk Prediction
By Eldar Β· Published June 29, 2026
Predicts student dropout/failure risk using OULAD dataset with XGBoost, Logistic Regression baseline, and SHAP explainability analysis.
- classification
- education
- xgboost
- shap
- imbalanced-data
- eda
Inside this notebook
# π OULAD Student Dropout/Failure Prediction β Stage-Wise Pipeline **Goal:** Predict which students are at risk of dropping out or failing using the Open University Learning Analytics Dataset (OULAD) β a rich public dataset covering **demographics, assessment scores, and VLE (Virtual Learning Environment) clickstream activity**. **Dataset:** OULAD (Kaggle: `saksh1mishr4/oulada-clean-dataset-for-mooc-dropout-predition`) **Rows:** 32,593 student-course enrolments | **Features:** 21 columns (demographics + engagement + assessments) **Target:** Binary β `at_risk = 1` (Fail/Withdrawn) vs `0` (Pass/Distinction) **Models:** XGBoost Classifier + Logistic Regression baseline **Explainability:** SHAP (TreeExplainer)
%pip install kagglehub==0.3.10 shap==0.46.0 xgboost==2.1.3 -qNote: you may need to restart the kernel to use updated packages.
# ============================================================
# Cell: Imports & Aesthetic Configuration
# ============================================================
import warnings
warnings.filterwarnings('ignore')
import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
import shap
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
β¦β Libraries loaded successfully. β Style configuration loaded.
# ============================================================
# Step 1 β Download & Load Dataset
# ============================================================
print("=" * 70)
print("π₯ DOWNLOADING OULAD DATASET")
print("=" * 70)
path = kagglehub.dataset_download("saksh1mishr4/oulada-clean-dataset-for-mooc-dropout-prediction")
print(f" β Downloaded to: {path}")
df = pd.read_csv(f"{path}/final_dataset.csv")
print(f"\n Shape: {df.shape[0]:,} rows Γ {df.shape[1]} columns")
print(f"\n Columns ({len(df.columns)}):")
for c in df.columns:
print(f" β’ {c:30s} [{df[c].dtype}]")
print("\n ββ First 5 rows ββ")
display(df.head(5))
β¦======================================================================
π₯ DOWNLOADING OULAD DATASET
======================================================================
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.10), please consider upgrading to the latest version (1.0.2).
β Downloaded to: /root/.cache/kagglehub/datasets/saksh1mishr4/oulada-clean-dataset-for-mooc-dropout-prediction/versions/1
Shape: 32,593 rows Γ 21 columns
Columns (21):
β’ codeβ¦# ============================================================
# Step 2a β EDA: Overview, Missing Values & Target
# ============================================================
print("=" * 70)
print("π EDA: DATA OVERVIEW")
print("=" * 70)
# Missing values
mv = df.isnull().sum()
mv = mv[mv > 0]
if len(mv) > 0:
print("\n β οΈ Missing Values Found:")
for col, cnt in mv.items():
print(f" {col}: {cnt} ({cnt/len(df)*100:.2f}%)")
else:
print("\n β
No missing values β dataset is clean!")
# Target distribution
β¦======================================================================
π EDA: DATA OVERVIEW
======================================================================
β
No missing values β dataset is clean!
ββ Target: final_result ββ
Pass 12361 (37.9%)
Withdrawn 10156 (31.2%)
Fail 7052 (21.6%)
Distinction 3024 (9.3%)
ββ Binary Target: at_risk ββ
0 = Pass/Distinction: 15385 (47.2%)
1 = Fail/Withdrawn:β¦# ============================================================
# Step 2b β EDA: Target Distribution Visualizations
# ============================================================
print("π TARGET DISTRIBUTION VISUALIZATIONS\n")
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
fig.suptitle('π― Target Variable: Student Final Outcome', fontsize=15, fontweight='bold', y=1.02)
# 1. Donut pie chart for original 4-class target
ax = axes[0]
colors_donut = ['#27ae60', '#e74c3c', '#f39c12', '#3498db']
explode = (0, 0.05, 0.05, 0)
wedges, texts, autotexts = ax.pie(
target_counts.values, labels=None, autopct='%1.1f%%',
startangle=90, colors=colors_donut, explode=explode,
pctdistance=0.78, textprops={'fontsize': 9, 'fontweight': 'bold'})
# Draw centre circle for donut effect
centre = plt.Circle((0, 0), 0.55, fc='white', linewidth=1.5, edgecolor='#ddd')
β¦π TARGET DISTRIBUTION VISUALIZATIONS β Saved: fig_target_distribution.png
# ============================================================
# Step 2c β EDA: Feature Distributions by Risk (Violin + KDE)
# ============================================================
print("π FEATURE DISTRIBUTIONS BY RISK STATUS\n")
# Pick top 6 numeric features correlating with at_risk
target_corr = df[num_cols].corrwith(df['at_risk']).abs().sort_values(ascending=False)
top6_feats = target_corr.head(6).index.tolist()
fig, axes = plt.subplots(2, 3, figsize=(20, 12))
fig.suptitle('π» Feature Distributions: At-Risk (Red) vs Not at-Risk (Green)', fontsize=15, fontweight='bold', y=1.01)
for i, feat in enumerate(top6_feats):
ax = axes[i // 3][i % 3]
# Violin plot with split by class
data_risk = df[df['at_risk'] == 1][feat].dropna()
data_safe = df[df['at_risk'] == 0][feat].dropna()
β¦This is a preview. Open the live notebook to see all 18 cells with their charts and full outputs, or fork it into your own Clusy workspace.