Student Dropout Risk Prediction

By Eldar Β· Published June 29, 2026

Predicts student dropout/failure risk using OULAD dataset with XGBoost, Logistic Regression baseline, and SHAP explainability analysis.

  • classification
  • education
  • xgboost
  • shap
  • imbalanced-data
  • eda
18 cells1 experiment140 views0 forks

Inside this notebook

# πŸŽ“ OULAD Student Dropout/Failure Prediction β€” Stage-Wise Pipeline **Goal:** Predict which students are at risk of dropping out or failing using the Open University Learning Analytics Dataset (OULAD) β€” a rich public dataset covering **demographics, assessment scores, and VLE (Virtual Learning Environment) clickstream activity**. **Dataset:** OULAD (Kaggle: `saksh1mishr4/oulada-clean-dataset-for-mooc-dropout-predition`) **Rows:** 32,593 student-course enrolments | **Features:** 21 columns (demographics + engagement + assessments) **Target:** Binary β€” `at_risk = 1` (Fail/Withdrawn) vs `0` (Pass/Distinction) **Models:** XGBoost Classifier + Logistic Regression baseline **Explainability:** SHAP (TreeExplainer)

%pip install kagglehub==0.3.10 shap==0.46.0 xgboost==2.1.3 -q
Note: you may need to restart the kernel to use updated packages.
# ============================================================
# Cell: Imports & Aesthetic Configuration
# ============================================================
import warnings
warnings.filterwarnings('ignore')

import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
import shap
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
…
βœ… Libraries loaded successfully.
βœ… Style configuration loaded.
# ============================================================
# Step 1 β€” Download & Load Dataset
# ============================================================
print("=" * 70)
print("πŸ“₯ DOWNLOADING OULAD DATASET")
print("=" * 70)

path = kagglehub.dataset_download("saksh1mishr4/oulada-clean-dataset-for-mooc-dropout-prediction")
print(f"  β†’ Downloaded to: {path}")

df = pd.read_csv(f"{path}/final_dataset.csv")
print(f"\n  Shape: {df.shape[0]:,} rows Γ— {df.shape[1]} columns")
print(f"\n  Columns ({len(df.columns)}):")
for c in df.columns:
    print(f"    β€’ {c:30s}  [{df[c].dtype}]")

print("\n  ── First 5 rows ──")
display(df.head(5))
…
======================================================================
πŸ“₯ DOWNLOADING OULAD DATASET
======================================================================
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.10), please consider upgrading to the latest version (1.0.2).
  β†’ Downloaded to: /root/.cache/kagglehub/datasets/saksh1mishr4/oulada-clean-dataset-for-mooc-dropout-prediction/versions/1

  Shape: 32,593 rows Γ— 21 columns

  Columns (21):
    β€’ code…
# ============================================================
# Step 2a β€” EDA: Overview, Missing Values & Target
# ============================================================
print("=" * 70)
print("πŸ” EDA: DATA OVERVIEW")
print("=" * 70)

# Missing values
mv = df.isnull().sum()
mv = mv[mv > 0]
if len(mv) > 0:
    print("\n  ⚠️  Missing Values Found:")
    for col, cnt in mv.items():
        print(f"      {col}: {cnt} ({cnt/len(df)*100:.2f}%)")
else:
    print("\n  βœ… No missing values β€” dataset is clean!")

# Target distribution
…
======================================================================
πŸ” EDA: DATA OVERVIEW
======================================================================

  βœ… No missing values β€” dataset is clean!

  ── Target: final_result ──
      Pass              12361  (37.9%)
      Withdrawn         10156  (31.2%)
      Fail               7052  (21.6%)
      Distinction        3024  (9.3%)

  ── Binary Target: at_risk ──
      0 = Pass/Distinction:    15385  (47.2%)
      1 = Fail/Withdrawn:…
# ============================================================
# Step 2b β€” EDA: Target Distribution Visualizations
# ============================================================
print("πŸ“Š TARGET DISTRIBUTION VISUALIZATIONS\n")

fig, axes = plt.subplots(1, 3, figsize=(20, 6))
fig.suptitle('🎯 Target Variable: Student Final Outcome', fontsize=15, fontweight='bold', y=1.02)

# 1. Donut pie chart for original 4-class target
ax = axes[0]
colors_donut = ['#27ae60', '#e74c3c', '#f39c12', '#3498db']
explode = (0, 0.05, 0.05, 0)
wedges, texts, autotexts = ax.pie(
    target_counts.values, labels=None, autopct='%1.1f%%',
    startangle=90, colors=colors_donut, explode=explode,
    pctdistance=0.78, textprops={'fontsize': 9, 'fontweight': 'bold'})
# Draw centre circle for donut effect
centre = plt.Circle((0, 0), 0.55, fc='white', linewidth=1.5, edgecolor='#ddd')
…
πŸ“Š TARGET DISTRIBUTION VISUALIZATIONS

  βœ… Saved: fig_target_distribution.png
# ============================================================
# Step 2c β€” EDA: Feature Distributions by Risk (Violin + KDE)
# ============================================================
print("πŸ“Š FEATURE DISTRIBUTIONS BY RISK STATUS\n")

# Pick top 6 numeric features correlating with at_risk
target_corr = df[num_cols].corrwith(df['at_risk']).abs().sort_values(ascending=False)
top6_feats = target_corr.head(6).index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(20, 12))
fig.suptitle('🎻 Feature Distributions: At-Risk (Red) vs Not at-Risk (Green)', fontsize=15, fontweight='bold', y=1.01)

for i, feat in enumerate(top6_feats):
    ax = axes[i // 3][i % 3]
    
    # Violin plot with split by class
    data_risk = df[df['at_risk'] == 1][feat].dropna()
    data_safe = df[df['at_risk'] == 0][feat].dropna()
…

This is a preview. Open the live notebook to see all 18 cells with their charts and full outputs, or fork it into your own Clusy workspace.