Spotify Lightweight Models
By Eldar · Published June 29, 2026
Train genre classifier, popularity predictor, and track recommender in parallel on 114K Spotify tracks using scikit-learn and joblib.
- spotify
- parallel-training
- multi-model
- scikit-learn
- eda
Inside this notebook
# 🎵 Spotify Tracks — 3 Lightweight Models Trained in Parallel **Dataset:** [maharshipandya/spotify-tracks-dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset) (114,000 tracks, 125 genres, 21 audio-feature columns) **Goal:** Train 3 different lightweight ML models side-by-side on a Spotify dataset using **parallel execution** (`joblib.Parallel`) to demonstrate multi-model training workflows. | Model | Task | Algorithm | |-------|------|-----------| | **1 — Genre Classifier** | Predict `track_genre` (125 classes) from audio features | Random Forest | | **2 — Popularity Predictor** | Predict `popularity` (0–100) from audio features | Ridge Regression | | **3 — Track Recommender** | Find similar tracks by cosine distance on audio embedding | NearestNeighbors | **Key Inputs:** `dataset.csv` (from HuggingFace) → cleaned numeric features + encoded targets **Key Outputs:** 3 trained model artifacts + metrics comparison + inference smoke tests
# ========== Imports & Setup ==========
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import json
import os
import time
import warnings
import requests
from datetime import datetime
from io import StringIO
# ML
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.neighbors import NearestNeighbors
…✅ All imports ready pandas 2.2.3 | sklearn 1.5.2 | joblib 1.5.3
# ========== Download Spotify Dataset ==========
CSV_URL = "https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/main/dataset.csv"
print("Downloading dataset…")
resp = requests.get(CSV_URL, timeout=120)
resp.raise_for_status()
df = pd.read_csv(StringIO(resp.text))
print(f"✅ Loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f" Size: {len(resp.content) / 1e6:.1f} MB")
df.head(3)Downloading dataset… ✅ Loaded: 114,000 rows × 21 columns Size: 20.1 MB
# ========== EDA & Data Profile ==========
print("=== COLUMNS & DTYPES ===")
print(df.dtypes.to_string())
print()
print("=== MISSING VALUES ===")
missing = df.isnull().sum()
missing = missing[missing > 0]
if len(missing) > 0:
print(missing.to_string())
else:
print("✅ No missing values in any column")
print()
print("=== BASIC STATS (numeric) ===")
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
print(df[num_cols].describe().to_string())
print()
…=== COLUMNS & DTYPES === Unnamed: 0 int64 track_id object artists object album_name object track_name object popularity int64 duration_ms int64 explicit bool danceability float64 energy float64 key int64 loudness float64 mode int64 speechiness float64 acousticness float64 instrumentalness float64 liveness float64…
# ========== Preprocessing for 3 Tasks ==========
# Drop ID / text columns — keep only audio features + targets
id_cols = ['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name', 'source']
drop_cols = [c for c in id_cols if c in df.columns]
# Feature columns for all models
feature_cols = ['popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
'key', 'loudness', 'mode', 'speechiness', 'acousticness',
'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']
X_all = df[feature_cols].copy()
y_genre = df['track_genre'].copy()
# Handle the 3 missing text rows — drop them
ok = df['track_genre'].notna()
X_all = X_all[ok].reset_index(drop=True)
y_genre = y_genre[ok].reset_index(drop=True)
…Rows after clean: 114,000 Train set: 91,200 | Test set: 22,800 Audio features: 14 Genre classes: 114 Scaler mean shape: (14,)
## Parallel Training — 3 Models at Once The cell below uses `joblib.Parallel(n_jobs=3)` to train all 3 models simultaneously. Each model is wrapped in a standalone function that receives pre-split data, trains, and returns evaluation metrics + the fitted model. | Worker | Model | Hyperparams | |--------|-------|-------------| | 1 | `RandomForestClassifier` — Genre (114 classes) | `n_estimators=100, max_depth=20, n_jobs=2` | | 2 | `Ridge` — Popularity (regression) | `alpha=1.0` | | 3 | `NearestNeighbors` — Track recommendation | `n_neighbors=20, metric=cosine` |
# ========== Train 3 Models in Parallel ==========
MODEL_DIR = "/home/user/models/spotify_triple"
os.makedirs(MODEL_DIR, exist_ok=True)
def train_random_forest(X_tr, X_te, y_tr, y_te, le_obj, feat_names):
"""Model 1: Genre Classification with Random Forest"""
print("[Worker 1] Starting Random Forest genre classifier…")
t0 = time.time()
rf = RandomForestClassifier(
n_estimators=100, max_depth=20, min_samples_leaf=5,
n_jobs=2, random_state=RANDOM_SEED, verbose=0,
class_weight='balanced'
)
rf.fit(X_tr, y_tr)
train_time = time.time() - t0
# Evaluate
y_pred = rf.predict(X_te)
…============================================================ Launching 3 workers in parallel on 8 CPUs… ============================================================ [Worker 2] Starting Ridge popularity predictor… [Worker 3] Starting NearestNeighbors recommender… [Worker 3] ✅ Fitted on 91,200 tracks (0.0s) [Worker 2] ✅ R²=0.0270 RMSE=22.06 (0.1s) [Worker 1] Starting Random Forest genre classifier… [Worker 1] ✅ Acc=0.2549 Top-5 Acc=0.5915 (18.1s) ==============================================…