Spotify Lightweight Models

By Eldar · Published June 29, 2026

Train genre classifier, popularity predictor, and track recommender in parallel on 114K Spotify tracks using scikit-learn and joblib.

spotify
parallel-training
multi-model
scikit-learn
eda

12 cells1 experiment57 views0 forks

Open the live notebook Explore more notebooks

Inside this notebook

# 🎵 Spotify Tracks — 3 Lightweight Models Trained in Parallel **Dataset:** [maharshipandya/spotify-tracks-dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset) (114,000 tracks, 125 genres, 21 audio-feature columns) **Goal:** Train 3 different lightweight ML models side-by-side on a Spotify dataset using **parallel execution** (`joblib.Parallel`) to demonstrate multi-model training workflows. | Model | Task | Algorithm | |-------|------|-----------| | **1 — Genre Classifier** | Predict `track_genre` (125 classes) from audio features | Random Forest | | **2 — Popularity Predictor** | Predict `popularity` (0–100) from audio features | Ridge Regression | | **3 — Track Recommender** | Find similar tracks by cosine distance on audio embedding | NearestNeighbors | **Key Inputs:** `dataset.csv` (from HuggingFace) → cleaned numeric features + encoded targets **Key Outputs:** 3 trained model artifacts + metrics comparison + inference smoke tests

# ========== Imports & Setup ==========
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import json
import os
import time
import warnings
import requests
from datetime import datetime
from io import StringIO

# ML
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.neighbors import NearestNeighbors
…

✅ All imports ready
   pandas 2.2.3 | sklearn 1.5.2 | joblib 1.5.3

# ========== Download Spotify Dataset ==========
CSV_URL = "https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/main/dataset.csv"

print("Downloading dataset…")
resp = requests.get(CSV_URL, timeout=120)
resp.raise_for_status()

df = pd.read_csv(StringIO(resp.text))
print(f"✅ Loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"   Size: {len(resp.content) / 1e6:.1f} MB")
df.head(3)

Downloading dataset…
✅ Loaded: 114,000 rows × 21 columns
   Size: 20.1 MB

# ========== EDA & Data Profile ==========
print("=== COLUMNS & DTYPES ===")
print(df.dtypes.to_string())
print()

print("=== MISSING VALUES ===")
missing = df.isnull().sum()
missing = missing[missing > 0]
if len(missing) > 0:
    print(missing.to_string())
else:
    print("✅ No missing values in any column")
print()

print("=== BASIC STATS (numeric) ===")
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
print(df[num_cols].describe().to_string())
print()
…

=== COLUMNS & DTYPES ===
Unnamed: 0            int64
track_id             object
artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64…

# ========== Preprocessing for 3 Tasks ==========
# Drop ID / text columns — keep only audio features + targets
id_cols = ['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name', 'source']
drop_cols = [c for c in id_cols if c in df.columns]

# Feature columns for all models
feature_cols = ['popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
                'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']

X_all = df[feature_cols].copy()
y_genre = df['track_genre'].copy()

# Handle the 3 missing text rows — drop them
ok = df['track_genre'].notna()
X_all = X_all[ok].reset_index(drop=True)
y_genre = y_genre[ok].reset_index(drop=True)

…

Rows after clean: 114,000
Train set: 91,200 | Test set: 22,800
Audio features: 14
Genre classes: 114
Scaler mean shape: (14,)

## Parallel Training — 3 Models at Once The cell below uses `joblib.Parallel(n_jobs=3)` to train all 3 models simultaneously. Each model is wrapped in a standalone function that receives pre-split data, trains, and returns evaluation metrics + the fitted model. | Worker | Model | Hyperparams | |--------|-------|-------------| | 1 | `RandomForestClassifier` — Genre (114 classes) | `n_estimators=100, max_depth=20, n_jobs=2` | | 2 | `Ridge` — Popularity (regression) | `alpha=1.0` | | 3 | `NearestNeighbors` — Track recommendation | `n_neighbors=20, metric=cosine` |

# ========== Train 3 Models in Parallel ==========
MODEL_DIR = "/home/user/models/spotify_triple"
os.makedirs(MODEL_DIR, exist_ok=True)

def train_random_forest(X_tr, X_te, y_tr, y_te, le_obj, feat_names):
    """Model 1: Genre Classification with Random Forest"""
    print("[Worker 1] Starting Random Forest genre classifier…")
    t0 = time.time()
    rf = RandomForestClassifier(
        n_estimators=100, max_depth=20, min_samples_leaf=5,
        n_jobs=2, random_state=RANDOM_SEED, verbose=0,
        class_weight='balanced'
    )
    rf.fit(X_tr, y_tr)
    train_time = time.time() - t0

    # Evaluate
    y_pred = rf.predict(X_te)
…

============================================================
Launching 3 workers in parallel on 8 CPUs…
============================================================
[Worker 2] Starting Ridge popularity predictor…
[Worker 3] Starting NearestNeighbors recommender…
[Worker 3] ✅ Fitted on 91,200 tracks  (0.0s)
[Worker 2] ✅ R²=0.0270  RMSE=22.06  (0.1s)
[Worker 1] Starting Random Forest genre classifier…
[Worker 1] ✅ Acc=0.2549  Top-5 Acc=0.5915  (18.1s)
==============================================…