NLP Model Performance: Raw vs Cleaned Data
By Eldar · Published June 29, 2026
Compares sentiment classification on IMDB reviews using raw vs cleaned text pipelines with LogisticRegression, RandomForest, and LinearSVC to determine whether data cleaning or model choice has greater impact.
- nlp
- sentiment-analysis
- data-cleaning
- text-preprocessing
- model-comparison
- imdb
Inside this notebook
# Raw vs Cleaned NLP: Does Data Cleaning Beat Model Choice? This notebook compares **two training pipelines** on the IMDB 50K movie reviews sentiment dataset: 1. **Raw text** — minimal preprocessing (just lowercase + basic tokenization) 2. **Cleaned text** — full NLP cleaning (remove HTML, URLs, punctuation, stopwords, lemmatize) The same **LogisticRegression** model is trained on both, then compared against a **RandomForest** on cleaned data to answer: *does simple cleaning improve results more than changing the model?*
# ── Setup & Imports ──────────────────────────────────────────────
%pip install -q kagglehub
import pandas as pd
import numpy as np
import re
import html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
…Note: you may need to restart the kernel to use updated packages. ✅ Imports ready
# ── 1. Download Dataset ──────────────────────────────────────────
print("Downloading IMDB 50K Movie Reviews from Kaggle...")
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
data_path = Path(path) / "IMDB Dataset.csv"
df = pd.read_csv(data_path)
print(f"✅ Loaded! Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst 3 reviews:")
df.head(3)Downloading IMDB 50K Movie Reviews from Kaggle... Downloading to /root/.cache/kagglehub/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/1.archive... Extracting files... ✅ Loaded! Shape: (50000, 2) Columns: ['review', 'sentiment'] First 3 reviews:
## Step 1: Explore the Dataset Before building pipelines, let's understand the data quality — class balance, text length, and baseline noisy elements (HTML tags, URLs, punctuation).
# ── 2. EDA / Profile the Dataset ────────────────────────────────
df = pd.read_csv(data_path) # fresh copy
print("─── Dataset Profile ───")
print(f"Rows: {df.shape[0]:,} Columns: {df.shape[1]}")
print(f"\n--- Sentiment Distribution ---")
print(df['sentiment'].value_counts())
print(f"\n--- Missing values ---")
print(df.isnull().sum())
# Quick text stats
df['review_len'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()
print(f"\n--- Text Length Stats (chars) ---")
print(df['review_len'].describe())
print(f"\n--- Word Count Stats ---")
print(df['word_count'].describe())
…─── Dataset Profile ─── Rows: 50,000 Columns: 2 --- Sentiment Distribution --- sentiment positive 25000 negative 25000 Name: count, dtype: int64 --- Missing values --- review 0 sentiment 0 dtype: int64 --- Text Length Stats (chars) --- count 50000.000000 mean 1309.431020 std 989.728014 min 32.000000 25% 699.000000 50% 970.000000 75% 1590.250000 max 13704.000000 Name: review_len, dtype: float64 --- Word Count Stats --- count…
# ── Visual EDA ──────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
# 1. Class balance
df['sentiment'].value_counts().plot(kind='bar', ax=axes[0], color=['#4e79a7', '#e15759'])
axes[0].set_title('Class Balance', fontsize=13)
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)
# 2. Review length distribution by sentiment
for sent in ['positive', 'negative']:
subset = df[df['sentiment'] == sent]['review_len']
axes[1].hist(subset, bins=60, alpha=0.6, label=sent, density=True)
axes[1].set_title('Review Length Distribution', fontsize=13)
axes[1].set_xlabel('Character Count')
axes[1].legend()
# 3. Estimate noise: % of reviews with HTML tags, URLs
…HTML tags: 58.4% of reviews URLs: 0.2% of reviews Special chars: 100.0% of reviews
## Step 2: Text Preprocessing Pipelines Two parallel pipelines: - **Raw**: only lowercase + basic splitting (minimal) - **Cleaned**: strip HTML, decode entities, remove URLs/@mentions/punctuation, lowercase, remove stopwords, lemmatize
# ── 3. Text Preprocessing Pipelines ──────────────────────────────
%pip install -q nltk
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('punkt_tab', quiet=True)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def pipeline_raw(text: str) -> str:
"""Minimal: lowercase only."""
return text.lower()
…This is a preview. Open the live notebook to see all 32 cells with their charts and full outputs, or fork it into your own Clusy workspace.