NLP Model Performance: Raw vs Cleaned Data

By Eldar · Published June 29, 2026

Compares sentiment classification on IMDB reviews using raw vs cleaned text pipelines with LogisticRegression, RandomForest, and LinearSVC to determine whether data cleaning or model choice has greater impact.

nlp
sentiment-analysis
data-cleaning
text-preprocessing
model-comparison
imdb

32 cells1 experiment116 views0 forks

Open the live notebook Explore more notebooks

Inside this notebook

# Raw vs Cleaned NLP: Does Data Cleaning Beat Model Choice? This notebook compares **two training pipelines** on the IMDB 50K movie reviews sentiment dataset: 1. **Raw text** — minimal preprocessing (just lowercase + basic tokenization) 2. **Cleaned text** — full NLP cleaning (remove HTML, URLs, punctuation, stopwords, lemmatize) The same **LogisticRegression** model is trained on both, then compared against a **RandomForest** on cleaned data to answer: *does simple cleaning improve results more than changing the model?*

# ── Setup & Imports ──────────────────────────────────────────────
%pip install -q kagglehub

import pandas as pd
import numpy as np
import re
import html

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

…

Note: you may need to restart the kernel to use updated packages.
✅ Imports ready

# ── 1. Download Dataset ──────────────────────────────────────────
print("Downloading IMDB 50K Movie Reviews from Kaggle...")
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
data_path = Path(path) / "IMDB Dataset.csv"

df = pd.read_csv(data_path)
print(f"✅ Loaded! Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst 3 reviews:")
df.head(3)

Downloading IMDB 50K Movie Reviews from Kaggle...
Downloading to /root/.cache/kagglehub/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/1.archive...
Extracting files...
✅ Loaded! Shape: (50000, 2)
Columns: ['review', 'sentiment']

First 3 reviews:

## Step 1: Explore the Dataset Before building pipelines, let's understand the data quality — class balance, text length, and baseline noisy elements (HTML tags, URLs, punctuation).

# ── 2. EDA / Profile the Dataset ────────────────────────────────
df = pd.read_csv(data_path)  # fresh copy

print("─── Dataset Profile ───")
print(f"Rows: {df.shape[0]:,}   Columns: {df.shape[1]}")
print(f"\n--- Sentiment Distribution ---")
print(df['sentiment'].value_counts())
print(f"\n--- Missing values ---")
print(df.isnull().sum())

# Quick text stats
df['review_len'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()
print(f"\n--- Text Length Stats (chars) ---")
print(df['review_len'].describe())
print(f"\n--- Word Count Stats ---")
print(df['word_count'].describe())

…

─── Dataset Profile ───
Rows: 50,000   Columns: 2

--- Sentiment Distribution ---
sentiment
positive    25000
negative    25000
Name: count, dtype: int64

--- Missing values ---
review       0
sentiment    0
dtype: int64

--- Text Length Stats (chars) ---
count    50000.000000
mean      1309.431020
std        989.728014
min         32.000000
25%        699.000000
50%        970.000000
75%       1590.250000
max      13704.000000
Name: review_len, dtype: float64

--- Word Count Stats ---
count…

# ── Visual EDA ──────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

# 1. Class balance
df['sentiment'].value_counts().plot(kind='bar', ax=axes[0], color=['#4e79a7', '#e15759'])
axes[0].set_title('Class Balance', fontsize=13)
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

# 2. Review length distribution by sentiment
for sent in ['positive', 'negative']:
    subset = df[df['sentiment'] == sent]['review_len']
    axes[1].hist(subset, bins=60, alpha=0.6, label=sent, density=True)
axes[1].set_title('Review Length Distribution', fontsize=13)
axes[1].set_xlabel('Character Count')
axes[1].legend()

# 3. Estimate noise: % of reviews with HTML tags, URLs
…

HTML tags: 58.4% of reviews
URLs:      0.2% of reviews
Special chars: 100.0% of reviews

## Step 2: Text Preprocessing Pipelines Two parallel pipelines: - **Raw**: only lowercase + basic splitting (minimal) - **Cleaned**: strip HTML, decode entities, remove URLs/@mentions/punctuation, lowercase, remove stopwords, lemmatize

# ── 3. Text Preprocessing Pipelines ──────────────────────────────
%pip install -q nltk

import nltk
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('punkt_tab', quiet=True)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def pipeline_raw(text: str) -> str:
    """Minimal: lowercase only."""
    return text.lower()

…

This is a preview. Open the live notebook to see all 32 cells with their charts and full outputs, or fork it into your own Clusy workspace.