TP53 Missense Mutation Pathogenicity Classifier
By Ju Lin · Published June 29, 2026
Predict whether TP53 missense mutations are pathogenic or benign using amino-acid physicochemical properties and Random Forest classification on ClinVar data.
- genomics
- classification
- protein-structure
- clinvar
- random-forest
- feature-engineering
Inside this notebook
# TP53 Missense Mutation Classifier Predict whether a single-amino-acid substitution in the **tumor suppressor p53 (TP53)** is **harmful (pathogenic)** or **benign** using features derived from the mutation position, the original and substituted amino acids, and their physicochemical properties. **Data source:** [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) — the NCBI public archive of human variant interpretations — filtered for TP53 missense mutations with known clinical significance (~1,475 unique variants). **Protein:** TP53 (UniProt P04637) — 393 residues, the most extensively studied protein in cancer genetics. **Pipeline:** Download → Parse HGVS notation → Engineer amino-acid Δ-properties → Train Random Forest classifier → Evaluate → Explain example predictions.
# ============================================================
# IMPORTS & SETUP
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import gzip
import io
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import (
train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
…All imports loaded successfully.
# ============================================================
# DOWNLOAD & FILTER TP53 VARIANTS FROM CLINVAR
# ============================================================
# Stream the gzip file from NCBI FTP; filter for TP53 missense
# mutations with known clinical significance.
import urllib.request
import sys
URL = "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz"
print("Downloading variant_summary.txt.gz from ClinVar FTP (streaming)...")
req = urllib.request.urlopen(URL)
buf = gzip.GzipFile(fileobj=req)
records = []
for i, raw in enumerate(buf):
if i == 0:
…Downloading variant_summary.txt.gz from ClinVar FTP (streaming)... Header loaded: 43 columns Downloaded & parsed 3,154 raw records. Unique missense mutations after dedup: 1454 Label distribution (ClinSigSimple): 0 (Benign): 951 1 (Pathogenic): 503 Preview:
# ============================================================
# AMINO ACID PROPERTY DICTIONARIES
# ============================================================
# Three-letter → one-letter code
AA1 = {
'Ala': 'A', 'Arg': 'R', 'Asn': 'N', 'Asp': 'D', 'Cys': 'C',
'Gln': 'Q', 'Glu': 'E', 'Gly': 'G', 'His': 'H', 'Ile': 'I',
'Leu': 'L', 'Lys': 'K', 'Met': 'M', 'Phe': 'F', 'Pro': 'P',
'Ser': 'S', 'Thr': 'T', 'Trp': 'W', 'Tyr': 'Y', 'Val': 'V',
}
# Physicochemical properties per amino acid (one-letter code)
# Hydrophobicity: Kyte-Doolittle scale (higher = more hydrophobic)
# Charge at pH 7.4: +1 (positive), -1 (negative), 0 (neutral)
# Molecular weight (g/mol)
# Volume (Angstrom^3, from Zamyatnin 1972)
# Chemical class
…AA properties defined for 20 amino acids. TP53 domains: Transactivation, Proline-rich, DNA-binding, Linker, Tetramerization, C-terminal
# ============================================================
# FEATURE ENGINEERING PIPELINE
# ============================================================
# Parse each protein_change like "Arg248Trp" into:
# - orig_aa, position, mut_aa
# - Δ-properties between orig and mut
# - functional domain
def parse_protein_change(pc):
"""Parse 'Arg248Trp' → ('Arg', 248, 'Trp')"""
m = re.match(r'^([A-Z][a-z]{2})(\d+)([A-Z][a-z]{2})$', pc)
if m:
return m.group(1), int(m.group(2)), m.group(3)
return None, None, None
def build_features(df):
rows = []
for _, row in df.iterrows():
…Feature matrix: 1368 rows, 21 columns Label distribution: target 0 944 1 424 Name: count, dtype: int64 Domains represented: domain DNA-binding 829 Transactivation 146 Tetramerization 104 Proline-rich 102 C-terminal 102 Linker 85 Name: count, dtype: int64 Feature columns: orig_aa, mut_aa, position, position_norm, domain, delta_hydrophobicity, delta_charge, delta_mw, delta_volume, abs_delta_hydrophobicity, abs_delta_charge, abs_delta_mw, abs_delta_v…
# ============================================================
# DATA EXPLORATION & VISUALIZATION
# ============================================================
fig, axes = plt.subplots(2, 3, figsize=(16, 8))
fig.suptitle('TP53 Missense Mutation Dataset Exploration', fontsize=14, fontweight='bold')
# 1. Class balance
ax = axes[0, 0]
counts = df_feat['target'].map({0: 'Benign', 1: 'Pathogenic'}).value_counts()
colors = ['#2ecc71', '#e74c3c']
ax.bar(counts.index, counts.values, color=colors, width=0.5)
ax.set_title('Class Balance')
ax.set_ylabel('Count')
for i, v in enumerate(counts.values):
ax.text(i, v + 20, str(v), ha='center', fontweight='bold')
ax.set_ylim(0, counts.max() * 1.15)
…# ============================================================
# TRAIN/TEST SPLIT (Stratified 80/20)
# ============================================================
from sklearn.model_selection import train_test_split
# Define feature columns for the model
feature_cols = [
'position', 'position_norm',
'delta_hydrophobicity', 'abs_delta_hydrophobicity',
'delta_charge', 'abs_delta_charge',
'delta_mw', 'abs_delta_mw',
'delta_volume', 'abs_delta_volume',
'is_same_polarity', 'is_same_class',
]
# Add one-hot encoded categoricals for domain, orig_aa, mut_aa, orig_class, mut_class
categorical_cols = ['domain', 'orig_aa', 'mut_aa', 'orig_class', 'mut_class']
…Train set: 1094 samples (339 pathogenic, 755 benign) Test set: 274 samples (85 pathogenic, 189 benign) Total features: 72 Features: position, position_norm, delta_hydrophobicity, abs_delta_hydrophobicity, delta_charge, abs_delta_charge, delta_mw, abs_delta_mw, delta_volume, abs_delta_volume, is_same_polarity, is_same_class, domain_C-terminal, domain_DNA-binding, domain_Linker, domain_Proline-rich, domain_Tetramerization, domain_Transactivation, orig_aa_A, orig_aa_C, orig_aa_D, orig_aa_E, orig_…
This is a preview. Open the live notebook to see all 13 cells with their charts and full outputs, or fork it into your own Clusy workspace.