TP53 Missense Mutation Pathogenicity Classifier

By Ju Lin · Published June 29, 2026

Predict whether TP53 missense mutations are pathogenic or benign using amino-acid physicochemical properties and Random Forest classification on ClinVar data.

genomics
classification
protein-structure
clinvar
random-forest
feature-engineering

13 cells1 experiment150 views0 forks

Open the live notebook Explore more notebooks

Inside this notebook

# TP53 Missense Mutation Classifier Predict whether a single-amino-acid substitution in the **tumor suppressor p53 (TP53)** is **harmful (pathogenic)** or **benign** using features derived from the mutation position, the original and substituted amino acids, and their physicochemical properties. **Data source:** [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) — the NCBI public archive of human variant interpretations — filtered for TP53 missense mutations with known clinical significance (~1,475 unique variants). **Protein:** TP53 (UniProt P04637) — 393 residues, the most extensively studied protein in cancer genetics. **Pipeline:** Download → Parse HGVS notation → Engineer amino-acid Δ-properties → Train Random Forest classifier → Evaluate → Explain example predictions.

# ============================================================
# IMPORTS & SETUP
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import gzip
import io
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import (
    train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
…

All imports loaded successfully.

# ============================================================
# DOWNLOAD & FILTER TP53 VARIANTS FROM CLINVAR
# ============================================================
# Stream the gzip file from NCBI FTP; filter for TP53 missense
# mutations with known clinical significance.

import urllib.request
import sys

URL = "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz"

print("Downloading variant_summary.txt.gz from ClinVar FTP (streaming)...")
req = urllib.request.urlopen(URL)
buf = gzip.GzipFile(fileobj=req)

records = []
for i, raw in enumerate(buf):
    if i == 0:
…

Downloading variant_summary.txt.gz from ClinVar FTP (streaming)...
Header loaded: 43 columns

Downloaded & parsed 3,154 raw records.
Unique missense mutations after dedup: 1454

Label distribution (ClinSigSimple):
  0 (Benign):      951
  1 (Pathogenic):  503

Preview:

# ============================================================
# AMINO ACID PROPERTY DICTIONARIES
# ============================================================
# Three-letter → one-letter code
AA1 = {
    'Ala': 'A', 'Arg': 'R', 'Asn': 'N', 'Asp': 'D', 'Cys': 'C',
    'Gln': 'Q', 'Glu': 'E', 'Gly': 'G', 'His': 'H', 'Ile': 'I',
    'Leu': 'L', 'Lys': 'K', 'Met': 'M', 'Phe': 'F', 'Pro': 'P',
    'Ser': 'S', 'Thr': 'T', 'Trp': 'W', 'Tyr': 'Y', 'Val': 'V',
}

# Physicochemical properties per amino acid (one-letter code)
# Hydrophobicity: Kyte-Doolittle scale (higher = more hydrophobic)
# Charge at pH 7.4: +1 (positive), -1 (negative), 0 (neutral)
# Molecular weight (g/mol)
# Volume (Angstrom^3, from Zamyatnin 1972)
# Chemical class

…

AA properties defined for 20 amino acids.
TP53 domains: Transactivation, Proline-rich, DNA-binding, Linker, Tetramerization, C-terminal

# ============================================================
# FEATURE ENGINEERING PIPELINE
# ============================================================
# Parse each protein_change like "Arg248Trp" into:
#   - orig_aa, position, mut_aa
#   - Δ-properties between orig and mut
#   - functional domain

def parse_protein_change(pc):
    """Parse 'Arg248Trp' → ('Arg', 248, 'Trp')"""
    m = re.match(r'^([A-Z][a-z]{2})(\d+)([A-Z][a-z]{2})$', pc)
    if m:
        return m.group(1), int(m.group(2)), m.group(3)
    return None, None, None

def build_features(df):
    rows = []
    for _, row in df.iterrows():
…

Feature matrix: 1368 rows, 21 columns

Label distribution:
target
0    944
1    424
Name: count, dtype: int64

Domains represented:
domain
DNA-binding        829
Transactivation    146
Tetramerization    104
Proline-rich       102
C-terminal         102
Linker              85
Name: count, dtype: int64

Feature columns:
orig_aa, mut_aa, position, position_norm, domain, delta_hydrophobicity, delta_charge, delta_mw, delta_volume, abs_delta_hydrophobicity, abs_delta_charge, abs_delta_mw, abs_delta_v…

# ============================================================
# DATA EXPLORATION & VISUALIZATION
# ============================================================

fig, axes = plt.subplots(2, 3, figsize=(16, 8))
fig.suptitle('TP53 Missense Mutation Dataset Exploration', fontsize=14, fontweight='bold')

# 1. Class balance
ax = axes[0, 0]
counts = df_feat['target'].map({0: 'Benign', 1: 'Pathogenic'}).value_counts()
colors = ['#2ecc71', '#e74c3c']
ax.bar(counts.index, counts.values, color=colors, width=0.5)
ax.set_title('Class Balance')
ax.set_ylabel('Count')
for i, v in enumerate(counts.values):
    ax.text(i, v + 20, str(v), ha='center', fontweight='bold')
ax.set_ylim(0, counts.max() * 1.15)

…

# ============================================================
# TRAIN/TEST SPLIT (Stratified 80/20)
# ============================================================
from sklearn.model_selection import train_test_split

# Define feature columns for the model
feature_cols = [
    'position', 'position_norm',
    'delta_hydrophobicity', 'abs_delta_hydrophobicity',
    'delta_charge', 'abs_delta_charge',
    'delta_mw', 'abs_delta_mw',
    'delta_volume', 'abs_delta_volume',
    'is_same_polarity', 'is_same_class',
]

# Add one-hot encoded categoricals for domain, orig_aa, mut_aa, orig_class, mut_class
categorical_cols = ['domain', 'orig_aa', 'mut_aa', 'orig_class', 'mut_class']

…

Train set: 1094 samples (339 pathogenic, 755 benign)
Test set:  274 samples (85 pathogenic, 189 benign)

Total features: 72
Features:
position, position_norm, delta_hydrophobicity, abs_delta_hydrophobicity, delta_charge, abs_delta_charge, delta_mw, abs_delta_mw, delta_volume, abs_delta_volume, is_same_polarity, is_same_class, domain_C-terminal, domain_DNA-binding, domain_Linker, domain_Proline-rich, domain_Tetramerization, domain_Transactivation, orig_aa_A, orig_aa_C, orig_aa_D, orig_aa_E, orig_…

This is a preview. Open the live notebook to see all 13 cells with their charts and full outputs, or fork it into your own Clusy workspace.