Preparing Data for Machine Learning (ETL)
End-to-End Machine Learning: Titanic Survival Prediction
1 min read
This section is 1 min read, full guide is 12 min read
Published Nov 18 2025
10
Show sections list
0
Log in to enable the "Like" button
0
Guide comments
0
Log in to enable the "Save" button
Respond to this guide
Guide Sections
Guide Comments
KerasMachine LearningMatplotlibNumPyPandasPythonscikit-learnSciPySeabornTensorFlow
Machine learning requires:
- No missing values
- All numeric inputs
- Encoded categorical variables
- Train/test split
We define feature lists:
features = [
"pclass", "sex", "age", "sibsp", "parch",
"fare", "embarked", "class", "who", "alone"
]
target = "survived"
data = titanic[features + [target]].copy()
data["age"] = data["age"].fillna(data["age"].median())
data["fare"] = data["fare"].fillna(data["fare"].median())
data["embarked"] = data["embarked"].fillna(data["embarked"].mode()[0])
Copy to Clipboard
Split the data:
X_train, X_test, y_train, y_test = train_test_split(
data[features], data[target],
test_size=0.2, random_state=42, stratify=data[target]
)
Copy to Clipboard
Define numeric and categorical transformers:
numeric_features = ["age", "sibsp", "parch", "fare", "pclass"]
categorical_features = ["sex", "embarked", "class", "who", "alone"]
preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), numeric_features),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
]
)
Copy to Clipboard














