Visualizing scikit-learn pipelines in Jupyter#
The goal of keeping this notebook is to:
make it available for users that want to reproduce it locally
archive the script in the event we want to rerecord this video with an update in the UI of scikit-learn in a future release.
First we load the dataset#
We need to define our data and target. In this case we build a classification model
import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data, target = (
ames_housing.drop(columns=target_name),
ames_housing[target_name],
)
target = (target > 200_000).astype(int)
We inspect the first rows of the dataframe
data
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal |
1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal |
1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal |
1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal |
1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal |
1460 rows Γ 80 columns
For the sake of simplicity, we can cherry-pick some features and only retain this arbitrary subset of data:
numeric_features = ["LotArea", "FullBath", "HalfBath"]
categorical_features = ["Neighborhood", "HouseStyle"]
data = data[numeric_features + categorical_features]
Then we create the pipeline#
The first step is to define the preprocessing steps
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
(
"scaler",
StandardScaler(),
),
]
)
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
The next step is to apply the transformations using ColumnTransformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
Then we define the model and join the steps in order
from sklearn.linear_model import LogisticRegression
model = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", LogisticRegression()),
]
)
Letβs visualize it!
model
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['LotArea', 'FullBath', 'HalfBath']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Neighborhood', 'HouseStyle'])])), ('classifier', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['LotArea', 'FullBath', 'HalfBath']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Neighborhood', 'HouseStyle'])])), ('classifier', LogisticRegression())])
ColumnTransformer(transformers=[('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['LotArea', 'FullBath', 'HalfBath']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Neighborhood', 'HouseStyle'])])
['LotArea', 'FullBath', 'HalfBath']
SimpleImputer(strategy='median')
StandardScaler()
['Neighborhood', 'HouseStyle']
OneHotEncoder(handle_unknown='ignore')
LogisticRegression()
Finally we score the model#
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data, target, cv=5)
scores = cv_results["test_score"]
print(
"The mean cross-validation accuracy is: "
f"{scores.mean():.3f} Β± {scores.std():.3f}"
)
The mean cross-validation accuracy is: 0.859 Β± 0.018
Note
In this case, around 86% of the times the pipeline correctly predicts whether the price of a house is above or below the 200_000 dollars threshold. But be aware that this score was obtained by picking some features by hand, which is not necessarily the best thing we can do for this classification task. In this example we can hope that fitting a complex machine learning pipelines on a richer set of features can improve upon this performance level.
Reducing a price estimation problem to a binary classification problem with a single threshold at 200_000 dollars is probably too coarse to be useful in in practice. Treating this problem as a regression problem is probably a better idea. We will see later in this MOOC how to train and evaluate the performance of various regression models.