import pandas as pd
from sklearn.linear_model import LinearRegression
Now we perform the data modeling using python and create an isolated environment with venv
. The steps to follow are easy and it suffices to create the virtual environment using the terminal with the following command:
python -m venv /path/to/new/virtual/environment
Then we can proceed to activate it using this command:
source <venv>/bin/activate where <venv> is the name of the virtual environment.
Simple Regression
= pd.read_csv("~/Desktop/Projects/DO4DS/eda_modeling/data/penguins.csv") df
= LinearRegression()
model = df["bill_length_mm"]
x = df["body_mass_g"] y
-1,1),y) model.fit(x.values.reshape(
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
To have a look at the estimated slope coefficient we can use the following:
print(f"the estimated slope parameter for the only predictor {x.name} is {model.coef_[0]}")
the estimated slope parameter for the only predictor bill_length_mm is 86.79175964755542
= model.predict(x.values.reshape(-1,1)) y_pred
from sklearn.metrics import r2_score
r2_score(y, y_pred)
0.34745261128883764
The \(R^2\) score is pretty low suggesting that the model could be improved further by specifying some predictors or by adding the variation coming from the species.
from pins import board_folder
from vetiver import vetiver_pin_write, VetiverModel
= board_folder(
model_board "data/model",
= True
allow_pickle_read
)
= VetiverModel(model, model_name='mock_model', prototype_data=x.values.reshape(-1,1))
v vetiver_pin_write(model_board, v)
Model Cards provide a framework for transparent, responsible reporting.
Use the vetiver `.qmd` Quarto template as a place to start,
with vetiver.model_card()
('The hash of pin "mock_model" has not changed. Your pin will not be stored.',)
from vetiver import VetiverAPI
= VetiverAPI(v, check_prototype = True) app
Multiple regression
= pd.get_dummies(df["species"])
dummies_species
= pd.concat([x, dummies_species], axis = 1)
df_x
df_x.head()
bill_length_mm | Adelie | Chinstrap | Gentoo | |
---|---|---|---|---|
0 | 39.1 | True | False | False |
1 | 39.5 | True | False | False |
2 | 40.3 | True | False | False |
3 | 36.7 | True | False | False |
4 | 39.3 | True | False | False |
now we can fit a multiple regression model
= LinearRegression()
multiple_reg
multiple_reg.fit(df_x, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
print(df_x.columns)
print(multiple_reg.coef_)
Index(['bill_length_mm', 'Adelie', 'Chinstrap', 'Gentoo'], dtype='object')
[ 90.29759774 93.41326282 -783.52837788 690.11511506]
= multiple_reg.predict(df_x) y_pred_multiple
r2_score(y, y_pred_multiple)
0.7848476112531085
By including the variation coming from the different species, it is possible to now increase the \(R^2\) score.