import pandas as pd
from sklearn.linear_model import LinearRegressionNow we perform the data modeling using python and create an isolated environment with venv. The steps to follow are easy and it suffices to create the virtual environment using the terminal with the following command:
python -m venv /path/to/new/virtual/environment
Then we can proceed to activate it using this command:
source <venv>/bin/activate where <venv> is the name of the virtual environment.
Simple Regression
df = pd.read_csv("~/Desktop/Projects/DO4DS/eda_modeling/data/penguins.csv")model = LinearRegression()
x = df["bill_length_mm"]
y = df["body_mass_g"]model.fit(x.values.reshape(-1,1),y)LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
To have a look at the estimated slope coefficient we can use the following:
print(f"the estimated slope parameter for the only predictor {x.name} is {model.coef_[0]}")the estimated slope parameter for the only predictor bill_length_mm is 86.79175964755542
y_pred = model.predict(x.values.reshape(-1,1))from sklearn.metrics import r2_scorer2_score(y, y_pred)0.34745261128883764
The \(R^2\) score is pretty low suggesting that the model could be improved further by specifying some predictors or by adding the variation coming from the species.
from pins import board_folder
from vetiver import vetiver_pin_write, VetiverModel
model_board = board_folder(
"data/model",
allow_pickle_read = True
)
v = VetiverModel(model, model_name='mock_model', prototype_data=x.values.reshape(-1,1))
vetiver_pin_write(model_board, v)Model Cards provide a framework for transparent, responsible reporting.
Use the vetiver `.qmd` Quarto template as a place to start,
with vetiver.model_card()
('The hash of pin "mock_model" has not changed. Your pin will not be stored.',)
from vetiver import VetiverAPI
app = VetiverAPI(v, check_prototype = True)Multiple regression
dummies_species = pd.get_dummies(df["species"])
df_x = pd.concat([x, dummies_species], axis = 1)
df_x.head()| bill_length_mm | Adelie | Chinstrap | Gentoo | |
|---|---|---|---|---|
| 0 | 39.1 | True | False | False |
| 1 | 39.5 | True | False | False |
| 2 | 40.3 | True | False | False |
| 3 | 36.7 | True | False | False |
| 4 | 39.3 | True | False | False |
now we can fit a multiple regression model
multiple_reg = LinearRegression()
multiple_reg.fit(df_x, y)LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
print(df_x.columns)
print(multiple_reg.coef_)Index(['bill_length_mm', 'Adelie', 'Chinstrap', 'Gentoo'], dtype='object')
[ 90.29759774 93.41326282 -783.52837788 690.11511506]
y_pred_multiple = multiple_reg.predict(df_x)r2_score(y, y_pred_multiple)0.7848476112531085
By including the variation coming from the different species, it is possible to now increase the \(R^2\) score.