modeling – DevOps4DS

Now we perform the data modeling using python and create an isolated environment with venv. The steps to follow are easy and it suffices to create the virtual environment using the terminal with the following command:

python -m venv /path/to/new/virtual/environment

Then we can proceed to activate it using this command:

source <venv>/bin/activate where <venv> is the name of the virtual environment.

Simple Regression

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.read_csv("~/Desktop/Projects/DO4DS/eda_modeling/data/penguins.csv")

model = LinearRegression()
x = df["bill_length_mm"]
y = df["body_mass_g"]

model.fit(x.values.reshape(-1,1),y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To have a look at the estimated slope coefficient we can use the following:

print(f"the estimated slope parameter for the only predictor {x.name} is {model.coef_[0]}")

the estimated slope parameter for the only predictor bill_length_mm is 86.79175964755542

y_pred = model.predict(x.values.reshape(-1,1))

from sklearn.metrics import r2_score

r2_score(y, y_pred)

0.34745261128883764

The \(R^2\) score is pretty low suggesting that the model could be improved further by specifying some predictors or by adding the variation coming from the species.

from pins import board_folder
from vetiver import vetiver_pin_write, VetiverModel

model_board = board_folder(
  "data/model", 
  allow_pickle_read = True
)

v = VetiverModel(model, model_name='mock_model', prototype_data=x.values.reshape(-1,1))
vetiver_pin_write(model_board, v)

Model Cards provide a framework for transparent, responsible reporting. 
 Use the vetiver `.qmd` Quarto template as a place to start, 
 with vetiver.model_card()
('The hash of pin "mock_model" has not changed. Your pin will not be stored.',)

from vetiver import VetiverAPI
app = VetiverAPI(v, check_prototype = True)

Multiple regression

dummies_species = pd.get_dummies(df["species"])

df_x = pd.concat([x, dummies_species], axis = 1)

df_x.head()

	bill_length_mm	Adelie	Chinstrap	Gentoo
0	39.1	True	False	False
1	39.5	True	False	False
2	40.3	True	False	False
3	36.7	True	False	False
4	39.3	True	False	False

now we can fit a multiple regression model

multiple_reg = LinearRegression()

multiple_reg.fit(df_x, y)

LinearRegression()

print(df_x.columns)
print(multiple_reg.coef_)

Index(['bill_length_mm', 'Adelie', 'Chinstrap', 'Gentoo'], dtype='object')
[  90.29759774   93.41326282 -783.52837788  690.11511506]

y_pred_multiple = multiple_reg.predict(df_x)

r2_score(y, y_pred_multiple)

0.7848476112531085

By including the variation coming from the different species, it is possible to now increase the \(R^2\) score.