2 Data Project Architecture
2.1 Key Takeaways
This chapter gives an opinionated overview of good design and conceptual layout practices in regards to a data project. The areas of responsibility within the project are broken out into
- Presentation,
- Processing, and
- Data layers.
The categories that a given data project may fall into our further divided into
- jobs,
- apps,
- reports and
- API’s.
The rest of the chapter discusses how to break a project down into the previously mentioned layers, as well as considerations for optimizing the Processing and Data layers.
2.2 Lab / Project
2.2.1 Initial Setup
The last chapter, Environments as Code, introduced the example project that we will use throughout the book. You can either clone a starter template for fork the project repo from do4ds_project or create the project from scratch yourself using the following Quarto CLI commands (taken from the Quarto documentation):
quarto create project website do4ds_project
# Choose (don't open) when prompted
quarto preview do4ds_project
… if the quarto preview
command loads a new website in your web browser, go back to the terminal and use Ctrl+C
to terminate the preview server. Change to the project directory and setup a local python virtual environment (you can grab the requirements.txt
file from here, if needed):
cd do4ds_project
# If using python, create and activate a local virtual environment
python -m venv ./venv
source venv/bin/activate
venv/bin/python -m pip install -r requirements.txt
Now that you are in the local project directory you can use the quarto preview
command without arguments to continue seeing updates to the local project in your browser:
quarto preview
# Alternately, if you forked the project sample from Github, you can use npm...
npm run preview
If you did not fork the project sample, make sure to create the eda.qmd
and model.qmd
files from chapter 1 and add them to the sidebar section of _quarto.yml
:
project:
type: website
website:
title: "do4ds_project"
navbar:
left:
- href: index.qmd
text: Home
sidebar:
style: "docked"
search: true
contents:
- eda.qmd
- model.qmd
2.2.2 Updates
To complete part 1 of the lab, I had to modify the example code. First, I added a line that would generate a vetiver
model and assign it to v
and then I changed the path to the local folder where the model could be stored:
from pins import board_folder
from vetiver import vetiver_pin_write
from vetiver import VetiverModel
= VetiverModel(model, model_name = "penguin_model")
v
= board_folder(
model_board "data/model",
= True
allow_pickle_read
) vetiver_pin_write(model_board, v)
In addition to these changes, I created a separate Python file with the code to run the vetiver
API, called api.py
, which also required updates to the VetiverApi
call to ensure that the API server had the correct input params in order to process the prediction:
from palmerpenguins import penguins
from pandas import get_dummies
from sklearn.linear_model import LinearRegression
from pins import board_folder
from vetiver import VetiverModel
from vetiver import VetiverAPI
# This is how you would reload the model from disk...
= board_folder('data/model', allow_pickle_read = True)
b = VetiverModel.from_pin(b, 'penguin_model')
v
# ... however VertiverAPI also uses the model inputs to define params from the prototype
= penguins.load_penguins().dropna()
df 3)
df.head(= get_dummies(df[['bill_length_mm', 'species', 'sex']], drop_first = True)
X = df['body_mass_g']
y
= LinearRegression().fit(X, y)
model
= VetiverModel(model, model_name = "penguin_model", prototype_data = X)
v
= VetiverAPI(v, check_prototype = True)
app = 8000) app.run(port
… and then used python api.py
to run the API. Once running, you can navigate to http://127.0.0.1:8000/docs in a web browser to see the autogenerated API documentation