Weekly Review June 12 - 17 2022 : MLOps learning and diagrams

Focusing this week on building small ML deployments and making Figma diagrams for different MLOps architectures.

MLOps

Courses and books

I'm looking at doing the new Andrew Ng Coursera Machine Learning Engineering for Production Specialization:

I did the classic ML-cs229 years ago and I always like how clear Andrew Ng's explanations are.

For production data and deployments there are lots of interesting issues with data validation, feature skew, data drift, concept drift etc.

The other learning option is the newly released Designing Machine Learning Systems by Chip Huyen:

1. Overview of Machine Learning Systems
  1. When to Use Machine Learning
    1. Machine Learning Use Cases
  2. Understanding Machine Learning Systems
    1. Machine Learning in Research Versus in Production
    2. Machine Learning Systems Versus Traditional Software
1. Introduction to Machine Learning Systems Design
  1. Business and ML Objectives
  2. Requirements for ML Systems
    1. Reliability
    2. Scalability
    3. Maintainability
    4. Adaptability
  3. Iterative Process
  4. Framing ML Problems
    1. Types of ML Tasks
    2. Objective Functions
  5. Mind Versus Data
1. Data Engineering Fundamentals
  1. Data Sources
  2. Data Formats
    1. JSON
    2. Row-Major Versus Column-Major Format
    3. Text Versus Binary Format
  3. Data Models
    1. Relational Model
    2. NoSQL
    3. Structured Versus Unstructured Data
  4. Data Storage Engines and Processing
    1. Transactional and Analytical Processing
    2. ETL: Extract, Transform, and Load
  5. Modes of Dataflow
    1. Data Passing Through Databases
    2. Data Passing Through Services
    3. Data Passing Through Real-Time Transport
  6. Batch Processing Versus Stream Processing
1. Training Data
  1. Sampling
    1. Nonprobability Sampling
    2. Simple Random Sampling
    3. Stratified Sampling
    4. Weighted Sampling
    5. Reservoir Sampling
    6. Importance Sampling
  2. Labeling
    1. Hand Labels
    2. Natural Labels
    3. Handling the Lack of Labels
  3. Class Imbalance
    1. Challenges of Class Imbalance
    2. Handling Class Imbalance
  4. Data Augmentation
    1. Simple Label-Preserving Transformations
    2. Perturbation
    3. Data Synthesis
1. Feature Engineering
  1. Learned Features Versus Engineered Features
  2. Common Feature Engineering Operations
    1. Handling Missing Values
    2. Scaling
    3. Discretization
    4. Encoding Categorical Features
    5. Feature Crossing
    6. Discrete and Continuous Positional Embeddings
  3. Data Leakage
    1. Common Causes for Data Leakage
    2. Detecting Data Leakage
  4. Engineering Good Features
    1. Feature Importance
    2. Feature Generalization
1. Model Development and Offline Evaluation
  1. Model Development and Training
    1. Evaluating ML Models
    2. Ensembles
    3. Experiment Tracking and Versioning
    4. Distributed Training
    5. AutoML
  2. Model Offline Evaluation
    1. Baselines
    2. Evaluation Methods
1. Model Deployment and Prediction Service
  1. Machine Learning Deployment Myths
    1. Myth 1: You Only Deploy One or Two ML Models at a Time
    2. Myth 2: If We Don’t Do Anything, Model Performance Remains the Same
    3. Myth 3: You Won’t Need to Update Your Models as Much
    4. Myth 4: Most ML Engineers Don’t Need to Worry About Scale
  2. Batch Prediction Versus Online Prediction
    1. From Batch Prediction to Online Prediction
    2. Unifying Batch Pipeline and Streaming Pipeline
  3. Model Compression
    1. Low-Rank Factorization
    2. Knowledge Distillation
    3. Pruning
    4. Quantization
  4. ML on the Cloud and on the Edge
    1. Compiling and Optimizing Models for Edge Devices
    2. ML in Browsers
1. Data Distribution Shifts and Monitoring
  1. Causes of ML System Failures
    1. Software System Failures
    2. ML-Specific Failures
  2. Data Distribution Shifts
    1. Types of Data Distribution Shifts
    2. General Data Distribution Shifts
    3. Detecting Data Distribution Shifts
    4. Addressing Data Distribution Shifts
  3. Monitoring and Observability
    1. ML-Specific Metrics
    2. Monitoring Toolbox
    3. Observability
1. Continual Learning and Test in Production
  1. Continual Learning
    1. Stateless Retraining Versus Stateful Training
    2. Why Continual Learning?
    3. Continual Learning Challenges
    4. Four Stages of Continual Learning
    5. How Often to Update Your Models
  2. Test in Production
    1. Shadow Deployment
    2. A/B Testing
    3. Canary Release
    4. Interleaving Experiments
    5. Bandits
1. Infrastructure and Tooling for MLOps
  1. Storage and Compute
    1. Public Cloud Versus Private Data Centers
  2. Development Environment
    1. Dev Environment Setup
    2. Standardizing Dev Environments
    3. From Dev to Prod: Containers
  3. Resource Management
    1. Cron, Schedulers, and Orchestrators
    2. Data Science Workflow Management
  4. ML Platform
    1. Model Deployment
    2. Model Store
    3. Feature Store
  5. Build Versus Buy
1. The Human Side of Machine Learning
  1. User Experience
    1. Ensuring User Experience Consistency
    2. Combatting “Mostly Correct” Predictions
    3. Smooth Failing
  2. Team Structure
    1. Cross-functional Teams Collaboration
    2. End-to-End Data Scientists
  3. Responsible AI
    1. Irresponsible AI: Case Studies
    2. A Framework for Responsible AI

Currently working through the MadeForML MLOps course. It's fairly small and covers lots of production processes and concepts and libraries.

Python dependency management is ... a bit messy. The course project could not be installed with Python 3.10 which I was using. 3.7.10 is recommended by Goku Mohandas but ... my new M1 cannot build numpy with pyenv and Python 3.7.10!

Python 3.7.13 fixed the problem. I filed an issue and Goku has updated the course now.

These versions of Python work fine:

3.7.13, 3.8.13, 3.9.11 and 3.10.3 all work fine on my Intel mac. The GCC solution was not a viable one for me as it caused issues with pip modules that were built with clang. github

Python ML tools

Used Optuna for hyperparameter optimization

Used cleanlab which finds labelling errors in real world datasets. It can discard suspect noisy labels during training.

Set up a [[Weights and Biases]] account and setup tracking. The MadeForML course uses (local) MLflow. I wanted to dive into Weight and Biases which is much more advanced than MLflow. The basics were easy, there's lots more to learn though.

Used snorkel to define slice functions and select results, calculate metrics. This is for sampling segments of your data and getting training metrics for that slice. Useful to see how your model is performing on specific problem areas.

The Snorkel Team is now focusing their efforts on Snorkel Flow, an end-to-end AI application development platform based on the core ideas behind Snorkel

Everybody is building an end-to-end platform now! ML in production has been rapidly evolving, and now patterns of usage are settling down. That's why everybody is building full platforms now. We are at that stage.

Studying Seldon for deploying ML in Kubernetes with monitoring and metrics. I didn't yet install it locally to

Diagrams

My Figma skills are slowly improving. I need to spend more time on these.

I made a diagram for a Kubeflow pipeline, tracked with MLflow deploying with seldon.

training pipeline.png

It still looks complex, doesn't communicate the work process yet. This is what the diagrams are for: increasing understanding or identifying confusion.

Links of the week

GHGSat Montreal GHG emissions tracking via satellite

ghgsat.com/en

If I'm gone

A template for "A cheat sheet for if I am somehow incapacitated." github.com/christophercalm/if-im-gone/blob/..

Typer

typer.tiangolo.com [[python]] [[cli]] generator using decorators and python type hints

Though of course Click is very nice.

Chris Sattinger