June 15 - Aug 7 2022 : Vacation, ML Systems, Labelling

I was on vacation the last two weeks of July. Bicycling down the Loire and up to Bretagne.

On the train I bought Designing Machine Learning Systems. (I have to say I like being able to buy a book on my phone and start reading it right away.)

Finished the book over the course of 2 weeks. Many of my questions about production practices, data drift, data engineering were answered. I have a good framework for approaching ML problems now.

Lots of useful surveys of industry practices. A few surprises:

  • Most companies don't do Reinforcement Learning in the form of: get new data, update the online model immediately. This wouldn't scale for millions of users and many regions. Instead they just rebuild these models once a minute/hour/day/week with the latest data. Run validation checks during CI/CD, deploy if it passes.
  • The new sexiness—FeatureStores—are only adopted by 40% of companies and half of them built their own.

My first week back was only a 3 day week. It's another heatwave in Berlin, my partner is working, my son is at home (starting school soon).

Each day I start work at 6 or 7 when it's cool and quiet.


I'm working with a dataset of scraped web data to classify content into a few buckets. I will also want to classify the domains, identify the company|org and classify those.

The content to classify will be updating constantly, so I will need to detect data drift and this classifier will need to be periodically retrained.

Initially I just need one labelled set to train a baseline model with.

Using Label Studio to browse the content, and do some manual labelling to get some insights. I quickly realize which classes make sense, which are more complicated, which ones imply conflicts.

Worked with snorkel to write heuristic labelling functions (simple functions, NLP powered, Naive Bayes etc.). These are blended to produce noise-aware training labels, which are probabilistic or confidence-weighted labels which can then be used in training.

Next step: active learning. Integrating snorkel with Label Studio.

Label Studio is probably the best free, open source solution. However to do active learning you either have to upgrade to Enterprise ("Talk to Sales...") or figure out how best to insert a human in the loop.

Two ways I am considering:

  1. raw -> label with snorkel functions -> label studio to manually label the rest
  2. Label Studio can GET predictions from a snorkel powered service.

SpaCy's Prodigy is excellent at active learning. It's $390/490 to buy, which is totally reasonable if you are otherwise going to spend $390 worth of time trying to setup something similar.

Looked at briefly:

  • Diffgram
  • doccano
  • brat

Paid:

  • Labelbox
  • SuperAnnotate
  • V7 Labs (Darwin)
  • BasicAI
  • SuperbAI
  • Kili-Technology
  • Cord
  • HastyAI
  • Dataloop
  • Keymakr
  • Scale Nucleus

Signed up for kumu, a beautiful community site for Systems Mapping, Network graphs, Concept mapping etc.

Screen Shot 2022-08-09 at 12.15.17 PM.jpg