Apache Beam, Google Dataflow, Flink, AWS Kinesis and Batch - an overview

I did a screening call with a recruiter for a startup that is using IoT video processing to detect leaks in chemical plants and reduce emissions. Like many startups, they have customers and installations but now it's time to scale up the software architecture and the team.

This got me thinking about how I would design a system to do streaming video analysis, ML on windowed IoT metrics, an alert system, administration etc.

I wrote a first design and will do another round on that soon.

I'm trying to increase my ratio of writing : reading. I read and ingest a lot—I take notes but I don't produce enough publishable content. Editing and publishing it is an important missing part.

Analyzing streaming video frames lead me to dig into some of the current streaming processors. This is just a quick outline, not a detailed comparison.

Quick take:

  • GCP Dataflow: serverless streaming and batch processing for data small or BIG
  • Apache Beam: the engine behind Dataflow
  • AWS Batch: the AWS equivalent to Dataflow
  • Flink: super fast batch or streaming, runs on Kubernetes. Dominant, huge.
  • AWS Kinesis: powered by Flink, hooked up to AWS.

Apache Beam is a Pipeline DAG orchestrator for working with batch or streaming data.

You write Pipelines which are DAGs with Transformers. You can run your pipeline on a small input file, or with terabytes or a streaming firehouse of data.

  • Python or Java
  • Runs:
    • locally (great for developing and running integration tests on your pipelines)
    • on Google Dataflow which is a managed Beam runner
    • in your Flink cluster (should you have one laying around)
  • Great for:
    • ETL
      • Extract from and Load back out to:
        • local files
        • buckets s3/gs
        • PubSub
        • Kafka
        • Snowflake
    • Transforms
      • Feature engineering
      • Windowing functions, aggregation
    • Run pytorch, TensorFlow or sklearn inference inside your pipelines!

Google Dataflow is a managed Apache Beam.

  • Serverless: scales from 0 to TB
  • Good choice for occasional processing, side projects, startups. No need to run a cluster continuously.
  • It's a lot more affordable if you are just running at certain times during the day (most companies) as opposed to managing your own Flink cluster.
  • Metrics and observability tools built in
  • Direct input and output connections to Google services
  • Can easily run pytorch, sklearn and TensorFlow inference on streaming or batch
  • TFX (TensorFlow Extended) is built on top of this.
  • The name "Dataflow" is awfully generic.
  • For anyone running on GCP this is a useful tool that could replace some jobs running on Airflow or Argo Workflows at much lower operational cost and complexity.

AWS Batch is about the same as GCP Dataflow. My initial impression is it's a bit more "some assembly required".

Flink is a framework and distributed processing engine for batch and streaming data

  • More popular than Beam
  • Runs on Kubernetes
  • It's powering a lot of bigger startups right now
  • Trillions of events per day
  • Can keep terabytes of application state in memory!
  • Python, Java, Scala

AWS Kinesis

  • Runs managed Flink and connects all your data in and out.
  • Can run Apache Beam Pipelines
  • Uses:
    • Web clickstream
    • IoT
    • Cameras can stream video to Kinesis, do facial Rekognition.
    • Log processing
    • Connected to the rest of AWS offerings (Redshift etc)