Apache Beam, Google Dataflow, Flink, AWS Kinesis and Batch

I did a screening call with a recruiter for a startup that is using IoT video processing to detect leaks in chemical plants and reduce emissions. Like many startups, they have customers and installations but now it's time to scale up the software architecture and the team.

This got me thinking about how I would design a system to do streaming video analysis, ML on windowed IoT metrics, an alert system, administration etc.

I wrote a first design and will do another round on that soon.

I'm trying to increase my ratio of writing : reading. I read and ingest a lot—I take notes but I don't produce enough publishable content. Editing and publishing it is an important missing part.

Comparing Apache Beam, Google Dataflow, Flink, AWS Kinesis and Batch

Analyzing streaming video frames lead me to dig into some of the current streaming processors. This is just a quick outline, not a detailed comparison.

Quick take:

GCP Dataflow: serverless streaming and batch processing for data small or BIG
Apache Beam: the engine behind Dataflow
AWS Batch: the AWS equivalent to Dataflow
Flink: super fast batch or streaming, runs on Kubernetes. Dominant, huge.
AWS Kinesis: powered by Flink, hooked up to AWS.

Apache Beam is a Pipeline DAG orchestrator for working with batch or streaming data.

You write Pipelines which are DAGs with Transformers. You can run your pipeline on a small input file, or with terabytes or a streaming firehouse of data.

Python or Java
Runs:
- locally (great for developing and running integration tests on your pipelines)
- on Google Dataflow which is a managed Beam runner
- in your Flink cluster (should you have one laying around)
Great for:
- ETL
  - Extract from and Load back out to:
    - local files
    - buckets s3/gs
    - PubSub
    - Kafka
    - Snowflake
- Transforms
  - Feature engineering
  - Windowing functions, aggregation
- Run pytorch, TensorFlow or sklearn inference inside your pipelines!

Google Dataflow is a managed Apache Beam.

Serverless: scales from 0 to TB
Good choice for occasional processing, side projects, startups. No need to run a cluster continuously.
It's a lot more affordable if you are just running at certain times during the day (most companies) as opposed to managing your own Flink cluster.
Metrics and observability tools built in
Direct input and output connections to Google services
Can easily run pytorch, sklearn and TensorFlow inference on streaming or batch
TFX (TensorFlow Extended) is built on top of this.
The name "Dataflow" is awfully generic.
For anyone running on GCP this is a useful tool that could replace some jobs running on Airflow or Argo Workflows at much lower operational cost and complexity.

AWS Batch is about the same as GCP Dataflow. My initial impression is it's a bit more "some assembly required".

Flink is a framework and distributed processing engine for batch and streaming data

More popular than Beam
Runs on Kubernetes
It's powering a lot of bigger startups right now
Trillions of events per day
Can keep terabytes of application state in memory!
Python, Java, Scala

AWS Kinesis

Runs managed Flink and connects all your data in and out.
Can run Apache Beam Pipelines
Uses:
- Web clickstream
- IoT
- Cameras can stream video to Kinesis, do facial Rekognition.
- Log processing
- Connected to the rest of AWS offerings (Redshift etc)

Chris Sattinger

Chris Sattinger

Apache Beam, Google Dataflow, Flink, AWS Kinesis and Batch - an overview

Comparing Apache Beam, Google Dataflow, Flink, AWS Kinesis and Batch