Summaries/Apache/Storm.md

---
title: Storm
updated: 2022-05-24 19:15:01Z
created: 2021-05-04 14:58:11Z
---

# Why use Apache Storm?

Apache Storm is a free and open source distributed real-time computation system. Apache Storm makes it easy to reliably process **unbounded streams of data**, doing for real-time processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language.

Apache Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Apache Storm integrates with the queueing and database technologies you already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, re-partitioning the streams between each stage of the computation however needed. Read more in the tutorial.

- Real-time continuous streaming data on clusters
- Runs on top of Yarn
- Works on individual events (**NOT** micro-batches like Spark)
  - Storm is a better solution to Spark streaming
- Storm is perfect for sub-second latency (fast)

## Storm Topology

<img src="../images/StormSpoutBolt.png" width="200">

- Streams consists of ___tuples___ that flow through
- Spouts are ___sources___  of stream data (from Kafka, Twitter, etc)
- ___Bolts___ process stream data as it's recieved
  - transform, aggregate, write to database / HDFS
  - So no final state. Data stream continous goes on an on forever
- Storm topology is a graph of spouts ans bolts tat process the stream
  - can get complex (In Spark you get the DAG for free)

## Storm Architecture

<img src="../images/StormArchitecture.png" width="300">

- Nimbus is a single point of failure
  - Job tracker
  - can restart quickly witout loosing any data
  - HA is available as a Nimbus backup server
- Zookeeper (in it self is HA)
- Supervisors are doing the work

## Developing Storm applications

- usually in Java
  - Bolts may be directed through scripts in other languages
  - Selling point of Storm, but in practice in Java
- Storm Core
  - lower-level API for Storm
  - "At-least-once" semantics (possibility of duplicated data)
- Trident
  - Highlevel API for Storm   <=== prefer
  - "Exactly once" semantics
  - After submitted, Storm runs forever - until explicitly stopped

## Storm vs Spark Streaming

Storm
- tumbling window
  -  ie all events in the past 5 sec exactly; no overlap of events
- sliding window
  - can overlap by design
  Storm only Java

Spark
  - graph, ML, micro-batch streaming
  Spark in Scala and Python

Kafka and Storm => perfect combination