71 lines
2.7 KiB
Markdown
71 lines
2.7 KiB
Markdown
---
|
|
title: Storm
|
|
updated: 2022-05-24 19:15:01Z
|
|
created: 2021-05-04 14:58:11Z
|
|
---
|
|
|
|
# Why use Apache Storm?
|
|
|
|
Apache Storm is a free and open source distributed real-time computation system. Apache Storm makes it easy to reliably process **unbounded streams of data**, doing for real-time processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language.
|
|
|
|
Apache Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
|
|
|
|
Apache Storm integrates with the queueing and database technologies you already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, re-partitioning the streams between each stage of the computation however needed. Read more in the tutorial.
|
|
|
|
- Real-time continuous streaming data on clusters
|
|
- Runs on top of Yarn
|
|
- Works on individual events (**NOT** micro-batches like Spark)
|
|
- Storm is a better solution to Spark streaming
|
|
- Storm is perfect for sub-second latency (fast)
|
|
|
|
## Storm Topology
|
|
|
|
<img src="../images/StormSpoutBolt.png" width="200">
|
|
|
|
- Streams consists of ___tuples___ that flow through
|
|
- Spouts are ___sources___ of stream data (from Kafka, Twitter, etc)
|
|
- ___Bolts___ process stream data as it's recieved
|
|
- transform, aggregate, write to database / HDFS
|
|
- So no final state. Data stream continous goes on an on forever
|
|
- Storm topology is a graph of spouts ans bolts tat process the stream
|
|
- can get complex (In Spark you get the DAG for free)
|
|
|
|
## Storm Architecture
|
|
|
|
<img src="../images/StormArchitecture.png" width="300">
|
|
|
|
- Nimbus is a single point of failure
|
|
- Job tracker
|
|
- can restart quickly witout loosing any data
|
|
- HA is available as a Nimbus backup server
|
|
- Zookeeper (in it self is HA)
|
|
- Supervisors are doing the work
|
|
|
|
## Developing Storm applications
|
|
|
|
- usually in Java
|
|
- Bolts may be directed through scripts in other languages
|
|
- Selling point of Storm, but in practice in Java
|
|
- Storm Core
|
|
- lower-level API for Storm
|
|
- "At-least-once" semantics (possibility of duplicated data)
|
|
- Trident
|
|
- Highlevel API for Storm <=== prefer
|
|
- "Exactly once" semantics
|
|
- After submitted, Storm runs forever - until explicitly stopped
|
|
|
|
## Storm vs Spark Streaming
|
|
|
|
Storm
|
|
- tumbling window
|
|
- ie all events in the past 5 sec exactly; no overlap of events
|
|
- sliding window
|
|
- can overlap by design
|
|
Storm only Java
|
|
|
|
Spark
|
|
- graph, ML, micro-batch streaming
|
|
Spark in Scala and Python
|
|
|
|
Kafka and Storm => perfect combination
|