Summaries/Apache/Storm.md

2.7 KiB

title updated created
Storm 2022-05-24 19:15:01Z 2021-05-04 14:58:11Z

Why use Apache Storm?

Apache Storm is a free and open source distributed real-time computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language.

Apache Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Apache Storm integrates with the queueing and database technologies you already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, re-partitioning the streams between each stage of the computation however needed. Read more in the tutorial.

  • Real-time continuous streaming data on clusters
  • Runs on top of Yarn
  • Works on individual events (NOT micro-batches like Spark)
    • Storm is a better solution to Spark streaming
  • Storm is perfect for sub-second latency (fast)

Storm Topology

  • Streams consists of tuples that flow through
  • Spouts are sources of stream data (from Kafka, Twitter, etc)
  • Bolts process stream data as it's recieved
    • transform, aggregate, write to database / HDFS
    • So no final state. Data stream continous goes on an on forever
  • Storm topology is a graph of spouts ans bolts tat process the stream
    • can get complex (In Spark you get the DAG for free)

Storm Architecture

  • Nimbus is a single point of failure
    • Job tracker
    • can restart quickly witout loosing any data
    • HA is available as a Nimbus backup server
  • Zookeeper (in it self is HA)
  • Supervisors are doing the work

Developing Storm applications

  • usually in Java
    • Bolts may be directed through scripts in other languages
    • Selling point of Storm, but in practice in Java
  • Storm Core
    • lower-level API for Storm
    • "At-least-once" semantics (possibility of duplicated data)
  • Trident
    • Highlevel API for Storm <=== prefer
    • "Exactly once" semantics
    • After submitted, Storm runs forever - until explicitly stopped

Storm vs Spark Streaming

Storm

  • tumbling window
    • ie all events in the past 5 sec exactly; no overlap of events
  • sliding window
    • can overlap by design Storm only Java

Spark

  • graph, ML, micro-batch streaming Spark in Scala and Python

Kafka and Storm => perfect combination