跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(82) LLM(78) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(33) Go基础(29) Python(24) Vue(22) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) 聊天机器人(10) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) ChatGPT(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) RAG(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) whisper(6) Prisma(6) 隐私保护(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 智能体(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) kafka(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) 编程语言(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) nextjs(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) RAG架构(3) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)

A curated list of awesome streaming (stream processing) frameworks, applications, readings and other resources. Inspired by other awesome projects.

Website

https://manuzhang.github.io/awesome-streaming/ is a more dynamic website where you can find updates of the awesome projects here.

Table of Contents

Streaming Engine

  • Apache Apex [Java] - unified platform for big data stream and batch processing.
  • Apache Ballista [Rust] - distributed compute platform powered by Apache Arrow.
  • Apache Flink [Java] - system for high-throughput, low-latency data stream processing that supports stateful computation, data-driven windowing semantics and iterative stream processing.
  • Apache Heron (incubating) [Java] - a realtime, distributed, fault-tolerant stream processing engine from Twitter.
  • Apache Samza [Scala/Java] - distributed stream processing framework that build on Kafka(messaging, storage) and YARN(fault tolerance, processor isolation, security and resource management).
  • Apache Spark Streaming [Scala] - makes it easy to build scalable fault-tolerant streaming applications.
  • Apache Storm [Clojure/Java] - distributed real-time computation system. Storm is to stream processing what Hadoop is to batch processing.
  • AthenaX [Java] - Uber's Stream Analytics Framework used in production
  • Faust [Python] - stream processing library, porting the ideas from Kafka Streams to Python
  • Gearpump [Scala] - lightweight real-time distributed streaming engine built on Akka.
  • Hazelcast Jet [Java] - A general purpose distributed data processing engine, built on top of Hazelcast.
  • hailstorm [Haskell] - distributed stream processing with exactly-once semantics based on Storm.
  • Maki Nage [Python] - A stream processing framework for data scientists, based on Kafka and ReactiveX.
  • mantis [Java] - Netflix's platform to build an ecosystem of realtime stream processing applications
  • mupd8(muppet) [Scala/Java] - mapReduce-style framework for processing fast/streaming data.
  • Onyx [Clojure] - Distributed, masterless, high performance, fault tolerant data processing.
  • s4 [Java] - general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
  • SABER [Java/C] - Window-Based Hybrid CPU/GPU Stream Processing Engine.
  • Scramjet Transform Hub [JavaScript/Node.js] - data processing engine for running multiple data processing apps (sequences) written in JavaScript or TypeScript
  • SPQR [Java] - dynamic framework for processing high volumn data streams through pipelines.
  • tigon [C++/Java] - high throughput real-time streaming processing framework built on Hadoop and HBase.
  • Teknek [Java] - Simple elegant stream processing with interactive prototying shell SOL (Stream Operator Language) Mesos, designed for high performance data processing jobs that require flexibility & control.
  • Trill [.NET/C#] - Trill is a high-performance one-pass in-memory streaming analytics engine from Microsoft Research.
  • Wallaroo [Python] - A fast, stream-processing framework. Wallaroo makes it easy to react to data in real-time. By eliminating infrastructure complexity, going from prototype to production has never been simpler.
  • LightSaber [C++] - Multi-core Window-Based Stream Processing Engine. LightSaber uses code generation for efficient window aggregation.
  • HStreamDB [Haskell] - The streaming database built for IoT data storage and real-time processing.
  • Kuiper [Golang] - An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
  • WindFlow [C++] - A C++17 Data Stream Processing Parallel Library for Multicores and GPUs

Streaming Library

  • Apache Kafka Streams [Java] - lightweight stream processing library included in Apache Kafka (since 0.10 version).
  • Akka Streams [Scala] - stream processing library on Akka Actors.
  • Daggy [C++] - real-time streams aggregation and catching.
  • Benthos [Go] - Benthos is a high performance and resilient message streaming service, able to connect various sources and sinks and perform arbitrary actions, transformations and filters on payloads
  • FS2(prev. 'Scalaz-Stream') [Scala] - Compositional, streaming I/O library for Scala.
  • monix [Scala] - high-performance Scala / Scala.js library for composing asynchronous and event-based programs.
  • Scramjet Framework - functional reactive stream programming framework written on top of Node.js object streams.
  • Streamline [Java] - Stream Analytics Framework by Hortonworks, designed as a wrapper around existing streaming solutions like Storm. Aimed to allow users to drag-and-drop streaming components to focus on business logic.
  • StreamAlert [Python] - Airbnb's Real-time Data Analysis and Alerting.
  • Swave [Scala] - A lightweight Reactive Streams Infrastructure Toolkit for Scala.
  • Streamz [Python] - A lightweight library for building pipelines to manage continuous streams of data; supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.
  • Stream Ops [Java] - A fully embeddable data streaming engine and stream processing API for Java.
  • Tributary [Python] - A python library for constructing dataflow graphs. Supports synchronous, reactive data streams built using python generators that mimic complex event processors, as well as lazily-evaluated acyclic graphs and functional currying streams.
  • YoMo [Go] - An open source Streaming Serverless Framework for building Low-latency Geo-distributed system. YoMo Built atop QUIC Transport Protocol and Functional Reactive Programming interface.

Streaming Application

  • straw [Python/Java] - A platform for real-time streaming search.
  • storm-crawler [Java] - Web crawler SDK based on Apache Storm.

IoT

  • sensorbee [Go] - lightweight stream processing engine for IoT.
  • Apache Edgent [Java] - a programming model and runtime that enables continuous streaming analytics on gateways and edge devices which can work with centralized systems to provide efficient and timely analytics across the whole IoT ecosystem: from the center to the edge, opens sourced by IBM.
  • Apache StreamPipes [Java] - a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams.

DSL

  • Apache Beam [Java, Python, SQL, Scala, Go] - unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs), open sourced by Google.
  • coast [Scala] - a DSL that builds DAGs on top of Samza and provides exactly-once semantics.
  • Esper [Java] - component for complex event processing (CEP) and event series analysis.
  • Streamparse [Python] - lets you run Python code against real-time streams of data via Apache Storm.
  • summingbird [Scala] - library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Data Pipeline

  • Apache Kafka [Scala/Java] - distributed, partitioned, replicated commit log service, which provides the functionality of a messaging system, but with a unique design.
  • Apache Pulsar [Java] - distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
  • Apache RocketMQ [Java] - distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
  • brooklin [Java] - a distributed system intended for streaming data between various heterogeneous source and destination systems with high reliability and throughput at scale from Linkedin (replaced databus).
  • camus [Java] - Linkedin's Kafka -> HDFS pipeline.
  • databus [Java] - Linkedin's source-agnostic distributed change data capture system.
  • flume [Java] - distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
  • fluvio [Rust/WASM] - Real-time programmable data streaming platform with in-line computation capabilities.
  • Gazette [golang] - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
  • LogDevice [C++] - a high-performant distributed system by Facebook for streaming and storing sequential data, using a log structure.
  • metaq [Java] - Taobao's high available, high performance distributed messaging system
  • NATS streaming [Go] - fast disk-backed messaging solution
  • nsq [Go] - realtime distributed messaging platform designed to operate at scale, handling billions of messages per day.
  • Redpanda [C++] - Redpanda is Kafka compatible, ZooKeeper-free, JVM-free and source available.
  • RudderStack [Go] - an open source customer data infrastructure (segment, mparticle alternative).
  • suro [Java] - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data.
  • StreamSets Data Collector [Java] - continuous big data ingestion infrastructure that reads from and writes to a large number of end-points, including S3, JDBC, Hadoop, Kafka, Cassandra and many others.

Online Machine Learning

  • Apache Samoa [Java] - distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.
  • DataSketches [Java] - sketches library from Yahoo!.
  • streamDM [Scala] - mining Big Data streams using Spark Streaming from Huawei.
  • StreamingBandit [Python] - Provides a webserver to quickly setup and evaluate possible solutions to contextual multi-armed bandit (cMAB) problems.
  • StormCV [Java] - enables the use of Apache Storm for video processing by adding computer vision (CV) specific operations and data model.
  • trident-ml [Java] - realtime online machine learning library based on Trident.
  • yurita [Scala] - Anomaly detection framework built on Spark Structured Streaming from Paypal.

Streaming SQL

  • pipelinedb [C] - An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
  • squall [Java] - Squall executes SQL queries on top of Storm for doing online processing.
  • StreamCQL [Java] - Continuous Query Language on RealTime Computation System.
  • ksqlDB [Java] - A cloud-native, source-available database purpose-built for stream processing applications
  • Materialize [Rust] - A source-available streaming SQL engine for maintaining materialized views on data from message brokers and databases.
  • Siddhi [Java] - A cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to capture events from diverse data sources, process them, detect complex conditions, and publish output to various endpoints in real time.

Benchmark

  • storm-perf-test [Java] - a simple storm performance/stress test.
  • streaming-benchmarks [Java] - Benchmarks for Low Latency (Streaming) solutions including Apache Storm, Apache Spark, Apache Flink, etc.
  • flotilla [Go] - Automated message queue orchestration for scaled-up benchmarking.

Toolkit

  • akka [Scala] - toolkit and runtime for building highly concurrent, distributed, and resilient message-driven application on the JVM.
  • pulsar [Python] - Actor based event driven concurrent framework for Python.
  • aeron [Java/C++] - efficient reliable unicast and multicast message transport.
  • StreamFlow [Java] - stream processing tool designed to help build and monitor processing workflows.
  • samza-luwak [Java] - uses Luwak, a stored-query engine built on Lucene, to implement full-text search on streams.
  • Turbine [Java] - tool for aggregating streams of Server-Sent Event (SSE) JSON data into a single stream.
  • Nussknacker [Scala] - A visual tool to define and run real-time decision algorithms.

Closed Source

  • Amazon Kinesis Streams [Java] - real-time, fully managed and scalable data stream engine provided by AWS.
  • Azure Stream Analytics [.NET] a massively scalable, fully managed, real-time, data stream engine provided by Microsoft Azure.
  • Cloud Dataflow[Java, Python, SQL, Scala] - Google's managed stream and batch data processing engine. Supports running Beam pipelines.
  • concord [C++] - a distributed stream processing framework built in C++ on top of Apache.
  • IBM Streams [Python/Java/Scala] - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
  • jubatus [C++] - distributed processing framework and streaming machine learning library.
  • millwheel - framework for building low-latency data-processing applications that is widely used at Google.

Readings

  1. In-Stream Big Data Processing
  2. The world beyond batch: Streaming 101 by Tyler Akidau.
  3. Real Time Analytics: Algorithms and Systems (VLDB 2015)
  4. Grokking Streaming Systems by Josh Fischer & Ning Wang
  5. Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing by Reuven Lax, Slava Chernyak, and Tyler Akidau
  6. Data Pipelines with Apache Airflow by Bas P. Harenslak and Julian Rutger de Ruiter

原文:https://github.com/manuzhang/awesome-streaming