跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(82) LLM(75) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(33) Go基础(29) Python(24) Vue(22) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) 聊天机器人(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) ChatGPT(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) whisper(6) Prisma(6) 隐私保护(6) RAG(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) 智能体(4) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) kafka(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) nextjs(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 编程语言(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 大型语言模型(2) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)
SEO Title

A curated list of awesome Apache Spark packages and resources.

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of CaliforniaBerkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).

Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.

Contents

Packages

Language Bindings

Notebooks and IDEs

  • almond  - A scala kernel for Jupyter.
  • Apache Zeppelin  - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
  • Polynote  - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.
  • Spark Notebook  - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
  • sparkmagic  - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

  • Succinct - Support for efficient queries on compressed data.
  • itachi  - A library that brings useful functions from modern database management systems to Apache Spark.
  • spark-daria  - A Scala library with essential Spark functions and extensions to make you more productive.
  • quinn  - A native PySpark implementation of spark-daria.
  • Apache DataFu  - A library of general purpose functions and UDF's.
  • Joblib Apache Spark Backend  - joblib backend for running tasks on Spark clusters.

SQL Data Sources

SparkSQL has serveral built-in Data Sources for files. These include csvjsonparquetorc, and avro. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.

Storage

  • Delta Lake  - Storage layer with ACID transactions.
  • lakeFS  - Integration with the lakeFS atomic versioned storage layer.

Bioinformatics

  • ADAM  - Set of tools designed to analyse genomics data.
  • Hail  - Genetic analysis framework.

GIS

  • Magellan  - Geospatial analytics using Spark.
  • Apache Sedona  - Cluster computing system for processing large-scale spatial data.

Time Series Analytics

  • Spark-Timeseries  - Scala / Java / Python library for interacting with time series data on Apache Spark.
  • flint  - A time series library for Apache Spark.

Graph Processing

  • Mazerunner  - Graph analytics platform on top of Neo4j and GraphX.
  • GraphFrames  - Data frame based graph API.
  • neo4j-spark-connector  - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
  • SparklingGraph  - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).

Machine Learning Extension

Middleware

  • Livy  - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
  • spark-jobserver  - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
  • Mist  - Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.
  • Apache Toree  - IPython protocol based middleware for interactive applications.
  • Apache Kyuubi  - A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark.

Monitoring

Utilities

  • silex  - Collection of tools varying from ML extensions to additional RDD methods.
  • sparkly  - Helpers & syntactic sugar for PySpark.
  • pyspark-stubs  - Static type annotations for PySpark (obsolete since Spark 3.1. See SPARK-32681).
  • Flintrock  - A command-line tool for launching Spark clusters on EC2.
  • Optimus  - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.

Natural Language Processing

Streaming

  • Apache Bahir  - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Interfaces

  • Apache Beam  - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
  • Blaze  - Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark DataFrames and RDDs.
  • Koalas  - Pandas DataFrame API on top of Apache Spark.

Testing

  • deequ  - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
  • spark-testing-base  - Collection of base test classes.
  • spark-fast-tests  - A lightweight and fast testing framework.

Web Archives

Workflow Management

Resources

Books

Papers

MOOCS

Workshops

Projects Using Spark

  • Oryx 2 - Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
  • Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
  • PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
  • Crossdata - Data integration platform with extended DataSource API and multi-user environment.

Docker Images

Miscellaneous

References

Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.

原文:https://github.com/awesome-spark/awesome-spark