category
Jina允许您构建通过gRPC、HTTP和WebSockets进行通信的多模式人工智能服务和管道,然后将其扩展并部署到生产中。您可以专注于您的逻辑和算法,而不必担心基础设施的复杂性。
Jina为从本地部署过渡到Docker Compose、Kubernetes或Jina AI Cloud等高级编排框架的ML模型提供了流畅的Python体验。Jina让每一位开发人员都能使用先进的解决方案工程和云原生技术。
- 为任何数据类型和任何主流深度学习框架构建和服务模型。
- 设计高性能服务,具有易于扩展、双工客户端服务器流、批处理、动态批处理、异步/非阻塞数据处理和任何协议。
- 在流式传输LLM模型输出的同时为其提供服务。
- 通过Executor Hub、OpenTelemetry/Prometheus可观察性实现Docker容器集成。
- 通过Jina AI Cloud优化CPU/GPU主机。
- 通过我们的Kubernetes和Docker Compose集成部署到您自己的云或系统。
等等,Jina和FastAPI有什么不同?
Jina的价值主张可能与FastAPI的价值主张非常相似。然而,有几个根本的区别:
- 数据结构和通信协议
- FastAPI通信依赖Pydantic,Jina依赖DocArray,允许Jina支持多个协议来公开其服务。对gRPC协议的支持特别适用于数据密集型应用程序,如嵌入服务,其中嵌入和张量可以更有效地序列化。
- 高级编排和扩展功能
- Jina允许您轻松地对服务和模型进行容器化和编排,从而提供并发性和可扩展性。
- Jina允许您部署由多个微服务组成的应用程序,这些微服务可以独立地进行容器化和扩展。
- 云端之旅
- Jina提供了从本地开发(使用DocArray)到本地服务(使用Deployment和Flow)的平稳过渡,再到通过使用Kubernetes能力来协调容器的使用寿命来提供生产就绪服务。
- 通过使用Jina AI Cloud,您可以在一个命令中访问应用程序的可扩展和无服务器部署。
Documentation
Install
pip install jina
Find more install options on Apple Silicon/Windows.
Get Started
Basic Concepts
Jina has three fundamental layers:
- Data layer: BaseDoc and DocList (from DocArray) are the input/output formats in Jina.
- Serving layer: An Executor is a Python class that transforms and processes Documents. By simply wrapping your models into an Executor, you allow them to be served and scaled by Jina. Gateway is the service making sure connecting all Executors inside a Flow.
- Orchestration layer: Deployment serves a single Executor, while a Flow serves Executors chained into a pipeline.
The full glossary is explained here.
Serve AI models
Let's build a fast, reliable and scalable gRPC-based AI service. In Jina we call this an Executor. Our simple Executor will wrap the StableLM LLM from Stability AI. We'll then use a Deployment to serve it.
Note A Deployment serves just one Executor. To combine multiple Executors into a pipeline and serve that, use a Flow.
Let's implement the service's logic:
executor.py |
---|
from jina import Executor, requests from docarray import DocList, BaseDoc from transformers import pipeline class Prompt(BaseDoc): text: str class Generation(BaseDoc): prompt: str text: str class StableLM(Executor): def __init__(self, **kwargs): super().__init__(**kwargs) self.generator = pipeline( 'text-generation', model='stabilityai/stablelm-base-alpha-3b' ) @requests def generate(self, docs: DocList[Prompt], **kwargs) -> DocList[Generation]: generations = DocList[Generation]() prompts = docs.text llm_outputs = self.generator(prompts) for prompt, output in zip(prompts, llm_outputs): generations.append(Generation(prompt=prompt, text=output)) return generations |
Then we deploy it with either the Python API or YAML:
Python API: deployment.py |
YAML: deployment.yml |
---|---|
from jina import Deployment from executor import StableLM dep = Deployment(uses=StableLM, timeout_ready=-1, port=12345) with dep: dep.block() |
jtype: Deployment with: uses: StableLM py_modules: - executor.py timeout_ready: -1 port: 12345 And run the YAML Deployment with the CLI: |
Use Jina Client to make requests to the service:
from jina import Client from docarray import DocList, BaseDoc class Prompt(BaseDoc): text: str class Generation(BaseDoc): prompt: str text: str prompt = Prompt( text='suggest an interesting image generation prompt for a mona lisa variant' ) client = Client(port=12345) # use port from output above response = client.post(on='/', inputs=[prompt], return_type=DocList[Generation]) print(response[0].text)
a steampunk version of the Mona Lisa, incorporating mechanical gears, brass elements, and Victorian era clothing details
Note In a notebook, you can't use
deployment.block()
and then make requests to the client. Please refer to the Colab link above for reproducible Jupyter Notebook code snippets.
Build a pipeline
Sometimes you want to chain microservices together into a pipeline. That's where a Flow comes in.
A Flow is a DAG pipeline, composed of a set of steps, It orchestrates a set of Executors and a Gateway to offer an end-to-end service.
Note If you just want to serve a single Executor, you can use a Deployment.
For instance, let's combine our StableLM language model with a Stable Diffusion image generation model. Chaining these services together into a Flow will give us a service that will generate images based on a prompt generated by the LLM.
text_to_image.py |
---|
import numpy as np from jina import Executor, requests from docarray import BaseDoc, DocList from docarray.documents import ImageDoc class Generation(BaseDoc): prompt: str text: str class TextToImage(Executor): def __init__(self, **kwargs): super().__init__(**kwargs) from diffusers import StableDiffusionPipeline import torch self.pipe = StableDiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16 ).to("cuda") @requests def generate_image(self, docs: DocList[Generation], **kwargs) -> DocList[ImageDoc]: result = DocList[ImageDoc]() images = self.pipe( docs.text ).images # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/) result.tensor = np.array(images) return result |
Build the Flow with either Python or YAML:
Python API: flow.py |
YAML: flow.yml |
---|---|
from jina import Flow from executor import StableLM from text_to_image import TextToImage flow = ( Flow(port=12345) .add(uses=StableLM, timeout_ready=-1) .add(uses=TextToImage, timeout_ready=-1) ) with flow: flow.block() |
jtype: Flow with: port: 12345 executors: - uses: StableLM timeout_ready: -1 py_modules: - executor.py - uses: TextToImage timeout_ready: -1 py_modules: - text_to_image.py Then run the YAML Flow with the CLI: |
Then, use Jina Client to make requests to the Flow:
from jina import Client from docarray import DocList, BaseDoc from docarray.documents import ImageDoc class Prompt(BaseDoc): text: str prompt = Prompt( text='suggest an interesting image generation prompt for a mona lisa variant' ) client = Client(port=12345) # use port from output above response = client.post(on='/', inputs=[prompt], return_type=DocList[ImageDoc]) response[0].display()
Easy scalability and concurrency
Why not just use standard Python to build that service and pipeline? Jina accelerates time to market of your application by making it more scalable and cloud-native. Jina also handles the infrastructure complexity in production and other Day-2 operations so that you can focus on the data application itself.
Increase your application's throughput with scalability features out of the box, like replicas, shards and dynamic batching.
Let's scale a Stable Diffusion Executor deployment with replicas and dynamic batching:
- Create two replicas, with a GPU assigned for each.
- Enable dynamic batching to process incoming parallel requests together with the same model inference.
Normal Deployment | Scaled Deployment |
---|---|
jtype: Deployment with: uses: TextToImage timeout_ready: -1 py_modules: - text_to_image.py |
jtype: Deployment with: uses: TextToImage timeout_ready: -1 py_modules: - text_to_image.py env: CUDA_VISIBLE_DEVICES: RR replicas: 2 uses_dynamic_batching: # configure dynamic batching /default: preferred_batch_size: 10 timeout: 200 |
Assuming your machine has two GPUs, using the scaled deployment YAML will give better throughput compared to the normal deployment.
These features apply to both Deployment YAML and Flow YAML. Thanks to the YAML syntax, you can inject deployment configurations regardless of Executor code.
Deploy to the cloud
Containerize your Executor
In order to deploy your solutions to the cloud, you need to containerize your services. Jina provides the Executor Hub, the perfect tool to streamline this process taking a lot of the troubles with you. It also lets you share these Executors publicly or privately.
You just need to structure your Executor in a folder:
TextToImage/ ├── executor.py ├── config.yml ├── requirements.txt
config.yml |
requirements.txt |
---|---|
jtype: TextToImage py_modules: - executor.py metas: name: TextToImage description: Text to Image generation Executor based on StableDiffusion url: keywords: [] |
diffusers accelerate transformers |
Then push the Executor to the Hub by doing: jina hub push TextToImage
.
This will give you a URL that you can use in your Deployment
and Flow
to use the pushed Executors containers.
jtype: Flow with: port: 12345 executors: - uses: jinai+docker://<user-id>/StableLM - uses: jinai+docker://<user-id>/TextToImage
Get on the fast lane to cloud-native
Using Kubernetes with Jina is easy:
jina export kubernetes flow.yml ./my-k8s kubectl apply -R -f my-k8s
And so is Docker Compose:
jina export docker-compose flow.yml docker-compose.yml docker-compose up
Note You can also export Deployment YAML to Kubernetes and Docker Compose.
That's not all. We also support OpenTelemetry, Prometheus, and Jaeger.
What cloud-native technology is still challenging to you? Tell us and we'll handle the complexity and make it easy for you.
Deploy to JCloud
You can also deploy a Flow to JCloud, where you can easily enjoy autoscaling, monitoring and more with a single command.
First, turn the flow.yml
file into a JCloud-compatible YAML by specifying resource requirements and using containerized Hub Executors.
Then, use jina cloud deploy
command to deploy to the cloud:
wget https://raw.githubusercontent.com/jina-ai/jina/master/.github/getting-started/jcloud-flow.yml jina cloud deploy jcloud-flow.yml
Warning
Make sure to delete/clean up the Flow once you are done with this tutorial to save resources and credits.
Read more about deploying Flows to JCloud.
Streaming for LLMs
Large Language Models can power a wide range of applications from chatbots to assistants and intelligent systems. However, these models can be heavy and slow and your users want systems that are both intelligent and fast!
Large language models work by turning your questions into tokens and then generating new token one at a time until it decides that generation should stop. This means you want to stream the output tokens generated by a large language model to the client. In this tutorial, we will discuss how to achieve this with Streaming Endpoints in Jina.
Service Schemas
The first step is to define the streaming service schemas, as you would do in any other service framework. The input to the service is the prompt and the maximum number of tokens to generate, while the output is simply the token ID:
from docarray import BaseDoc class PromptDocument(BaseDoc): prompt: str max_tokens: int class ModelOutputDocument(BaseDoc): token_id: int generated_text: str
Service initialization
Our service depends on a large language model. As an example, we will use the gpt2
model. This is how you would load such a model in your executor
from jina import Executor, requests from transformers import GPT2Tokenizer, GPT2LMHeadModel import torch tokenizer = GPT2Tokenizer.from_pretrained('gpt2') class TokenStreamingExecutor(Executor): def __init__(self, **kwargs): super().__init__(**kwargs) self.model = GPT2LMHeadModel.from_pretrained('gpt2')
Implement the streaming endpoint
Our streaming endpoint accepts a PromptDocument
as input and streams ModelOutputDocument
s. To stream a document back to the client, use the yield
keyword in the endpoint implementation. Therefore, we use the model to generate up to max_tokens
tokens and yield them until the generation stops:
class TokenStreamingExecutor(Executor): ... @requests(on='/stream') async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument: input = tokenizer(doc.prompt, return_tensors='pt') input_len = input['input_ids'].shape[1] for _ in range(doc.max_tokens): output = self.model.generate(**input, max_new_tokens=1) if output[0][-1] == tokenizer.eos_token_id: break yield ModelOutputDocument( token_id=output[0][-1], generated_text=tokenizer.decode( output[0][input_len:], skip_special_tokens=True ), ) input = { 'input_ids': output, 'attention_mask': torch.ones(1, len(output[0])), }
Learn more about streaming endpoints from the Executor
documentation.
Serve and send requests
The final step is to serve the Executor and send requests using the client. To serve the Executor using gRPC:
from jina import Deployment with Deployment(uses=TokenStreamingExecutor, port=12345, protocol='grpc') as dep: dep.block()
To send requests from a client:
import asyncio from jina import Client async def main(): client = Client(port=12345, protocol='grpc', asyncio=True) async for doc in client.stream_doc( on='/stream', inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10), return_type=ModelOutputDocument, ): print(doc.generated_text) asyncio.run(main())
The
The capital
The capital of
The capital of France
The capital of France is
The capital of France is Paris
The capital of France is Paris.
- 登录 发表评论