跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(82) LLM(75) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(33) Go基础(29) Python(24) Vue(22) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) 聊天机器人(10) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) ChatGPT(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) whisper(6) Prisma(6) 隐私保护(6) RAG(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) 智能体(4) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) 编程语言(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) kafka(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) nextjs(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 大型语言模型(2) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)

category

The question answering system, that is based on semantic search and LLM currently one of the most popular application of LLM functionality. But what after we build it? How to evaluate the work of QnA system?

I would like to cover the evaluation of QnA system today in this article. I will describe several methods, that I tried myself and maybe it will be also useful for you.

Let’s start!

Evaluate the whole QnA system by validation dataset

It is a first method, that came to my mind. I’ve generated the validation dataset with a help of colleagues, who has knowledge in a domain area, that was used for QnA system. It can be not so big dataset, in my case I have 30–35 questions. But here one of the main trick, we can generate the validation dataset in two ways:

  1. Set up the pairs of questions and full, consistent answer, that we would like to see as an answer from our QnA system:

question = “Which document states we have on our project?”

answer = “The first state it is a ‘new’, then we can change it to ‘in progress’ or ‘postpone’, after that the final state will ‘success’ or ‘failed’ ”

2. Set up the pairs of questions and answers, but in this case, answer it will be just a list of entities/points, that oue QnA application should cover in the answer:

question = “Which document states we have on our project?”

answer = “new, in progress, postpone, success, failed”

So if we choose the first method we can evaluate the answer from QnA system in two ways: using ROUGE metric or using LLM as evaluator for ground-truth and predicted answer. In the second case it will be possible to use only LLM, but with small modification of prompt for evaluation.

I use prompt for evaluation from Langchain library, but you can create a new one, which will be more suitable for you.

So here prompt for the first case:

“You are a teacher grading a quiz.
You are given a question, the student’s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.

Example Format:
QUESTION: question here
STUDENT ANSWER: student’s answer here
TRUE ANSWER: true answer here
GRADE: CORRECT or INCORRECT here”

For the second case, when we have only the set of entities, we should use another one:

“You are a teacher grading a quiz.
You are given a question, the student’s answer, and the points, that should be in the student answer, and are asked to score the student answer as either CORRECT or INCORRECT.

Example Format:
QUESTION: question here
STUDENT ANSWER: student’s answer here
POINTS IN THE ANSWER: true answer here
GRADE: CORRECT or INCORRECT here”

So here we are. It is our first method, that can help us evaluate the QnA system in common.

Evaluate the whole QnA system by automatically generated validation dataset

This is method, that very similar to the first one, but with important detail.

Here we ask the generate the validation dataset by using LLM and the next prompt, which you can also find in the Langchain library, particularly in its sub-project auto-evaluator (and they have a very comprehensive guide)

### Human

You are question-answering assistant tasked with answering questions based on the provided context.

Here is the question: \

{question}

Use the following pieces of context to answer the question at the end. Use three sentences maximum. \

{context}

### Assistant

Answer: Think step by step.

It is possible to evaluate the QnA application by yourself or using auto-evaluator in langchain. But here the small notice that for now auto-evaluator works only with the ‘stuff’ type of request to LLM, when we put all relevant documents in one prompt. However we can use and add custom embeddings and document stores in auto-evaluator, that is can help to simulate our custom QnA application and evaluate it almost automatically.

Evaluate the semantics search part of a QnA system

For evaluating the only semantic search part of a QnA system we can use variety of metrics, but before dive deeper into metrics, I would like to cover several methods of semantic search, which I found interesting

  1. Similarity search via using different methods for defining distance a.k.a cosine similarity, euclidean distance, dot product, e.t.c — the most simplest approach
  2. Maximal Marginal Relevance(MRR) — more complex method, details I recommend to find in this source. But the main concept that besides similarity we also take into account the diversity of documents

So for my case I choose the MRR as a method for find the most relevant documents for user’s question. Then the evaluation stage came. I tried several metrics, the link on the whole review of them I put here, I totally recommend you have a look on it. Below you can find the list metrics, which I used:

  1. Mean Reciprocal Rank(MRR) — it is based on the first relevant rank and therefore it doesn’t take into account the rest predicted answers as can be considered as drawback
  2. Mean Average Precision(MAP) — this metrics already takes into account all answers, but still it is not consider the order of ranking
  3. Normalized Discounted Cumulative Gain(NDCG) —it is a cumulative gain of top answers, but discounted. The value of discount connected with rank of an answer

So here the short overview of semantic search metrics.

I used them in collaboration with metrics, that based on LLM evaluation for estimate the result of our QnA system

Below I would like to quickly mention the Test-Driven development(TDD) for creating prompts. It can be useful not only forQnA but also for other projects that use the LLM.

Test-Driven development(TDD) for prompt engineering

One of the best practice for software development it is a TDD and prompt engineering isn’t an exception in this case.

It has a very simple concept. We create a prompt and also evaluation function of an answer from LLM, that we get as a response on our prompt. In this way we just implement our expectations from communication with LLM as a test case.

For this purpose I found suitable library Promptimize, below you can find the simplest test case, that can be defined by Promptimize:

PromptCase(“hello there!”, lambda x: evals.any_word(x, [“hi”, “hello”]))

Here we just check that in the answer on our prompt ‘hello there!’ we will find one from these words ‘hi’, ‘hello’.

It was simply.

But this library also can help us to check the answer from LLM in the case, when we need to generate the code or sql-like query. We can build evaluator function as a runner of a code and put it in PromptCase as an input parameter.

Here you can find the example of it.

So, thank you for reach the end of this article! I hope you have found it useful!

Stay tuned! In future I can proceed with this topic and other exciting things, that connected with AI, LLM e.t.c