In this edition, I have meticulously documented every testing framework for LLMs that I’ve come across on the internet and GitHub.

Basic LLM Testing Framework

LLM Testing Framework

I am organizing the frameworks in alphabetical order, without assigning any specific rank to them.

👩‍⚖️ DeepEval

DeepEval provides a Pythonic way to run offline evaluations on your LLM pipelines so you can launch comfortably into production. The guiding philosophy is a “Pytest for LLM” that aims to make productionizing and evaluating LLMs as easy as ensuring all tests pass.

DeepEval is a tool for easy and efficient LLM testing. DeepEval aims to make writing tests for LLM applications (such as RAG) as easy as writing Python unit tests.

🪂 Metrics

  • AnswerRelevancy: Depends on “sentence-transformers/multi-qa-MiniLM-L6-cos-v1”

  • BertScoreMetric: Depends on “sentence-transformers/all-mpnet-base-v2”

  • Dbias: LLMs can become highly biased after finetuning from any RLHF or optimizations. Bias, however, is a very vague term so the paper focuses on bias in the following areas.

    • Gender (e.g. “All man hours in his area of responsibility must be approved.”)
    • Age (e.g. “Apply if you are a recent graduate.”)
    • Racial/Ethnicity (e.g. “Police are looking for any black males who may be involved in this case.”)
    • Disability (e.g. “Genuine concern for the elderly and handicapped”)
    • Mental Health (e.g. “Any experience working with retarded people is required for this job.”)
    • Religion
    • Education
    • Political ideology
    • This is measured according to tests with logic following this paper:


  • BLEUMetric: Compute the BLEU score for a candidate sentence given a reference sentence. Depends on the nltk models

  • CohereRerankerMetric

  • ConceptualSimilarityMetric: Asserting conceptual similarity.Depends on “sentence-transformers/all-mpnet-base-v2”

  • ranking_similarity: Similarity measures between two different ranked lists. Built on “A Similarity Measure for Indefinite Rankings”

  • NonToxicMetric: Built on detoxify

  • FactualConsistencyMetric: Depends on “cross-encoder/nli-deberta-v3-large”

  • EntailmentScoreMetric: Depends on “cross-encoder/nli-deberta-base”

  • Custom Metrics: Can be added.

🎈 Details

🧗 Remarks

  • Clean Dashboard.
  • The model derived Metrics - and it’s good. You can adjust the model depending on the performance.
  • Helpful to measure the output quality.
  • Less Community Support.

I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. Subscribe

🕵️ AgentOps (in development)

🎈 Details

🧗 Remarks

  • Enlisting the product because of the exciting LLM debugging roadmap


Midjourney Image Source MidJourney 💪💪💪

Testing & Observability Platform for LLM Apps. From prompt playground to end-to-end tests, baserun helps you ship your LLM apps with confidence and speed.

Baserun is a YCombinator-backed great tool to debug the prompts on runtime.

🎈 Details

🧗 Remarks

  • Clean Detailed Dashboard with prompt cost(I loved that).
  • The evaluation framework is heavily inspired by the OpenAI Evals project and offers a number of built-in evaluations which we record and aggregate in the Baserun dashboard.
  • The framework simplifies the LLM Debugging workflow.
  • The hallucinations can be prevented with the tool to some extent.
  • Less Customisation Scope.

Midjourney Image Source MidJourney

🐤 PromptTools

Welcome to prompttools created by Hegel AI! This repo offers a set of open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. The core idea is to enable developers to evaluate using familiar interfaces like code, notebooks, and a local playground.

In just a few lines of codes, you can test your prompts and parameters across different models (whether you are using OpenAI, Anthropic, or LLaMA models). You can even evaluate the retrieval accuracy of vector databases.

🎈 Details

🪂 Metrics

  • Experiments and Harnesses : Here are two main abstractions used in the prompttools library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details.
    • An experiment is a low-level abstraction that takes the cartesian product of possible inputs to an LLM API. For example, the OpenAIChatExperiment accepts lists of inputs for each parameter of the OpenAI Chat Completion API. Then, it constructs and asynchronously executes requests using those potential inputs. An example of using an experiment is here.
    • There are two main abstractions used in the prompttools library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details. A harness is built on top of an experiment and manages abstractions over inputs.
  • Evaluation and Validation : These built-in functions help you to evaluate the outputs of your experiments. They can also be used to be part of your CI/CD system.

    • You can also manually enter feedback to evaluate prompts, see HumanFeedback.ipynb
    • IT uses gpt4 as a judge
    • Here is a list of APIs that we support with our experiments:
    • LLMs
      • OpenAI (Completion, ChatCompletion, Fine-tuned models) - Supported
      • LLaMA.Cpp (LLaMA 1, LLaMA 2) - Supported
      • HuggingFace (Hub API, Inference Endpoints) - Supported
      • Anthropic - Supported
      • Google PaLM - Supported
      • Azure OpenAI Service - Supported
      • Replicate - Supported
      • Ollama - In Progress
    • Vector Databases and Data Utility
      • Chroma - Supported
      • Weaviate - Supported
      • Qdrant - Supported
      • LanceDB - Supported
      • Milvus - Exploratory
      • Pinecone - Exploratory
      • Epsilla - In Progress
    • Frameworks
      • LangChain - Supported
      • MindsDB - Supported
      • LlamaIndex - Exploratory
    • Computer Vision
      • Stable Diffusion - Supported
      • Replicate’s hosted Stable Diffusion - Supported


🧗 Remarks

  • I have been using it for the last 15 days. The Streamlit-based dashboard is smooth.
  • Prompt Template Experimentation is a nice feature of the product. But I am expecting more comparison details without latency and similarities.
  • The framework covers the LLM, VectorDb, and orchestrators.
  • Great Community Support.
  • Great tool for RLHF.
  • Can’t add a self-hosted server.

Somewhere @Krakow Somewhere @Krakow

🐚 Nvidia NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or “rails” for short) are specific ways of controlling the output of a large language model, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialog path, using a particular language style, extracting structured data, and more.

NeMo Guardrails will help ensure smart applications powered by large language models (LLMs) are accurate, appropriate, on topic, and secure. The software includes all the code, examples, and documentation businesses need to add safety to AI apps that generate text.

It sits in the middle between the user (After Vector Embedding) and guard LLM server. It is open source so the engineer can write their own logic onto the guardrail.

NeMo Guardrails enables developers to set up three kinds of boundaries:

  • Topical guardrails prevent apps from veering off into undesired areas. For example, they keep customer service assistants from answering questions about the weather.

  • Safety guardrails ensure apps respond with accurate, appropriate information. They can filter out unwanted language and enforce that references are made only to credible sources.

  • Security guardrails restrict apps to making connections only to external third-party applications known to be safe.

NeMo Guardrails

🎈 Details

🧗 Remarks

  • Nemo-Guardrail is An easily programmable guardrail that is a must for the production-based LLM application.
  • The conversation designer can add the boundaries of the conversation in the same plain English using colang.
  • The filtering policy of the guard rail depends on the embedding space - more intelligent.
  • Supports the production batching for the orchestration.
  • The community is great.
  • The most required framework in the time.

I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. Subscribe

🦜 Agenta

Building production-ready LLM-powered applications is currently very difficult. It involves countless iterations of prompt engineering, parameter tuning, and architectures.

Agenta provides you with the tools to quickly do prompt engineering and 🧪 experiment, ⚖️ evaluate, and 🚀 deploy your LLM apps. All without imposing any restrictions on your choice of framework, library, or model.


🎈 Details

🧗 Remarks

  • The website and app code have excellent UX. The end-to-end user journey, from creation to testing, is beautifully designed.
  • Can be hosted OnPrem - Aws or GCP
  • They have different parts:
    • Playground: to create the prompts from lots of predefined templates like
      • sales_call_summarizer
      • baby_name_generator
      • chat_models
      • completion_models
      • compose_email
      • experimental
      • extract_data_to_json
      • job_info_extractor
      • noteGPT
      • recipes_and_ingredients
      • sales_call_summarizer
      • sales_transcript_summarizer
      • sentiment_analysis
    • Test Sets
    • Evaluate
    • API Endpoint


Somewhere @Texel Island

🦚 AgentBench

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors.


🎈 Details

🧗 Remarks

  • This paper evaluates the performance of several LLMs (LLama 2, Vicuna, GPT-X, Dolly, etc.) as intelligent agents in a long chain environment that involves databases (SQL), web booking, and product comparison on the internet. The main question to be answered is whether LLama 2 is superior to ChatGPT in comparing products on the internet. For the purpose of this study, an agent refers to an LLM that operates in this environment.
  • An “AGENT” is an LLM that operates within a simulated environment to achieve a specific goal. In this study, the term, “agent” is used to refer to such an LLM. The agent’s performance is assessed based on its capability to complete assigned tasks.
  • To date, It’s one of the best approaches to evaluating a LLM model for various tasks.


Somewhere @Andora

🐿️ Guidance

Guidance enables you to control modern language models more effectively and efficiently than traditional prompting or chaining. Guidance programs allow you to interleave generation, prompting, and logical control into a single continuous flow matching how the language model actually processes the text. Simple output structures like Chain of Thought and its many variants (e.g., ART, Auto-CoT, etc.) have been shown to improve LLM performance. The advent of more powerful LLMs like GPT-4 allows for even richer structure, and guidance makes that structure easier and cheaper.

🎈 Details

🕵️‍♀️ Features

  • 🔹 Live streaming Simple, intuitive syntax. Guidance feels like a templating language, and just like standard Handlebars templates, you can do variable interpolation (e.g., ) and logical control.
  • 🔹 Chat dialog Guidance supports API-based chat models like GPT-4, as well as open chat models like Vicuna through a unified API based on role tags (e.g., …). This allows interactive dialog development that combines rich templating and logical control with modern chat models.
  • 🔹 Guidance acceleration When multiple generation or LLM-directed control flow statements are used in a single Guidance program then we can significantly improve inference performance by optimally reusing the Key/Value caches as we progress through the prompt. This means Guidance only asks the LLM to generate the green text below, not the entire program. This cuts this prompt’s runtime in half vs. a standard generation approach.

🔹 Token healing: The standard greedy tokenizations used by most language models introduce a subtle and powerful bias that can have all kinds of unintended consequences for your prompts. Using a process we call “token healing” guidance automatically removes these surprising biases, freeing you to focus on designing the prompts you want without worrying about tokenization artifacts. - Details

  • 🔹 Rich output structure example: To demonstrate the value of output structure, we take a simple task from BigBench, where the goal is to identify whether a given sentence contains an anachronism (a statement that is impossible because of non-overlapping time periods). Below is a simple two-shot prompt for it, with a human-crafted chain-of-thought sequence.
  • 🔹 Guaranteeing valid syntax JSON example: Large language models are great at generating useful outputs, but they are not great at guaranteeing that those outputs follow a specific format. This can cause problems when we want to use the outputs of a language model as input to another system. For example, if we want to use a language model to generate a JSON object, we need to make sure that the output is valid JSON. With guidance we can both accelerate inference speed and ensure that generated JSON is always valid. Below we generate a random character profile for a game with perfect syntax every time.
  • 🔹 Role-based chat model example: Modern chat-style models like ChatGPT and Alpaca are trained with special tokens that mark out “roles” for different areas of the prompt. Guidance supports these models through role tags that automatically map to the correct tokens or API calls for the current LLM. Below we show how a role-based guidance program enables simple multi-step reasoning and planning.
  • 🔹 Agents: We can easily build agents that talk to each other or to a user, via the await command. The await command allows us to pause execution and return a partially executed guidance program. By putting await in a loop, that partially executed program can then be called again and again to form a dialog (or any other structure you design). For example, here is how we might get GPT-4 to simulate two agents talking to one another.

🧗 Remarks

  • If I need to select a tool for prompt engineering, I select this one.
  • Community Support is Superb.

I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. Subscribe

🦆 Arthur Bench

Today, we’re excited to introduce our newest product: Arthur Bench, the most robust way to evaluate LLMs. Bench is an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models. This open source tool will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make informed, data-driven decisions when integrating the latest AI technologies into their operations. Here are some ways in which Arthur Bench helps businesses:Model Selection & Validation, Budget & Privacy Optimization, Translation of Academic Benchmarks to Real-World Performance.

Arthur Bench

🎈 Details

🧗 Remarks

  • This tool creates a test suite automatically using datasets.
  • Periodically validates models for resiliency to model changes outside their control.
  • The system offers deployment gates that identify anomalous inputs, potential PII leakage, toxicity, and other quality metrics. It learns from production performance to optimize thresholds for these quality gates.
  • Provides core token-level observability, performance dashboarding, inference debugging, and alerting.
  • Accelerates ability to identify and debug underperforming regions.

I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. Subscribe

🌳 Galileo LLM Studio

Algorithm-powered LLMOps Platform: Find the best prompt, inspect data errors while fine-tuning, monitor LLM outputs in real-time. All in one powerful, collaborative platform.

Galileo LLM Studio

🎈 Details

🕵️‍♀️ Features

  • 🔹 Prompt Engineering
    • Promot Inspector.
    • A detailed easy Dashboard with multiple parameters and evaluation scores.
    • Hallucination Score.
  • 🔹 LLM Fine-Tune and Debugging
    • The watcher function analyze the input data.
    • A detailed dashboard with data quality - Auto identification of the data pulling from LLM that reduces the performance.
    • Fix and track data changes over time.
  • 🔹 Production Monitoring
    • Real-time LLM Monitoring.
    • Risk Control with customized plugins
    • Customized alert with your Slack.

🧗 Remarks

  • To date, I found this one is the tool for LLMOps. The developer can push the LLM model into production with confidence using the tool.


An Overview of Lakera Guard – Bringing Enterprise-Grade Security to LLMs with One Line of Code.At Lakera, we supercharge AI developers by enabling them to swiftly identify and eliminate their AI applications’ security threats so that they can focus on building the most exciting applications securely.

🕵️‍♀️ Features

  • 🔹 Content Moderation
    • These are the categories that Lakera Guard currently evaluate against for inappropriate content in the input prompt.
      • Hate: Content targeting race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste, including violence. Content directed at non-protected groups (e.g., chess players) is exempt.- Sex: Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).
  • 🔹 Prompt injections
    • Jailbreaks: LLMs can be forced into malicious behavior by jailbreak attack prompts. Lakera Guard updates to protect against these.
    • Prompt injections: Prompt injection attacks must be stopped at all costs. Attackers will do whatever it takes to manipulate the system’s behavior or gain unauthorized access. But fear not, Lakera Guard is constantly updated to prevent prompt injections and protect your system from harm.
  • 🔹 Sensitive information:
    • PII stands for Personally Identifiable Information - data that can identify an individual. It requires strict protection due to identity theft and privacy risks. Organizations handling PII must safeguard it to prevent unauthorized access. Laws like GDPR and HIPAA ensure proper PII handling and privacy protection.
  • 🔹 Relevant Language
    • There are many ways to challenge LLMs using language. Users may: Either Use Japanese jailbreaks, Employ Portuguese prompt injections, Intentionally include spelling errors in prompts to bypass defenses, Insert extensive code or special characters into prompts.They assign a score between 0 and 1 to indicate the authenticity of a prompt. A higher score suggests a genuine attempt at regular communication.
  • 🔹 Unknown links
    • One way in which prompt injection can be dangerous is phishing.

🧗 Remarks

  • The Roadmap is amazing.
  • LLM security is a real topic - and they are working on it.

🐣 NightFall AI

ChatGPT and other generative AI tools are powerful ways to increase your team’s output. But sensitive data such as PII, confidential information, API keys, PHI, and much more can be contained in prompts. Rather than block these tools, use Nightfall’s Chrome extension or Developer Platform to.

NightFall AI

🎈 Details

🧗 Remarks

  • A great tool for handling LLM security.
  • Manage all security tasks in your SIEM or Nightfall dashboard.
  • Proactively protect your company and customer data.
  • Identify and manage secrets and keys from a single dashboard.
  • Train employees on best practice security policies, Build a culture of trust and strong data security hygiene.
  • Complete visibility of your sensitive data.

🦢 BenchLLM

BenchLLM is a Python-based open-source library that streamlines the testing of Large Language Models (LLMs) and AI-powered applications. It measures the accuracy of your model, agents, or chains by validating responses on any number of tests via LLMs.


🎈 Details

🧗 Remarks

  • A detailed customizable library to evaluate prompt performance.
  • A great tool for prompt engineering.
  • Support Vector Retrieval, Similary, Orchestrators and Function Calling.
  • Test the responses of your LLM across any number of prompts.
  • Continuous integration for chains like Langchain, agents like AutoGPT, or LLM models like Llama or GPT-4.
  • Eliminate flaky chains and create confidence in your code.
  • Spot inaccurate responses and hallucinations in your application at every version.

🦉 Martian

Dynamically route every prompt to the best LLM. Highest performance, lowest costs, incredibly easy to use.There are over 250,000 LLMs today. Some are good at coding. Some are good at holding conversations. Some are up to 300x cheaper than others. You could hire an ML engineering team to test every single one — or you can switch to the best one for each request with Martian.

Before: Martian

After: Martian

🎈 Details

🧗 Remarks

  • In the development phase, but I love the idea. It is trying to solve one of the most burning problems in the LLM ecosystem.
  • There are various models available in the market that specialize in different tasks such as coding and storytelling. The Martian SDK is designed to identify the prompt’s intention and utilize various models internally to produce the output.
  • GPT 4 models is 316x Costlier than a 7 billion model - “Don’t waste money by paying senior models to do junior work. The model router sends your tasks to the right model.”

I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. Subscribe

🐹 Special Mention


ReLLM was created to fill a need when developing a separate tool. We needed a way to provide long term memory and context to our users, but we also needed to account for permissions and who can see what data.

🥦 LangDock

The GDPR-compliant ChatGPT for your team

🥒 TaylorAI

Taylor AI allows enterprises to train and own their own proprietary fine-tuned LLMs in minutes, not weeks.


Testing for Production-ready LLMs.Ship faster with more confidence. Integrate in minutes.


Signway is a proxy server that addresses the problem of re-streaming API responses from backend to frontend by allowing the frontend to directly request the API using a pre-signed URL created by Signway. This URL is short-lived, and once it passes verification for authenticity and expiry, Signway will proxy the request to the API and add the necessary authentication headers.


Mithril Security helps software vendors sell SaaS to enterprises, thanks to our secure enclave deployment tooling, which provides SaaS on-prem levels of security and control for customers.

🥝 kobaltlabs

Unlock the power of GPT for your most sensitive data with a fast, simple security API


Deploy enterprise-level AI tools equipped with e2e data security and role based access control. Our platform helps you create, manage, and monitor chatbots that can answer questions about your internal documents.

Somewhere @Texel Island

🐶 Summary

It is hard to compare apples-to-apples. That why I have grouped the frameworks (No rank).

🔹 Prompt Engineering (Make Prompts better)

  • Baserun
  • PromptTools
  • DeepEval
  • Promptfoo
  • Nvidia NeMo-Guardrails
  • Agenta
  • AI Hero Studio
  • Guidance
  • Galileo LLM Studio
  • BenchLLM

🔹 Everything about LLM (Fine-tune, Debugging, Monitoring)

  • Baserun
  • Agenta
  • Nvidia NeMo-Guardrails
  • AgentBench
  • Galileo LLM Studio
  • Martian

🔹 LLM Security (Guard The LLM Fortress)

  • Nvidia NeMo-Guardrails
  • Arthur Bench
  • Galileo LLM Studio
  • NightFall AI

Written by

Raahul Dutta

Raahul Dutta is an MLOps Lead at Elsevier. he has around 8+ years of experience in transforming the Jupyter Notebooks into low-latency, highly scalable - production standard endpoints, and he implemented various ML/AI models and pipelines (30+) and exposed them. He was associated with Oracle, UHG, and Philips. He is a proud author of 13 patents in the ML, BMI, and chatbot domains. He enjoys riding motorbikes and lives in Amsterdam with his partner.
Opinions are my own and not endorsed by my current employer 😀