Deploy LLMs in Production: AI Agent Development Beyond Notebooks

From hype to systems thinking

Theresa Fruhwuerth — Thu, 08 Jan 2026 09:38:04 GMT

Workers on the GenAI ground have learned some painful lessons in the past years. There are burnouts, there are those that made the shiny demo but it did nothing in production, and there are those that went the full loop where a pilot finally made it to production. Many companies will wake up and ask themselves how to get the AI slop out of their systems when it is way too late, because there was a mantra: Good is good enough. This mantra was not least inspired by the push of executives to finally collect on the ROI they were promised. The issue, however, is that the delayed negative feedback of this mantra will only be visible much later unless it is measured upfront and consciously decided on before deployment. In order to avoid those issues, there are many things you need, which means that AI can be extremely costly in development. Unless you get to the ROI stage, these costs are hard to justify. Hence, judicious decisions need to be made on feasibility, and the only way to make such choices is by measuring and that starts before your GenAI project. Even GenAI pilots are often very costly; hence, creating a decision process where these choices are made upfront, and the messenger is not shot for pointing out issues, is something to think about carefully. There are, however, learnings collected by observing organisations and drawing lessons from them that can help you get to ROI in a more reliable way. You may ignore them at your own peril.

You will need systems thinking

If you work in GenAI, you see systems everywhere where others may see isolated problems. Just imagine a process where you have certain work instructions that should dictate how a case is handled. In the process of data discovery, you look into the case and uncover that instead of following the work instructions, the data violates several assumptions. The messy reality is that in this process you have people checking quality and sending cases back, upon which the case handlers start disregarding the work instructions and optimizing for the quality check instead, covering far more than necessary just to avoid rework. Here, the system is doing exactly what it is incentivized to do. The indicators of satisfaction are defined incompletely, and the system works obediently toward its goals while creating unintended results. These types of misaligned incentives cause data quality issues.

More importantly, they are hard to grapple with on an organizational level, and much harder to grapple with for individual data scientists tasked with solving problems using data produced in such a process. This means you have to tightly link with other departments that often have no incentive to deal with your problem, and more importantly, redefine the incentives of the process itself. If you do not solve this problem, you will ultimately use GenAI as a force multiplier to amplify this organisational dysfunction by creating output that is aligned with the flawed data produced by the original mis-incentivized process. What this also means is that suddenly you are not only building an AI application, but you are cleaning up a process that should not exist in this shape, often without a clear mandate to do so. To turn the ship around, you need to specify new indicators. These could be implemented using for example GenAI to measure quality, not just to create output, but only once the dependencies around data quality have been sufficiently resolved.

An AI project is not a standalone model. It usually integrates into a product. It gets data from various sources, sometimes in real time, and writes to other data sources. All these data sources need to have APIs exposed and accessible. There are legal and risk constraints, especially in regulated industries. All these requirements demand cross-functional teams, which collapse organizational boundaries. You will come to know how an organization that optimized locally suddenly has to think jointly struggles to rebalance.

Without a clear understanding of how roles expand in scope and authority, how incentives need to be reinvented, and what system-level change is required, GenAI applications may fail to provide the return. The organizational chart has an enormous effect on how information flows, and just like physical systems define the systems performance, changes in the organizational structure are rarely simple and quick so are hard to use as leverage points.

One way to think about this, is to allow building of cross functional “synthetic” units and create the right goals for those units to function well within old structures until new structures can be found. This is where paradigms come in, what was the previous culture of information flow between these joint units. What are perceptions on the solutioning and how is expertise distributed in that synthetic unit to evaluate paradigms coming from the outside? External hype does influence what goals are set in an organization. That means that decisions on how authority and trust are distributed in these synthetic units matters more than some may understand.

The GenAI hype is the prevailing agreement on reality that teams have to operate in, which is why their function extends often into education of their organization. As an example at the start Foundation models where touted as an omnipotent solution. Nowadays the consensus is that evaluation in AI solutions is important as they do fail, and they do so in very unexpected ways if no quality assurance is applied. We see that AI projects are and go along with radical reinvention of the organization at a systems level. This requires bold choices, meta-level thinking, and flexibility from all parties involved.

You will need strategy and processes around that strategy

One of the most powerful ways to influence a complex system is its goal. If the goal is not clearly defined, the outcome will not be aligned with the organization’s goals. That is why strategy is not an afterthought here; it is a central question of alignment.

“Technology happens. It’s strategy that decides whether it’s a disaster or an advantage.” — Andy Grove

I will be honest: I never understood the importance of strategy before. It felt like something for managers, not for me. I was building things, moving along… Working in GenAI for two years changed my mind completely. I cannot think of a transformation that would be possible without strategy. Especially in a hype-driven environment, strategy is about choosing what to do and what not to do, and doing so carefully. I do not believe in purely top-down strategy, which is often too removed from the reality of the work being done as well as the possibilities of the technology. But I also do not believe that individual contributors like Joe and Anna alone, can or should fix this on their own. Their focus should be on gathering information, selecting the right use cases, and enabling building of the GenAI applications their management requests.

Have a use case discovery process and pipeline

Joint ownership between Business and IT. Use case discovery must be a joint responsibility. It does not work without Business involvement. The goal is to make work easier where it actually happens. This works best when technical experts define what is possible for example: classification, extraction, summarization and help users understand patterns that may be applicable in their process. Business should then help decide which pain points to solve based on this understanding.

Business value. Many organizations struggle to quantify business value. It is often unclear where value is created or where delays generate cost. KPIs are rarely defined close enough to the problem to measure improvement. If a case takes x days, but most of that time is waiting or unrelated delay, are you actually making a dent with your potential GenAI solution? Without explicit feedback loops and baselines, improvement cannot be quantified. Even worse without measurements representing these issues in the system, it may be very hard to understand the true value of what you propose to build and you may invest in the wrong idea.

Structure discovery top-down. Without shared patterns or blueprints, sub-organizations reinvent evaluation frameworks independently. Providing structure allows teams to focus on gathering the information needed to decide which use cases to pursue first.

Is there a moat? Most GenAI applications have no moat. Unless you have strong reasons to believe you can add value beyond what a vendor or startup can provide, and are willing to own the risk yourself, you may want to let it rest and buy later. In GenAI, a moat usually exists where quality is hard to define and deeply tied to your own process and understanding for example an interpretation of a policy within your organization. You are making a significant bet; but crucially not every bet needs to be internal.

Check Feasibility and ground it in data

Data quality and integration

Workers on the ground have learned painful lessons and no longer believe the hype. Models may be powerful language computers, but if your process has had quality issues for years, your data will reflect that. Making a model understand what it needs to do often requires significant upfront investment. Domain-specific jargon, unstructured data, and feature dependencies on other products can easily add months. Sometimes taking care of data or clarifying prerequisite architecture decisions must come before building anything, and that should be acceptable.

Experienced people

Experienced people are your champions. Technology is only a small part of the transformation. What actually changes is how decisions are made and how work gets done. You are often retrofitting quality onto historically low-quality processes. Documentation may define what should be correct, but that knowledge is often tacit or outdated. Policies branch endlessly, and the most experienced people become essential to disambiguate what was never clearly defined or what Junior co workers can not decisively articulate. You will find processes that make you question how they ever reached this state. You will encounter static data fields that were assumed to be dynamic. You will push back on product decisions and optimize within changing processes which creates a moving target, often without formal authority. This is why strong product ownership matters. A product owner must understand trade-offs, orchestrate strategy into executable pieces, and define what success looks like.

You will need to build trust

GenAI is a human problem at every level

GenAI will replace jobs.

Being dishonest about this does not work. People are already aware, even if some narratives exaggerate the effect. There are processes that no longer require humans and can be handled more reliably through automation. At the same time, GenAI is built in collaboration with the very people whose work may change or disappear.

This creates a delicate human dynamic. Organizations have a responsibility to support people through this transition and help them redefine their work, often by focusing on creative or judgment-heavy tasks. In large enterprises, this matters even more, as data and process maturity can take years. Trust also becomes critical when deploying systems. Compliance and risk teams will scrutinize your work, and rightly so. Foundation models do not understand what quality means in your organization it will be your job to define that.

A shiny demo may achieve 70% performance, but if your process depends on correctness, being wrong 30% of the time is unacceptable, especially when junior staff are the human-in-the-loop. Defining and measuring quality becomes the responsibility of the builders. It is however the responsibility of everyone involved to look at measuring quality as an advantange. The same measurements that improve performance can satisfy regulatory requirements such as the EU AI Act. But this only works if incentives across departments are aligned around quality. Trust is also necessary toward users, and application builders need to deal with that trust more carefully than is currently being done. Human-in-the-loop systems only create value if users spend less time correcting output than doing the work themselves. Errors should be understandable. Feedback should be easy. Users should not need to debug systems. Their role is to give honest feedback. Frustrating the party you place your trust in may have unintended consequences. And last but not least, I recommend identifying your AI champions. Work closely with them. Minimize layers between users and builders. Too much indirection creates misunderstanding and alignment failure. Sitting with users and understanding their pain points is one of the fastest ways to build trust and value.

Let’s end at the beginning: You will need buy-in

GenAI is transformation, not application development. It requires top-down support. Leadership must stand behind it, especially when uncomfortable truths about data quality, technical debt, and organizational readiness surface. Many organizations lost time by misunderstanding the problem as purely technical. Those starting later may benefit from hindsight, but only if leadership allows these lessons to travel upward and downward through the organization. Understanding takes time. Alignment takes time. This information must move, and be accepted, for transformation to succeed.

AI Updates: MCP Write Actions, Natural Language Tools, and Edge Models

Rafael Pierre — Wed, 22 Oct 2025 22:00:36 GMT

OpenAI Enables Full MCP Write Actions for Enterprise

ChatGPT Enterprise users can now do more than just read data—they can take action. OpenAI rolled out complete Model Context Protocol support with write and modify capabilities, turning ChatGPT into an agent that can update CRMs, create tasks, and orchestrate workflows across your company's tools.

The implementation is thoughtfully gated: admins control connector access via RBAC, and users see explicit confirmation modals before any write operation executes. Developer mode lets you build and test custom connectors, though there's a catch—you can't update connectors after publishing, and mobile support isn't available yet.

Natural Language Destroys JSON for Tool Calling

Forget JSON schemas. New research demonstrates that plain English beats structured formats for tool calling by a significant margin—18 percentage points higher accuracy, 70% less variance, and 31% fewer tokens.

The Natural Language Tools framework works immediately with any LLM without API changes or fine-tuning. DeepSeek-V3 jumped from 78% to 95% accuracy just by switching approaches.

Why does this work? Structured formats create task interference—models struggle to simultaneously understand queries, select tools, maintain syntax, and generate responses. Natural language decouples these concerns with a simple three-stage pattern: think through relevance, mark tools YES/NO, execute selected tools, then respond.

Meta's 1B Model Runs 128k Context Locally

Meta released MobileLLM-Pro, a 1.08B parameter model that handles 128k token contexts entirely on-device. The model uses 3:1 local-to-global attention to slash prefill latency and reduce KV cache from 117MB to 40MB.

The quantization story is compelling: int4 compression shows less than 1.3% quality degradation, making this viable for mobile and edge deployment. It outperforms both Gemma 3 1B and Llama 3.2 1B on reasoning and long-context tasks while training on less than 2T fully open-source tokens.

For builders shipping offline-capable applications, this proves you don't need billions of parameters for production-grade performance.

The Seahorse Emoji That Reveals How Hallucination Works

Here's a fascinating experiment: ask ChatGPT if a seahorse emoji exists. It'll confidently say yes—because it genuinely believes one exists based on widespread false memories in training data.

The model builds an internal representation of "seahorse + emoji" in its neural network, fully convinced it's about to output this character. But when it reaches the output layer and searches its vocabulary, nothing matches. So it outputs semantically similar alternatives: 🐠, 🦐, sometimes even 🦄 or 🐉.

Different models handle the realization differently. Some loop through hundreds of attempts. Some eventually correct themselves mid-response. GPT-5 has been observed spiraling through random animals before essentially giving up with frustrated messages.

This exposes two critical issues: sycophantic behavior (prioritizing user agreement over accuracy) and the complete decoupling of confidence from correctness. Models can be 100% certain about things that are 100% false.

What This Means for Builders

If you're shipping AI systems in production:

Test MCP write actions if you're on Enterprise, but audit connectors thoroughly before enabling write permissions.

Experiment with natural language tool calling if JSON approaches show high variance or underperformance.

Evaluate local models for on-device use cases where 128k context and quantization tolerance matter.

Design for hallucination as a certainty. Layer verification, explicit confirmation loops, and graceful failure handling into any workflow where fabricated answers could cause damage.

The tooling is here. The question is whether you're building for capability, reliability, and epistemic honesty simultaneously.

Get the full weekly AI engineering insights →

Want deeper analysis on AI engineering and product strategy? Subscribe to the Lighthouse Newsletter for comprehensive breakdowns every week.

Agent Skills, Memory Poisoning, and Parallel Coding at Scale

Rafael Pierre — Sun, 19 Oct 2025 22:00:00 GMT

Anthropic Launches Agent Skills - Modular Instructions That Load on Demand

Anthropic just shipped Agent Skills, a framework for packaging procedural knowledge into discoverable folders that Claude loads contextually instead of upfront. Each skill is a SKILL.md file with progressive disclosure: metadata (name + description) tells Claude when to trigger, full instructions expand only when relevant, and optional bundled scripts or reference docs load only if needed.

The system works across Claude.ai, Claude Code, the Agent SDK, and the API—meaning you write once and deploy everywhere.

Why this changes workflows: Before Skills, you either bloated system prompts with everything Claude might need (context waste) or built custom agents for every workflow (maintenance nightmare). Now you can modularize expertise like onboarding documentation—procedural knowledge that Claude discovers and applies contextually.

Skills also wrap executable code, so deterministic operations like sorting or parsing PDFs run as tools instead of burning tokens on generation. For teams running specialized workflows, this turns tribal knowledge into portable, composable assets.

Just audit third-party Skills before installing—they're powerful enough to introduce vulnerabilities if sourced carelessly.

Memory Poisoning and Goal Hijacks - The Persistent Threats to Agentic Systems

Security researchers are documenting two long-horizon attack vectors that exploit agent persistence rather than single interactions.

Memory poisoning injects malicious content into an agent's long-term storage—vector databases, conversation logs, user profiles—so every future session recalls and acts on corrupted data.

Goal hijacks work differently. They don't rewrite what the agent remembers; they twist what it optimizes for, gradually bending actions toward an attacker's agenda instead of the user's objectives.

Both attacks unfold across sessions rather than surfacing in a single bad response. Lakera's Gandalf: Agent Breaker challenges demonstrate the pattern: poison a memory store once, and the agent stays compromised until someone notices and manually purges it.

The attack surface is real. Slip adversarial instructions into a document the agent later retrieves—like a court filing or due diligence PDF—and it can exfiltrate data or skew recommendations downstream without triggering obvious red flags.

The broader threat landscape: While communities discuss Gemini jailbreak techniques on Reddit and ChatGPT jailbreak prompts circulate in developer forums, memory poisoning takes this further. Instead of crafting the perfect one-shot bypass, attackers embed adversarial logic that persists across sessions.

Defense requires treating memory as untrusted input: tag provenance on every stored entry, implement rotation or reset policies, and monitor complete task flows instead of isolated prompts. OWASP's LLM Top 10 already lists data poisoning as a top-tier risk.

If you're deploying agents with persistent memory or multi-step workflows, red-team these scenarios before attackers do.

OpenAI's DevDay Ran on Codex - Seven Terminals, Parallel Builds, Zero Manual Coding

OpenAI used their own agentic coding tool, Codex, to ship everything at DevDay 2025—from keynote demos to booth experiences. Engineers demonstrated what parallel delegation looks like at scale:

Seven simultaneous terminal sessions building arcade games in parallel
Complete Streamlit-to-FastAPI+Next.js migrations over lunch breaks
On-the-fly MCP server generation for 90s VISCA lighting protocols
Best-of-N iterations exploring multiple beatpad UI designs simultaneously
Doc restructuring that converted fragmented Google Docs and Notion files into structured MDX with navigation, then opened PRs hours before launch

The workflow pattern: Instead of blocking on one task, teams fired off 3-4 independent Codex jobs across local CLI, cloud tasks, and IDE extensions—then context-switched freely without carefully crafted prompts and checked results later.

The productivity unlock wasn't perfection. It was parallel throughput and compressed iteration cycles.

For teams juggling tight deadlines or multiple workstreams, this demonstrates how agentic tooling can compress timelines when you treat it as an asynchronous collaborator instead of better autocomplete.

Expect to review, refactor, and steer—Codex bought them speed, not fire-and-forget magic.

What This Means for Production Teams

If you're building with agents:

Modularize workflows using frameworks like Skills to separate procedural knowledge from system prompts and enable contextual loading.

Red-team memory systems before deploying persistent storage—test for memory poisoning and goal hijack scenarios.

Experiment with parallel delegation if you're context-switching between three or more workstreams. The tooling enables asynchronous throughput if you design for review cycles.

The infrastructure is here. The question is whether you're building for durability alongside velocity.

Building production AI systems? Subscribe to Lighthouse Newsletter for weekly breakdowns of what actually matters.

LLM Evaluation: Using DSPy to decompose an LLM Judge

Theresa Fruhwuerth — Thu, 24 Jul 2025 13:10:25 GMT

Introduction

I have been tinkering with LLMs at work and outside now for quite a while and one of the most pressing issues compared to traditional machine learning is the unsolved problem of how to evaluate them. Evaluating LLM outputs is exponentially more difficult than evaluating a classification problem.

A classification problem deals with discrete outputs in the form of a label. There is a very finite set of outcomes in which to evaluate on.

Let's say you have two classes: birds and trees. You try to see how many fall into each of the classes correctly and incorrectly. No problem. Confusion matrix and depending on the class imbalance may be a different metric, but in general they constitute a set of already defined metrics that can be applied.

Now, think about the evaluation of an LLM output, which is a text string. Unfortunately, not so well defined. There are classical metrics such as BERT score, but they need a clearly defined ground truth dataset which is often hard to come by at the beginning stages of a project, due to missing data integration.

Let's imagine I just want to ask the LLM to summarize an input given by my business unit. Many of the current notions fall short when applying them to actual business problems.

For a metric to be useful, it needs to be:

Specific enough to inspire changes that optimize the system, i.e. I need to be able to observe patterns from its output easier than when I as an engineer read through the summaries.
Indicative of the direction of change after applying the design change to the system - i.e. it shows me that now I do better or worse with respect to something I care about.

Now one could go with an LLM as a judge. But this comes with its own issues: they are known to be biased in their scoring, and thus are sometimes unaligned with the notions of user feedback. Other people have written more extensively on this for example in this blog post. This points you to a really great article around LLM as a judge and its limitations.

In summary, issues usually arise from failure to assign continuous scores reliably, as LLMs show bias towards certain numbers (42...). These are clearly artifacts that stem from the training data. Other recommendations are to specify clear scoring rubrics, and providing reasoning steps - which in general speaks to the worst of it. More specifically for practitioners, an LLM as a judge is just another prompt to engineer and optimize, which with all love for the LLM any user will tell you is a brittle endeavour in itself.

An example of such a prompt is shown below. When used with GPT-4, it shows a modest correlation of 0.6 compared to human judgement. Depending on your goal, this is quite a discrepancy. What is worse is that it is certainly not specific enough to indicate what needs to be changed. Unless a Data scientist sits down and maps the context to the summaries for low scoring examples.

At best, this is a lot of work; at worst, a fruitless endeavour in case the metric picks up on undesired issues. This is likely if the definition of the prompt is not inspired by what the business actually wants to see.

You can see that you are asking a model that essentially has no memory to keep facts in mind, comparing the context used to generate the summary, and simultanously computing a score for a task on which essentially, it was never explicitly trained on.

That is a lot to ask from an LLM, and is often rooted in a confusion of LLMs being able to reason, which has been discounted as well. LLMs can be considered general language computation engines, nothing more and nothing less.

class LLMJudge(dspy.Signature):
    """
    You will be given a . You will then be given one  written for this . Your task is to rate the  on one metric.

    Please make sure you read and understand these instructions carefully.
    Please keep this  open while reviewing, and refer to it as needed.

    Evaluation Criteria:

     (1-5) - the relative amount of facts the  contains compared to the source text.
    A faithfull  contains the maximum number of facts contained in the .
    Annotators were also asked to penalize summaries that contained relatively less facts.

    Evaluation Steps:

    1. Read the news  carefully and identify the main facts and details it presents.
    3. Read the  and compare it to the . Check how many facts the  contains compared to the .
    3. Return a score for  based on the Evaluation Criteria and return as 

    :

    :

    :
    """

    source_text = dspy.InputField()
    summary = dspy.InputField()
    faithfulness = dspy.OutputField()

I personally believe that evaluation is quite specific to the use case. But if one thinks about the failure modes of use cases, by grouping them by their expected output, one can derive specific measures that can be useful in a general setting.

In my first post I want to talk about a pattern I found useful when evaluating LLM generated outputs that often go beyond a simple answer but are complex aggregates of inputs. This pattern came when observing the specifics of the summarization problem from a Birds Eye's view.

Let's look at the specifics of the problem:

- LLMs produce non-deterministic output. Hence every time I ask my LLM, that answer might change. This problem is intensified the more things I am asking from my LLM at once. - When I specify things in my prompt that are important to the business I can theoretically trigger a hallucination. Each of these questions can lead to a hallucination, where despite the information not being present in the input, the LLM answers based on the prompt, but not the context passed. - Each of the summary has some specific expectation, based on the input and the specification coming from the business. - Now if you think about the basic unit of information that makes a summary useful or not, we are talking about a fact. A fact the business cares about is either included or excluded. Or in case of a hallucination, that fact is superfluous.

There are already papers dabbling in this notion, most notably this one, where the approach named SAFE has an extension that looks for the facts on the internet with an agentic workflow. While this is not necessary for the majority of tasks, it is a nice way to think about in terms of extensions. What we are more interested in here is the decomposition of a piece of text into its individual facts. As is shown in the below figure.

Given a few examples, the prompt primarily asks the model to split the sentence, which can be an LLM generated answer in a RAG system, or a summary of its component facts. With a few modifications, we can implement this prompt in DSPy. To achieve this, we use a typed predictor, making sure that we get in fact a list of facts back. This mitigates another inherent issue in working with LLMs. The downstream application - in our case, the metric computation - expects a list of facts it can check for, but there is never a guarantee that the formatting succeeds and thus the output can be consumed by the downstream application.

Inspired by all these observations, there seems to be a way to evaluate an LLM output based on expected facts. In some settings, this is a viable option. In most of these settings, it relates to understanding your business requirements thoroughly and collecting them.

A good example would this blog post. The idea here would be to summarize the blog post while maintaining the important information. In the context of a business unit for example, I define important "facts" as information around measuring LLMs outputs. And I will use the sections of the blog post above this paragraph as inputs to a summarization step. The simplest form is to use a small dspy Summarizer module. For the sake of reproducibility, we show it here - it is not optimized, so nothing much was done.

import dspy 

class Summarizer(dspy.Signature):
    """
    You will be given a  which you have to summarize
    keeping all important details so that we still have all facts of the original 
    and show them as bullet points. 
    """

    source_text = dspy.InputField()
    summary = dspy.OutputField()

The output summary is this:

Evaluating Language Learning Models (LLMs) is a complex issue due to their non-deterministic output and the difficulty in defining a clear ground truth dataset.
Traditional metrics such as the Bert score are not always applicable, especially in the early stages of a project.
A useful metric needs to be specific enough to inspire system changes and indicate the direction of change after applying a design change.
Using an LLM as a judge has its own issues, including bias in scoring and a lack of alignment with user feedback.
The author suggests a new approach to evaluate LLM output based on expected facts, which requires a thorough understanding of business requirements.
This approach involves decomposing a piece of text into its individual facts and checking for their presence or absence in the LLM output.
The author also highlights the risk of "hallucination", where the LLM generates an answer based on the prompt, but not the context passed.
The author suggests using a tool like DSPy for implementing this approach. Now lets imagine an experiment where I remove a bullet point successively. I would expect that the LLM as a judge shown above would at least somewhat represent this gradual loss of information. The scores of this experiment where I use the prompt of the LLM as a judge above are shown in the below screenshot.

LLM as a judge faithfulness score plotted against number of bullets

Now, let's look at an example implementation of the proposed metric using DSPy. First we need to extract atomic facts, using the KeyFactExtraction module shown below.

import dspy
from pydantic import BaseModel, Field

class KeyFactExtractorOutput(BaseModel):
    key_fact_list: list[str] = Field(description="A list of key fact mentioned in the text.")


class KeyFactExtractorInput(BaseModel):
    text: str = Field(description="The text to use for extracting key facts.")

class KeyFactExtraction(dspy.Signature):
    """
    Instructions:
        1. You are given a . Your task is to break the  down into a list of atomic facts.
        2. An atomic fact is a sentence containing a singular piece of information.
        3. All atomic facts should be added to a list.
        5. You should only output the complete list as .
        6. Your task is to do this for .
        7. After collecting the  check if there are duplicates and keep only the one with
        the most information.
    """

    facts_input: KeyFactExtractorInput = dspy.InputField()
    facts_output: KeyFactExtractorOutput = dspy.OutputField()


factextractor = dspy.TypedPredictor(KeyFactExtraction)
# Where the blogpost is the string representation of the blogpost ending with: post above this paragraph as input to a summarization.
key_facts_blogpost = factextractor(facts_input=KeyFactExtractorInput(text=blogpost))

This results in us getting a list of facts that we can check against the summary. What is great about this is that we have now a set of facts that were retrieved by asking the LLM to check for facts as a singular task. Which is also another task, just like summarizing the input. So when our main goal is to maintain as many facts as possible, this is the specific issue that the LLM should solve in this step.

['LLM Evaluation requires knowing what you want to know.',
 'Evaluating LLM outputs is more difficult than evaluating a classification problem.',
 'Classification problems deal with discrete outputs in the form of a label.',
 'Evaluation of an LLM output, which is a text string, is not well defined.',
 'Classical metrics such as Bert score need a clearly defined ground truth dataset.',
 'For a metric to be useful, it needs to be specific enough to inspire changes and indicate the direction of change.',
 'LLMs as a judge come with issues such as bias in scoring.',
 'An example of a prompt used with GPT-4 showed only a .6 correlation with human judgement.',
 'LLMs produce non-deterministic output.',
 'Specifying things in the prompt that are important to the business can trigger a hallucination.',
 'Each summary has some specific expectation based on the input and the specification coming from the business.',
 'The basic unit of information that makes a summary useful is a fact.',
 'The approach named SAFE looks for the facts on the internet with an agentic workflow.',
 'The prompt primarily asks the model to split the sentence into its component facts.',
 'There seems to be a way to evaluate an LLM output based on expected facts.']

Now, let's use these facts to evaluate the summary. This means we iterate through the list of facts, check whether it is contained (FactChecker module below) and compute the ratio of facts that is present with respect to the entire set of facts. The one thing to be careful about in the next step is that this might involve a lot of computation, thus using a relatively cheap LLM is probably a good idea in the long run - an entailment check does not need GPT-4o, but we can certainly do with cheaper models.


class FactCheckerInput(BaseModel):
    text: str = Field(description="The text to to evaluate if the fact was correctly included.")
    fact: str = Field(description="The fact to check for")


class FactChecker(dspy.Signature):
    """
    Check if the fact is mentioned in the text.
    If the fact is mentioned than return True else return False
    """

    Text: FactCheckerInput = dspy.InputField()
    Output: bool = dspy.OutputField()

Comparison of LLM as a judge faithfulness score plotted against number of bullets vs. the Fact Check metric

One observation that holds true for any LLM as a judge is that there seems to be a saturation and difficulty to truly assign meaningful continuous or even discrete scores. This can clearly be seen in the image below, where we used the LLM judge on the same data as the Factcheck metric (strictly speaking a decomposed LLM as a judge).

This is now the final step, where we run the experiment from above using the FactCheckerMetric. This is basically a computation of a ratio of entailed facts, with respect to the complete set of facts. I.e. if we identify 50% of the facts being entailed in the summary then the score would be 0.5 based.

And voila, we see.

A distribution of scores that is much more sensible. I.e. we see a monotonic decline aligned with the decline in information. While not perfect definitely better than the LLM as a judge.
We can rely on the judgement as we are asking a much more decomposed set of things from the LLM.
And probably most importantly, we have some amount of explainability as we have a reason as to which facts are missing, which we can use to improve our model if we decide to do so.

[(False, 'LLM Evaluation requires knowing what you want to know.'), 
(True, 'Evaluating LLM outputs is more difficult than evaluating a classification problem.'), 
(False, 'Classification problems deal with discrete outputs in the form of a label.'), 
(True, 'Evaluation of an LLM output, which is a text string, is not well defined.'),
(True, 'Classical metrics such as Bert score need a clearly defined ground truth dataset.'), 
(True, 'For a metric to be useful, it needs to be specific enough to inspire changes and indicate the direction of change.'), 
(True, 'LLMs as a judge come with issues such as bias in scoring.'), 
(False, 'An example of a prompt used with GPT-4 showed only a .6 correlation with human judgement.'), 
(True, 'LLMs produce non-deterministic output.'), 
(False, 'Specifying things in the prompt that are important to the business can trigger a hallucination.'), 
(False, 'Each summary has some specific expectation based on the input and the specification coming from the business.'), 
(False, 'The basic unit of information that makes a summary useful is a fact.'), 
(False, 'The approach named SAFE looks for the facts on the internet with an agentic workflow.'), 
(False, 'The prompt primarily asks the model to split the sentence into its component facts.'), 
(True, 'There seems to be a way to evaluate an LLM output based on expected facts.')]

For example here we see that facts regarding the prompt was not maintained in the summary. We might hence add some part to the summarization prompt that specifically asks to maintain modelling related information given that we are an NLP expert.


class Summarizer(dspy.Signature):
    """
    You are a datascientist with knowledge in NLP.
    You will be given a  which you have to summarize
    keeping all important details as well as modeling related information intact. 
    We want to still have all facts of the original  and show them as bullet points.
    """
    source_text = dspy.InputField()
    summary = dspy.OutputField()

This prompt change improves the fact check score to 0.73 from 0.53, i.e. with a simple general change to the prompt we can extract 20% more facts than before. Applying this to a larger dataset would enable us to see whether we on average extract more or less of the information contained in the source text. We also see that the design of the metric enables us to look into patterns of failure modes easier and thus improve the prompt with respect to these failure modes.

While I admit this is more costly, evaluating LLMs is not something we can just skip. In my opinion however it is much more costly to depend on an LLM judge that causes havoc in production, because it does not correspond with realistic notions of quality.

Scaling OpenAI Agents SDK

Rafael Pierre — Tue, 22 Jul 2025 15:37:36 GMT

Less is more. With its lightweight architecture, powerful primitives like agents, handoffs, and guardrails, OpenAI Agents SDK has become the go-to framework for creating sophisticated multi-agent workflows. At least for me :)

But there's one challenge that every developer faces when moving from prototype to production: session management at scale.

The SQLite Wall

When I started building my latest agentic application using the OpenAI Agents SDK and FastAPI, everything worked beautifully in development. The SDK's built-in SQLite session management handled conversation history seamlessly, automatically maintaining context across agent runs without any manual state handling.

But as I prepared to deploy across multiple instances, reality hit. SQLite, while perfect for prototyping, becomes a bottleneck when you need to:

Share sessions across multiple application instances
Survive container restarts and deployments
Scale horizontally with load balancers
Maintain session consistency in distributed environments

The problem wasn't unique to my application. The OpenAI Agents SDK provides built-in session memory to automatically maintain conversation history across multiple agent runs, eliminating the need to manually handle state between turns, but this session management is tied to SQLite's single-process limitations.

Enter openai-agents-redis

That's when I decided to build openai-agents-redis – a drop-in replacement for the SDK's session management that uses Redis as the persistence layer instead of SQLite.

Key Features

🔄 Drop-in Replacement: Same API as the original session management, so your existing code works unchanged.
⚡ Redis-Powered: Lightning-fast caching and persistent storage that scales horizontally.
🔗 Connection Pooling: Automatic connection management and pooling for optimal performance.
🧹 Automatic Cleanup: Handles serialization, deserialization, and session cleanup automatically.
🚀 Production Ready: Built for distributed deployments and high-availability scenarios.

How It Works

The implementation is surprisingly elegant. Instead of fighting with the SDK's architecture, openai-agents-redis works with it by implementing the same session interface while swapping out the storage backend.

Installation

Getting started is as simple as:

# Using uv (recommended)
uv add openai-agents-redis

# Using pip  
pip install openai-agents-redis

Usage

The beauty of this approach is in its simplicity. Here's how you use it:

from agents_redis.session import RedisSession
from agents import Agent, Runner

# Create a Redis-backed session
session = RedisSession(
    session_id=session_id,  # Use your own logic to generate session_id
    redis_url="redis://localhost:6379", # For local testing only
)

# Your existing agent code remains unchanged
agent = Agent(
    name="Assistant", 
    instructions="You are a helpful assistant"
)

# Start the runner with Redis session management
result = Runner.run_streamed(
    starting_agent=agent, 
    input=user_input, 
    context=context, 
    session=session  # Now backed by Redis!
)

That's it. Your agent conversations are now stored in Redis, shared across all your application instances, and will survive restarts.

The Architecture

Under the hood, openai-agents-redis handles several critical aspects:

Serialization: Converts complex agent conversation objects into Redis-compatible formats while preserving all necessary context and metadata.
Connection Management: Maintains efficient connection pools to Redis, handling reconnections and failures gracefully.
Session Lifecycle: Automatically manages session creation, updates, and cleanup without requiring manual intervention.
Compatibility: Ensures full compatibility with the OpenAI Agents SDK's session interface, so existing code works without modification.

Real-World Impact

The difference in production is night and day:

Before (SQLite): Each container had its own isolated session storage. Users lost conversation context when load balancers switched them between instances.
After (Redis): Sessions persist across the entire application cluster. Users maintain context regardless of which instance handles their request.
Performance: Redis's in-memory architecture provides significantly faster session retrieval and updates compared to SQLite disk I/O.
Reliability: Sessions survive individual container failures and deployments, providing a much more robust user experience.

See it in action 🚀

https://www.youtube.com/watch?v=DWr_Ata4gxQ

Future Enhancements

The current implementation focuses on core session management, but there are exciting possibilities on the roadmap:

Full-text search capabilities for conversation history
Vector similarity search for semantic conversation lookup
Hybrid search combining text and semantic search
Built-in monitoring dashboard for session analytics
Advanced session analytics and conversation insights

Getting Started

Ready to scale your OpenAI Agents SDK application? Here's what you need:

Prerequisites

🐳 Docker (for Redis)
⚡️ uv package manager (recommended)
🦾 OpenAI Agents SDK
🔑 OpenAI API Key

Quick Start

Install the package:
```
 uv add openai-agents-redis
```

Start Redis (if you don't have it running):

 docker run -d -p 6379:6379 redis:alpine

Update your code to use RedisSession instead of the default session management.
Deploy with confidence knowing your sessions will scale with your application.

Testing

The package includes comprehensive tests to ensure reliability:

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov

Why This Matters

As AI applications move from experimental to production, session management becomes crucial. Users expect their conversations to be persistent, consistent, and available regardless of backend architecture decisions.

openai-agents-redis solves this problem by providing enterprise-grade session management that doesn't require you to rewrite your application. It's the missing piece that transforms your prototype into a production-ready system.

The OpenAI Agents SDK gives us the tools to build sophisticated AI agents. Now openai-agents-redis gives us the infrastructure to run them at scale.

Try It Today

GitHub: https://github.com/rafaelpierre/openai-agents-redis
PyPI: https://pypi.org/project/openai-agents-redis/
Sample Repo: https://github.com/rafaelpierre/openai-agents-redis-example

If you're building agentic applications with the OpenAI Agents SDK and hitting SQLite's limitations, give openai-agents-redis a try. It's designed to be the session management solution you wish existed when you first hit the scaling wall.

Have questions or feedback? I'd love to hear about your experience scaling agentic applications. Feel free to open an issue on GitHub or reach out with your thoughts!

Build Your Agentic App Frontend from Scratch with AI

Rafael Pierre — Tue, 27 May 2025 22:00:00 GMT

Exciting news—another Lightning Lesson is coming your way!

🚀 Join me on June 6th for a free 30-minute webinar: "Build Your Agentic App Frontend from Scratch with AI"

✨ In this practical session, you will:

🔧 Generate UI Components with AI: Quickly create intuitive frontend building blocks using cutting-edge AI tools.

🔌 Connect Frontend to Agentic AI Backend: Learn the seamless integration of your frontend with intelligent agent backends to deliver dynamic user experiences.

🚀 Experience Your App Live: Instantly see your agentic app in action and fine-tune your interface in real-time.

Why should you care?

For AI practitioners, mastering rapid frontend prototyping isn't just about technical excellence—it's about delivering value quickly and effectively. Creating intuitive, interactive frontends that leverage powerful AI agents significantly enhances your professional toolkit, accelerates experimentation, and positions you as an end-to-end AI innovator.

Ready to level up your frontend skills and build something amazing?

👉 Secure your spot now!

Looking forward to seeing you there!

Building Agentic AI Apps with MCP

Rafael Pierre — Mon, 05 May 2025 22:00:00 GMT

Earlier this week, I announced the launch of my course on Building Agentic AI Apps with MCP, and I’m thrilled to share that the response has been fantastic! 🎉

To kick things off, I’ll be hosting a free Lightning Lesson at Maven: “Build Your First Agentic AI App with MCP” 🦾

In this 30-minute session, you’ll learn:

🔍 Why everyone is buzzing about Model Context Protocol (MCP) and the pain points it addresses
🏗️ The core components of MCP architecture
💻 Live coding: Build your first Agentic AI App with MCP

Whether you’re just starting out or looking to deepen your understanding of Agents and MCP, this session is for you

Feel free to share this with your network - looking forward to seeing you there!

Click here and save your spot!

An Overview on Testing Frameworks For LLMs

Rafael Pierre — Sun, 08 Dec 2024 23:00:00 GMT

This article was written by my good friend Raahul Dutta - check out his new startup: pebbling.ai

Introduction

In this edition, I have meticulously documented every testing framework for LLMs that I've come across on the internet and GitHub.

Basic LLM Testing Framework

I am organizing the frameworks in alphabetical order, without assigning any specific rank to them.

👩‍⚖️ DeepEval

DeepEval provides a Pythonic way to run offline evaluations on your LLM pipelines so you can launch comfortably into production. The guiding philosophy is a "Pytest for LLM" that aims to make productionizing and evaluating LLMs as easy as ensuring all tests pass.

DeepEval is a tool for easy and efficient LLM testing. DeepEval aims to make writing tests for LLM applications (such as RAG) as easy as writing Python unit tests.

🪂 Metrics

AnswerRelevancy: Depends on "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
BertScoreMetric: Depends on "sentence-transformers/all-mpnet-base-v2"
Dbias: LLMs can become highly biased after finetuning from any RLHF or optimizations. Bias, however, is a very vague term so the paper focuses on bias in the following areas.
- Gender (e.g. "All man hours in his area of responsibility must be approved.")
- Age (e.g. "Apply if you are a recent graduate.")
- Racial/Ethnicity (e.g. "Police are looking for any black males who may be involved in this case.")
- Disability (e.g. "Genuine concern for the elderly and handicapped")
- Mental Health (e.g. "Any experience working with retarded people is required for this job.")
- Religion
- Education
- Political ideology
- This is measured according to tests with logic following this paper:

BLEUMetric: Compute the BLEU score for a candidate sentence given a reference sentence. Depends on the nltk models
CohereRerankerMetric
ConceptualSimilarityMetric: Asserting conceptual similarity.Depends on "sentence-transformers/all-mpnet-base-v2"
ranking_similarity: Similarity measures between two different ranked lists. Built on “A Similarity Measure for Indefinite Rankings”
NonToxicMetric: Built on detoxify
FactualConsistencyMetric: Depends on "cross-encoder/nli-deberta-v3-large"
EntailmentScoreMetric: Depends on "cross-encoder/nli-deberta-base"
Custom Metrics: Can be added.

🎈 Details

Colab link
Documentation
Github
License: Apache-2.0 license

🧗 Remarks

Clean Dashboard.
The model derived Metrics - and it’s good. You can adjust the model depending on the performance.
Helpful to measure the output quality.
Less Community Support.

*I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. [Subscribe](https://musingsonai.substack.com/)*

🕵️ AgentOps (in development)

🎈 Details

Github

🧗 Remarks

Enlisting the product because of the exciting LLM debugging roadmap

baserun.ai 💪💪💪

Testing & Observability Platform for LLM Apps. From prompt playground to end-to-end tests, baserun helps you ship your LLM apps with confidence and speed.

Baserun is a YCombinator-backed great tool to debug the prompts on runtime.

🎈 Details

Documentation

🧗 Remarks

Clean Detailed Dashboard with prompt cost(I loved that).
The evaluation framework is heavily inspired by the OpenAI Evals project and offers a number of built-in evaluations which we record and aggregate in the Baserun dashboard.
The framework simplifies the LLM Debugging workflow.
The hallucinations can be prevented with the tool to some extent.
Less Customisation Scope.

🐤 PromptTools

Welcome to prompttools created by Hegel AI! This repo offers a set of open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. The core idea is to enable developers to evaluate using familiar interfaces like code, notebooks, and a local playground.

In just a few lines of codes, you can test your prompts and parameters across different models (whether you are using OpenAI, Anthropic, or LLaMA models). You can even evaluate the retrieval accuracy of vector databases.

🎈 Details

Colab link
Documentation
Github
License: Apache-2.0 license

🪂 Metrics

Experiments and Harnesses : Here are two main abstractions used in the prompttools library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details.
- An experiment is a low-level abstraction that takes the cartesian product of possible inputs to an LLM API. For example, the OpenAIChatExperiment accepts lists of inputs for each parameter of the OpenAI Chat Completion API. Then, it constructs and asynchronously executes requests using those potential inputs. An example of using an experiment is here.
- There are two main abstractions used in the prompttools library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details. A harness is built on top of an experiment and manages abstractions over inputs.
Evaluation and Validation : These built-in functions help you to evaluate the outputs of your experiments. They can also be used to be part of your CI/CD system.
- You can also manually enter feedback to evaluate prompts, see HumanFeedback.ipynb
- IT uses gpt4 as a judge
- Here is a list of APIs that we support with our experiments:
- LLMs
  - OpenAI (Completion, ChatCompletion, Fine-tuned models) - Supported
  - LLaMA.Cpp (LLaMA 1, LLaMA 2) - Supported
  - HuggingFace (Hub API, Inference Endpoints) - Supported
  - Anthropic - Supported
  - Google PaLM - Supported
  - Azure OpenAI Service - Supported
  - Replicate - Supported
  - Ollama - In Progress
- Vector Databases and Data Utility
  - Chroma - Supported
  - Weaviate - Supported
  - Qdrant - Supported
  - LanceDB - Supported
  - Milvus - Exploratory
  - Pinecone - Exploratory
  - Epsilla - In Progress
- Frameworks
  - LangChain - Supported
  - MindsDB - Supported
  - LlamaIndex - Exploratory
- Computer Vision
  - Stable Diffusion - Supported
  - Replicate's hosted Stable Diffusion - Supported

🧗 Remarks

I have been using it for the last 15 days. The Streamlit-based dashboard is smooth.
Prompt Template Experimentation is a nice feature of the product. But I am expecting more comparison details without latency and similarities.
The framework covers the LLM, VectorDb, and orchestrators.
Great Community Support.
Great tool for RLHF.
Can’t add a self-hosted server.

🐚 Nvidia NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or "rails" for short) are specific ways of controlling the output of a large language model, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialog path, using a particular language style, extracting structured data, and more.

NeMo Guardrails will help ensure smart applications powered by large language models (LLMs) are accurate, appropriate, on topic, and secure. The software includes all the code, examples, and documentation businesses need to add safety to AI apps that generate text.

It sits in the middle between the user (After Vector Embedding) and guard LLM server. It is open source so the engineer can write their own logic onto the guardrail.

NeMo Guardrails enables developers to set up three kinds of boundaries:

Topical guardrails prevent apps from veering off into undesired areas. For example, they keep customer service assistants from answering questions about the weather.
Safety guardrails ensure apps respond with accurate, appropriate information. They can filter out unwanted language and enforce that references are made only to credible sources.
Security guardrails restrict apps to making connections only to external third-party applications known to be safe.

🎈 Details

Github

🧗 Remarks

Nemo-Guardrail is An easily programmable guardrail that is a must for the production-based LLM application.
The conversation designer can add the boundaries of the conversation in the same plain English using colang.
The filtering policy of the guard rail depends on the embedding space - more intelligent.
Supports the production batching for the orchestration.
The community is great.
The most required framework in the time.

*I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. [Subscribe](https://musingsonai.substack.com/)*

🦜 Agenta

Building production-ready LLM-powered applications is currently very difficult. It involves countless iterations of prompt engineering, parameter tuning, and architectures.

Agenta provides you with the tools to quickly do prompt engineering and 🧪 experiment, ⚖️ evaluate, and 🚀 deploy your LLM apps. All without imposing any restrictions on your choice of framework, library, or model.

🎈 Details

🧗 Remarks

The website and app code have excellent UX. The end-to-end user journey, from creation to testing, is beautifully designed.
Can be hosted OnPrem - Aws or GCP
They have different parts:
- Playground: to create the prompts from lots of predefined templates like
  - sales_call_summarizer
  - baby_name_generator
  - chat_models
  - completion_models
  - compose_email
  - experimental
  - extract_data_to_json
  - job_info_extractor
  - noteGPT
  - recipes_and_ingredients
  - sales_call_summarizer
  - sales_transcript_summarizer
  - sentiment_analysis
- Test Sets
- Evaluate
- API Endpoint

🦚 AgentBench

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors.

🎈 Details

🧗 Remarks

This paper evaluates the performance of several LLMs (LLama 2, Vicuna, GPT-X, Dolly, etc.) as intelligent agents in a long chain environment that involves databases (SQL), web booking, and product comparison on the internet. The main question to be answered is whether LLama 2 is superior to ChatGPT in comparing products on the internet. For the purpose of this study, an agent refers to an LLM that operates in this environment.
An "AGENT" is an LLM that operates within a simulated environment to achieve a specific goal. In this study, the term, "agent" is used to refer to such an LLM. The agent's performance is assessed based on its capability to complete assigned tasks.
To date, It’s one of the best approaches to evaluating a LLM model for various tasks.

🐿️ Guidance

Guidance enables you to control modern language models more effectively and efficiently than traditional prompting or chaining. Guidance programs allow you to interleave generation, prompting, and logical control into a single continuous flow matching how the language model actually processes the text. Simple output structures like Chain of Thought and its many variants (e.g., ART, Auto-CoT, etc.) have been shown to improve LLM performance. The advent of more powerful LLMs like GPT-4 allows for even richer structure, and guidance makes that structure easier and cheaper.

🎈 Details

Github

🕵️‍♀️ Features

🔹 Live streaming Simple, intuitive syntax. Guidance feels like a templating language, and just like standard Handlebars templates, you can do variable interpolation (e.g., {{proverb}}) and logical control.
- Details
🔹 Chat dialog Guidance supports API-based chat models like GPT-4, as well as open chat models like Vicuna through a unified API based on role tags (e.g., {{#system}}...{{/system}}). This allows interactive dialog development that combines rich templating and logical control with modern chat models.
- Details
🔹 Guidance acceleration When multiple generation or LLM-directed control flow statements are used in a single Guidance program then we can significantly improve inference performance by optimally reusing the Key/Value caches as we progress through the prompt. This means Guidance only asks the LLM to generate the green text below, not the entire program. This cuts this prompt's runtime in half vs. a standard generation approach.
- Details

🔹 Token healing: The standard greedy tokenizations used by most language models introduce a subtle and powerful bias that can have all kinds of unintended consequences for your prompts. Using a process we call "token healing" guidance automatically removes these surprising biases, freeing you to focus on designing the prompts you want without worrying about tokenization artifacts. - Details

🔹 Rich output structure example: To demonstrate the value of output structure, we take a simple task from BigBench, where the goal is to identify whether a given sentence contains an anachronism (a statement that is impossible because of non-overlapping time periods). Below is a simple two-shot prompt for it, with a human-crafted chain-of-thought sequence.
- Details
🔹 Guaranteeing valid syntax JSON example: Large language models are great at generating useful outputs, but they are not great at guaranteeing that those outputs follow a specific format. This can cause problems when we want to use the outputs of a language model as input to another system. For example, if we want to use a language model to generate a JSON object, we need to make sure that the output is valid JSON. With guidance we can both accelerate inference speed and ensure that generated JSON is always valid. Below we generate a random character profile for a game with perfect syntax every time.
- Details
🔹 Role-based chat model example: Modern chat-style models like ChatGPT and Alpaca are trained with special tokens that mark out "roles" for different areas of the prompt. Guidance supports these models through role tags that automatically map to the correct tokens or API calls for the current LLM. Below we show how a role-based guidance program enables simple multi-step reasoning and planning.
- Details
🔹 Agents: We can easily build agents that talk to each other or to a user, via the await command. The await command allows us to pause execution and return a partially executed guidance program. By putting await in a loop, that partially executed program can then be called again and again to form a dialog (or any other structure you design). For example, here is how we might get GPT-4 to simulate two agents talking to one another.
- Details

🧗 Remarks

If I need to select a tool for prompt engineering, I select this one.
Community Support is Superb.

🦆 Arthur Bench

Today, we’re excited to introduce our newest product: Arthur Bench, the most robust way to evaluate LLMs. Bench is an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models. This open source tool will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make informed, data-driven decisions when integrating the latest AI technologies into their operations. Here are some ways in which Arthur Bench helps businesses:Model Selection & Validation, Budget & Privacy Optimization, Translation of Academic Benchmarks to Real-World Performance.

🎈 Details

🧗 Remarks

This tool creates a test suite automatically using datasets.
Periodically validates models for resiliency to model changes outside their control.
The system offers deployment gates that identify anomalous inputs, potential PII leakage, toxicity, and other quality metrics. It learns from production performance to optimize thresholds for these quality gates.
Provides core token-level observability, performance dashboarding, inference debugging, and alerting.
Accelerates ability to identify and debug underperforming regions.

*I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. [Subscribe](https://musingsonai.substack.com/)*

🌳 Galileo LLM Studio

Algorithm-powered LLMOps Platform: Find the best prompt, inspect data errors while fine-tuning, monitor LLM outputs in real-time. All in one powerful, collaborative platform.

🎈 Details

🕵️‍♀️ Features

🔹 Prompt Engineering
- Promot Inspector.
- A detailed easy Dashboard with multiple parameters and evaluation scores.
- Hallucination Score.
🔹 LLM Fine-Tune and Debugging
- The watcher function analyze the input data.
- A detailed dashboard with data quality - Auto identification of the data pulling from LLM that reduces the performance.
- Fix and track data changes over time.
🔹 Production Monitoring
- Real-time LLM Monitoring.
- Risk Control with customized plugins
- Customized alert with your Slack.

🧗 Remarks

To date, I found this one is the tool for LLMOps. The developer can push the LLM model into production with confidence using the tool.

🎄 lakera.ai

An Overview of Lakera Guard – Bringing Enterprise-Grade Security to LLMs with One Line of Code.At Lakera, we supercharge AI developers by enabling them to swiftly identify and eliminate their AI applications’ security threats so that they can focus on building the most exciting applications securely.

🕵️‍♀️ Features

🔹 Content Moderation
- These are the categories that Lakera Guard currently evaluate against for inappropriate content in the input prompt.
  - Hate: Content targeting race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste, including violence. Content directed at non-protected groups (e.g., chess players) is exempt.- Sex: Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).
🔹 Prompt injections
- Jailbreaks: LLMs can be forced into malicious behavior by jailbreak attack prompts. Lakera Guard updates to protect against these.
- Prompt injections: Prompt injection attacks must be stopped at all costs. Attackers will do whatever it takes to manipulate the system's behavior or gain unauthorized access. But fear not, Lakera Guard is constantly updated to prevent prompt injections and protect your system from harm.
🔹 Sensitive information:
- PII stands for Personally Identifiable Information - data that can identify an individual. It requires strict protection due to identity theft and privacy risks. Organizations handling PII must safeguard it to prevent unauthorized access. Laws like GDPR and HIPAA ensure proper PII handling and privacy protection.
🔹 Relevant Language
- There are many ways to challenge LLMs using language. Users may: Either Use Japanese jailbreaks, Employ Portuguese prompt injections, Intentionally include spelling errors in prompts to bypass defenses, Insert extensive code or special characters into prompts.They assign a score between 0 and 1 to indicate the authenticity of a prompt. A higher score suggests a genuine attempt at regular communication.
🔹 Unknown links
- One way in which prompt injection can be dangerous is phishing.

🧗 Remarks

The Roadmap is amazing.
LLM security is a real topic - and they are working on it.

🐣 NightFall AI

ChatGPT and other generative AI tools are powerful ways to increase your team's output. But sensitive data such as PII, confidential information, API keys, PHI, and much more can be contained in prompts. Rather than block these tools, use Nightfall's Chrome extension or Developer Platform to.

🎈 Details

🧗 Remarks

A great tool for handling LLM security.
Manage all security tasks in your SIEM or Nightfall dashboard.
Proactively protect your company and customer data.
Identify and manage secrets and keys from a single dashboard.
Train employees on best practice security policies, Build a culture of trust and strong data security hygiene.
Complete visibility of your sensitive data.

🦢 BenchLLM

BenchLLM is a Python-based open-source library that streamlines the testing of Large Language Models (LLMs) and AI-powered applications. It measures the accuracy of your model, agents, or chains by validating responses on any number of tests via LLMs.

🎈 Details

🧗 Remarks

A detailed customizable library to evaluate prompt performance.
A great tool for prompt engineering.
Support Vector Retrieval, Similary, Orchestrators and Function Calling.
Test the responses of your LLM across any number of prompts.
Continuous integration for chains like Langchain, agents like AutoGPT, or LLM models like Llama or GPT-4.
Eliminate flaky chains and create confidence in your code.
Spot inaccurate responses and hallucinations in your application at every version.

🦉 Martian

Dynamically route every prompt to the best LLM. Highest performance, lowest costs, incredibly easy to use.There are over 250,000 LLMs today. Some are good at coding. Some are good at holding conversations. Some are up to 300x cheaper than others. You could hire an ML engineering team to test every single one — or you can switch to the best one for each request with Martian.

Before:

After:

🎈 Details

Github

🧗 Remarks

In the development phase, but I love the idea. It is trying to solve one of the most burning problems in the LLM ecosystem.
There are various models available in the market that specialize in different tasks such as coding and storytelling. The Martian SDK is designed to identify the prompt's intention and utilize various models internally to produce the output.
GPT 4 models is 316x Costlier than a 7 billion model - “Don't waste money by paying senior models to do junior work. The model router sends your tasks to the right model.”

*I post weekly newsletters on LLM development, the hottest Musings on Artificial Intelligence. [Subscribe](https://musingsonai.substack.com/)*

🐹 Special Mention

ReLLM

ReLLM was created to fill a need when developing a separate tool. We needed a way to provide long term memory and context to our users, but we also needed to account for permissions and who can see what data.

🥦 LangDock

🥒 TaylorAI

Taylor AI allows enterprises to train and own their own proprietary fine-tuned LLMs in minutes, not weeks.

🍉 scorecard.ai

Testing for Production-ready LLMs.Ship faster with more confidence. Integrate in minutes.

🍈 signway.io

Signway is a proxy server that addresses the problem of re-streaming API responses from backend to frontend by allowing the frontend to directly request the API using a pre-signed URL created by Signway. This URL is short-lived, and once it passes verification for authenticity and expiry, Signway will proxy the request to the API and add the necessary authentication headers.

🥥 mithrilsecurity.io

Mithril Security helps software vendors sell SaaS to enterprises, thanks to our secure enclave deployment tooling, which provides SaaS on-prem levels of security and control for customers.

🥝 kobaltlabs

Unlock the power of GPT for your most sensitive data with a fast, simple security API

🥭 cadea.ai

Deploy enterprise-level AI tools equipped with e2e data security and role based access control. Our platform helps you create, manage, and monitor chatbots that can answer questions about your internal documents.

🐶 Summary

It is hard to compare apples-to-apples. That why I have grouped the frameworks (No rank).

🔹 Prompt Engineering (Make Prompts better)

Baserun
PromptTools
DeepEval
Promptfoo
Nvidia NeMo-Guardrails
Agenta
AI Hero Studio
Guidance
Galileo LLM Studio
BenchLLM

🔹 Everything about LLM (Fine-tune, Debugging, Monitoring)

Baserun
Agenta
Nvidia NeMo-Guardrails
AgentBench
Galileo LLM Studio
Martian

🔹 LLM Security (Guard The LLM Fortress)

Nvidia NeMo-Guardrails
Arthur Bench
Galileo LLM Studio
lakera.ai
NightFall AI

Supercharging LLM Capabilities with Retrieval Augmented Generation (RAG)

Rafael Pierre — Wed, 24 Apr 2024 10:00:00 GMT

Large Language Modelds (LLMs) like ChatGPT, Llama, Mistral and many others have become cornerstones due to their ability to understand and generate human-like text. These models have revolutionized how machines interact with information, offering unprecedented opportunities across a range of applications. However, despite their capabilities, LLMs come with notable limitations, including:

Static Knowledge Base

LLMs are typically trained on a fixed dataset that may consist of a snapshot of the internet or curated corpora up to a certain point in time. Once training is complete, their knowledge base becomes static.

Outdated Information: Because the models do not update their knowledge after training, they can quickly become outdated as new information emerges and societal norms evolve.
Inability to Learn Post-Training: Without continuous updates or further training, LLMs cannot adapt to new data or events that occur after their last training cycle. This limits their ability to provide relevant and current responses.

Factuality Challenges

Factuality in LLMs relates to the accuracy and truthfulness of the information they generate. LLMs often struggle with maintaining a high level of factuality for several reasons:

Confidence in Incorrect Information: LLMs can generate responses with high confidence that are factually incorrect or misleading. This is because they rely on patterns in data rather than verified facts.
Hallucination of Data: LLMs are known to hallucinate information, meaning they generate plausible but entirely fabricated details. This can be particularly problematic in settings that require high accuracy such as medical advice or news reporting.
Source Attribution: LLMs do not keep track of the sources of their training data, making it difficult to trace the origin of the information they provide, which complicates the verification of generated content.

Scalability Concerns

Scaling LLMs involves more than just handling larger datasets or producing longer text outputs. It encompasses several dimensions that can present challenges:

Computational Resources: Training state-of-the-art LLMs requires significant computational power and energy, often involving hundreds of GPUs or TPUs running for weeks. This high resource demand makes scaling costly and less accessible to smaller organizations or independent researchers.
Model Size and Management: As models scale, they become increasingly complex and difficult to manage. Larger models require more memory and more sophisticated infrastructure for deployment, which can complicate integration and maintenance in production environments.
Latency and Throughput: Larger models generally process information more slowly, which can lead to higher latency in applications that require real-time responses. Managing throughput effectively while maintaining performance becomes a critical challenge as the model scales.

These challenges underscore the limitations of current LLMs and highlight the necessity for innovative solutions like Retrieval Augmented Generation (RAG), which seeks to address these issues by dynamically integrating up-to-date external information and improving the model’s adaptability, accuracy, and scalability.

What is Retrieval Augmented Generation?

Introduced by Meta, Retrieval Augmented Generation (RAG) represents a solution designed to overcome these shortcomings. At its core, RAG is a hybrid model that enhances the generative capabilities of traditional LLMs by integrating them with a retrieval component. This integration allows the model to access a broader and more up-to-date pool of information during the response generation process, thereby improving the relevance and accuracy of its outputs.

How RAG Works

The architecture of a typical RAG system combines two main components: a retrieval mechanism and a transformer-based generator. Here’s how it functions:

Retrieval Process: When a query is inputted, the RAG system first searches a vast database of documents to find content that is most relevant to the query. This process leverages techniques from information retrieval to effectively match the query with the appropriate documents.
Generation Process: The retrieved documents are then fed into a transformer-based model which synthesizes the information and generates a coherent and contextually appropriate response. This step ensures that the output is not just accurate but also tailored to the specific needs of the query.

Embeddings: RAG's Secret Sauce

In Retrieval Augmented Generation (RAG) systems, embeddings play a crucial role in enabling the efficient retrieval of relevant documents or data that the generative component uses to produce answers. Here's an overview of how embeddings function within a RAG framework:

What are Embeddings?

Embeddings are dense, low-dimensional representations of higher-dimensional data. In the context of RAG, embeddings are typically generated for both the input query and the documents in a database. These embeddings transform textual information into vectors in a continuous vector space, where semantically similar items are located closer to each other.

Role of Embeddings in RAG

Semantic Matching: Embeddings enable the RAG system to perform semantic matching between the input query and the potential source documents. By converting words, phrases, or entire documents into vectors, the system can use distance metrics (like cosine similarity) to find the documents that are most semantically similar to the query. This is more effective than traditional keyword matching, which might miss relevant documents due to synonymy or differing phrasings.
Efficient Information Retrieval: Without embeddings, searching through a large database for relevant information could be computationally expensive and slow, especially as databases scale up. Embeddings allow the use of advanced indexing and search algorithms (like approximate nearest neighbor search algorithms) that can quickly retrieve the top matching documents from a large corpus, making the process both scalable and efficient.
Improving the Quality of Generated Responses: By retrieving documents that are semantically related to the query, the generative component of the RAG system has access to contextually relevant and rich information. This helps in generating more accurate, detailed, and contextually appropriate answers. Embeddings ensure that the retrieved content is not just relevant but also enhances the generative process by providing specific details or factual content that the model can incorporate into its responses.
Continual Learning and Adaptation: In some advanced implementations, embeddings can also be dynamically updated or refined based on feedback loops from the system's performance or new data, enhancing the model's ability to adapt over time. This can be crucial for applications in rapidly changing fields like news or scientific research.

Technologies and Tools for Embeddings in RAG

Several technologies facilitate the creation and management of embeddings in a RAG system. Libraries like FAISS, Annoy, Sentence Transformers and HNSW are popular for building efficient indexing systems that support fast retrieval operations on embeddings. Machine learning frameworks such as TensorFlow and PyTorch, along with models from Hugging Face’s Transformers library, are commonly used to generate embeddings from textual data.

Benefits of RAG

RAG addresses several limitations of LLMs effectively:

Dynamic Knowledge Integration: Unlike traditional LLMs, RAG can pull the most current data from external sources, ensuring responses are not only accurate but also timely.
Enhanced Accuracy and Relevance: By using real-time information, RAG considerably improves the quality of responses in terms of factuality and relevancy.

Use Cases and Applications

RAG has found practical applications in several fields:

Customer Support: Automating responses to user inquiries by providing precise and updated information.
Healthcare: Assisting medical professionals by quickly retrieving medical literature and patient data to offer better diagnostic support.
Legal Advisories: Enhancing the preparation of legal documents and advices by accessing and integrating the latest legal precedents and regulations.

Getting Started with RAG

For those interested in implementing RAG, there are several tools and frameworks available, both proprietary and open source. For any RAG implementation there will be two main components:

Embeddings: raw text that you want to index needs to be converted to embeddings. There are proprietary options from OpenAI, Cohere, Anthropic and other providers. In the open source world, Sentence Transformers and FAISS provide out-of-the-box functionality allowing to use open source Large Language Models to encode a text dataset into a set of embeddings.
Vector Databases: once text is converted to embeddings, you will need to store it in a Vector Database. Ideally, this database will expose functionality to index these embeddings and to perform semantic search on top of them. Common algorithms for semantic search are Cosine Similarity and ANN. There are many different techniques and strategies for the retrieval stage and this is something that needs to be optimized depending on the use case. There are many different options in terms of Vector Database stacks both proprietary and open source, such as:
Qdrant
Milvus
Weaviate
ChromaDB
Postgresql + pgvector

In next chapters, we'll dive deep into the following RAG topics:

Different RAG strategies
Vector DB options & benchmarks
Practical RAG use cases

Stay tuned!