Teaching Mistral Agents to Say No: Content Moderation from Prompt to Response

In this tutorial, we’ll implement content moderation guardrails for Mistral agents to ensure safe and policy-compliant interactions. By using Mistral’s moderation APIs, we’ll validate both the user input and the agent’s response against categories like financial advice, self-harm, PII, and more. This helps prevent harmful or inappropriate content from being generated or processed — a […] The post Teaching Mistral Agents to Say No: Content Moderation from Prompt to Response appeared first on MarkTechPost.

Jun 23, 2025 - 11:00
 0
Teaching Mistral Agents to Say No: Content Moderation from Prompt to Response

In this tutorial, we’ll implement content moderation guardrails for Mistral agents to ensure safe and policy-compliant interactions. By using Mistral’s moderation APIs, we’ll validate both the user input and the agent’s response against categories like financial advice, self-harm, PII, and more. This helps prevent harmful or inappropriate content from being generated or processed — a key step toward building responsible and production-ready AI systems.

The categories are mentioned in the table below:

Setting up dependencies

Install the Mistral library

Loading the Mistral API Key

You can get an API key from https://console.mistral.ai/api-keys

from getpass import getpass
MISTRAL_API_KEY = getpass('Enter Mistral API Key: ')

Creating the Mistral client and Agent

We’ll begin by initializing the Mistral client and creating a simple Math Agent using the Mistral Agents API. This agent will be capable of solving math problems and evaluating expressions.

from mistralai import Mistral

client = Mistral(api_key=MISTRAL_API_KEY)
math_agent = client.beta.agents.create(
    model="mistral-medium-2505",
    description="An agent that solves math problems and evaluates expressions.",
    name="Math Helper",
    instructions="You are a helpful math assistant. You can explain concepts, solve equations, and evaluate math expressions using the code interpreter.",
    tools=[{"type": "code_interpreter"}],
    completion_args={
        "temperature": 0.2,
        "top_p": 0.9
    }
)

Creating Safeguards

Getting the Agent response

Since our agent utilizes the code_interpreter tool to execute Python code, we’ll combine both the general response and the final output from the code execution into a single, unified reply.

def get_agent_response(response) -> str:
    general_response = response.outputs[0].content if len(response.outputs) > 0 else ""
    code_output = response.outputs[2].content if len(response.outputs) > 2 else ""

    if code_output:
        return f"{general_response}\n\n                        </div>
                                            <div class=
                            
                                Read More