AI agents are no longer science fiction – they’re already drafting code, triaging security alerts, negotiating freight rates, and even scouting M&A candidates, often without direct human supervision. This rapid adoption brings a new challenge: many CIOs and IT leaders admit, “We’ve unleashed AI pilots across the enterprise, but now we can’t see what they’re doing.” The lack of observability isn’t just inconvenient – it’s a board-level risk. Unmonitored agents could inadvertently expose sensitive data, introduce malicious packages via unchecked tools, rack up runaway API bills, or quietly influence critical decisions based on faulty reasoning.
Transparency and accountability must therefore become foundational pillars of any enterprise AI strategy. As both Microsoft and Google have stressed, agent actions need to be observable and auditable via robust logging and clear action characterization.
In this post, I outline practical steps to evolve from agent chaos to operational clarity, drawing on lessons from our recent CIO Think Tank session with Iain Paterson (Well Health), Dan Lausted (Paragon Micro), and Michael Paul (Speedy Transport), as well as latest Microsoft best practices for secure AI applications.
This early detection is critical, especially as threats shift towards the AI software supply chain. Consider that while the average cost of a data breach is now over $4.8 million, that figure plummets when the vulnerability is caught during development, reducing potential incident costs from millions to mere hundreds of dollars.
Transparent agents make it easier to demonstrate adherence to policies and retain tamper-proof records
When stakeholders clearly see how agents behave and deliver value, both adoption and budget support tend to accelerate.
As one Iain Paterson, CSO of WELL Health put it, “An agent is basically a highspeed employee you didn’t train. Would you let a new hire roam without supervision?”
In other words, the business will trust AI when IT provides the supervision and metrics to manage it responsibly.
To achieve end-to-end clarity, treat observability as a layered architecture. The following tables break down what to monitor at each layer.
Important Note: The tools in this space are evolving rapidly; the products mentioned here are current examples of a category of solution (e.g., observability platforms, LLM gateways). The principle is to understand the category and select the best-fit tool for your enterprise stack.
Each layer feeds a central observability plane—think “Splunk for AI.” Early adopters are building lightweight proxies today because commercial suites remain immature. Notably, Microsoft’s Azure AI Foundry is introducing a unified monitoring dashboard (in preview) that brings together performance, safety, quality, and usage metrics in real time, much of it via integration with Azure Monitor Application Insights.
Whether you use a built-in solution or assemble your own, the goal is the same: centralize the visibility so you can query and correlate across all layers when investigating an incident or optimizing performance.
If a problem occurs, you want to quickly answer: What was the prompt? What did the agent decide internally? Did it call any external system? How long did it run? What result did it produce and was it correct? Complete observability means having those answers at your fingertips.
Create a mandatory exit point for all Large Language Model (LLM) requests coming from your applications. This LLM proxy works like an API gateway. It intercepts every prompt and response, logs the traffic, and enforces rate limits and quotas. It also scrubs or masks sensitive data and can add standard system instructions, like Azure’s Prompt Guidelines and safe-completion policies. This gateway can check content safety in real-time. For example, you can use Azure’s Prompt Shields and the new Spotlighting feature at this point to spot and remove prompt injection attacks, which occur when hidden instructions in inputs try to trick the model.
Quick win: Opensource projects like llmproxy can be containerized in hours.
Open-source projects like LLM-Proxy or LiteLLM provide drop-in proxy servers that mimic the OpenAI API, so redirecting your traffic through them (via a simple base URL change) can often be done in hours. In an enterprise Azure environment, you might also leverage Azure API Management in front of the Azure OpenAI service, gaining logging and policy control.
The end result is every prompt and response goes through a checkpoint where it’s recorded and can be modified or blocked if needed, dramatically reducing the chance of unmonitored or malicious model interactions.
Each agent “tool” (database connector, email sender, cloudstorage writer) must carry a signed manifest declaring scopes and limits. CI/CD pipelines block unsigned adapters, reducing supply chain risk.
Michael Paul’s team uses synthetic or tokenized datasets for new agents. Only after passing deterministic tests do agents receive read-only production credentials; write access requires a second approval.
Traditional ML monitored model drift; agents require workflow drift metrics:
• Consistency Rate – semantically similar inputs yield functionally equivalent outputs in ≥ 95% of test cases.
• Escalation Rate – % of tasks requiring human intervention.
• Exploit Exposure – # of successful jailbreaks vs. attempts.
Retain separate logs for (a) operational replay, (b) legal/regulatory evidence. Immutable storage (e.g., Azure Immutable Blob) satisfies retention policies without hampering day-to-day analytics.
Adopt attribute based access control for prompts. Example: Finance personas blocked from invoking code generation tools; developers restricted from reading sensitive HR payloads.
Scenario
Well Health Technologies embedded an AI orchestration layer in its Security Operations Center. Agents now auto close benign EDR alerts and collect context for deeper investigations.
Transparency Approach
Outcomes
The lesson: when agents are transparent, efficiency and assurance rise in tandem.
Accountability extends beyond seeing actions; we must also explain them. Techniques include:
Implement these patterns early; retrofitting clarity after wide deployment is exponentially harder. Once an agent is widely deployed and integrated into business processes, adding logging or changing prompts to get explanations can be like changing the engine of a plane in mid-flight.
Incorporate explainability from the outset. It will pay dividends not only in compliance and debugging, but also in continuous improvement – your team can learn much more from an AI when the AI shows its work.
I’ve seen CISOs either clamp down—banning all external LLMs—or relinquish oversight, hoping for the best. Both extremes drive shadow AI. Instead, build a culture where visibility tools are shared resources that empower teams.
How do you cultivate this? A few practices that have proven effective:
When teams realize logging protects them—guarding budgets, reputations, and customer trust—compliance becomes voluntary, not forced. Not a “Department of No” but a “Center of Know” that provides knowledge, safety, and support for innovative uses of AI.
If you’re wondering how to get from today’s state (maybe a scattered set of AI pilot projects, each doing their own thing) to a well-governed, transparent AI environment, it helps to break the journey into manageable phases.
Here’s a 90-day (roughly 3-month) roadmap that many organizations can follow to build a baseline of AI observability and control.
Weeks 1–2: Inventory Agents & Data Flows:
Start by cataloging what you have. Which AI agents or LLM applications already exist in your enterprise (even experimental ones)? Where are they running, and who owns them? What data sources do they touch, and what APIs or tools do they call?
The deliverable is a living catalog or register of AI use cases, including for each: the business purpose, the technical implementation (model used, etc.), the integration points (data in/out), and current monitoring if any. This inventory is crucial – you can’t secure or monitor what you don’t know about. Don’t be surprised if you find more than you expected; this step often uncovers a few “rogue” experiments. The outcome is awareness and a list of targets for the following steps.
Weeks 3–4: Stand Up an LLM Proxy:
Pick one or two high-priority AI applications (perhaps those in production or those handling sensitive data) and route their model calls through a newly deployed LLM proxy/gateway. This might be a self-hosted solution (like the open-source ones mentioned earlier) or a service in preview such as Azure’s prompt management proxy. Aim to capture at least 80% of the prompt/response traffic out of the gate.
You’ll likely uncover some integration challenges – e.g., switching the base URLs, handling authentication – but by end of week 4 you should see logs of prompts and completions flowing into a central store. Also implement basic PII masking at this stage: ensure the proxy scrubs obvious personal data from the logs or uses Azure Content Safety’s PII detection to redact sensitive info, so that monitoring doesn’t create a privacy problem.
The key deliverable here is a working “wire tap” on your AI’s conversations, with an initial view of what’s being asked and answered across that 80% of traffic.
Weeks 5–6: Define Reliability SLOs:
With some visibility in place, now define what good looks like for your AI agents. Work with business and technical leaders to set service level objectives (SLOs) or benchmarks for the AI’s performance and behavior. For example, you might decide: “Our customer support AI should correctly answer at least 95% of known FAQ questions, and anything below that triggers a retraining.” Or, “The sales email generator AI should have <1% rate of unacceptable content outputs (as flagged by compliance).”
Also consider SLOs for cost: “This agent should not exceed $X in compute per month without approval.” Microsoft recommends establishing metrics like consistency and escalation rates, as discussed, and getting executive agreement on those targets. Document these reliability and safety objectives and get them approved by the stakeholders (IT governance, compliance, line-of-business owners).
The deliverable is an agreed set of AI performance metrics and targets. This not only guides the AI team’s improvements but also gives executives confidence that there is a measurable way to judge the AI’s impact and risks.
Weeks 7–8: Instrument Top-Risk Agents (Layers 2 & 3):
Now that org-wide policies are set, focus on the most critical or risky AI agent and enhance its observability in depth. This means enabling Layer 2 (agent logic) and Layer 3 (external action) monitoring for it. Concretely, if the agent uses a framework like LangChain or Azure’s Orchestration, turn on verbose logging of each step/tool use (many SDKs have debug modes for this). If not, add custom logging in the code for each decision point. Simultaneously, set up monitoring on any external actions: for instance, if the agent can access a database, ensure all its DB queries are logged (perhaps by routing through an API that logs or using database auditing features).
Feed these detailed traces into your security information and event management (SIEM) system – such as Microsoft Sentinel – or whatever log aggregator your security team uses. By the end of week 8, your highest-risk AI agent should be essentially transparent: you can see its chain-of-thought and any outside calls in one place. If it tries something odd or unauthorized, you’ll catch it almost immediately because you’ve instrumented those potential points of failure. This step often uncovers surprises and is a chance to fix issues (maybe you discover the agent was using an API key that had more access than needed, etc.).
The deliverable here is a fully instrumented agent feeding data to your central logging/SIEM, with alerts configured for any truly abnormal events.
Weeks 9–10: Pilot Self-Critique and Explanation:
Take two important AI workflows (for example, the two most common tasks the agent does, or one common task and one high-risk task) and add explainability features to them. This could involve modifying the agent’s prompt to include a “reasoning” field in its output, or implementing a simple analysis script that takes the agent’s output and compares it to a ground truth (if available) for accuracy.
The idea is to start capturing the why and how for key decisions. For instance, if the agent closes a helpdesk ticket automatically, have it add a note like “Closed because the issue matched known resolution pattern X, confidence 98%.” Encourage it to justify its actions in logs or comments. You might also deploy a basic evaluation harness for these workflows: e.g., once a day, run a test set of prompts through the agent and calculate consistency or quality metrics, then log those results. By week 10, you have an initial feedback loop in place.
The deliverable might be a report or dashboard that shows, for those two workflows, the agent’s reasoning excerpts and a metric like “accuracy yesterday 97% vs last week 95%, trend improving.” This pilot will demonstrate the value of explainability to others and can be extended to more workflows later. It’s essentially a trial of the self-regulating AI concept, where the agent’s own outputs are being monitored for quality continuously (what Microsoft calls continuous evaluation in Azure AI Foundry).
Weeks 11–12: Executive Dashboard & Review:
In the final stretch of the quarter, consolidate everything into a high-level AI Ops dashboard for leadership. This “single pane of glass” should display the volume of AI agent activities, key metrics (the SLOs you defined), any notable anomalies or incidents caught, and the trend lines.
For example, it might have: number of agent-driven transactions this week, % that were successful, % escalated to human; cost this week vs budget; any policy violations or blocks that occurred (e.g., “5 prompts were blocked for disallowed content”); and overall reliability score. If you have multiple agents, it could break down stats by agent or department. Leverage your existing BI tools or Azure Monitor workbooks to build this – it doesn’t have to be fancy at first, even a spreadsheet is fine as a prototype.
The point is to provide visibility up the chain, so executives can see at a glance that “AI is under control and delivering value.” In week 12, hold an executive review meeting to walk through the dashboard and the progress made in the last 3 months. Show before-and-after examples of visibility (e.g., “In January we had no idea what the marketing GPT was doing; by March we can show every prompt and response and how it aligns to policy.”). This is your chance to get buy-in (and maybe budget) for the next phase – broader rollout or more advanced tooling. The deliverable is a polished AI governance dashboard and a presentation that tells the story of improved clarity.
These small wins compound: gaining visibility on one agent gives you a template to do the same for others; one dashboard for leadership will spark ideas for additional metrics they want to see. With a defensible foundation in place, you can scale AI with much greater confidence.
The goal is that, three months from now, no one says “we have AI running and don’t know what it’s doing” – instead they’ll say, “we have AI running and we watch it like a hawk (in real time and after the fact), so we do know what it’s doing and can steer it appropriately.”
AI agents will only accelerate—multimodal inputs, larger action spaces, autonomous fleets. As Microsoft’s guidance highlights and our recent CIO ThinkTank findings show time and time again - security and governance can’t be an afterthought – they must be woven into every stage of AI development and operations.
Unmanaged chaos is optional. With deliberate design and cultural alignment, clarity becomes your competitive edge. The sooner IT leaders embed transparency and accountability, the sooner innovation can flourish without sleepless nights.
Richard Harbridge is a Microsoft MVP and leader in the Microsoft ecosystem who frequently speaks and shares on AI adoption, digital workplace strategy, and enterprise governance.
Join Our Mailing List