<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AWS Community Builders </title>
    <description>The latest articles on DEV Community by AWS Community Builders  (@aws-builders).</description>
    <link>https://dev.to/aws-builders</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2794%2F88da75b6-aadd-4ea1-8083-ae2dfca8be94.png</url>
      <title>DEV Community: AWS Community Builders </title>
      <link>https://dev.to/aws-builders</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aws-builders"/>
    <language>en</language>
    <item>
      <title>AI Terms, Simply Explained: Notes from My Learning Journey</title>
      <dc:creator>Sandeep Sangu</dc:creator>
      <pubDate>Wed, 20 May 2026 05:38:51 +0000</pubDate>
      <link>https://dev.to/aws-builders/ai-terms-simply-explained-notes-from-my-learning-journey-3b52</link>
      <guid>https://dev.to/aws-builders/ai-terms-simply-explained-notes-from-my-learning-journey-3b52</guid>
      <description>&lt;p&gt;While preparing for the &lt;code&gt;AWS Certified AI Practitioner exam&lt;/code&gt;, I thought it would be helpful to ✍️ down my understanding of some common &lt;code&gt;AI&lt;/code&gt; and &lt;code&gt;GenAI&lt;/code&gt; terms.&lt;/p&gt;

&lt;p&gt;These notes reflect my understanding, shaped by different learning resources, including &lt;code&gt;AWS&lt;/code&gt; publicly available content and from experiences.&lt;/p&gt;

&lt;p&gt;This is not a textbook or a glossary. 📚 &lt;/p&gt;

&lt;p&gt;It’s a simple explanation of key terms, written in a way that I would have liked to read when I first started — with real-world analogies and no jargon.&lt;/p&gt;

&lt;p&gt;Let’s get started. 🚀&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Fundamentals Matter
&lt;/h3&gt;

&lt;p&gt;As we all know, terms like &lt;code&gt;Machine Learning&lt;/code&gt;, &lt;code&gt;AI&lt;/code&gt;, &lt;code&gt;Generative AI&lt;/code&gt;, and &lt;code&gt;Agentic AI&lt;/code&gt; are becoming common. These are the ones we hear the most, but there are many more working quietly behind the scenes.&lt;/p&gt;

&lt;p&gt;Personally, I believe staying relevant and up to date is the key.&lt;/p&gt;

&lt;p&gt;When you understand the fundamentals right, it becomes easier to connect the dots when you work on real AI projects — and that confidence makes a real difference.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Fundamentals&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;1️⃣ &lt;strong&gt;&lt;code&gt;Artificial Intelligence (AI)&lt;/code&gt;&lt;/strong&gt; is the idea of making computers do things that would normally require human intelligence. 🤖&lt;/p&gt;

&lt;p&gt;Think of it as teaching machines to solve problems, understand language, or even make decisions — tasks that earlier needed a person.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-life examples we already use:&lt;/strong&gt; 📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Voice Assistants&lt;/strong&gt; like Siri and Alexa that understand what you say and respond. 🗣️&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommendation Systems&lt;/strong&gt; on Netflix or Amazon that suggest what to watch or buy. 🎬🛒&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chatbots&lt;/strong&gt; that help answer your questions on websites. 💬&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AI&lt;/code&gt; is now behind many tools and services we use daily. Knowing the basics helps you understand how these systems are built and what’s happening behind the scenes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;🔍 Quick Note: Why Data Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All &lt;code&gt;AI systems&lt;/code&gt; — whether it's &lt;code&gt;Machine Learning&lt;/code&gt;, &lt;code&gt;Generative AI&lt;/code&gt;, or &lt;code&gt;Chatbots&lt;/code&gt; — rely heavily on data. Data is what helps AI learn, find patterns, and make decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where does the data come from?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It can be collected from public datasets, user interactions, company records, or even purchased from authorized data providers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In short: &lt;em&gt;No data, no AI.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The better the data, the smarter the AI becomes.&lt;/p&gt;

&lt;p&gt;2️⃣ &lt;strong&gt;&lt;code&gt;Machine Learning (ML)&lt;/code&gt;&lt;/strong&gt; 🧠 is a branch of &lt;code&gt;AI&lt;/code&gt; focused on teaching computers to learn from data, without being explicitly programmed for every task.&lt;/p&gt;

&lt;p&gt;While AI is the broader idea of making machines intelligent, &lt;strong&gt;ML is one way we achieve it&lt;/strong&gt; — by helping machines find patterns in data and improve over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-life examples:&lt;/strong&gt; 📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Movie recommendations&lt;/strong&gt; on Netflix that get better the more you watch. 🎬&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spam filters&lt;/strong&gt; in your email that learn what to block. ✉️🚫&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fraud detection systems&lt;/strong&gt; 🏦 used by banks to spot unusual transactions. &lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Machine Learning&lt;/code&gt; powers many of the AI applications we interact with daily. Understanding how ML works helps demystify how intelligent systems make decisions based on data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;3️⃣ &lt;strong&gt;&lt;code&gt;Artificial Neural Networks (ANN)&lt;/code&gt;&lt;/strong&gt; are computer systems inspired by how the human brain works.&lt;/p&gt;

&lt;p&gt;They are made up of &lt;code&gt;layers&lt;/code&gt; of simple units called &lt;strong&gt;neurons&lt;/strong&gt;, connected to each other, and are designed to recognize patterns in data — much like how our brain processes information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;input layer&lt;/strong&gt; receives the raw data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hidden layers&lt;/strong&gt; work through the data to find patterns and relationships.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;output layer&lt;/strong&gt; gives the final result or decision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-life examples:&lt;/strong&gt; 📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Facial recognition&lt;/strong&gt; systems that unlock your phone. 📱🔓&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice recognition&lt;/strong&gt; 🎙️ in assistants like Alexa or Google Assistant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handwriting recognition&lt;/strong&gt; when you digitize notes. ✍️📝&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Neural networks&lt;/code&gt; are at the heart of many AI applications that require &lt;code&gt;pattern recognition&lt;/code&gt;. They help machines &lt;code&gt;process&lt;/code&gt; complex data and make &lt;code&gt;decisions&lt;/code&gt; more like how humans do.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;4️⃣ &lt;strong&gt;&lt;code&gt;Deep Learning&lt;/code&gt;&lt;/strong&gt; is a type of Machine Learning that uses large neural networks with many layers — which is why it's called &lt;code&gt;deep.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can think of it as a more powerful way for machines to learn complex tasks by breaking them down into smaller steps — similar to how we build a house brick by brick 🧱🏠, or how we first set up infrastructure before deploying an app in tech projects. 🖥️🚀&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-life examples:&lt;/strong&gt; 📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-driving cars&lt;/strong&gt; 🚗🚦recognizing traffic signs and pedestrians. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Photo apps&lt;/strong&gt; 📸🧑‍🤝‍🧑 that automatically recognize and tag faces.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Deep Learning&lt;/code&gt; has made it possible for machines to perform tasks that once needed human-level skills — like seeing, recognizing, and even understanding — at a much higher scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;5️⃣ &lt;strong&gt;&lt;code&gt;Generative AI (GenAI)&lt;/code&gt;&lt;/strong&gt; is a type of AI that creates new content — like text, images, or even music — based on what it has learned.🧩&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;You can think of it like a chef who has studied thousands of recipes and can now create a new dish using that knowledge&lt;/em&gt;.🍳&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Real-life examples we already see:&lt;/strong&gt; 📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT&lt;/strong&gt; helping write emails or answer questions.📝&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Q Developer&lt;/strong&gt; suggesting code, helping troubleshoot, and assisting in building AWS applications.💻&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI tools&lt;/strong&gt; that generate artwork from text prompts.🎨&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Generative AI&lt;/code&gt; is speeding up how we create, design, and problem-solve — helping us move from ideas to results much faster.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;6️⃣ &lt;strong&gt;&lt;code&gt;Foundation Models (FM)&lt;/code&gt;&lt;/strong&gt; are large &lt;code&gt;AI models&lt;/code&gt; trained on a huge variety of data — text, images, or both — so they can handle many different tasks without being specialized for just one thing.&lt;/p&gt;

&lt;p&gt;You can think of a &lt;code&gt;Foundation Model&lt;/code&gt; like a strong base in construction — once built, it can support different types of buildings on top.🏗️🏢&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-life examples you might know:&lt;/strong&gt;📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4&lt;/strong&gt;,📝which powers ChatGPT for understanding and generating text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stable Diffusion&lt;/strong&gt;, 🎨used for creating realistic images from text prompts.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of building a new AI model for every task, &lt;code&gt;Foundation Models&lt;/code&gt; give us a powerful starting point that can be &lt;code&gt;fine-tuned&lt;/code&gt; for specific needs — making AI development faster and more flexible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;7️⃣ &lt;strong&gt;&lt;code&gt;Large Language Models (LLMs)&lt;/code&gt;&lt;/strong&gt; are AI systems trained on huge amounts of text data to understand and generate human language.🧠📝&lt;/p&gt;

&lt;p&gt;You can think of an &lt;code&gt;LLM&lt;/code&gt; like a &lt;code&gt;smart virtual assistant&lt;/code&gt; — or like a &lt;code&gt;doctor&lt;/code&gt; who has seen thousands of cases and can diagnose based on experience, without having to look things up every time. 🩺📚&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where you see LLMs in action:&lt;/strong&gt; 📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chatbots&lt;/strong&gt; that answer customer service questions.💬&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Email writing assistants&lt;/strong&gt; that suggest better sentences.✉️&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI search tools&lt;/strong&gt; that provide direct answers instead of links.🔍&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;LLMs&lt;/code&gt; are powering a new generation of tools that can understand human language and respond naturally, helping make information and communication faster and easier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Note:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All &lt;code&gt;LLMs&lt;/code&gt; are &lt;code&gt;Foundation Models (FMs)&lt;/code&gt;, but not all FMs are LLMs — FMs can handle other types of data too, like images or video.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Real-world example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AWS offers a service called &lt;code&gt;Amazon Bedrock&lt;/code&gt;, where you can access different LLMs like &lt;code&gt;Anthropic's Claude&lt;/code&gt; and &lt;code&gt;Meta's Llama 2&lt;/code&gt; and AWS's own &lt;a href="https://aws.amazon.com/bedrock/amazon-models/titan/" rel="noopener noreferrer"&gt;Amazon Titan&lt;/a&gt; models to build language-based applications.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;8️⃣ &lt;strong&gt;&lt;code&gt;Natural Language Processing (NLP)&lt;/code&gt;&lt;/strong&gt; is the part of AI that helps computers understand and work with human language — both what we write and what we say. 🗣️💻&lt;/p&gt;

&lt;p&gt;&lt;em&gt;You can think of &lt;code&gt;NLP&lt;/code&gt; like teaching a computer how to read, listen, and respond in ways that feel natural to us&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Behind the scenes:&lt;/strong&gt; 🔍&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NLP&lt;/code&gt; uses algorithms that learn from lots of examples — books, conversations, articles — so that computers can figure out what we mean and reply in a way that feels human.&lt;br&gt;
It’s not hard-coded with rules — it learns patterns and improves over time, just like we do when we practice a new language.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Two important sides of NLP:&lt;/strong&gt; 📚&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understanding Language (NLU):&lt;/strong&gt; This is where the computer tries to figure out what the words really mean — like detecting the mood behind a sentence (happy, sad) or guessing what someone wants based on what they said.😊😠&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Creating Language (NLG):&lt;/strong&gt; This is where the computer generates text or speech — for example, turning typed words into spoken voice (&lt;code&gt;text-to-speech&lt;/code&gt;) or turning spoken voice into written words (&lt;code&gt;speech-to-text&lt;/code&gt;).✍️🔊&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NLP&lt;/code&gt; is what makes it possible for computers to have more natural conversations with us — whether it’s chatting with a support bot or using voice commands on a device.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;9️⃣ &lt;strong&gt;&lt;code&gt;Transformer Models&lt;/code&gt;&lt;/strong&gt; are a type of AI model designed to understand and process language more effectively.🧠💬&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Unlike older models that read sentences one word at a time, Transformers look at the entire sentence all at once.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What makes them special is a trick called &lt;code&gt;attention&lt;/code&gt; — they figure out which words in a sentence are more important to focus on.&lt;/p&gt;

&lt;p&gt;For example, in a customer review:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“The food was amazing, but the service was slow.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The model pays more attention to words like &lt;strong&gt;“food,” “amazing,” “service,”&lt;/strong&gt; and &lt;strong&gt;“slow”&lt;/strong&gt; because they carry the real meaning, instead of small filler words.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Transformers have become the foundation for many advanced AI systems, helping them understand language faster and more accurately than before.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://pages.awscloud.com/NAMER-LN-cch-generative-ai-glossary-for-leaders-2024-learn.html?nc1=h_ls" rel="noopener noreferrer"&gt;AWS Generative AI Glossary&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://d1.awsstatic.com/training-and-certification/docs-ai-practitioner/AWS-Certified-AI-Practitioner_Exam-Guide.pdf" rel="noopener noreferrer"&gt;AWS exam guide&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>genai</category>
      <category>cloudcomputing</category>
      <category>learning</category>
    </item>
    <item>
      <title>VPC Peering: El puente de red para que recursos aislados se comuniquen</title>
      <dc:creator>Javier Madriz</dc:creator>
      <pubDate>Tue, 19 May 2026 20:47:25 +0000</pubDate>
      <link>https://dev.to/aws-builders/vpc-peering-el-puente-de-red-para-que-recursos-aislados-se-comuniquen-3k5e</link>
      <guid>https://dev.to/aws-builders/vpc-peering-el-puente-de-red-para-que-recursos-aislados-se-comuniquen-3k5e</guid>
      <description>&lt;p&gt;¡Bienvenidos todos a un nuevo workshop sobre redes! El día de hoy vamos a trabajar con VPC Peering: aprenderemos qué es, cuándo podemos usarlo, cuáles son sus beneficios e inclusive cuándo es mejor evitarlo. Pero no nos quedaremos solo en la teoría; implementaremos esta solución conectando dos redes totalmente aisladas y realizaremos pruebas para mover tráfico real entre los recursos desplegados en cada VPC.&lt;/p&gt;

&lt;h2&gt;
  
  
  ¿Qué es un VPC Peering?
&lt;/h2&gt;

&lt;p&gt;Un VPC Peering (o interconexión de VPC) es una conexión de red entre dos VPC que permite el enrutamiento de tráfico entre ellas utilizando direcciones IPv4 o IPv6 privadas. Los recursos desplegados en estas redes pueden comunicarse entre sí como si estuvieran dentro de la misma red local. Lo mejor de todo es que permite conectar VPC que están en la misma región, en regiones distintas e, inclusive, pertenecientes a cuentas de AWS diferentes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alcance del workshop
&lt;/h3&gt;

&lt;p&gt;Nos enfocaremos en cómo establecer la interconexión y las reglas de seguridad necesarias para mover el tráfico de una VPC a otra de manera segura, aplicando el principio de mínimo privilegio.&lt;/p&gt;

&lt;p&gt;Como en guías anteriores ya aprendimos los conceptos fundamentales de VPC y sus componentes, no repetiremos ese trabajo de forma manual. Para centrar nuestra atención únicamente en la interconexión, las rutas y la seguridad, he preparado una plantilla de CloudFormation en formato YAML. Con ella desplegaremos automáticamente las VPC que vamos a interconectar y los recursos que intercambiarán tráfico una vez establecido el peering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plantilla cloudformation para desplegar recursos
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;AWSTemplateFormatVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2010-09-09'&lt;/span&gt;
&lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Infraestructura&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;base&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;para&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Workshop&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;de&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;VPC&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Peering&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;EIC&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Endpoint&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Entorno&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Privado'&lt;/span&gt;

&lt;span class="na"&gt;Parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;LatestAmiId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AWS::SSM::Parameter::Value&amp;lt;AWS::EC2::Image::Id&amp;gt;'&lt;/span&gt;
    &lt;span class="na"&gt;Default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64'&lt;/span&gt;
    &lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AMI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mas&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reciente&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;de&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Amazon&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Linux&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2023"&lt;/span&gt;

  &lt;span class="na"&gt;InstanceType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;String&lt;/span&gt;
    &lt;span class="na"&gt;Default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;t3.micro&lt;/span&gt;
    &lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tipo&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;de&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;instancia&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;para&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;el&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;laboratorio"&lt;/span&gt;

&lt;span class="na"&gt;Resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# --- INFRAESTRUCTURA VPC 01 ---&lt;/span&gt;
  &lt;span class="na"&gt;VPC01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::VPC&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;CidrBlock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.0.0/24&lt;/span&gt;
      &lt;span class="na"&gt;EnableDnsSupport&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;EnableDnsHostnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vpc-01-workshop&lt;/span&gt;

  &lt;span class="na"&gt;Subnet01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::Subnet&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC01&lt;/span&gt;
      &lt;span class="na"&gt;CidrBlock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.0.0/25&lt;/span&gt;
      &lt;span class="na"&gt;AvailabilityZone&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Select&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;!GetAZs&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;subnet-01-privada&lt;/span&gt;

  &lt;span class="na"&gt;RouteTable01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::RouteTable&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC01&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rt-01-privada&lt;/span&gt;

  &lt;span class="na"&gt;SubnetRouteTableAssociation01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SubnetRouteTableAssociation&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;SubnetId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Subnet01&lt;/span&gt;
      &lt;span class="na"&gt;RouteTableId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RouteTable01&lt;/span&gt;

  &lt;span class="c1"&gt;# --- SEGURIDAD VPC 01 ---&lt;/span&gt;
  &lt;span class="na"&gt;SGEIC01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SecurityGroup&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GroupDescription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Security&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Group&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;para&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;EIC&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Endpoint&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;01"&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC01&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sg-eic-01&lt;/span&gt;

  &lt;span class="na"&gt;SGInstance01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SecurityGroup&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GroupDescription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Security&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Group&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;para&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Instancia&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;01"&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC01&lt;/span&gt;
      &lt;span class="na"&gt;SecurityGroupIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
          &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
          &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
          &lt;span class="na"&gt;SourceSecurityGroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGEIC01&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sg-instance-01&lt;/span&gt;

  &lt;span class="c1"&gt;# Regla de salida para que el EIC llegue a la instancia&lt;/span&gt;
  &lt;span class="na"&gt;EIC01Egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SecurityGroupEgress&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGEIC01&lt;/span&gt;
      &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
      &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
      &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
      &lt;span class="na"&gt;DestinationSecurityGroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGInstance01&lt;/span&gt;

  &lt;span class="c1"&gt;# --- RECURSOS VPC 01 ---&lt;/span&gt;
  &lt;span class="na"&gt;EICEndpoint01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::InstanceConnectEndpoint&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;SubnetId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Subnet01&lt;/span&gt;
      &lt;span class="na"&gt;SecurityGroupIds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGEIC01&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eic-endpoint-01&lt;/span&gt;

  &lt;span class="na"&gt;Instance01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::Instance&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;InstanceType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InstanceType&lt;/span&gt;
      &lt;span class="na"&gt;ImageId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LatestAmiId&lt;/span&gt;
      &lt;span class="na"&gt;SubnetId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Subnet01&lt;/span&gt;
      &lt;span class="na"&gt;SecurityGroupIds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGInstance01&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;instancia-01-requester&lt;/span&gt;

  &lt;span class="c1"&gt;# --- INFRAESTRUCTURA VPC 02 ---&lt;/span&gt;
  &lt;span class="na"&gt;VPC02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::VPC&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;CidrBlock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;11.0.0.0/24&lt;/span&gt;
      &lt;span class="na"&gt;EnableDnsSupport&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;EnableDnsHostnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vpc-02-workshop&lt;/span&gt;

  &lt;span class="na"&gt;Subnet02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::Subnet&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC02&lt;/span&gt;
      &lt;span class="na"&gt;CidrBlock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;11.0.0.0/25&lt;/span&gt;
      &lt;span class="na"&gt;AvailabilityZone&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Select&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;!GetAZs&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;subnet-02-privada&lt;/span&gt;

  &lt;span class="na"&gt;RouteTable02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::RouteTable&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC02&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rt-02-privada&lt;/span&gt;

  &lt;span class="na"&gt;SubnetRouteTableAssociation02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SubnetRouteTableAssociation&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;SubnetId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Subnet02&lt;/span&gt;
      &lt;span class="na"&gt;RouteTableId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RouteTable02&lt;/span&gt;

  &lt;span class="c1"&gt;# --- SEGURIDAD VPC 02 ---&lt;/span&gt;
  &lt;span class="na"&gt;SGEIC02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SecurityGroup&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GroupDescription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Security&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Group&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;para&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;EIC&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Endpoint&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;02"&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC02&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sg-eic-02&lt;/span&gt;

  &lt;span class="na"&gt;EIC02Egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SecurityGroupEgress&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGEIC02&lt;/span&gt;
      &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
      &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
      &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
      &lt;span class="na"&gt;DestinationSecurityGroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGInstance02&lt;/span&gt;

  &lt;span class="na"&gt;SGInstance02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::SecurityGroup&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GroupDescription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Security&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Group&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;para&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Instancia&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;02"&lt;/span&gt;
      &lt;span class="na"&gt;VpcId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VPC02&lt;/span&gt;
      &lt;span class="na"&gt;SecurityGroupIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
          &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
          &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
          &lt;span class="na"&gt;SourceSecurityGroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGEIC02&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sg-instance-02&lt;/span&gt;


  &lt;span class="c1"&gt;# --- RECURSOS VPC 02 ---&lt;/span&gt;
  &lt;span class="na"&gt;EICEndpoint02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::InstanceConnectEndpoint&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;SubnetId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Subnet02&lt;/span&gt;
      &lt;span class="na"&gt;SecurityGroupIds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGEIC02&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eic-endpoint-02&lt;/span&gt;

  &lt;span class="na"&gt;Instance02&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::EC2::Instance&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;InstanceType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InstanceType&lt;/span&gt;
      &lt;span class="na"&gt;ImageId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LatestAmiId&lt;/span&gt;
      &lt;span class="na"&gt;SubnetId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Subnet02&lt;/span&gt;
      &lt;span class="na"&gt;SecurityGroupIds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SGInstance02&lt;/span&gt;
      &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;instancia-02-accepter&lt;/span&gt;

&lt;span class="na"&gt;Outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Instancia01ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ID&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;de&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;la&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Instancia&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;01"&lt;/span&gt;
    &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Instance01&lt;/span&gt;
  &lt;span class="na"&gt;Instancia02ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ID&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;de&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;la&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Instancia&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;02"&lt;/span&gt;
    &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Instance02&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Que despliega exactamente esta plantilla?
&lt;/h3&gt;

&lt;p&gt;Para garantizar un entorno seguro y portátil, el código automatiza la infraestructura base bajo un modelo 100% privado:&lt;/p&gt;

&lt;p&gt;2 VPC aisladas: vpc-01-workshop (10.0.0.0/24) y vpc-02-workshop (11.0.0.0/24), configuradas sin salida a internet (sin Internet Gateways ni NAT Gateways).&lt;/p&gt;

&lt;p&gt;2 Subredes privadas: subnet-01-privada y subnet-02-privada, segmentadas con máscaras /25 en la primera zona de disponibilidad de la región.&lt;/p&gt;

&lt;p&gt;2 Tablas de rutas base: rt-01-privada y rt-02-privada asociadas a sus respectivas subredes, listas para recibir las rutas del peering manualmente.&lt;/p&gt;

&lt;p&gt;2 EC2 Instance Connect (EIC) Endpoints: eic-endpoint-01 y eic-endpoint-02. Estos componentes actúan como el puente seguro para conectarnos por SSH desde nuestra terminal sin usar IPs públicas ni llaves .pem.&lt;/p&gt;

&lt;p&gt;2 Instancias EC2: instancia-01-requester e instancia-02-accepter con Amazon Linux 2023, ubicadas en el corazón de sus redes privadas.&lt;/p&gt;

&lt;p&gt;4 Grupos de Seguridad (Security Groups): Dos para los endpoints (con reglas de salida en el puerto 22) y dos para las instancias, configurados bajo el principio de mínimo privilegio para aceptar conexiones SSH únicamente si provienen de su respectivo EIC Endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Paso 1: Despliegue de la infraestructura base
&lt;/h3&gt;

&lt;p&gt;Guarda la plantilla anterior en un archivo .yaml y abre CloudFormation en la consola de AWS. Aunque el código funciona en cualquier región, te recomiendo usar N. Virginia (us-east-1) para que tu pantalla coincida exactamente con las imágenes de referencia que verás en cada paso.&lt;/p&gt;

&lt;p&gt;Crea el Stack siguiendo estos pasos rápidos:&lt;/p&gt;

&lt;p&gt;Haz clic en Create stack (With new resources).&lt;/p&gt;

&lt;p&gt;Selecciona Choose an existing template -&amp;gt; Upload a template file y sube tu archivo .yaml.&lt;/p&gt;

&lt;p&gt;Asigna un nombre a tu Stack y avanza presionando Next (deja el resto de opciones por defecto).&lt;/p&gt;

&lt;p&gt;Haz clic en Submit.&lt;/p&gt;

&lt;p&gt;El aprovisionamiento tomará un par de minutos. Una vez que el estado cambie a CREATE_COMPLETE, la automatización habrá terminado y estaremos listos para iniciar la configuración manual de nuestro VPC Peering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbnaxcla2kkz2n98522z5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbnaxcla2kkz2n98522z5.png" alt=" " width="484" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Paso 2: Comprobar el aislamiento (El fallo esperado)
&lt;/h3&gt;

&lt;p&gt;Con nuestro Stack desplegado, el primer paso será conectarnos a nuestra instancia-01-requester ubicada (en la vpc-01-workshop). Lo haremos a través del EIC Endpoint (EC2 Instance Connect) que automatizamos con la plantilla. Esto representa una excelente práctica de seguridad: eliminamos por completo la gestión de llaves de acceso .pem y añadimos una capa de protección adicional, si nunca has manejado los EIC te dejo el enlace al workshop anterior donde explicamos todos los endpoint incluyendo los EIC y ademas usamos cada uno de ellos en un ejemplo: &lt;a href="https://dev.to/aws-builders/domina-la-conectividad-privada-en-aws-con-vpc-endpoints-ahorra-mes-360i"&gt;Workshop VPC Endpoints&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dirígete al servicio de EC2, selecciona la instancia instancia-01-requester y haz clic en el botón Connect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3se30awxfkkvqfjn6ih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3se30awxfkkvqfjn6ih.png" alt=" " width="794" height="288"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;En la próxima pantalla notarás algunos mensajes de advertencia como: «No public IPv4 or IPv6 address assigned» e «Instance is not in a public subnet». ¡No te preocupes! Lejos de ser un error, esto es una excelente señal: nos confirma que nuestras subredes y recursos están completamente aislados del mundo exterior.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;En la pestaña de EC2 Instance Connect, cambia el tipo de conexión a Connect using EC2 Instance Connect Endpoint.&lt;/li&gt;
&lt;li&gt;Verás que el sistema seleccionará automáticamente nuestro eic-endpoint-01 en la lista desplegable.&lt;/li&gt;
&lt;li&gt;TPara finalizar, haz clic en el botón Connect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwdura2i7eb1nyeyg0c1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwdura2i7eb1nyeyg0c1.png" alt=" " width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Una vez dentro de la terminal de la instancia-01-requester, ejecutaremos un comando para intentar comunicarnos con la instancia-02-accepter. (Ve a la consola de EC2 y copia la dirección IP privada de esa segunda instancia, la vas a necesitar).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ping -c 4 &amp;lt;IP_PRIVADA_DE_INSTANCIA_02&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Nota rápida: El comando ping se utiliza para verificar la conectividad básica entre dos recursos. La bandera -c 4 (count) le indica al sistema que envíe exactamente 4 paquetes de prueba hacia la IP de destino.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ejecuta el comando y observa el resultado:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvsratcvwot7qsdog2ez.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvsratcvwot7qsdog2ez.png" alt=" " width="640" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;El diagnóstico es claro: 4 paquetes transmitidos, 0 paquetes recibidos (100% packet loss). El comando se queda congelado y expira.&lt;/p&gt;

&lt;p&gt;¿Por qué pasa esto? Porque ambas VPC están en un aislamiento absoluto. En términos de redes, el router de nuestra vpc-01-workshop recibe el paquete con destino a la red 11.0.0.x, revisa su tabla de rutas local y, al no encontrar ninguna instrucción que le diga cómo llegar allá, simplemente descarta el paquete. Para él, esa dirección IP no existe.&lt;/p&gt;

&lt;p&gt;Ahora si viene lo bueno...&lt;/p&gt;

&lt;h3&gt;
  
  
  Paso 3: Creacion de un VPC Peering
&lt;/h3&gt;

&lt;p&gt;A continuación, vamos a construir el puente entre ambas redes para que nuestros servidores dejen de estar aislados.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Dirígete al servicio de VPC en la consola de AWS.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;En la columna izquierda, busca y selecciona Peering connections.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Haz clic en el botón Create peering connection.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Colocale un nombre, yo usare: pc-vpc1-to-vpc2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;VPC ID (Requester): Selecciona vpc-01-workshop. Esta será la red encargada de iniciar la solicitud de interconexión.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Para este laboratorio, deja seleccionadas las opciones por defecto: My account (Mi cuenta) y This region (Esta región).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;En el campo VPC ID (Accepter), selecciona vpc-02-workshop.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxfefiobawljofqapoi0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxfefiobawljofqapoi0.png" alt=" " width="746" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finaliza haciendo clic en Create peering connection en la parte inferior.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Nota de Arquitecto: En este ejercicio ambas VPC conviven en la misma cuenta y región, pero ten en cuenta que el proceso es idéntico si decidieras interconectar redes en zonas geográficas o estructuras corporativas distintas; en ese caso, simplemente elegirías las opciones Another account o Another region según corresponda.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ya creamos nuestro peering, pero ¡ATENCIÓN! Falta un paso crucial. Recuerda que al configurar esta conexión establecimos un solicitante (Requester) y un aceptador (Accepter). Esto significa que la solicitud está flotando en el aire y la vpc-02-workshop debe aceptarla formalmente para que el estado pase de Pending acceptance a Active.&lt;/p&gt;

&lt;p&gt;En la misma pantalla donde acabas de crear el peering (justo debajo del banner verde de éxito):  haz clic en el menú desplegable Actions (en la esquina superior derecha), elige la opción Accept request y confirma.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0uwd61i1tcooz6e2jct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0uwd61i1tcooz6e2jct.png" alt=" " width="799" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Nota de Arquitecto: Estamos aceptando la solicitud nosotros mismos porque ambas VPC están en nuestra cuenta de AWS. Si la VPC de destino perteneciera a otra cuenta corporativa o a un cliente externo, el administrador de esa cuenta tendría que iniciar sesión en su propia consola para aceptar tu conexión.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Paso 4: Configuración de Tablas de Rutas (El Mapa de Red)
&lt;/h3&gt;

&lt;p&gt;Ya tenemos nuestro puente construido y activo (VPC Peering), pero si intentas hacer ping nuevamente, notarás que sigue fallando. ¿Por qué? Porque aunque el enlace lógico ya existe, los routers de nuestras VPC todavía no saben que deben usarlo. Nos falta configurar las instrucciones de navegación: las Tablas de Rutas.&lt;/p&gt;

&lt;p&gt;Comenzaremos configurando el camino de ida. Vamos a dirigirnos al servicio VPC, tabla de rutas y seleccionamos la asociada a nuestra subred origen (rt-01-privada) para especificarle que cuando un recurso intente comunicarse con el bloque CIDR 11.0.0.0/24 (la red de la VPC-02), redirija ese tráfico utilizando nuestra Peering Connection (pc-vpc1-to-vpc2).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Selecciona la tabla de rutas rt-01-privada y dirígete a la pestaña Routes en la parte inferior.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Nota de observación: Verás que de momento solo existe una ruta con el destino (Target) configurado como local. Esta regla por defecto le indica a la VPC que cualquier tráfico dirigido al bloque CIDR 10.0.0.0/24 debe quedarse dentro de su propia red local.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Haz clic en el botón Edit routes (esquina superior derecha de la pestaña).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;En el editor de rutas, haz clic en Add route y configura los siguientes campos:&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;- Destination: Coloca el bloque CIDR completo de la VPC-02 (11.0.0.0/24). Con esto le indicas al router el "rango de red de destino". Le estás diciendo: «Cualquier paquete que intente ir a cualquier recurso dentro de la VPC-02, debe aplicar esta regla». (Ojo: no colocamos la IP de la instancia individual, sino el rango de toda la red vecina).&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Target: Selecciona Peering Connection en la lista desplegable. Al hacer clic en el cuadro de búsqueda vacío, el sistema te mostrará automáticamente el ID de nuestro peering pc-vpc1-to-vpc2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Por último, haz clic en Save changes (Guardar cambios) para fijar el mapa de ida.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxrc6w793zl6dnbepf8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxrc6w793zl6dnbepf8g.png" alt=" " width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  El "muro" del laboratorio
&lt;/h3&gt;

&lt;p&gt;Ok, tenemos el peering activo y nuestra ruta definida. Si volvemos a la terminal e intentamos ejecutar el comando nuevamente:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ping -c 4 IP_PRIVADA_DE_INSTANCIA_02&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;¿Qué sucede? Sí, nos topamos exactamente con el mismo mensaje: 4 packets transmitted, 0 received, 100% packet loss, time 3153ms.&lt;/p&gt;

&lt;p&gt;Este es el típico escenario de redes que suele frustrar a la mayoría y hacerlos dudar de su configuración. ¡Pero hoy no será nuestro caso! Si nuestra conexión de peering está OK y la tabla de rutas de origen está OK, ¿qué otro elemento nos está bloqueando? Debemos revisar al guardián que protege directamente al recurso: el Security Group de la instancia-02-accepter.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Selecciona la pestaña Inbound rules (Reglas de entrada). Notarás que este grupo de seguridad fue creado por nuestra plantilla de CloudFormation con una única regla: permitir tráfico SSH en el puerto 22 exclusivamente desde el EIC Endpoint. Por eso podemos conectarnos sin problemas, pero cualquier otro tipo de acceso está denegado por defecto.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Para que nuestra instancia de destino acepte y responda a las solicitudes de eco (Echo Requests) que le envía el comando ping, necesitamos habilitar el protocolo ICMP (Internet Control Message Protocol). A diferencia de los servicios web comunes, este tráfico no utiliza puertos TCP o UDP, sino que opera directamente a nivel de red para enviar mensajes de diagnóstico. Como los Security Groups bloquean todo el tráfico entrante por defecto, debemos añadir una regla explícita para permitirlo. ¡Vamos a hacerlo!.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Haz clic en Edit inbound rules y luego en el botón Add rule.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Configura los siguientes campos en la nueva fila:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Type: Selecciona All ICMP - IPv4.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Source: Déjalo en Custom y en el cuadro de texto ingresa el bloque CIDR de la VPC-01 (10.0.0.0/24).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Haz clic en Save rules.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Source Custom: Agregamos el CIDR de la vpc-01 que es donde habita la instancia instancia-01-requester desde donde estamos ejecutado el comando ping y queremos que pueda comunicarse y ademas recibir respuesta de la instancia instancia-02-accepter a la que protege este grupo de seguridad.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ym9c66buy80n7nd7t49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ym9c66buy80n7nd7t49.png" alt=" " width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;¿Qué acabamos de hacer? Le indicamos al grupo de seguridad de la instancia-02-accepter que permita la entrada de paquetes de diagnóstico (ping), siempre y cuando provengan de algún recurso ubicado dentro de la red de la VPC-01.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  El segundo ping
&lt;/h3&gt;

&lt;p&gt;Con el grupo de seguridad con las reglas correctas, estamos listos para volver a ejecutar nuestro comando en la terminal:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ping -c 4 IP_PRIVADA_DE_INSTANCIA_02&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Y el resultado es... ¡otra vez lo mismo!: 4 packets transmitted, 0 received, 100% packet loss, time 3100ms.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Te debes estar preguntando: «A ver, el peering está activo, la ruta de ida está lista y el Security Group ya permite el ping... ¿Qué rayos sucede? Javier, ¿acaso quieres estresarme?»&lt;/p&gt;

&lt;p&gt;La realidad es que no, pero en la arquitectura de redes la mejor manera de aprender es fallando, entendiendo el porqué de las cosas y corrigiendo. Lo que estamos experimentando aquí es un concepto vital: el enrutamiento en AWS no es bidireccional por defecto.&lt;/p&gt;

&lt;p&gt;Para que un ping sea exitoso, el paquete necesita un camino de ida y un camino de vuelta. Analicemos qué está pasando en este instante tras bambalinas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;El viaje de ida: El paquete sale de la instancia-01, el router de la VPC-01 ve la ruta hacia el peering, el puente cruza con éxito, llega a la VPC-02, el Security Group valida que es un paquete ICMP permitido y se lo entrega a la instancia-02. ¡La ida funciona perfecto!&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;El viaje de vuelta: La instancia-02 recibe el paquete y, como es educada, genera una respuesta (Echo Reply) con destino a la IP de la VPC-01.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;El problema:&lt;/strong&gt; Cuando este paquete de regreso llega al router de la VPC-02, este revisa su propia tabla de rutas. Como no hemos tocado la tabla de rutas de la VPC-02, el router solo ve su regla local (11.0.0.0/24). Al no tener una instrucción explícita que le diga cómo regresar a la red 10.0.0.0/24, el router no sabe qué hacer y tira la respuesta a la basura.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;En resumen: la instancia-02 sí recibe el mensaje, pero sus respuestas se quedan atrapadas en su propia red. La instancia-01 se queda esperando eternamente un eco que jamás va a volver.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configurando el camino de regreso
&lt;/h3&gt;

&lt;p&gt;¿Qué debemos hacer entonces para solucionar el problema del paquete atrapado? Exacto: ir a la tabla de rutas rt-02-privada (asociada a la VPC-02) y repetir el mismo procedimiento que hicimos al principio. Esta vez, agregaremos una regla que especifique que todo el tráfico dirigido a la red de la VPC-01 (10.0.0.0/24) debe salir a través de nuestra Peering Connection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ga7u0irabzyd856843j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ga7u0irabzyd856843j.png" alt=" " width="799" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Guarda los cambios y, ahora sí, regresemos a la terminal de nuestra primera instancia para lanzar el comando de nuevo:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ping -c 4 IP_PRIVADA_DE_INSTANCIA_02&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq73ukuvs2owyt67osjqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq73ukuvs2owyt67osjqd.png" alt=" " width="617" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;¡Victoria! Respuesta totalmente satisfactoria: 4 packets transmitted, 4 received, 0% packet loss, time 3126ms. Oficialmente, este es el momento de poner a sonar de fondo We Are the Champions de Queen. ¡Ja, ja!&lt;/p&gt;

&lt;h3&gt;
  
  
  Probando la bidireccionalidad
&lt;/h3&gt;

&lt;p&gt;Al configurar las rutas en ambos sentidos, el puente de red ha quedado completamente establecido de forma bidireccional, sin importar qué recurso inicie la comunicación. Esto significa que si nos conectamos a la instancia-02-accepter en la VPC-02, podríamos hacerle un ping de vuelta a la instancia-01-requester en la VPC-01.&lt;/p&gt;

&lt;p&gt;Eso sí... espero que te hayas acordado del detalle vital que acabamos de aprender con los grupos de seguridad (firewall). Para que la instancia-01 pueda procesar esa solicitud, su propio Security Group debe permitirlo.&lt;/p&gt;

&lt;p&gt;Dirígete al grupo de seguridad sg-instance-01 (el que protege a la instancia en la VPC-01) y añade la regla correspondiente en las Inbound rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Type: All ICMP - IPv4&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Source: El CIDR de la VPC-02 (11.0.0.0/24)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvskxlzevnopna8dqf2iq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvskxlzevnopna8dqf2iq.png" alt=" " width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ahora, conéctate a la instancia de la VPC-02 (usando su pestaña de EC2 Instance Connect como hicimos en el Paso 1) y ejecuta el ping apuntando al origen:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ping -c 4 IP_PRIVADA_DE_INSTANCIA_01&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4s82p4eb4vceoclgbgwe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4s82p4eb4vceoclgbgwe.png" alt=" " width="617" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;¡Resultados impecables! 4 packets transmitted, 4 received, 0% packet loss, time 3114ms.&lt;/p&gt;

&lt;p&gt;¡Lo hemos logrado! Conseguimos establecer una comunicación fluida, privada y segura entre recursos que habitan en redes totalmente aisladas gracias a VPC Peering y a una correcta gestión de enrutamiento y seguridad.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cuándo usar (y cuándo evitar) un VPC Peering?
&lt;/h3&gt;

&lt;p&gt;El VPC Peering es la herramienta ideal cuando necesitas una conexión directa, UNO a UNO, entre dos redes. Sin embargo, hay una regla de oro a nivel de infraestructura que es imprescindible recordar: las conexiones de Peering NO son transitivas.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2fezk8v4v0okrb3l0x6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2fezk8v4v0okrb3l0x6.png" alt=" " width="800" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;¿Qué significa esto en la práctica? Imagina el siguiente escenario:&lt;/p&gt;

&lt;p&gt;Tienes un Peering que conecta la VPC-A con la VPC-B.&lt;/p&gt;

&lt;p&gt;Tienes otro Peering que conecta la VPC-B con la VPC-C.&lt;/p&gt;

&lt;p&gt;Es muy común que los principiantes supongan que, como la VPC-B está en el medio, la red A podría comunicarse con la C utilizándola como puente. Pero en AWS esto no es posible. Debido a que el ruteo no es transitivo, para que la VPC-A y la VPC-C se puedan hablar, tendrías que crear obligatoriamente un tercer Peering directo entre ellas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;El problema de la escala:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Por esta misma naturaleza, el VPC Peering se recomienda únicamente cuando manejas un número pequeño de redes. Si tu infraestructura crece y necesitas interconectar 10, 20 o 50 VPCs entre sí, configurar conexiones "uno a uno" creará una telaraña inmanejable de enlaces y tablas de rutas, convirtiéndose en una pesadilla de administración.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Nota de Arquitecto&lt;/strong&gt;: Cuando te enfrentes a un escenario donde necesitas interconectar muchas redes a gran escala, la solución ya no es el VPC Peering; en ese caso, debes dar el salto a un servicio de enrutamiento centralizado como &lt;strong&gt;AWS Transit Gateway&lt;/strong&gt; que aprovecho para  haceres spoiler, sera nuestro proximo workshop.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Paso Final: Eliminación de recursos (¡No olvides este paso!)
&lt;/h3&gt;

&lt;p&gt;Después de que interactúes y pruebes todo lo que acabamos de construir, es fundamental eliminar los recursos para evitar costos innecesarios en tu cuenta de AWS. Sigue este orden específico para garantizar una limpieza exitosa:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;VPC Peering: Dirígete al servicio de VPC, selecciona Peering connections en la columna izquierda, busca el peering que creamos (pcx-vpc01-to-vpc02), selecciónalo y haz clic en Delete.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CloudFormation Stack: Una vez borrado el peering, ve al servicio de CloudFormation, selecciona el Stack que desplegamos al inicio y haz clic en Delete. AWS se encargará de borrar las instancias, VPCs, tablas de rutas y Security Groups automáticamente.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;¡Con esto habremos terminado! Espero sinceramente que hayas aprendido algo nuevo el día de hoy sobre enrutamiento y seguridad en nubes privadas virtuales.&lt;/p&gt;

&lt;p&gt;Si tienes alguna opinión, feedback o duda, no olvides dejarla en la sección de comentarios. Además, te invito a compartir este contenido técnico; ¡podría ser de gran ayuda para otras personas en su camino de aprendizaje!&lt;/p&gt;

&lt;p&gt;Nos vemos en el próximo workshop, donde abordaremos el siguiente nivel: AWS Transit Gateway.&lt;/p&gt;

</description>
      <category>networking</category>
      <category>spanish</category>
      <category>tutorial</category>
      <category>aws</category>
    </item>
    <item>
      <title>Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Tue, 19 May 2026 09:10:53 +0000</pubDate>
      <link>https://dev.to/aws-builders/escape-vendor-lock-in-multi-backend-log-delivery-with-otel-collector-for-fsx-for-ontap-2inb</link>
      <guid>https://dev.to/aws-builders/escape-vendor-lock-in-multi-backend-log-delivery-with-otel-collector-for-fsx-for-ontap-2inb</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;We shipped the same FSx for ONTAP audit logs to &lt;strong&gt;three backends simultaneously&lt;/strong&gt; — Datadog, Grafana Cloud, and Honeycomb — without changing a single line of Lambda code. The OpenTelemetry Collector sits between our Lambda and the backends as a routing layer. Adding or removing a backend is a YAML config change, not a code deployment.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same audit logs → 3 backends simultaneously&lt;/li&gt;
&lt;li&gt;Zero Lambda code changes between backends (SHA-256 verified)&lt;/li&gt;
&lt;li&gt;OTel Collector as the vendor-neutral routing layer&lt;/li&gt;
&lt;li&gt;All 3 event sources work: FSx audit logs via S3 Access Point, EMS webhooks, FPolicy file operations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt;, we built a Lambda that speaks Datadog's API directly. It works great — but what happens when your security team wants Splunk, your SRE team wants Grafana, and your platform team is evaluating Honeycomb?&lt;/p&gt;

&lt;p&gt;You'd need three separate Lambdas, each with vendor-specific formatting, auth, and retry logic. That's vendor lock-in expressed as infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem: Vendor-Specific APIs = Lock-in
&lt;/h3&gt;

&lt;p&gt;Every observability vendor has their own wire format:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vendor&lt;/th&gt;
&lt;th&gt;Auth Header&lt;/th&gt;
&lt;th&gt;Payload Format&lt;/th&gt;
&lt;th&gt;Endpoint Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datadog&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DD-API-KEY: &amp;lt;key&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Custom JSON schema&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://http-intake.logs.{site}/api/v2/logs&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Splunk&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Splunk &amp;lt;token&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HEC &lt;code&gt;event&lt;/code&gt; wrapper&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://&amp;lt;host&amp;gt;:8088/services/collector/event&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana Cloud&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Basic &amp;lt;b64&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OTLP&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://otlp-gateway-prod-&amp;lt;region&amp;gt;.grafana.net/otlp&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honeycomb&lt;/td&gt;
&lt;td&gt;&lt;code&gt;x-honeycomb-team: &amp;lt;key&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OTLP&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://api.honeycomb.io&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your Lambda speaks Datadog's API, switching to Grafana Cloud means rewriting your Lambda. That's the lock-in.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: OTLP as the Producer-to-Collector Contract
&lt;/h3&gt;

&lt;p&gt;OpenTelemetry Protocol (OTLP) is the vendor-neutral producer-to-Collector contract. Our Lambda speaks OTLP — period. The OTel Collector handles routing, processing, and backend-specific export.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│ AWS Account                                                         │
│                                                                     │
│  ┌──────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Audit Logs   │────▶│ Lambda           │     │ OTel Collector  │  │
│  │ (via S3 AP)  │────▶│ (OTLP Shipper)   │────▶│ (Docker/Fargate)│  │
│  │ EMS/FPolicy  │────▶│                  │     │                 │  │
│  └──────────────┘     └──────────────────┘     └─┬──────┬──────┬─┘  │
│                                                  │      │      │    │
└──────────────────────────────────────────────────┼──────┼──────┼────┘
                                                   │      │      │
                                                   ▼      ▼      ▼
                                              Datadog  Grafana Honeycomb
                                               (AP1)    Cloud    
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Lambda sends OTLP/HTTP to the Collector. The Collector fans out to any combination of backends. Adding Honeycomb? Add 5 lines of YAML. Dropping Datadog? Remove 4 lines. No Lambda redeployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting, you need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;FSx for ONTAP with audit logging&lt;/strong&gt; configured (see &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt; for setup)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; installed locally (Colima works — see troubleshooting for compose compatibility)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At least one backend account&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Datadog: API key + site (e.g., &lt;code&gt;ap1.datadoghq.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Grafana Cloud: Instance ID + API token (Cloud Portal → OTLP)&lt;/li&gt;
&lt;li&gt;Honeycomb: Ingest API key (starts with &lt;code&gt;hcaik_&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS account&lt;/strong&gt; with Lambda deployment capability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parts 1–4 context&lt;/strong&gt; (recommended but not required — this integration works standalone)&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;FSx for ONTAP S3 Access Point note&lt;/strong&gt;: The Lambda reads audit logs through an S3 Access Point attached to the FSx for ONTAP volume. Data remains on the FSx file system — it is not copied to a separate S3 bucket. S3 API throughput via FSx depends on the file system's &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/performance.html" rel="noopener noreferrer"&gt;provisioned throughput capacity&lt;/a&gt;, not standard S3 scaling. Validate FSx read throughput separately from Collector and backend ingest throughput.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The OTel Collector Configuration
&lt;/h2&gt;

&lt;p&gt;The Collector config is the heart of this pattern. Here's the full verified configuration for multi-backend delivery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# otel-collector-config.yaml&lt;/span&gt;
&lt;span class="c1"&gt;# ✅ VERIFIED WORKING (2026-05-18)&lt;/span&gt;
&lt;span class="c1"&gt;# Image: otel/opentelemetry-collector-contrib:0.152.0&lt;/span&gt;
&lt;span class="c1"&gt;# Backends: Grafana Cloud (ap-northeast-0) + Honeycomb&lt;/span&gt;

&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# memory_limiter:        # Recommended for production&lt;/span&gt;
  &lt;span class="c1"&gt;#   check_interval: 1s&lt;/span&gt;
  &lt;span class="c1"&gt;#   limit_mib: 512&lt;/span&gt;
  &lt;span class="c1"&gt;#   spike_limit_mib: 128&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:GRAFANA_OTLP_ENDPOINT}&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;${env:GRAFANA_BASIC_AUTH}"&lt;/span&gt;

  &lt;span class="na"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://api.honeycomb.io&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_API_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_DATASET}&lt;/span&gt;

&lt;span class="na"&gt;extensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;health_check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:13133&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;extensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;health_check&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Depending on your Honeycomb environment and dataset model, &lt;code&gt;x-honeycomb-dataset&lt;/code&gt; may be optional or handled differently. Refer to your &lt;a href="https://docs.honeycomb.io/send-data/opentelemetry/" rel="noopener noreferrer"&gt;Honeycomb OTLP setup page&lt;/a&gt; for the recommended configuration.&lt;/p&gt;

&lt;p&gt;This article uses &lt;code&gt;otlp_http&lt;/code&gt; (the forward-compatible component name). If your Collector version does not recognize it, use the older &lt;code&gt;otlphttp&lt;/code&gt; alias or upgrade the Collector.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Section Breakdown
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Settings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;receivers.otlp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accepts OTLP/HTTP from Lambda&lt;/td&gt;
&lt;td&gt;Port 4318 (OTLP standard)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;processors.batch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Buffers logs before export&lt;/td&gt;
&lt;td&gt;5s timeout OR 1000 records (whichever first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exporters.otlp_http/*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sends to each backend&lt;/td&gt;
&lt;td&gt;Per-backend auth headers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;extensions.health_check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Liveness probe&lt;/td&gt;
&lt;td&gt;Port 13133 for &lt;code&gt;curl -f&lt;/code&gt; checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.pipelines&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wires components together&lt;/td&gt;
&lt;td&gt;logs: receiver → processor → exporters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production note&lt;/strong&gt;: This configuration is suitable for development and validation. For production, add &lt;code&gt;retry_on_failure&lt;/code&gt; and &lt;code&gt;sending_queue&lt;/code&gt; settings to exporters, configure &lt;code&gt;memory_limiter&lt;/code&gt; processor, and consider persistent storage extensions. Without persistent buffering, telemetry in the Collector's in-memory batch can be lost during Collector restarts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Adding Datadog as a Third Backend
&lt;/h3&gt;

&lt;p&gt;To send to all three simultaneously, add the Datadog exporter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ... existing grafana + honeycomb exporters ...&lt;/span&gt;

  &lt;span class="na"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:DD_API_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;site&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:DD_SITE}&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Restart the Collector. Same Lambda, same OTLP payload, now three destinations.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For Datadog, this example uses the Collector's dedicated &lt;code&gt;datadog&lt;/code&gt; exporter rather than generic &lt;code&gt;otlp_http&lt;/code&gt;, because it handles Datadog-specific intake behavior, metadata mapping, and host tagging.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Lambda Handler (OTLP Shipper)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Design Decisions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Why OTLP?&lt;/strong&gt; — It gives the Lambda a single producer-to-Collector contract. The Collector then handles each backend's supported exporter or intake path. One format to maintain, not three.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why no vendor SDK?&lt;/strong&gt; — SDKs add cold start latency, dependency management, and vendor coupling. Pure &lt;code&gt;urllib3&lt;/code&gt; + JSON keeps the Lambda lean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why AUTH_MODE?&lt;/strong&gt; — Different Collectors may need different auth. The Lambda supports &lt;code&gt;none&lt;/code&gt;, &lt;code&gt;basic&lt;/code&gt;, and &lt;code&gt;bearer&lt;/code&gt; modes without code changes.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Field Mapping: FSx ONTAP → OTLP Attributes
&lt;/h3&gt;

&lt;p&gt;The Lambda maps FSx ONTAP audit fields to semantic OTLP attribute keys:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;FSx ONTAP Field&lt;/th&gt;
&lt;th&gt;OTLP Attribute Key&lt;/th&gt;
&lt;th&gt;Example Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EventID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;event.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4663&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UserName&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;user.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;admin@corp.local&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ClientIP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;client.address&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.0.1.50&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReadData&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ObjectName&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/vol/data/reports/q4.xlsx&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Success&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SVMName&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn.svm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;The examples above focus on S3 audit logs because they are the highest-volume path. The same OTLP shipper pattern is reused for EMS webhook events and FPolicy file operations using source-specific field mappers (&lt;code&gt;ems_handler.py&lt;/code&gt;, &lt;code&gt;fpolicy_handler.py&lt;/code&gt;), while preserving the same Collector-facing OTLP contract. For EMS and FPolicy, source-specific service names are used (&lt;code&gt;fsxn-ems&lt;/code&gt;, &lt;code&gt;fsxn-fpolicy&lt;/code&gt;) to distinguish event sources in the backend.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Resource-level attributes (set once per payload, not per log record):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn-audit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Service identification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cloud.provider&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cloud context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cloud.platform&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws_fsx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Platform context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;cloud.platform=aws_fsx&lt;/code&gt; is a project-specific value used to identify FSx for ONTAP as the data source. It is not part of the &lt;a href="https://opentelemetry.io/docs/specs/semconv/resource/cloud/" rel="noopener noreferrer"&gt;OpenTelemetry semantic conventions&lt;/a&gt; standard &lt;code&gt;cloud.platform&lt;/code&gt; values (which include &lt;code&gt;aws_ec2&lt;/code&gt;, &lt;code&gt;aws_ecs&lt;/code&gt;, &lt;code&gt;aws_eks&lt;/code&gt;, &lt;code&gt;aws_lambda&lt;/code&gt;, etc.).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Severity Determination Logic
&lt;/h3&gt;

&lt;p&gt;The Lambda determines OTLP severity from the &lt;code&gt;Result&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;WARN_KEYWORDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;determine_severity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Determine OTLP severity from FSx ONTAP Result field.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;WARN_KEYWORDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means failed access attempts (&lt;code&gt;Result: "Failure"&lt;/code&gt;) automatically get &lt;code&gt;severityNumber: 13&lt;/code&gt; (WARN), making them easy to filter in any backend.&lt;/p&gt;

&lt;p&gt;The Lambda sets both &lt;code&gt;severityNumber&lt;/code&gt; and &lt;code&gt;severityText&lt;/code&gt; according to the &lt;a href="https://opentelemetry.io/docs/specs/otel/logs/data-model/#severity-fields" rel="noopener noreferrer"&gt;OpenTelemetry Logs Data Model&lt;/a&gt; severity level definitions.&lt;/p&gt;

&lt;h3&gt;
  
  
  OTLP Payload Construction
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_otlp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;source_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Build OTLP Log Data Model payload.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;log_records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;map_log_record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resourceLogs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attributes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stringValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud.provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stringValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud.platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stringValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_fsx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scopeLogs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fsxn-otel-shipper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logRecords&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;log_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No vendor SDK. No vendor-specific formatting. Just the OTLP Log Data Model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retry with Exponential Backoff
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;BASE_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# seconds
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_send_otlp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send OTLP payload via HTTP POST with retry logic.

    Retries on HTTP 429 and 5xx. Does not retry on 4xx (except 429).
    Exponential backoff: 2s, 4s, 8s with jitter.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1/logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;auth_headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;auth_headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;json_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json_body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;wait_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BASE_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="c1"&gt;# Client error (4xx except 429) — don't retry
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AUTH_MODE Support
&lt;/h3&gt;

&lt;p&gt;The Lambda supports three authentication modes via the &lt;code&gt;AUTH_MODE&lt;/code&gt; environment variable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AUTH_MODE&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;none&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No auth headers sent&lt;/td&gt;
&lt;td&gt;Local Collector (no auth needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;basic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Basic &amp;lt;base64(token)&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Grafana Cloud direct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bearer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Authorization: Bearer &amp;lt;token&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generic OTLP endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When using the Collector pattern, set &lt;code&gt;AUTH_MODE=none&lt;/code&gt; on the Lambda — the Collector handles backend auth via its own config.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Direct auth modes (&lt;code&gt;basic&lt;/code&gt;, &lt;code&gt;bearer&lt;/code&gt;) are useful for testing or bypassing the Collector. In the multi-backend pattern, keep &lt;code&gt;AUTH_MODE=none&lt;/code&gt; and let the Collector handle backend credentials.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Local Development: Docker Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Configure credentials&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;integrations/otel-collector
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env with your backend credentials:&lt;/span&gt;
&lt;span class="c"&gt;#   GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp&lt;/span&gt;
&lt;span class="c"&gt;#   GRAFANA_BASIC_AUTH=&amp;lt;base64(instanceId:apiToken)&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;#   HONEYCOMB_API_KEY=hcaik_&amp;lt;your-ingest-key&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;#   HONEYCOMB_DATASET=fsxn-audit&lt;/span&gt;

&lt;span class="c"&gt;# 2. Start OTel Collector&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; otel-collector &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="nt"&gt;-p&lt;/span&gt; 13133:13133 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env-file&lt;/span&gt; .env &lt;span class="se"&gt;\&lt;/span&gt;
  otel/opentelemetry-collector-contrib:0.152.0

&lt;span class="c"&gt;# 3. Verify health&lt;/span&gt;
curl &lt;span class="nt"&gt;-f&lt;/span&gt; http://localhost:13133/
&lt;span class="c"&gt;# Expected: HTTP 200 — {"status":"Server available", ...}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;health_check&lt;/code&gt; extension confirms the Collector process is available; it does not guarantee that each backend exporter is successfully delivering logs. Monitor exporter errors separately using the Collector's internal telemetry metrics if enabled and exposed.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 4. Send a test payload&lt;/span&gt;
bash scripts/generate-otlp-payload.sh &lt;span class="nt"&gt;--output&lt;/span&gt; /tmp/payload.json
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:4318/v1/logs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; @/tmp/payload.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Colima users&lt;/strong&gt;: &lt;code&gt;docker compose&lt;/code&gt; v2 plugin is NOT available in Colima. All scripts in this repo detect this and fall back to &lt;code&gt;docker run&lt;/code&gt;. If you see "docker compose: command not found", this is expected behavior.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  First Success Path
&lt;/h3&gt;

&lt;p&gt;If you're trying this for the first time, start small:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run the Collector locally with &lt;strong&gt;one&lt;/strong&gt; backend.&lt;/li&gt;
&lt;li&gt;Send one fresh OTLP payload.&lt;/li&gt;
&lt;li&gt;Confirm the event appears in that backend.&lt;/li&gt;
&lt;li&gt;Add the second exporter.&lt;/li&gt;
&lt;li&gt;Only then move to multi-backend or AWS deployment.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This keeps the first validation focused on the producer-to-Collector contract before introducing backend parity and production networking.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Deployment: CloudFormation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/otel-collector/template.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-otel-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;S3AccessPointArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;OtlpEndpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://&amp;lt;your-collector-endpoint&amp;gt;:4318 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-otel-key-XXXXXX &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;AuthMode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_IAM &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;This template deploys the Lambda-side OTLP shipper. The Collector endpoint must already be reachable from the Lambda — for example, a local Collector for development, an EC2-hosted Collector, or an ECS/Fargate-based Collector in the same VPC. If the Lambda is in a VPC, ensure security groups allow outbound TCP 4318 to the Collector. See the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/vpc-deployment.md" rel="noopener noreferrer"&gt;VPC Deployment Guide&lt;/a&gt; and &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/security-hardening.md" rel="noopener noreferrer"&gt;Security Hardening Guide&lt;/a&gt; for production Collector deployment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the Collector handles auth, set &lt;code&gt;AuthMode=none&lt;/code&gt; on the Lambda. The Collector config contains the per-backend credentials via environment variables (sourced from &lt;code&gt;.env&lt;/code&gt; or Secrets Manager in production).&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Variables
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Lambda&lt;/th&gt;
&lt;th&gt;Collector&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OTLP_ENDPOINT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Collector URL (e.g., &lt;code&gt;http://collector:4318&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AUTH_MODE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;none&lt;/code&gt; / &lt;code&gt;basic&lt;/code&gt; / &lt;code&gt;bearer&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SERVICE_NAME&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;OTLP &lt;code&gt;service.name&lt;/code&gt; attribute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GRAFANA_OTLP_ENDPOINT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Grafana Cloud OTLP gateway URL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GRAFANA_BASIC_AUTH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;base64(instanceId:apiToken)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HONEYCOMB_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Ingest key (hcaik_...)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HONEYCOMB_DATASET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Dataset name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DD_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Datadog API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DD_SITE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Datadog site (&lt;code&gt;datadoghq.com&lt;/code&gt;, &lt;code&gt;datadoghq.eu&lt;/code&gt;, &lt;code&gt;ap1.datadoghq.com&lt;/code&gt;, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Verified Results
&lt;/h2&gt;

&lt;p&gt;All backends were tested on 2026-05-18 using &lt;code&gt;otel/opentelemetry-collector-contrib:0.152.0&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Region/Site&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Event Sources&lt;/th&gt;
&lt;th&gt;Auth Method&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datadog&lt;/td&gt;
&lt;td&gt;ap1.datadoghq.com&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;S3 audit + EMS + FPolicy&lt;/td&gt;
&lt;td&gt;Datadog exporter (&lt;code&gt;DD-API-KEY&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana Cloud&lt;/td&gt;
&lt;td&gt;ap-northeast-0&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;S3 audit + EMS + FPolicy&lt;/td&gt;
&lt;td&gt;Basic Auth via &lt;code&gt;otlp_http&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honeycomb&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;S3 audit + EMS + FPolicy&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;x-honeycomb-team&lt;/code&gt; via &lt;code&gt;otlp_http&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Backend&lt;/td&gt;
&lt;td&gt;Grafana + Honeycomb&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;Simultaneous delivery&lt;/td&gt;
&lt;td&gt;Both auth methods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Backend&lt;/td&gt;
&lt;td&gt;Datadog + Grafana + Honeycomb&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;Simultaneous 3-way delivery&lt;/td&gt;
&lt;td&gt;All three exporters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All three backends received the same structured attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;event.type&lt;/code&gt;, &lt;code&gt;user.name&lt;/code&gt;, &lt;code&gt;client.address&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsxn.operation&lt;/code&gt;, &lt;code&gt;fsxn.path&lt;/code&gt;, &lt;code&gt;fsxn.result&lt;/code&gt;, &lt;code&gt;fsxn.svm&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cloud.provider=aws&lt;/code&gt;, &lt;code&gt;cloud.platform=aws_fsx&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;OTLP standardizes the producer-to-Collector contract, but backend-specific indexing, query semantics, and retention behavior still need to be validated per destination. OpenTelemetry is not a backend — it defines APIs, protocols, and Collector components for telemetry generation, collection, processing, and export. Storage, visualization, and alerting are handled by the backends themselves. See the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/backend-parity-matrix.md" rel="noopener noreferrer"&gt;Backend Parity Matrix&lt;/a&gt; and &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/poc-checklist.md" rel="noopener noreferrer"&gt;PoC Checklist&lt;/a&gt; for backend-specific validation details.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Proof: Zero Code Changes
&lt;/h2&gt;

&lt;p&gt;Here's the key evidence. The Lambda handler's SHA-256 hash is identical regardless of which backend receives the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;shasum &lt;span class="nt"&gt;-a&lt;/span&gt; 256 integrations/otel-collector/lambda/handler.py
&lt;span class="c"&gt;# Same hash whether targeting Datadog, Grafana Cloud, or Honeycomb&lt;/span&gt;
&lt;span class="c"&gt;# The file never changes — only the Collector config does&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What changes between backends? &lt;strong&gt;Only the OTel Collector config file.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Demonstration: Adding a Backend
&lt;/h3&gt;

&lt;p&gt;Starting state: Grafana Cloud only.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: single backend&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding Honeycomb:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After: add 5 lines to exporters section + update pipeline&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://api.honeycomb.io&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_API_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;x-honeycomb-dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${env:HONEYCOMB_DATASET}&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;otlp_http/honeycomb&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart the Collector. Done. No Lambda redeployment, no code review, no CI/CD pipeline for the shipper.&lt;/p&gt;

&lt;h3&gt;
  
  
  Demonstration: Removing a Backend
&lt;/h3&gt;

&lt;p&gt;Dropping Datadog during a migration to Grafana Cloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Remove from exporters list — that's it&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# removed: datadog&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Timestamp Rejection / Static Payload Gotcha
&lt;/h3&gt;

&lt;p&gt;Datadog documents that logs older than 18 hours are dropped at intake (&lt;a href="https://docs.datadoghq.com/api/latest/logs/" rel="noopener noreferrer"&gt;Datadog Logs API docs&lt;/a&gt;). Other backends may also reject or hide events with timestamps outside their accepted windows. In my testing, future timestamps also caused ingestion issues on some backends. When testing with static payloads, always generate fresh timestamps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Use the payload generator to create fresh timestamps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/generate-otlp-payload.sh &lt;span class="nt"&gt;--output&lt;/span&gt; /tmp/fresh-payload.json
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:4318/v1/logs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; @/tmp/fresh-payload.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Grafana Cloud Auth Format
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;loki&lt;/code&gt; exporter is &lt;strong&gt;NOT&lt;/strong&gt; the correct approach for OTLP → Grafana Cloud.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;code&gt;loki&lt;/code&gt; exporter with Loki push API&lt;/li&gt;
&lt;li&gt;✅ &lt;code&gt;otlp_http/grafana&lt;/code&gt; with OTLP gateway endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Basic Auth value must be &lt;code&gt;base64(instanceId:apiToken)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate the auth value&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&amp;lt;your-instance-id&amp;gt;:&amp;lt;your-grafana-cloud-api-token&amp;gt;"&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where the instance ID is your numeric Grafana Cloud instance ID (found in Cloud Portal → OTLP configuration).&lt;/p&gt;

&lt;h3&gt;
  
  
  Honeycomb Key Types
&lt;/h3&gt;

&lt;p&gt;Honeycomb has two key types. Only ingest keys work for data ingestion:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Key Prefix&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Works for OTLP?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hcaik_&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Ingest API key&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hcxik_&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Environment key&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you see &lt;code&gt;401 Unauthorized&lt;/code&gt; from Honeycomb, check your key prefix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Colima Docker Compose Compatibility
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;docker compose&lt;/code&gt; v2 plugin is not available in Colima environments. All scripts in this repository detect this automatically and fall back to &lt;code&gt;docker run&lt;/code&gt;. This is expected — not an error.&lt;/p&gt;

&lt;p&gt;If you need compose-like orchestration on Colima, use the explicit &lt;code&gt;docker run&lt;/code&gt; commands shown in the Deployment section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Mistake: loki Exporter vs otlp_http
&lt;/h3&gt;

&lt;p&gt;A frequent misconfiguration when targeting Grafana Cloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ WRONG — loki exporter uses Loki-specific push API&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;loki&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://logs-prod-&amp;lt;region&amp;gt;.grafana.net/loki/api/v1/push&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ CORRECT — otlp_http uses the OTLP gateway&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp_http/grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://otlp-gateway-prod-&amp;lt;region&amp;gt;.grafana.net/otlp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OTLP gateway is Grafana Cloud's native OTLP ingestion endpoint. It handles logs, metrics, and traces through a single URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Model: How to Think About It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lambda Cost (OTLP Path vs Direct Send)
&lt;/h3&gt;

&lt;p&gt;In my validation, the OTLP Lambda was simpler and shorter-lived than the vendor-specific direct-send path. Your duration will vary depending on batching, payload size, network path, and backend response time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Direct Send (Part 2)&lt;/th&gt;
&lt;th&gt;OTLP + Collector&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda complexity&lt;/td&gt;
&lt;td&gt;Vendor formatting + HTTP + retry&lt;/td&gt;
&lt;td&gt;OTLP POST to nearby Collector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda memory&lt;/td&gt;
&lt;td&gt;256MB&lt;/td&gt;
&lt;td&gt;256MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor SDK deps&lt;/td&gt;
&lt;td&gt;Yes (adds cold start)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry complexity&lt;/td&gt;
&lt;td&gt;Per-vendor&lt;/td&gt;
&lt;td&gt;Delegated to Collector&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  OTel Collector Cost
&lt;/h3&gt;

&lt;p&gt;The Collector introduces a fixed infrastructure cost that is independent of event volume:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Docker on local machine&lt;/td&gt;
&lt;td&gt;Development, testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker on EC2 Spot (t3.small)&lt;/td&gt;
&lt;td&gt;Low-volume production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Fargate (0.5 vCPU, 1GB)&lt;/td&gt;
&lt;td&gt;Production (no OS management)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Fargate + NAT Gateway&lt;/td&gt;
&lt;td&gt;VPC-internal production&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When to Use Each Pattern
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single vendor, low volume&lt;/td&gt;
&lt;td&gt;Direct Send (Part 2 pattern) — no Collector overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single vendor, high volume&lt;/td&gt;
&lt;td&gt;Collector (buffering + backpressure benefits)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-vendor evaluation&lt;/td&gt;
&lt;td&gt;Collector (add/remove exporters freely)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor migration in progress&lt;/td&gt;
&lt;td&gt;Collector (parallel delivery during cutover)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance: logs in multiple systems&lt;/td&gt;
&lt;td&gt;Collector (fan-out is a config change)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Collector has fixed infrastructure costs regardless of volume. As volume increases or vendors multiply, the Collector path becomes more cost-effective because it processes once and fans out. The Collector path centralizes fan-out outside the Lambda. Direct-send can also fan out within one Lambda, but that pushes vendor-specific formatting, retry behavior, and failure isolation back into application code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Backend ingest/retention costs are not included in these AWS-side estimates. Datadog, Grafana Cloud, and Honeycomb each have their own pricing models that can become the dominant cost at scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  When to Use This Pattern
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multi-Vendor Evaluation
&lt;/h3&gt;

&lt;p&gt;Want to try Honeycomb for a month alongside your existing Datadog setup? Add one exporter to the Collector config. No Lambda redeployment. No risk to your existing pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compliance: Logs in Multiple Systems
&lt;/h3&gt;

&lt;p&gt;Some organizations require audit logs in multiple systems — security team uses Splunk, dev team uses Datadog, compliance team needs a cold archive. The Collector fans out to all simultaneously from a single OTLP stream.&lt;/p&gt;

&lt;h3&gt;
  
  
  Migration Between Vendors
&lt;/h3&gt;

&lt;p&gt;Moving from Datadog to Grafana Cloud? Run both exporters in parallel during migration. Verify data parity in the new system. Remove the old exporter when satisfied. Zero-downtime vendor migration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Optimization: Route by Volume
&lt;/h3&gt;

&lt;p&gt;Use the Collector's processor pipeline to route high-volume noisy logs (read operations) to a cheaper backend while keeping security-critical events (deletes, permission changes) on a premium platform with alerting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;For production hardening, the repository includes guides covering VPC deployment, health monitoring, persistent buffering, security hardening, and benchmarking. Auto-scaling and Multi-AZ deployment are natural next steps for production Collector operations.&lt;/p&gt;

&lt;p&gt;For production and partner-led deployments, the repository includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/architecture-decision.md" rel="noopener noreferrer"&gt;Architecture Decision Record&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/vpc-deployment.md" rel="noopener noreferrer"&gt;VPC Deployment Guide&lt;/a&gt; — private networking, security groups, and Collector reachability from Lambda&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/config-governance.md" rel="noopener noreferrer"&gt;Config Governance Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/security-hardening.md" rel="noopener noreferrer"&gt;Security Hardening Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/operations-guide.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/cost-model.md" rel="noopener noreferrer"&gt;Cost Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/poc-checklist.md" rel="noopener noreferrer"&gt;PoC Checklist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/routing-filtering-examples.md" rel="noopener noreferrer"&gt;Routing and Filtering Examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/compliance-note.md" rel="noopener noreferrer"&gt;Compliance Evidence Note&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/migration-guide.md" rel="noopener noreferrer"&gt;Migration Guide&lt;/a&gt; — zero-downtime migration from direct-send to the Collector path&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/otel-semantic-mapping.md" rel="noopener noreferrer"&gt;OTel Semantic Mapping Guide&lt;/a&gt; — standard vs project-specific attributes, schema evolution, and what OTLP does not solve&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/backend-parity-matrix.md" rel="noopener noreferrer"&gt;Backend Parity Matrix&lt;/a&gt; — visibility and query behavior across Datadog, Grafana Cloud, and Honeycomb&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/glossary.md" rel="noopener noreferrer"&gt;Glossary / 用語集&lt;/a&gt; — English/Japanese OTel terminology used in this project&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/enterprise-workload-addendum.md" rel="noopener noreferrer"&gt;Enterprise Workload Addendum&lt;/a&gt; — SAP, VMware, and mission-critical workload considerations&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/integrations/otel-collector/docs/en/storage-service-selection.md" rel="noopener noreferrer"&gt;Storage Service Selection Note&lt;/a&gt; — when to use FSx for ONTAP, Amazon S3, Amazon EFS, and Amazon EBS&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;OTLP is the stable producer contract&lt;/strong&gt;. Your Lambda speaks one protocol; the Collector handles backend-specific exporters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OTel Collector is the routing and processing layer&lt;/strong&gt; that decouples log producers from observability backends.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero Lambda code changes&lt;/strong&gt; when switching or adding backends — verified with SHA-256 hash comparison.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-backend delivery is a config change&lt;/strong&gt;, not a code change. Add 5 lines of YAML, restart the Collector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three FSx ONTAP event sources work&lt;/strong&gt;: FSx audit logs via S3 Access Point (Part 2), EMS webhooks (Part 3), and FPolicy file operations (Part 4).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collector economics improve&lt;/strong&gt; as volume increases or vendors multiply — fixed Collector cost is amortized across all destinations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start with direct send&lt;/strong&gt; (Part 2) for simplicity. &lt;strong&gt;Graduate to the Collector&lt;/strong&gt; when you need multi-backend, vendor migration, or volume-based routing.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Series Navigation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Why Your FSx for ONTAP Logs Deserve Better&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Shipping FSx for ONTAP Logs to Datadog — The Serverless Way&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Event-Driven Ransomware Detection with ONTAP ARP + Datadog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing"&gt;FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 5&lt;/strong&gt;: Escape Vendor Lock-in with OTel Collector (this post)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Questions about the OTel Collector pattern or multi-backend delivery? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing"&gt;Part 4 — FPolicy File Activity Pipeline&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/fsxn-observability-integrations&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>aws</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>CTF Event Report: Security-JAWS 10th Anniversary Day 2 — All 27 AWS Security Challenges Solved</title>
      <dc:creator>TOMOAKI ishihara</dc:creator>
      <pubDate>Mon, 18 May 2026 13:13:37 +0000</pubDate>
      <link>https://dev.to/aws-builders/ctf-event-report-security-jaws-10th-anniversary-day-2-all-27-aws-security-challenges-solved-51c3</link>
      <guid>https://dev.to/aws-builders/ctf-event-report-security-jaws-10th-anniversary-day-2-all-27-aws-security-challenges-solved-51c3</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I participated in the CTF held on &lt;a href="https://s-jaws.connpass.com/event/383752/" rel="noopener noreferrer"&gt;Day 2 of "Security-JAWS DAYS ~10th Anniversary Event~"&lt;/a&gt;, organized by &lt;a href="https://s-jaws.connpass.com/" rel="noopener noreferrer"&gt;Security-JAWS&lt;/a&gt;, a Japanese AWS user community focused on cloud security.&lt;/p&gt;

&lt;p&gt;The CTF was themed around a fictional SaaS company called "TechVault", and the scenario had us conducting a penetration investigation — starting from their employee portal and ultimately uncovering evidence of fraudulent transactions. It was an exceptionally well-crafted CTF with a cohesive narrative running through all challenges.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ey14azbl3tt4yy3jojv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ey14azbl3tt4yy3jojv.jpg" alt="secjaws10th-000.jpg" width="649" height="344"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Event Overview
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Duration:&lt;/strong&gt; 13:00–17:00 / 4 hours&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total challenges:&lt;/strong&gt; 27

&lt;ul&gt;
&lt;li&gt;Tutorial: 6 / 290 pt&lt;/li&gt;
&lt;li&gt;Mainline: 12 / 2,300 pt&lt;/li&gt;
&lt;li&gt;Bonus: 5 / 1,400 pt&lt;/li&gt;
&lt;li&gt;Advanced: 2 / 700 pt&lt;/li&gt;
&lt;li&gt;Blue Team: 1 / 300 pt&lt;/li&gt;
&lt;li&gt;Finale: 1 / 600 pt&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Setting:&lt;/strong&gt; AWS environment of a fictional SaaS company "TechVault"&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The story begins with an intrusion investigation of TechVault's portal service and culminates in gathering evidence of someone's fraudulent transactions. The level of polish in the scenario design was remarkable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ercmszg4ulm5xgsnv39.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ercmszg4ulm5xgsnv39.jpg" alt="secjaws10th-001.jpg" width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Challenges solved:&lt;/strong&gt; 27 out of 27&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score:&lt;/strong&gt; 5,310 pt (max: ~5,590–5,650 pt)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time to complete:&lt;/strong&gt; 2 hours 48 minutes 36 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final ranking:&lt;/strong&gt; 12th out of 125 participants&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What went well:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Using knowledge and CLI tools I don't normally touch in daily work, and working through them hands-on gave me a much deeper understanding of each technique.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;What I'd improve:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;I should have set up my environment beforehand. I normally use devContainers, so my host machine only had the minimum: AWS CLI and Python. Docker, OpenSSL, Boto3, and similar tools were missing, which cost me more time than necessary.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nmnlrn9ihulnxzactw4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nmnlrn9ihulnxzactw4.jpg" alt="secjaws10th-003.jpg" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpo24r1bsny8w2ujb3cn9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpo24r1bsny8w2ujb3cn9.jpg" alt="secjaws10th-002.jpg" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge Structure
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Tutorial (290 pt)
&lt;/h3&gt;

&lt;p&gt;A step-by-step introduction to web reconnaissance and AWS CLI basics.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T1&lt;/td&gt;
&lt;td&gt;Web Recon · robots.txt&lt;/td&gt;
&lt;td&gt;Discover hidden paths from &lt;code&gt;robots.txt&lt;/code&gt; Disallow directives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T2&lt;/td&gt;
&lt;td&gt;The Unlocked Warehouse · Public S3&lt;/td&gt;
&lt;td&gt;Retrieve files directly from a publicly exposed S3 bucket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T3&lt;/td&gt;
&lt;td&gt;Behind the Page · HTML Source&lt;/td&gt;
&lt;td&gt;Investigate credentials buried in HTML comments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T4&lt;/td&gt;
&lt;td&gt;First Steps with curl&lt;/td&gt;
&lt;td&gt;Check information embedded in HTTP response headers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T5&lt;/td&gt;
&lt;td&gt;Leaked Config · .env File&lt;/td&gt;
&lt;td&gt;Find a &lt;code&gt;.env&lt;/code&gt; file mistakenly placed in the web root&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T6&lt;/td&gt;
&lt;td&gt;First Steps with AWS CLI&lt;/td&gt;
&lt;td&gt;Use the key found in &lt;code&gt;.env&lt;/code&gt; to run &lt;code&gt;sts get-caller-identity&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The T5→T6 flow was clever. You grab a key from &lt;code&gt;.env&lt;/code&gt; and immediately use it with the AWS CLI — a hands-on demonstration of how a web vulnerability becomes an AWS entry point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mainline (2,300 pt)
&lt;/h3&gt;

&lt;p&gt;The core of the CTF: an attack chain that follows the path of intrusion → privilege escalation → evidence collection, starting from Stage 0.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stage 0&lt;/td&gt;
&lt;td&gt;The Forgotten Debug Mode&lt;/td&gt;
&lt;td&gt;Extract AWS keys from debug output left in an auth API error response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 1A&lt;/td&gt;
&lt;td&gt;Flip the Bucket&lt;/td&gt;
&lt;td&gt;Find a file hidden under a &lt;code&gt;.hidden/&lt;/code&gt; prefix in an S3 bucket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 1B&lt;/td&gt;
&lt;td&gt;Who Am I?&lt;/td&gt;
&lt;td&gt;Read IAM policy metadata to understand the compromised user's permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 1C&lt;/td&gt;
&lt;td&gt;The Past Never Disappears&lt;/td&gt;
&lt;td&gt;Recover AWS keys left in a Git repository's commit history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 1D&lt;/td&gt;
&lt;td&gt;Ask the AI&lt;/td&gt;
&lt;td&gt;Prompt injection against an AI assistant embedded in the employee portal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2A&lt;/td&gt;
&lt;td&gt;The Deleted File&lt;/td&gt;
&lt;td&gt;Recover a deleted file using S3 object versioning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2B&lt;/td&gt;
&lt;td&gt;The Permission Map&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;sts:AssumeRole&lt;/code&gt; to pivot laterally into the DataAnalystRole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2D&lt;/td&gt;
&lt;td&gt;The Function's Secret&lt;/td&gt;
&lt;td&gt;Retrieve sensitive data stored in Lambda environment variables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2E&lt;/td&gt;
&lt;td&gt;The Parameter Labyrinth&lt;/td&gt;
&lt;td&gt;Navigate SSM Parameter Store paths to collect secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2G&lt;/td&gt;
&lt;td&gt;The AI's Permissions&lt;/td&gt;
&lt;td&gt;Extract S3 data via an over-privileged Bedrock agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3A&lt;/td&gt;
&lt;td&gt;The Vault Key&lt;/td&gt;
&lt;td&gt;Retrieve the ZIP decryption password from Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final&lt;/td&gt;
&lt;td&gt;Consolidate the Evidence&lt;/td&gt;
&lt;td&gt;Decrypt the evidence file using all collected information to expose the CEO's fraud&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Bedrock agent challenge (Stage 2G) was fresh. The agent was configured with direct S3 access, so data from a bucket I couldn't read directly could be pulled out simply by asking the agent "show me the project metadata." It drove home how important permission design is when integrating AI into your stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus / Advanced / Blue Team / Finale (3,000 pt)
&lt;/h3&gt;

&lt;p&gt;Additional challenges branching off the mainline, each requiring deeper technical knowledge.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2C&lt;/td&gt;
&lt;td&gt;The Server's Shadow&lt;/td&gt;
&lt;td&gt;bonus&lt;/td&gt;
&lt;td&gt;Flag stored in EC2 instance tags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2F&lt;/td&gt;
&lt;td&gt;Find It Automatically&lt;/td&gt;
&lt;td&gt;bonus&lt;/td&gt;
&lt;td&gt;Scan all branches for secrets using &lt;code&gt;gitleaks&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3B&lt;/td&gt;
&lt;td&gt;The Invisible Voice&lt;/td&gt;
&lt;td&gt;bonus&lt;/td&gt;
&lt;td&gt;SSRF to IMDSv1 to steal EC2 role temporary credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3C&lt;/td&gt;
&lt;td&gt;The False Face&lt;/td&gt;
&lt;td&gt;advanced&lt;/td&gt;
&lt;td&gt;Self-declare &lt;code&gt;custom:role=admin&lt;/code&gt; during Cognito sign-up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3D&lt;/td&gt;
&lt;td&gt;The Truth Inside the Image&lt;/td&gt;
&lt;td&gt;bonus&lt;/td&gt;
&lt;td&gt;Recover files deleted by &lt;code&gt;RUN rm&lt;/code&gt; from Docker image layers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3E&lt;/td&gt;
&lt;td&gt;The Neighbor's Vault&lt;/td&gt;
&lt;td&gt;advanced&lt;/td&gt;
&lt;td&gt;Read another tenant's data via wildcard permissions on S3 Vectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 4&lt;/td&gt;
&lt;td&gt;Follow the Trail&lt;/td&gt;
&lt;td&gt;blueteam&lt;/td&gt;
&lt;td&gt;Identify attacker operation timestamps from CloudTrail logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 5&lt;/td&gt;
&lt;td&gt;Suspicious Activity&lt;/td&gt;
&lt;td&gt;finale&lt;/td&gt;
&lt;td&gt;Decrypt CTO complicity evidence by tracing late-night activity in CloudTrail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 5B&lt;/td&gt;
&lt;td&gt;Combined Attack Surface&lt;/td&gt;
&lt;td&gt;bonus&lt;/td&gt;
&lt;td&gt;Call an internal API by combining intelligence from Stage 3D and Stage 3E&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Stage 3D was by far the most time-consuming for me — because Docker wasn't installed in my CTF environment. Instead of using &lt;code&gt;docker history&lt;/code&gt; (which would have shown it in seconds), I had to query the ECR API directly to fetch the image manifest, download each layer, and extract the tarballs manually. Painful on the clock, but I ended up with a much deeper understanding of how Docker image layers actually work.&lt;/p&gt;

&lt;p&gt;Stage 4 and Stage 5 involved parsing large CloudTrail log files with &lt;code&gt;jq&lt;/code&gt; to reconstruct the attacker's footsteps — a great taste of what SOC/incident response work feels like. Stage 5 in particular required chaining multiple steps: find the suspicious late-night (JST) operations in the logs, track down the Secrets Manager path they pointed to, and decrypt the encrypted evidence file with OpenSSL. OpenSSL wasn't available either, so I ended up implementing the decryption in Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Attack Chain
&lt;/h2&gt;

&lt;p&gt;Each challenge looks independent, but they're all connected as a single story.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Obtain AWS keys from debug API response (Stage 0)
        ↓
IAM recon reveals an AssumeRole-able role (Stage 1B)
        ↓
Pivot laterally into DataAnalystRole (Stage 2B)
        ↓
┌─────────────────────────────────────────────────────┐
│  Collect intelligence across multiple parallel paths │
│  · EC2 tags (Stage 2C)                              │
│  · S3 versioning — recover deleted files (Stage 2A) │
│  · Lambda environment variables (Stage 2D)          │
│  · SSM Parameter Store (Stage 2E)                   │
│  · SSRF → IMDSv1 (Stage 3B)                        │
│  · ECR Docker layer analysis (Stage 3D)             │
│  · S3 Vectors cross-tenant leak (Stage 3E)          │
└─────────────────────────────────────────────────────┘
        ↓
Retrieve password from Secrets Manager (Stage 3A)
        ↓
Decrypt ZIP to obtain CEO fraud evidence (Final)
        ↓
Trace CTO complicity via CloudTrail (Stage 4 → 5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each individual vulnerability might look limited in isolation, but chaining them together produces a critical breach. Stage 5B is the perfect example: an internal API only reachable by combining intelligence gathered from two separate advanced stages.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways and Mitigations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Information Leakage (Debug, Headers, etc.)
&lt;/h3&gt;

&lt;p&gt;Debug output that's convenient during development can leak AWS keys if left enabled in production. HTML comments, response headers, and &lt;code&gt;robots.txt&lt;/code&gt; are all reconnaissance vectors attackers regularly check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Disable debug mode in production environments.&lt;/li&gt;
&lt;li&gt;Remove unnecessary response headers like &lt;code&gt;X-Powered-By&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Never place secrets in front-end source code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  S3 Misconfiguration
&lt;/h3&gt;

&lt;p&gt;Three distinct S3 issues appeared: public access enabled, a &lt;code&gt;.hidden/&lt;/code&gt; prefix used as security-by-obscurity, and deleted files recoverable via versioning. All three stem from treating S3 like a traditional filesystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable Block Public Access on all buckets.&lt;/li&gt;
&lt;li&gt;Prefixes are not access controls.&lt;/li&gt;
&lt;li&gt;If versioning is enabled, also design lifecycle policies to expire delete markers and old versions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Secrets in Git History
&lt;/h3&gt;

&lt;p&gt;Even after deleting a &lt;code&gt;.env&lt;/code&gt; file and committing the removal, &lt;code&gt;git log -p&lt;/code&gt; surfaces it instantly. Tools like &lt;code&gt;gitleaks&lt;/code&gt; can scan every branch and every commit in seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrate &lt;code&gt;git-secrets&lt;/code&gt; or &lt;code&gt;gitleaks&lt;/code&gt; as a pre-commit hook.&lt;/li&gt;
&lt;li&gt;If a secret was already committed, rewrite history with &lt;code&gt;git filter-repo&lt;/code&gt; &lt;em&gt;and&lt;/em&gt; rotate the key immediately.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prompt Injection
&lt;/h3&gt;

&lt;p&gt;A single sentence — "ignore previous instructions" — was enough to extract the contents of the system prompt. Using the system prompt as a "hidden" information store is not a security boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never put sensitive information in system prompts.&lt;/li&gt;
&lt;li&gt;Validate both inputs and outputs. Make the boundary between user input and system instructions explicit.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Overly Broad IAM Permissions and AssumeRole
&lt;/h3&gt;

&lt;p&gt;Having &lt;code&gt;sts:AssumeRole&lt;/code&gt; allows switching to a different role. In this CTF, flags were embedded in IAM policy descriptions and EC2 tags for challenge purposes — but in the real world, metadata fields are an underappreciated place for sensitive data to accumulate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apply the principle of least privilege rigorously.&lt;/li&gt;
&lt;li&gt;When granting &lt;code&gt;sts:AssumeRole&lt;/code&gt;, restrict the target resources.&lt;/li&gt;
&lt;li&gt;Use Condition keys in trust policies to restrict callers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Poor Secret Management
&lt;/h3&gt;

&lt;p&gt;Three storage locations appeared: Lambda environment variables, SSM Parameter Store, and Secrets Manager. Even Secrets Manager provides no protection if the IAM permissions granting &lt;code&gt;GetSecretValue&lt;/code&gt; are too broad.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manage secrets in Secrets Manager.&lt;/li&gt;
&lt;li&gt;Scope the &lt;code&gt;GetSecretValue&lt;/code&gt; resource policy to the specific secret ARN.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SSRF × IMDSv1
&lt;/h3&gt;

&lt;p&gt;A URL preview feature in the dashboard was fetching external URLs server-side — and there was no filtering to block requests to &lt;code&gt;http://169.254.169.254&lt;/code&gt;. IMDSv1 requires no token, so SSRF access to the link-local address yields EC2 role temporary credentials directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enforce IMDSv2 (&lt;code&gt;HttpTokens: required&lt;/code&gt;) on all EC2 instances.&lt;/li&gt;
&lt;li&gt;URL-fetching features should use an allowlist, and must block private IP ranges and link-local addresses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Secrets Persisted in Docker Image Layers
&lt;/h3&gt;

&lt;p&gt;A Dockerfile pattern like &lt;code&gt;COPY secret.txt .&lt;/code&gt; → &lt;code&gt;RUN python setup.py&lt;/code&gt; → &lt;code&gt;RUN rm secret.txt&lt;/code&gt; produces a final image where &lt;code&gt;secret.txt&lt;/code&gt; is not visible at runtime. However, downloading the image layers directly from the ECR API reveals &lt;code&gt;secret.txt&lt;/code&gt; intact in a prior layer's tarball.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use multi-stage builds; never copy secrets into build contexts.&lt;/li&gt;
&lt;li&gt;Retrieve secrets from Secrets Manager at runtime instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Broken Multi-Tenant Permission Design
&lt;/h3&gt;

&lt;p&gt;The S3 Vectors resource policy was set to &lt;code&gt;Resource: "*"&lt;/code&gt;, allowing a role scoped to one tenant to query another tenant's vector data. Tenant isolation in SaaS demands rigorous permission separation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Constrain the &lt;code&gt;Resource&lt;/code&gt; and &lt;code&gt;Condition&lt;/code&gt; in resource policies to tenant-specific identifiers.&lt;/li&gt;
&lt;li&gt;If sharing a vector bucket, scope queries and metadata access by tenant at the API level.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cognito Authorization Design Flaw
&lt;/h3&gt;

&lt;p&gt;Passing &lt;code&gt;custom:role=admin&lt;/code&gt; in &lt;code&gt;--user-attributes&lt;/code&gt; during &lt;code&gt;aws cognito-idp sign-up&lt;/code&gt; was enough to self-declare administrator status, which the application then trusted for authorization decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Control role assignment server-side (e.g., in a Pre Sign-up Lambda Trigger).&lt;/li&gt;
&lt;li&gt;Never use attributes that external parties can set as the basis for authorization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AI Agent Over-Privilege
&lt;/h3&gt;

&lt;p&gt;The Bedrock agent's IAM role had access to S3 buckets that the DataAnalystRole itself could not read. By asking the agent a natural-language question, data from otherwise-inaccessible buckets was pulled out indirectly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apply the principle of least privilege to AI agent roles as well.&lt;/li&gt;
&lt;li&gt;Explicitly enumerate and restrict the resources an agent is permitted to access.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  CloudTrail for Evidence Preservation
&lt;/h3&gt;

&lt;p&gt;With CloudTrail logs in place, "what happened, when, and by whom" can be reconstructed almost completely. A handful of &lt;code&gt;jq&lt;/code&gt; filters were enough to trace the attacker's full activity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable CloudTrail in all regions and ship logs to S3.&lt;/li&gt;
&lt;li&gt;Pair with GuardDuty for real-time detection.&lt;/li&gt;
&lt;li&gt;Apply Object Lock to the log bucket to prevent tampering.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;All 27 challenges together gave me a visceral sense of how AWS misconfigurations cascade. Each individual problem represented a realistic "seen-in-the-wild" vulnerability — but what made this CTF special was that they were all woven into a single coherent story.&lt;/p&gt;

&lt;p&gt;AWS certifications don't teach you &lt;em&gt;why&lt;/em&gt; something is dangerous. Solving these challenges hands-on — making mistakes, working around missing tools, figuring out the low-level APIs when Docker wasn't available — built an intuition that studying documentation alone never could.&lt;/p&gt;

&lt;p&gt;Highly recommend participating if a similar opportunity comes around. And if you're building on AWS, I hope this report serves as a useful checklist of things worth double-checking in your own environment.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>security</category>
      <category>ctf</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>I Built a ML Churn Predictor in Minutes- Here's How Kiro Made It Possible</title>
      <dc:creator>Adeline Makokha [AWS Hero]</dc:creator>
      <pubDate>Mon, 18 May 2026 12:25:13 +0000</pubDate>
      <link>https://dev.to/aws-builders/i-built-a-ml-churn-predictor-in-minutes-heres-how-kiro-made-it-possible-2bdl</link>
      <guid>https://dev.to/aws-builders/i-built-a-ml-churn-predictor-in-minutes-heres-how-kiro-made-it-possible-2bdl</guid>
      <description>&lt;p&gt;Customer churn is one of the most expensive problems in the telecom industry. Acquiring a new customer costs 5–10× more than retaining an existing one, yet most companies only discover a customer has churned &lt;em&gt;after&lt;/em&gt; they've already left. The goal of this project is to flip that, give analysts a tool to identify at-risk customers &lt;em&gt;before&lt;/em&gt; they churn, so retention teams can act proactively.&lt;/p&gt;

&lt;p&gt;What would normally take days of planning, scaffolding, and wiring together took a fraction of the time, because I built it with &lt;strong&gt;&lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;&lt;/strong&gt;, an AI-powered development environment that thinks in specs, not just code completions.&lt;/p&gt;

&lt;p&gt;In this article I'll walk through building a complete churn prediction web application from scratch using Python, Flask, scikit-learn, and Plotly. By the end you'll have a working app that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accepts CSV uploads of customer data&lt;/li&gt;
&lt;li&gt;Runs a Random Forest churn prediction model&lt;/li&gt;
&lt;li&gt;Visualises results with three interactive charts&lt;/li&gt;
&lt;li&gt;Lets you browse, filter, and sort at-risk customers&lt;/li&gt;
&lt;li&gt;Exports results to CSV for downstream use&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How Kiro Accelerated This Build
&lt;/h2&gt;

&lt;p&gt;Before diving into the code, it's worth explaining &lt;em&gt;why&lt;/em&gt; this came together so fast.&lt;/p&gt;

&lt;p&gt;Most AI coding tools are reactive meaning you write code, they autocomplete. Kiro works differently. It starts with a &lt;strong&gt;spec-driven workflow&lt;/strong&gt; where you describe what you want to build, and Kiro helps you think through requirements, design, and implementation tasks &lt;em&gt;before&lt;/em&gt; a single line of code is written.&lt;/p&gt;

&lt;p&gt;Here's exactly how this project unfolded:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Requirements in minutes, not hours
&lt;/h3&gt;

&lt;p&gt;I described the project in plain English, &lt;em&gt;"a telecom customer churn prediction website using Python"&lt;/em&gt; and Kiro generated a full requirements document covering 7 requirement areas with precise, testable acceptance criteria in EARS format. Things like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"IF a Dataset contains up to 10,000 Customer Records, THEN THE Predictor SHALL complete prediction within 30 seconds."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No ambiguity. No back-and-forth. Edge cases I hadn't even thought about, like what happens when &lt;code&gt;tenure = 0&lt;/code&gt;, or when a CSV is valid but contains zero data rows, were already covered.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Technical design with 15 correctness properties
&lt;/h3&gt;

&lt;p&gt;From the requirements, Kiro produced a full technical design document i.e, component interfaces with Python signatures, data models, an architecture diagram, Flask route table, and &lt;strong&gt;15 formal correctness properties&lt;/strong&gt; to be verified with property-based tests. For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"For any array of churn scores and any threshold in [0.0, 1.0], &lt;code&gt;compute_churn_rate&lt;/code&gt; SHALL return exactly &lt;code&gt;round((count of scores &amp;gt;= threshold / total count) * 100, 2)&lt;/code&gt;."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the kind of rigour that usually only happens on large teams with dedicated QA. Kiro baked it in from the start.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Implementation tasks, automatically sequenced
&lt;/h3&gt;

&lt;p&gt;Kiro then broke the design into a dependency-ordered task list, 13 top-level tasks across 8 parallel waves, from project scaffolding through to integration tests. Each task referenced specific requirements for traceability.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Code generation that actually matches the spec
&lt;/h3&gt;

&lt;p&gt;With the spec in place, Kiro generated all the Python modules, Flask routes, Jinja2 templates, JavaScript, and sample data and the code matched the design document precisely. No hallucinated APIs, no mismatched interfaces.&lt;/p&gt;

&lt;p&gt;The result is a production-quality app with 146 passing tests (unit, integration, and property-based) generated from a single plain English description.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The takeaway:&lt;/strong&gt; Kiro doesn't just write code faster. It helps you build the &lt;em&gt;right&lt;/em&gt; thing by front-loading the thinking. The spec becomes the source of truth, and the code follows from it.&lt;/p&gt;
&lt;/blockquote&gt;







&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;Here's the full feature set at a glance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CSV Upload&lt;/td&gt;
&lt;td&gt;Validates format, size (≤50 MB), required columns, and row-level data quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Churn Prediction&lt;/td&gt;
&lt;td&gt;Random Forest model, configurable threshold (default 0.5)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;Summary stats + 3 Plotly charts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;At-Risk Table&lt;/td&gt;
&lt;td&gt;Paginated (25/page), sortable, filterable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Export&lt;/td&gt;
&lt;td&gt;Download results as CSV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model Info&lt;/td&gt;
&lt;td&gt;Displays model name, version, and training date&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The app follows a clean separation of concerns. Each responsibility lives in its own module:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fim3kcfqib1ogv5ejq0s2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fim3kcfqib1ogv5ejq0s2.png" alt=" " width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Request flows:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Upload&lt;/strong&gt; → &lt;code&gt;Validator&lt;/code&gt; checks the file → valid rows stored in &lt;code&gt;AppState&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predict&lt;/strong&gt; → &lt;code&gt;Predictor&lt;/code&gt; scores every row → &lt;code&gt;Visualizer&lt;/code&gt; builds chart specs → stored in &lt;code&gt;AppState&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export&lt;/strong&gt; → &lt;code&gt;Exporter&lt;/code&gt; serialises results → streamed as file download&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startup&lt;/strong&gt; → &lt;code&gt;ModelLoader&lt;/code&gt; loads &lt;code&gt;model.joblib&lt;/code&gt; once; if it fails, prediction is disabled&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;telecom-churn-app/
├── app.py               # Flask routes and AppState
├── validator.py         # CSV upload validation
├── predictor.py         # Churn scoring
├── visualizer.py        # Plotly chart builders
├── exporter.py          # CSV export
├── model_loader.py      # joblib model loading
├── table_helpers.py     # Pagination, sort, filter
├── generate_model.py    # One-time model training script
├── requirements.txt
├── data/
│   ├── sample_customers.csv   # 200 rows
│   └── sample_small.csv       # 20 rows for quick testing
├── templates/
│   ├── base.html
│   └── dashboard.html
└── static/
    └── app.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;li&gt;pip&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Install dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;requirements.txt&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;Flask&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=3.0.3&lt;/span&gt;
&lt;span class="py"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=2.2.2&lt;/span&gt;
&lt;span class="py"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=1.26.4&lt;/span&gt;
&lt;span class="py"&gt;scikit-learn&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=1.5.0&lt;/span&gt;
&lt;span class="py"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=1.4.2&lt;/span&gt;
&lt;span class="py"&gt;plotly&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=5.22.0&lt;/span&gt;
&lt;span class="py"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=8.2.2&lt;/span&gt;
&lt;span class="py"&gt;hypothesis&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;=6.103.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Generate the model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python generate_model.py
&lt;span class="c"&gt;# → Model saved to model.joblib&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Run the app
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python app.py
&lt;span class="c"&gt;# → Running on http://localhost:5000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Data
&lt;/h2&gt;

&lt;p&gt;The app expects a CSV with these five columns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Rules&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;customer_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tenure&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;numeric&lt;/td&gt;
&lt;td&gt;0 – 999 (months)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;monthly_charges&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;numeric&lt;/td&gt;
&lt;td&gt;&amp;gt; 0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;total_charges&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;numeric&lt;/td&gt;
&lt;td&gt;&amp;gt; 0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;contract_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;"Month-to-month", "One year", "Two year"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A sample row from &lt;code&gt;data/sample_small.csv&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;customer&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;tenure&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;monthly&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;charges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;total&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;charges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;contract&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;type&lt;/span&gt;
&lt;span class="k"&gt;CUST&lt;/span&gt;&lt;span class="mf"&gt;0001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;69&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;113.04&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;7560.84&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;One&lt;/span&gt; &lt;span class="k"&gt;year&lt;/span&gt;
&lt;span class="k"&gt;CUST&lt;/span&gt;&lt;span class="mf"&gt;0004&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;59.11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;2325.40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;Month&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;
&lt;span class="k"&gt;CUST&lt;/span&gt;&lt;span class="mf"&gt;0007&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;113.62&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;7797.64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;Two&lt;/span&gt; &lt;span class="k"&gt;year&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1: Training the Model (&lt;code&gt;generate_model.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;We generate 1,000 rows of synthetic training data where churn probability is driven by three realistic signals i.e, short tenure, high monthly charges, and month-to-month contracts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fum4u0ulceferkon7l09y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fum4u0ulceferkon7l09y.png" alt=" " width="800" height="337"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_training_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;tenure&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;73&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;monthly_charges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;20.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;120.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total_charges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenure&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;monthly_charges&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;contract_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Month-to-month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;One year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Two year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Churn probability: higher for short tenure, month-to-month, high charges
&lt;/span&gt;    &lt;span class="n"&gt;churn_prob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="mf"&gt;0.4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;tenure&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;72&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monthly_charges&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contract_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Month-to-month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;churn_prob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;churn_prob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;churn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;binomial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;churn_prob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After training a &lt;code&gt;RandomForestClassifier&lt;/code&gt;, we attach metadata directly to the model object before saving with joblib:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RandomForestChurnModel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.joblib&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the model and its metadata in a single file  and no separate config needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Loading the Model (&lt;code&gt;model_loader.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;The model is loaded once at startup. If the file is missing or corrupt, the app enters a degraded state where prediction is disabled but everything else still works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0w3ps7icvmyxy2gbwoh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0w3ps7icvmyxy2gbwoh.png" alt=" " width="800" height="308"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;training_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;  &lt;span class="c1"&gt;# displayed as ISO 8601 YYYY-MM-DD
&lt;/span&gt;
&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LoadedModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ModelMetadata&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelLoadError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LoadedModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ModelLoadError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model file not found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ModelLoadError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to load model: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;raw_meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;
    &lt;span class="n"&gt;training_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromisoformat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;training_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LoadedModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ModelMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;raw_meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;raw_meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;training_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;app.py&lt;/code&gt;, this runs before the first request is served:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;app_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AppState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_load_model_on_startup&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loaded_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL_PATH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ModelLoadError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_load_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;_load_model_on_startup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Validating Uploads (&lt;code&gt;validator.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;The validator runs a multi-stage pipeline. Each stage can fail fast with a clear error message:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokjw39rseix3ad28lukn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokjw39rseix3ad28lukn.png" alt=" " width="800" height="137"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The row-level rules are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;monthly_charges&lt;/code&gt; and &lt;code&gt;total_charges&lt;/code&gt; must be numeric and &amp;gt; 0&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tenure&lt;/code&gt; must be numeric, ≥ 0, and ≤ 999 (tenure = 0 is valid)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw33b45dt3h7w9315ao6s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw33b45dt3h7w9315ao6s.png" alt=" " width="800" height="349"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_validate_rows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;work_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_tenure_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;          &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coerce&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_monthly_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coerce&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_total_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coerce&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;valid_mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_tenure_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;notna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_tenure_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_tenure_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;999&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_monthly_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;notna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_monthly_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_total_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;notna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_total_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;valid_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;valid_mask&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;invalid_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;valid_mask&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;valid_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invalid_count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If some rows are invalid but at least one is valid, the app warns the user and proceeds with the clean rows. If &lt;em&gt;all&lt;/em&gt; rows are invalid, prediction is blocked.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ValidationResult&lt;/code&gt; dataclass carries everything the route handler needs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ValidationResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;warning_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;total_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;valid_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;invalid_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Running Predictions (&lt;code&gt;predictor.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;The predictor one-hot encodes &lt;code&gt;contract_type&lt;/code&gt; to match the training feature set, then calls &lt;code&gt;predict_proba&lt;/code&gt; to get churn probabilities:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frdl2doelbicnrb1cegfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frdl2doelbicnrb1cegfl.png" alt=" " width="800" height="342"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PredictionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;feature_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;feature_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_dummies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="c1"&gt;# Ensure all contract type columns exist even if not in this batch
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type_Month-to-month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type_One year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type_Two year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;feature_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;feature_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

        &lt;span class="n"&gt;feature_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type_Month-to-month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type_One year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type_Two year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;feature_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;feature_cols&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;probas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;probas&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# probability of churn
&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;PredictionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;customer_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                                &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;PredictionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The churn rate formula is explicit and deterministic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_churn_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;at_risk_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;at_risk_count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5: Visualising Results (&lt;code&gt;visualizer.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Three Plotly charts are built server-side and serialised to JSON, then rendered client-side with &lt;code&gt;Plotly.newPlot&lt;/code&gt;. This keeps the server stateless with respect to chart rendering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chart 1: At-Risk vs Non-At-Risk bar chart&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;At-Risk  ████████████████  87
Non-Risk ████████████████████████████████  113
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Chart 2: Churn Score Distribution (histogram)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Exactly 10 bins of width 0.1 spanning [0.0, 1.0]:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffqhetz0pg2waniz1c7bt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffqhetz0pg2waniz1c7bt.png" alt=" " width="800" height="280"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_score_histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;bin_edges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 11 edges = 10 bins
&lt;/span&gt;    &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bin_edges&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bin_centers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;bin_edges&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;bin_edges&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="n"&gt;fig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;go&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;go&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bin_centers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.09&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
    &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_layout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Churn Score Distribution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Chart 3: Churn Rate by Contract Type&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Month-to-month  ████████████████████████  62.4%
One year        ████████  21.3%
Two year        ████  10.1%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 6: The Flask Application (&lt;code&gt;app.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;All mutable state lives in a single &lt;code&gt;AppState&lt;/code&gt; dataclass — a simple singleton for single-user deployments:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsd42vu8kxe6frlvywxn6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsd42vu8kxe6frlvywxn6.png" alt=" " width="800" height="298"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AppState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;prediction_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PredictionResult&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
    &lt;span class="n"&gt;loaded_model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LoadedModel&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;model_load_error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;chart_specs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The six routes map cleanly to user actions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Route&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GET&lt;/td&gt;
&lt;td&gt;Redirect to dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /dashboard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GET&lt;/td&gt;
&lt;td&gt;Render main page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST /upload&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;POST&lt;/td&gt;
&lt;td&gt;Validate and store CSV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST /predict&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;POST&lt;/td&gt;
&lt;td&gt;Run prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;POST /threshold&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;POST&lt;/td&gt;
&lt;td&gt;Update churn threshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /export&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GET&lt;/td&gt;
&lt;td&gt;Stream CSV download&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The upload route shows the validation pipeline in action:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/upload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;uploaded_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;file_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uploaded_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uploaded_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_bytes&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;flash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;url_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dashboard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Clear previous results when new data is uploaded
&lt;/span&gt;    &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;
    &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prediction_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chart_specs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;flash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;flash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File uploaded successfully. &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; customer record(s) loaded.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;url_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dashboard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The threshold route validates the range before accepting the new value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_threshold&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;threshold_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;validate_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threshold_val&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;flash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Threshold &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;threshold_val&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is out of range. Valid range is [0.0, 1.0].&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;url_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dashboard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold_val&lt;/span&gt;
    &lt;span class="c1"&gt;# Rebuild charts immediately if results exist
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prediction_result&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prediction_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prediction_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold_val&lt;/span&gt;
        &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chart_specs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_build_chart_specs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prediction_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;flash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Threshold updated to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;threshold_val&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;url_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dashboard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 7: The Dashboard UI
&lt;/h2&gt;

&lt;p&gt;The dashboard uses a two-column Bootstrap 5 layout: a narrow left sidebar for controls, and a wide right panel for results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqgmydpp7gnwrppf1dv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqgmydpp7gnwrppf1dv3.png" alt=" " width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Chart data is injected into the page as JSON and rendered by Plotly client-side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!-- In dashboard.html --&amp;gt;&lt;/span&gt;
{% if has_results %}
&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"chart-data"&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"application/json"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="nx"&gt;chart_data_json&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nx"&gt;safe&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"table-data"&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"application/json"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="nx"&gt;at_risk_table_json&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nx"&gt;safe&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
{% endif %}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In app.js&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;renderCharts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chartData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getElementById&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chart-data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;responsive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;displayModeBar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="nx"&gt;Plotly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPlot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chart-at-risk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="nx"&gt;chartData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;at_risk_bar&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="nx"&gt;chartData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;at_risk_bar&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;layout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;Plotly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPlot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chart-histogram&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;chartData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score_histogram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;chartData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score_histogram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;layout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;Plotly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPlot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chart-contract&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="nx"&gt;chartData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contract_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="nx"&gt;chartData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contract_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;layout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 8: At-Risk Table: Pagination, Sort, Filter
&lt;/h2&gt;

&lt;p&gt;The table helpers are pure Python functions, independently testable and reusable:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mcew143ps8hgu343dqp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mcew143ps8hgu343dqp.png" alt=" " width="800" height="264"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# table_helpers.py
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;paginate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sort_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;direction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;reverse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;direction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;desc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_term&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;search_term&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;
    &lt;span class="n"&gt;term&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;search_term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client-side JavaScript mirrors this logic for instant interactivity without round-trips:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Sort on column header click&lt;/span&gt;
&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelectorAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#atRiskTable thead th[data-col]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;th&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;th&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;click&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;function &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data-col&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;sortDirection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sortColumn&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;col&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;sortDirection&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;asc&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;desc&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;asc&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;sortColumn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;col&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;currentPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;renderTable&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Filter on search input&lt;/span&gt;
&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getElementById&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tableSearch&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;input&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;function &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;term&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;filteredRecords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;allRecords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;term&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;currentPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;renderTable&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 9: Exporting Results (&lt;code&gt;exporter.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;The export produces a clean CSV with a fixed column order. One detail worth noting: pandas serialises Python booleans as &lt;code&gt;True&lt;/code&gt;/&lt;code&gt;False&lt;/code&gt; (capitalised) by default, but the spec requires lowercase &lt;code&gt;true&lt;/code&gt;/&lt;code&gt;false&lt;/code&gt;. We handle this explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;EXPORT_COLUMNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;churn_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_at_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_export_dataframe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;at_risk_flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;churn_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;       &lt;span class="c1"&gt;# 4 decimal places
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_at_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="n"&gt;at_risk_flags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_charges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})[&lt;/span&gt;&lt;span class="n"&gt;EXPORT_COLUMNS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;to_csv_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;export_df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;out_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;export_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_at_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_at_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sample export output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;customer&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;churn&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;contract&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;tenure&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;monthly&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;charges&lt;/span&gt;
&lt;span class="k"&gt;CUST&lt;/span&gt;&lt;span class="mf"&gt;0004&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.8231&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;Month&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;59.11&lt;/span&gt;
&lt;span class="k"&gt;CUST&lt;/span&gt;&lt;span class="mf"&gt;0005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.7654&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;Month&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;47.03&lt;/span&gt;
&lt;span class="k"&gt;CUST&lt;/span&gt;&lt;span class="mf"&gt;0009&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.1203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;One&lt;/span&gt; &lt;span class="k"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;97.46&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Testing Strategy
&lt;/h2&gt;

&lt;p&gt;The project uses two complementary testing approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example-based tests (pytest)
&lt;/h3&gt;

&lt;p&gt;These cover specific scenarios and exact error messages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tests/unit/test_validator.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_rejects_non_csv_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;some data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.xlsx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xlsx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_message&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_tenure_zero_is_valid&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;csv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id,tenure,monthly_charges,total_charges,contract_type&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;csv&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C001,0,50.0,0.01,Month-to-month&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_rows&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Property-based tests (Hypothesis)
&lt;/h3&gt;

&lt;p&gt;These verify universal correctness properties across thousands of generated inputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tests/property/test_predictor_properties.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hypothesis&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;given&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hypothesis.strategies&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;

&lt;span class="nd"&gt;@given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;min_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_classify_at_risk_consistency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Property 2: For any scores and threshold, classify_at_risk returns
    True iff score &amp;gt;= threshold — including identical scores.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_at_risk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_file_size_boundary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Property 9: _check_file_size returns True iff size &amp;lt;= 52,428,800.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_check_file_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;52_428_800&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why a singleton &lt;code&gt;AppState&lt;/code&gt; instead of Flask sessions?&lt;/strong&gt;&lt;br&gt;
Sessions are limited to ~4 KB (cookie storage) and can't hold DataFrames. For a single-user analytics tool, a module-level singleton is simpler and more practical than a database or Redis cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Plotly JSON instead of server-rendered images?&lt;/strong&gt;&lt;br&gt;
Plotly charts are interactive, users can hover, zoom, and pan. Serialising chart specs as JSON and rendering client-side means the server doesn't need a headless browser or image generation library.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why separate &lt;code&gt;table_helpers.py&lt;/code&gt;?&lt;/strong&gt;&lt;br&gt;
Keeping pagination, sort, and filter as pure functions makes them trivially testable without spinning up a Flask test client. The JavaScript mirrors the same logic for instant client-side interactivity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why one-hot encode at prediction time?&lt;/strong&gt;&lt;br&gt;
The uploaded CSV may not contain all three contract types. Encoding at prediction time and filling missing columns with 0 ensures the feature vector always matches what the model was trained on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running the Full App
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# 2. Train and save the model&lt;/span&gt;
python generate_model.py

&lt;span class="c"&gt;# 3. Start the server&lt;/span&gt;
python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:5000&lt;/code&gt;, upload &lt;code&gt;data/sample_customers.csv&lt;/code&gt;, and click &lt;strong&gt;Predict Churn&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You should see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summary stats (total customers, at-risk count, churn rate %)&lt;/li&gt;
&lt;li&gt;Three interactive Plotly charts&lt;/li&gt;
&lt;li&gt;A paginated, sortable, filterable table of at-risk customers&lt;/li&gt;
&lt;li&gt;An export button to download the full results as CSV&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;A few natural extensions from here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User authentication&lt;/strong&gt; - add Flask-Login for multi-user support with per-user state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model retraining&lt;/strong&gt; - add an admin route to upload new training data and retrain in-place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled batch jobs&lt;/strong&gt; - use Celery + Redis to run predictions on a schedule and email results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database persistence&lt;/strong&gt; - swap the in-memory &lt;code&gt;AppState&lt;/code&gt; for SQLAlchemy + PostgreSQL to persist results across restarts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SHAP explanations&lt;/strong&gt; - add feature importance explanations per customer using the &lt;code&gt;shap&lt;/code&gt; library&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;

&lt;p&gt;The full source is available on GitHub: &lt;a href="https://github.com/adeline-pepela/Agentic-AI-Demo-Day" rel="noopener noreferrer"&gt;Agentic AI Kiro&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The project includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All Python modules with docstrings&lt;/li&gt;
&lt;li&gt;Sample CSV data (200 rows)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;generate_model.py&lt;/code&gt; to reproduce the model&lt;/li&gt;
&lt;li&gt;Unit, integration, and property-based tests&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try Kiro Yourself
&lt;/h2&gt;

&lt;p&gt;If you want to build something like this or anything else - &lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; is worth trying. The spec-driven workflow changes how you approach a project. Instead of diving straight into code and figuring out the design as you go, you start with a clear picture of what you're building and why. The code becomes the easy part.&lt;/p&gt;

&lt;p&gt;The entire requirements document, technical design, task list, and implementation for this project came from a single prompt. That's the difference.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Flask, pandas, scikit-learn, and Plotly. Spec-driven development powered by &lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;. Tested with pytest and Hypothesis.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>flask</category>
      <category>kiro</category>
    </item>
    <item>
      <title>Stop Using Lambda for ML at This Scale (Benchmark + Cost Analysis)</title>
      <dc:creator>Matia Rašetina</dc:creator>
      <pubDate>Mon, 18 May 2026 07:00:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/stop-using-lambda-for-ml-at-this-scale-benchmark-cost-analysis-57jh</link>
      <guid>https://dev.to/aws-builders/stop-using-lambda-for-ml-at-this-scale-benchmark-cost-analysis-57jh</guid>
      <description>&lt;p&gt;As a CTO, your job during a Proof of Concept (POC) is deceptively simple: don’t over-engineer, and don’t overspend.&lt;/p&gt;

&lt;p&gt;You don’t need the perfect ML infrastructure—you need the cheapest architecture that works well enough.&lt;/p&gt;

&lt;p&gt;Here’s the pipeline we built for our ML POC:&lt;/p&gt;

&lt;p&gt;Audio file → S3 → Compute → Prediction → DynamoDB&lt;/p&gt;

&lt;p&gt;The real question isn’t &lt;em&gt;how&lt;/em&gt; to run inference—it’s:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;At what point does Lambda stop being the smartest choice?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this blog post, we are comparing the 3 Serverless ways of processing the data with an already trained Machine Learning model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Lambda with standard configuration&lt;/li&gt;
&lt;li&gt;AWS Lambda with Snapstart enabled&lt;/li&gt;
&lt;li&gt;AWS Lambda used as a proxy to use AWS SageMaker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To access the full project code, you can click the link here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Experiment setup
&lt;/h2&gt;

&lt;p&gt;The architecture across the board is very similar — all compute resources (Lambdas and SageMaker instance) have the same 4GB RAM configuration.&lt;/p&gt;

&lt;p&gt;There is a subtle difference in assigning the vCPUs, as for each 1.769GB of RAM in AWS Lambda, you get the equivalent of one vCPU, meaning that our Lambdas would have 2.31 vCPU assigned (based on AWS docs &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-memory.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;), and our SageMaker instance (&lt;code&gt;ml.t2.medium&lt;/code&gt; instance) would have 2 vCPU assigned.&lt;/p&gt;

&lt;p&gt;In addition, SageMaker stack has a proxy Lambda, with 128MB of RAM assigned, which gets the information from the uploaded file in S3, forwards the information to SageMaker and saves the results into DynamoDB.&lt;/p&gt;

&lt;p&gt;All stacks do not use any GPU instances, making the playing field as level as possible.&lt;/p&gt;

&lt;p&gt;Here is some other experiment choices to make the benchmark fair:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same CPU architecture everywhere (x86_64): Lambda functions use the x86_64 architecture, dependencies are bundled with the SAM x86_64 Python 3.12 image, and the SageMaker container image is built for linux/amd64 so ONNX and wheels behave the same across paths.&lt;/li&gt;
&lt;li&gt;Same language runtime: All Lambda handlers run Python 3.12 with the same packaged &lt;code&gt;lambda_src&lt;/code&gt; layout (only the handler and SnapStart wiring differ).&lt;/li&gt;
&lt;li&gt;Same model and container vs zip trade-off is intentional: One shared ONNX artifact from S3; standard and SnapStart load it inside the function, SageMaker serves it from a dedicated container behind an endpoint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To keep the benchmark fair, SageMaker serverless was intentionally excluded. The reason for this is to keep the costs of running the ML model as low as possible and to keep the performance fair across the board.&lt;/p&gt;

&lt;p&gt;The architecture diagram for this benchmark can be seen in the following picture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0bozgsy9glmjhut2alco.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0bozgsy9glmjhut2alco.png" alt="Architecture Diagram" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is an overview of all stacks in this experiment:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Cost Model&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda (4GB)&lt;/td&gt;
&lt;td&gt;Model runs directly inside Lambda, ~2.31 vCPU, 4GB of RAM&lt;/td&gt;
&lt;td&gt;Pay-per-request&lt;/td&gt;
&lt;td&gt;Scales to zero, no idle cost, fast per request&lt;/td&gt;
&lt;td&gt;High memory cost, not ideal at very high traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda with SnapStart Enabled&lt;/td&gt;
&lt;td&gt;Model runs directly inside Lambda, ~2.31 vCPU, 4GB of RAM&lt;/td&gt;
&lt;td&gt;Pay-per-request&lt;/td&gt;
&lt;td&gt;Predictable performance, cost-efficient at scale, SnapStart helping in cold starts&lt;/td&gt;
&lt;td&gt;High memory cost, not ideal at very high traffic, additional SnapStart cost if traffic is sporadic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SageMaker Endpoint&lt;/td&gt;
&lt;td&gt;Model hosted on ml.t2.medium (2 vCPU with 4GB of RAM), invoked via 128MB Lambda&lt;/td&gt;
&lt;td&gt;Fixed monthly&lt;/td&gt;
&lt;td&gt;Predictable performance, cost-efficient at scale&lt;/td&gt;
&lt;td&gt;Always-on, pays even when idle, slightly higher latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Configuration in CDK code
&lt;/h3&gt;

&lt;p&gt;Here are the code snippets of all compute resources used in this experiment. All Lambdas are created with the following method, to reduce code duplication.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_python_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;memory_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;architecture&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Architecture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Architecture&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X86_64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;snapstart&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SnapStartConf&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Function&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;runtime_environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;_DEFAULT_RUNTIME_ENV&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;runtime_environment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PYTHON_3_12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;architecture&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;architecture&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;bundled_lambda_code&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;memory_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;runtime_environment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;snap_start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapstart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standard Lambda:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize a Lambda with the standard configuration&lt;/span&gt;
standard_lambda &lt;span class="o"&gt;=&lt;/span&gt; create_python_function&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;self,
    &lt;span class="nv"&gt;function_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"standard-predictor"&lt;/span&gt;,
    &lt;span class="nv"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"standard_handler.handler"&lt;/span&gt;,
    &lt;span class="nb"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Duration.seconds&lt;span class="o"&gt;(&lt;/span&gt;90&lt;span class="o"&gt;)&lt;/span&gt;,
    &lt;span class="nv"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;={&lt;/span&gt;
        &lt;span class="s2"&gt;"PREDICTIONS_TABLE"&lt;/span&gt;: self.predictions_table.table_name,
        &lt;span class="s2"&gt;"PREDICTOR"&lt;/span&gt;: &lt;span class="s2"&gt;"standard"&lt;/span&gt;,
        &lt;span class="s2"&gt;"MODEL_S3_URI"&lt;/span&gt;: f&lt;span class="s2"&gt;"s3://{self.model_asset.s3_bucket_name}/{self.model_asset.s3_object_key}"&lt;/span&gt;,
    &lt;span class="o"&gt;}&lt;/span&gt;,
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SnapStart Lambda:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize the SnapStart Lamba&lt;/span&gt;
snapstart_lambda &lt;span class="o"&gt;=&lt;/span&gt; create_python_function&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;self,
    &lt;span class="nv"&gt;function_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"snapstart-predictor"&lt;/span&gt;,
    &lt;span class="nv"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"snapstart_handler.handler"&lt;/span&gt;,
    &lt;span class="nb"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Duration.seconds&lt;span class="o"&gt;(&lt;/span&gt;90&lt;span class="o"&gt;)&lt;/span&gt;,
    &lt;span class="nv"&gt;snapstart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;_lambda.SnapStartConf.ON_PUBLISHED_VERSIONS, &lt;span class="c"&gt;# Very important to configure this parameter!&lt;/span&gt;
    &lt;span class="nv"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;={&lt;/span&gt;
        &lt;span class="s2"&gt;"PREDICTIONS_TABLE"&lt;/span&gt;: predictions_table.table_name,
        &lt;span class="s2"&gt;"PREDICTOR"&lt;/span&gt;: &lt;span class="s2"&gt;"snapstart"&lt;/span&gt;,
        &lt;span class="s2"&gt;"MODEL_S3_URI"&lt;/span&gt;: f&lt;span class="s2"&gt;"s3://{model_asset.s3_bucket_name}/{model_asset.s3_object_key}"&lt;/span&gt;,
    &lt;span class="o"&gt;}&lt;/span&gt;,
&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Initializing an Alias, as SnapStart doesn't work without it&lt;/span&gt;
live_alias &lt;span class="o"&gt;=&lt;/span&gt; _lambda.Alias&lt;span class="o"&gt;(&lt;/span&gt;
    self,
    &lt;span class="s2"&gt;"SnapStartLiveAlias"&lt;/span&gt;,
    &lt;span class="nv"&gt;alias_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"live"&lt;/span&gt;,
    &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;snapstart_lambda.current_version,
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SageMaker endpoint + Lambda proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Configure the SageMaker endpoint&lt;/span&gt;
endpoint_config &lt;span class="o"&gt;=&lt;/span&gt; SageMaker.CfnEndpointConfig&lt;span class="o"&gt;(&lt;/span&gt;
    self,
    &lt;span class="s2"&gt;"AudioPredictorEndpointConfig"&lt;/span&gt;,
    &lt;span class="nv"&gt;production_variants&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;
        SageMaker.CfnEndpointConfig.ProductionVariantProperty&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="nv"&gt;variant_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"AllTraffic"&lt;/span&gt;,
            &lt;span class="nv"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;model.attr_model_name,
            &lt;span class="nv"&gt;initial_instance_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1,
            &lt;span class="nv"&gt;instance_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"ml.t2.medium"&lt;/span&gt;,
            &lt;span class="nv"&gt;initial_variant_weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1.0,
        &lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;]&lt;/span&gt;,
&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Define the endpoint&lt;/span&gt;
endpoint &lt;span class="o"&gt;=&lt;/span&gt; SageMaker.CfnEndpoint&lt;span class="o"&gt;(&lt;/span&gt;
    self,
    &lt;span class="s2"&gt;"AudioPredictorEndpoint"&lt;/span&gt;,
    &lt;span class="nv"&gt;endpoint_config_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;endpoint_config.attr_endpoint_config_name,
&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Initialize the SageMaker Lambda proxy&lt;/span&gt;
SageMaker_trigger &lt;span class="o"&gt;=&lt;/span&gt; create_python_function&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;self,
    &lt;span class="nv"&gt;function_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"SageMaker-predictor"&lt;/span&gt;,
    &lt;span class="nv"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"SageMaker_trigger_handler.handler"&lt;/span&gt;,
    &lt;span class="nb"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Duration.seconds&lt;span class="o"&gt;(&lt;/span&gt;90&lt;span class="o"&gt;)&lt;/span&gt;,
    &lt;span class="nv"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;={&lt;/span&gt;
        &lt;span class="s2"&gt;"PREDICTIONS_TABLE"&lt;/span&gt;: predictions_table.table_name,
        &lt;span class="s2"&gt;"PREDICTOR"&lt;/span&gt;: &lt;span class="s2"&gt;"SageMaker"&lt;/span&gt;,
        &lt;span class="s2"&gt;"MODEL_S3_URI"&lt;/span&gt;: f&lt;span class="s2"&gt;"s3://{model_asset.s3_bucket_name}/{model_asset.s3_object_key}"&lt;/span&gt;,
        &lt;span class="s2"&gt;"ENDPOINT_NAME"&lt;/span&gt;: endpoint.attr_endpoint_name,
    &lt;span class="o"&gt;}&lt;/span&gt;,
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;In the following image, you can see the execution duration of all the stacks which were used (note - SnapStart Lambda was ran once before to save the environment and then waited for 10 minutes for the Lambda to have a cold start again):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5adcvwl456n96uagxpzz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5adcvwl456n96uagxpzz.png" alt="Latency Comparison" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Method&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Median&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Stability (Std)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard Lambda&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;280.72 ms&lt;/td&gt;
&lt;td&gt;127.65 ms&lt;/td&gt;
&lt;td&gt;443.29 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SnapStart&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;178.60 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;124.69 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;166.35 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SageMaker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;339.18 ms&lt;/td&gt;
&lt;td&gt;226.91 ms&lt;/td&gt;
&lt;td&gt;350.76 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From the graph, we can see that the Lambdas execute faster than the SageMaker endpoint, staying under the 200ms mark. The circles represent the cold starts, and you can see that the SnapStart Lambda was at least 2x faster than other resources, thanks to SnapStart. SageMaker stack performed the worst, but not by a lot, having the most Lambda invocations just above the 200ms mark and the cold start taking almost 1.4 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lambda Cost (per request)
&lt;/h3&gt;

&lt;p&gt;Formula: &lt;code&gt;Cost = Duration × Memory × $0.0000166667&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duration: ~200 ms&lt;/li&gt;
&lt;li&gt;Memory: 4 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cost per request:&lt;/strong&gt; ~$0.0000133&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost per 1M requests:&lt;/strong&gt; ~$13.80&lt;/p&gt;

&lt;h3&gt;
  
  
  SageMaker Cost (fixed)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;ml.t4g.medium ≈ $24–30/month&lt;/li&gt;
&lt;li&gt;Runs 24/7, even when idle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lambda has &lt;strong&gt;variable costs&lt;/strong&gt; that scale with usage. SageMaker has &lt;strong&gt;fixed costs&lt;/strong&gt;, making the tradeoff clear when requests grow.&lt;/p&gt;

&lt;p&gt;The main question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When does SageMaker become the better option?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’ve done the math.&lt;/p&gt;

&lt;p&gt;SageMaker becomes a better option at ~72 requests per minute — take a look at the following graph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv1v9wojk5m18144d73t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv1v9wojk5m18144d73t.png" alt="Cost comparison of Lambda and Sagemaker" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is obvious that, with the serverless nature of Lambda, costs are going to be lower since you have a fixed price for running the SageMaker endpoint, but as you have more traffic, SageMaker will handle it cheaper.&lt;/p&gt;

&lt;p&gt;You can notice that the green line, representing the SageMaker endpoint, starts going up as well, — that is expected, as you will have many Lambda invocations as well, however it’s manageable as the already mentioned Lambda proxy is configured to use the lowest configuration.&lt;/p&gt;

&lt;p&gt;Here is a broader look at the cost of this benchmark, it shows a broader view of expected cost, based on the latest pricing and traffic you can expect.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Traffic Volume&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Standard Lambda (4GB)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;SnapStart Lambda (4GB)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;SageMaker (ml.t2.medium + 128MB Caller)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Price per 1M Req (Variable)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$13.80&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$16.82&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.81&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fixed Monthly Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$40.88&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total: 10 RPM (~438k req/mo)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$6.05&lt;/td&gt;
&lt;td&gt;$7.37&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$41.24&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total: 50 RPM (~2.1M req/mo)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$30.24&lt;/td&gt;
&lt;td&gt;$36.87&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$42.66&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total: 72 RPM (~3.1M req/mo)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$43.51&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$53.05&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$43.43&lt;/strong&gt; &lt;em&gt;(Crossover point)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total: 200 RPM (~8.7M req/mo)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$120.84&lt;/td&gt;
&lt;td&gt;$147.32&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$47.97&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total: 1000 RPM (~43.8M req/mo)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$604.22&lt;/td&gt;
&lt;td&gt;$736.62&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$76.38&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SageMaker pricing - &lt;a href="https://aws.amazon.com/sagemaker/ai/pricing/" rel="noopener noreferrer"&gt;link&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Lambda pricing - &lt;a href="https://aws.amazon.com/lambda/pricing/" rel="noopener noreferrer"&gt;link&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  CTO Verdict: A Decision Framework
&lt;/h3&gt;

&lt;p&gt;Think in thresholds, not services.&lt;/p&gt;

&lt;p&gt;Use Standard Lambda when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re in POC or early stage&lt;/li&gt;
&lt;li&gt;Traffic is low or unpredictable&lt;/li&gt;
&lt;li&gt;You want zero idle cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use Lambda with SnapStart when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traffic is low and sporadic&lt;/li&gt;
&lt;li&gt;You are willing to pay for the SnapStart snapshot restoration&lt;/li&gt;
&lt;li&gt;You also want a zero idle cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use SageMaker when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You exceed the mentioned 72 requests/minute consistently&lt;/li&gt;
&lt;li&gt;Traffic is steady&lt;/li&gt;
&lt;li&gt;You want predictable cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Final Rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda is the default&lt;/li&gt;
&lt;li&gt;SageMaker is the optimization&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>aws</category>
      <category>programming</category>
      <category>startup</category>
    </item>
    <item>
      <title>FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Mon, 18 May 2026 02:31:34 +0000</pubDate>
      <link>https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing</link>
      <guid>https://dev.to/aws-builders/fpolicy-file-activity-pipeline-ontap-to-datadog-via-ecs-fargate-2ing</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;ONTAP FPolicy pushes file operation notifications over a persistent TCP connection. We run a lightweight Python server on ECS Fargate that receives these events, normalizes them, and forwards them to SQS → Lambda → Datadog. In my validation environment, create events reached Datadog in about 6 seconds. Rename/delete behavior depends on FPolicy mode, protocol, and ONTAP/FSx behavior, so this post documents both the working path and the limitations observed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why FPolicy Needs Fargate
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3&lt;/a&gt;, we showed how EMS webhooks deliver ARP alerts via API Gateway → Lambda. That works because EMS uses standard HTTPS.&lt;/p&gt;

&lt;p&gt;FPolicy is different. ONTAP's FPolicy subsystem uses a &lt;strong&gt;proprietary binary protocol over persistent TCP connections&lt;/strong&gt;. ONTAP initiates the connection to the FPolicy server and maintains it with periodic KeepAlive messages. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;strong&gt;Lambda&lt;/strong&gt; — No persistent TCP connections, max 15-minute timeout&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;API Gateway&lt;/strong&gt; — HTTP/HTTPS only, no raw TCP&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;ECS Fargate&lt;/strong&gt; — Persistent TCP listener, private IP, auto-restart&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why I Did Not Use an NLB in This Validation
&lt;/h3&gt;

&lt;p&gt;I tested an NLB-based approach, but it did not work reliably in my validation. The issue was not that NLB cannot forward binary TCP traffic; it can. The challenge was FPolicy's stateful session negotiation and ONTAP's expectation of configured FPolicy server IPs. Health checks and connection behavior introduced additional complexity. For this validation, the simplest reliable path was to let ONTAP connect directly to the Fargate task's private IP and automate external-engine IP updates on task restart.&lt;/p&gt;

&lt;p&gt;The Fargate task runs a Python server that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Listens on TCP:9898&lt;/li&gt;
&lt;li&gt;Handles FPolicy protocol negotiation (version handshake)&lt;/li&gt;
&lt;li&gt;Receives KeepAlive messages (connection health)&lt;/li&gt;
&lt;li&gt;Parses file operation notifications&lt;/li&gt;
&lt;li&gt;Forwards structured events to SQS&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SMB/NFS Client
    │ file create/write/rename/delete
    ▼
FSx for ONTAP (FPolicy enabled)
    │ proprietary TCP protocol
    ▼
ECS Fargate (TCP:9898)
    │ parse → normalize → forward
    ▼
SQS Queue
    │ event source mapping
    ▼
Lambda (fpolicy_handler)
    │ format → ship
    ▼
Datadog Logs API v2 (source:fsxn-fpolicy)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP connects TO Fargate&lt;/strong&gt; — the Fargate task must be reachable on a private IP. Because that IP can change on task restart, the ONTAP external engine must be updated automatically or operationally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQS decouples&lt;/strong&gt; the TCP server from the shipping logic — if Datadog is slow, events buffer in SQS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda handles Datadog shipping&lt;/strong&gt; — retry logic, batch formatting, API key management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No NLB&lt;/strong&gt; — ONTAP connects directly to the Fargate task's private IP&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;FSx for ONTAP file system with a CIFS-enabled SVM&lt;/li&gt;
&lt;li&gt;VPC with private subnets (same as FSx for ONTAP)&lt;/li&gt;
&lt;li&gt;ECR repository with the FPolicy server image&lt;/li&gt;
&lt;li&gt;Private subnet egress for Fargate: either a NAT Gateway or VPC endpoints for ECR image pull, CloudWatch Logs, and SQS access&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Deploy the Fargate Stack
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; shared/templates/fpolicy-server-fargate.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-fpolicy-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;VpcId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-vpc-id&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;SubnetIds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-private-subnet&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;FsxnSvmSecurityGroupId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;fsx-sg-id&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ContainerImage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;account&amp;gt;.dkr.ecr.&amp;lt;region&amp;gt;.amazonaws.com/fsxn-fpolicy-server:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ECS Cluster + Fargate Service (1 task)&lt;/li&gt;
&lt;li&gt;SQS Queue for FPolicy events&lt;/li&gt;
&lt;li&gt;Security Group (inbound TCP:9898 from FSx SG)&lt;/li&gt;
&lt;li&gt;CloudWatch Log Group&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Deploy the Datadog Shipping Lambda
&lt;/h3&gt;

&lt;p&gt;The template accepts the SQS queue ARN as a parameter and automatically creates the event source mapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get the SQS queue ARN from Step 1 outputs&lt;/span&gt;
&lt;span class="nv"&gt;SQS_ARN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws cloudformation describe-stacks &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-fpolicy-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"Stacks[0].Outputs[?OutputKey=='FPolicyQueueArn'].OutputValue"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template-ems-fpolicy.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-ems-fpolicy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;secret-arn&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogSite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap1.datadoghq.com &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;FPolicySqsQueueArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SQS_ARN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates the Lambda function with an SQS event source mapping — no manual &lt;code&gt;create-event-source-mapping&lt;/code&gt; needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Get the Fargate Task IP
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;TASK_ARN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws ecs list-tasks &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; fsxn-fpolicy-server-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-name&lt;/span&gt; fsxn-fpolicy-server-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"taskArns[0]"&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

aws ecs describe-tasks &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; fsxn-fpolicy-server-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tasks&lt;/span&gt; &lt;span class="nv"&gt;$TASK_ARN&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"tasks[0].containers[0].networkInterfaces[0].privateIpv4Address"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ONTAP FPolicy Configuration
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CLI note&lt;/strong&gt;: Some ONTAP versions show these commands under &lt;code&gt;vserver fpolicy ...&lt;/code&gt;, while newer CLI contexts may allow shortened forms. Use the command form supported by your ONTAP version. The examples below use the form validated in my environment (FSx for ONTAP 9.17.1). See &lt;a href="https://docs.netapp.com/us-en/ontap-cli-9151/vserver-fpolicy-policy-external-engine-create.html" rel="noopener noreferrer"&gt;NetApp CLI reference&lt;/a&gt; for the full command syntax.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;FPolicy requires three components: an External Engine (where to send events), an Event (what to monitor), and a Policy (linking them together).&lt;/p&gt;

&lt;h3&gt;
  
  
  Create the External Engine
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vserver fpolicy policy external-engine create -vserver &amp;lt;svm-name&amp;gt; \
  -engine-name fpolicy_aws_engine \
  -primary-servers &amp;lt;fargate-task-ip&amp;gt; \
  -port 9898 \
  -extern-engine-type asynchronous \
  -ssl-option no-auth
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production note&lt;/strong&gt;: For production deployments, evaluate &lt;code&gt;server-auth&lt;/code&gt; or &lt;code&gt;mutual-auth&lt;/code&gt; instead of &lt;code&gt;no-auth&lt;/code&gt;, and validate certificate handling between ONTAP and the FPolicy server. See &lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/create-fpolicy-external-engine-task.html" rel="noopener noreferrer"&gt;NetApp FPolicy external engine documentation&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Create the FPolicy Event
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vserver fpolicy policy event create -vserver &amp;lt;svm-name&amp;gt; \
  -event-name cifs_file_events \
  -protocol cifs \
  -file-operations create,write,rename,delete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;: For write-heavy workloads, review the protocol-specific FPolicy filters supported by your ONTAP version and protocol. Where supported, use close/modify-oriented filters to reduce duplicate or noisy write events.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Create and Enable the Policy
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vserver fpolicy policy create -vserver &amp;lt;svm-name&amp;gt; \
  -policy-name fpolicy_aws \
  -events cifs_file_events \
  -engine fpolicy_aws_engine \
  -is-mandatory false

vserver fpolicy enable -vserver &amp;lt;svm-name&amp;gt; \
  -policy-name fpolicy_aws \
  -sequence-number 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This example uses an asynchronous, non-mandatory policy so client file operations are not blocked by FPolicy server processing or Datadog delivery. If the FPolicy server is unavailable, file operations continue unimpeded — but notifications may be buffered or lost depending on your ONTAP version and configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verify Connection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vserver fpolicy show-engine -vserver &amp;lt;svm-name&amp;gt; -engine-name fpolicy_aws_engine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see &lt;code&gt;connected&lt;/code&gt; status. In the ECS logs, KeepAlive messages confirm the connection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[INFO] fpolicy-server: [+] Connection from ('10.0.x.x', 44107)
[INFO] fpolicy-server: [Handshake] Policy=fpolicy_aws | Session=... | VsUUID=...
[INFO] fpolicy-server: [Send] NEGO_RESP | Version=1.2 | Policy=fpolicy_aws
[INFO] fpolicy-server: [KeepAlive] Received — connection healthy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  E2E Validation Results
&lt;/h2&gt;

&lt;p&gt;File operations on the SMB share produce events that flow through the entire pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;ECS Log&lt;/th&gt;
&lt;th&gt;SQS&lt;/th&gt;
&lt;th&gt;Lambda&lt;/th&gt;
&lt;th&gt;Datadog&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;create &lt;code&gt;blog_demo_create.txt&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ shipped:1&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;~6 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;create &lt;code&gt;blog_demo_write.txt&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ shipped:1&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;~6 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;create &lt;code&gt;confidential_report_2026.xlsx&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ shipped:1&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;~6 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  ECS Fargate Logs — Connection Lifecycle
&lt;/h3&gt;

&lt;p&gt;The FPolicy server logs show the complete lifecycle: server start → ONTAP connection → protocol handshake → KeepAlive → file events → SQS delivery.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7vo4iwqknavoepj0auh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7vo4iwqknavoepj0auh.png" alt="ECS Fargate CloudWatch Logs" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Lambda CloudWatch Logs — Event Processing
&lt;/h3&gt;

&lt;p&gt;Each SQS message triggers a Lambda invocation. Processing time is typically 30-50ms per event.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasqa80zoyvvglzeurkok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasqa80zoyvvglzeurkok.png" alt="Lambda CloudWatch Logs" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Datadog Log Explorer
&lt;/h3&gt;

&lt;p&gt;Query: &lt;code&gt;source:fsxn-fpolicy&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Each event contains structured attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;operation_type&lt;/code&gt;: The file operation (create, write, rename, delete)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;file_path&lt;/code&gt;: The file that was operated on&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;client_ip&lt;/code&gt;: The client that performed the operation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;volume_name&lt;/code&gt;: The ONTAP volume&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;svm&lt;/code&gt;: The ONTAP SVM name (may show "unknown" if not resolved from handshake context)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt;: When the operation occurred&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffr69olz7f2wi2lw3jau2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffr69olz7f2wi2lw3jau2.png" alt="FPolicy events in Datadog Log Explorer" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhs3qvl81xuxmmbecvd1a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhs3qvl81xuxmmbecvd1a.png" alt="FPolicy event detail — structured attributes visible in the side panel" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Correlating FPolicy with ARP
&lt;/h2&gt;

&lt;p&gt;The real power emerges when you combine FPolicy file activity with ARP ransomware detection from Part 3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source:(fsxn-fpolicy OR fsxn-ems) @attributes.svm:svm-prod-01
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This correlation query shows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ARP alert&lt;/strong&gt; (from EMS): "Ransomware detected on volume X"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File operations&lt;/strong&gt; (from FPolicy): Which user, from which IP, created/renamed which files&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Together they answer the critical incident response questions: &lt;em&gt;What happened, who did it, and from where?&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Use Case: Detecting Suspicious File Creation Bursts
&lt;/h3&gt;

&lt;p&gt;With FPolicy create events in Datadog, you can create a Monitor that fires when a single client creates more than 50 files in 5 minutes — a potential indicator of ransomware encryption or unauthorized bulk operations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Datadog Monitor query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;logs("source:fsxn-fpolicy @attributes.operation_type:create").rollup("count").by("@attributes.client_ip").last("5m") &amp;gt; 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alert message:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🚨 Suspicious file creation burst detected on FSx for ONTAP

Client IP: {{@attributes.client_ip}}
Volume: {{@attributes.volume_name}}
Count: {{value}} file creations in 5 minutes

Investigate immediately — check if this is authorized batch processing or potential ransomware activity.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on delete monitoring&lt;/strong&gt;: If your FPolicy configuration and ONTAP version reliably deliver delete events (e.g., synchronous mode or a future ONTAP release), you can extend this pattern to bulk deletion detection. In my async-mode validation, delete notifications were not reliably delivered — I recommend using audit logs from &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt; for delete-event completeness.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is difficult to achieve with traditional audit log polling, which depends on rotation and scheduler intervals. FPolicy's event-driven delivery makes sub-minute detection possible for the operations it reliably captures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operational Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fargate Task IP Changes
&lt;/h3&gt;

&lt;p&gt;When a Fargate task restarts (deployment, crash, scaling), it gets a new private IP. ONTAP's External Engine must be updated with the new IP. Options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Manual update&lt;/strong&gt;: &lt;code&gt;vserver fpolicy policy external-engine modify -primary-servers &amp;lt;new-ip&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated&lt;/strong&gt;: Lambda triggered by ECS task state change → ONTAP REST API update&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The repository includes a helper script (&lt;code&gt;shared/scripts/fpolicy-update-engine-ip.sh --auto&lt;/code&gt;) that detects the current task IP and updates the ONTAP engine. For full automation, wire an EventBridge rule on ECS task state changes to an update Lambda — this is not included in the base stack but is straightforward to add. Automated updates require network reachability to the ONTAP management endpoint and credentials (stored in Secrets Manager) with permission to modify the FPolicy external engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Restart Resilience — Validated
&lt;/h3&gt;

&lt;p&gt;I tested the full restart cycle to confirm the pipeline recovers gracefully:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stop Fargate (scale to 0)&lt;/td&gt;
&lt;td&gt;Task stopped&lt;/td&gt;
&lt;td&gt;~30s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Restart Fargate (scale to 1)&lt;/td&gt;
&lt;td&gt;New task, new IP&lt;/td&gt;
&lt;td&gt;~45s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update ONTAP Engine IP&lt;/td&gt;
&lt;td&gt;Reconnection&lt;/td&gt;
&lt;td&gt;~20s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File operation after restart&lt;/td&gt;
&lt;td&gt;Event delivered to Datadog&lt;/td&gt;
&lt;td&gt;~6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total recovery time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2 minutes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Lambda's retry logic also proved itself: on the first request after reconnection, a transient &lt;code&gt;RemoteDisconnected&lt;/code&gt; error occurred. The exponential backoff retry succeeded on the second attempt — exactly the behavior we designed for.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[WARNING] HTTP error shipping to Datadog (attempt 1/3): RemoteDisconnected
[INFO]    Processing complete: {"statusCode": 200, "body": {"shipped": 1}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cost Profile
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Monthly Cost (estimate)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fargate (0.25 vCPU, 0.5 GB)&lt;/td&gt;
&lt;td&gt;~$10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQS (low volume)&lt;/td&gt;
&lt;td&gt;&amp;lt; $1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda (event-driven)&lt;/td&gt;
&lt;td&gt;&amp;lt; $1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Logs&lt;/td&gt;
&lt;td&gt;~$2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$14/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare this to an always-on EC2-based collector, plus OS patching, agent management, and HA considerations. Exact EC2 costs vary by region and instance type.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is an AWS-side estimate and excludes Datadog ingest/retention costs, NAT Gateway or VPC endpoint charges, ECR storage, and high-volume CloudWatch Logs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Scaling
&lt;/h3&gt;

&lt;p&gt;A single Fargate task is sufficient for the low-volume validation scenarios in this post. The architecture can scale by tuning Fargate CPU/memory, SQS buffering, and Lambda concurrency, but you should benchmark your own workload before assuming a specific events/sec capacity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring
&lt;/h3&gt;

&lt;p&gt;Key CloudWatch metrics to watch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ECS/CPUUtilization&lt;/code&gt; — Fargate task health&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SQS/ApproximateNumberOfMessagesVisible&lt;/code&gt; — Queue depth (should stay near 0)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Lambda/Errors&lt;/code&gt; — Shipping failures&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Lambda/Duration&lt;/code&gt; — Processing time per batch&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The FPolicy Server
&lt;/h2&gt;

&lt;p&gt;The FPolicy server (&lt;code&gt;shared/fpolicy-server/fpolicy_server.py&lt;/code&gt;) implements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Protocol negotiation&lt;/strong&gt;: Responds to ONTAP's version handshake&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KeepAlive handling&lt;/strong&gt;: Acknowledges connection health checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event parsing&lt;/strong&gt;: Extracts file path, operation, user, client IP from binary frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQS forwarding&lt;/strong&gt;: Sends normalized JSON events to the queue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write coalescing&lt;/strong&gt;: Configurable delay to batch rapid write events (default: 5 seconds)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The server runs in &lt;code&gt;realtime&lt;/code&gt; mode — events are forwarded as they arrive, with optional write-complete delay to avoid duplicate notifications for multi-write operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations and Future Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Rename/Delete Events Not Delivered in Async Mode
&lt;/h3&gt;

&lt;p&gt;In my E2E testing, ONTAP did not deliver rename or delete notifications to the FPolicy server in asynchronous mode — even though these operations are configured in the FPolicy event definition. Only create events were reliably delivered. This appears to be a limitation of FSx for ONTAP's FPolicy implementation in async mode for certain operation types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workaround options:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use synchronous mode (adds latency to file operations — not recommended for production)&lt;/li&gt;
&lt;li&gt;Combine FPolicy (event-driven create) with audit log polling (catches rename/delete in EVTX)&lt;/li&gt;
&lt;li&gt;Accept create-only monitoring for event-driven alerting, use audit logs for forensic completeness&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  NFS Protocol Support
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Protocol&lt;/th&gt;
&lt;th&gt;FPolicy Support&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SMB/CIFS&lt;/td&gt;
&lt;td&gt;✅ Verified&lt;/td&gt;
&lt;td&gt;Primary validation protocol&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv3&lt;/td&gt;
&lt;td&gt;✅ Supported&lt;/td&gt;
&lt;td&gt;Requires explicit &lt;code&gt;vers=3&lt;/code&gt; mount option&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4.0&lt;/td&gt;
&lt;td&gt;✅ Supported&lt;/td&gt;
&lt;td&gt;Requires explicit &lt;code&gt;vers=4.0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4.1&lt;/td&gt;
&lt;td&gt;✅ Supported&lt;/td&gt;
&lt;td&gt;Requires ONTAP 9.15.1+, explicit &lt;code&gt;vers=4.1&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFSv4.2&lt;/td&gt;
&lt;td&gt;❌ Not supported&lt;/td&gt;
&lt;td&gt;ONTAP FPolicy does not monitor NFSv4.2 operations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For protocol support details, verify your ONTAP version. NetApp &lt;a href="https://kb.netapp.com/onprem/ontap/da/NAS/Does_ONTAP_support_FPolicy_for_NFS_4.2" rel="noopener noreferrer"&gt;documents&lt;/a&gt; that FPolicy does not currently support NFSv4.2; supported NFS protocols include NFSv3, NFSv4.0, and NFSv4.1 (ONTAP 9.15.1+).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical gotcha:&lt;/strong&gt; &lt;code&gt;mount -o vers=4&lt;/code&gt; on modern Linux negotiates to NFSv4.2, which ONTAP FPolicy does &lt;strong&gt;not&lt;/strong&gt; support. Always use explicit version: &lt;code&gt;mount -o vers=4.1&lt;/code&gt; or &lt;code&gt;vers=3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NFS + FPolicy latency:&lt;/strong&gt; NFSv3 lacks close semantics, so the FPolicy server cannot know when a write is complete. The server uses a configurable &lt;code&gt;WRITE_COMPLETE_DELAY_SEC&lt;/code&gt; (default: 5s) to wait before forwarding the event. This adds latency but prevents premature processing of incomplete files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NFS write hang (observed):&lt;/strong&gt; In some configurations, NFS write operations may hang when FPolicy is enabled — even with &lt;code&gt;is-mandatory=false&lt;/code&gt;. This is a &lt;a href="https://kb.netapp.com/onprem/ontap/da/NAS/NFS_hung_slowness_issue_when_dealing_with_long_path_names_with_FPolicy_enabled" rel="noopener noreferrer"&gt;known ONTAP behavior&lt;/a&gt; related to FPolicy notification processing. If you experience this, verify your ONTAP version and consider limiting FPolicy scope to specific volumes.&lt;/p&gt;

&lt;h3&gt;
  
  
  User Identity
&lt;/h3&gt;

&lt;p&gt;In the current implementation, the &lt;code&gt;user&lt;/code&gt; field may be empty for some operations depending on ONTAP's FPolicy notification content. The FPolicy binary frame includes user identity in extended attributes that require additional parsing logic. Future versions will extract this from the NOTI_REQ body.&lt;/p&gt;

&lt;h3&gt;
  
  
  Event Durability During Restarts
&lt;/h3&gt;

&lt;p&gt;In my validation, events generated while the Fargate server was disconnected were not observed downstream in Datadog after reconnection. Treat FPolicy delivery during server outages as something you must validate in your own environment.&lt;/p&gt;

&lt;p&gt;ONTAP &lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/synchronous-asynchronous-notifications-concept.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; describes buffering behavior for asynchronous notifications — notifications generated during a network outage are stored on the storage node and can be fetched when the server comes back online. Beginning with ONTAP 9.14.1, &lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html" rel="noopener noreferrer"&gt;FPolicy persistent store&lt;/a&gt; support is available for asynchronous non-mandatory policies. If you cannot tolerate event loss during FPolicy server restarts, evaluate persistent store and validate the behavior on your FSx for ONTAP version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git

&lt;span class="c"&gt;# Deploy prerequisites (if not already done)&lt;/span&gt;
aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; shared/templates/fpolicy-server-fargate.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-fpolicy-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;VpcId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-vpc&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;SubnetIds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-subnet&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;FsxnSvmSecurityGroupId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;fsx-sg&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ContainerImage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-ecr-image&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM

&lt;span class="c"&gt;# Configure ONTAP FPolicy (see ONTAP section above)&lt;/span&gt;
&lt;span class="c"&gt;# Create a file on the SMB share&lt;/span&gt;
&lt;span class="c"&gt;# Check Datadog: source:fsxn-fpolicy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where FPolicy Fits in ONTAP Telemetry
&lt;/h2&gt;

&lt;p&gt;This series covers three ONTAP telemetry sources. Each serves a different purpose:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Best Source&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Coverage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Compliance audit trail&lt;/td&gt;
&lt;td&gt;Audit logs (&lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;Minutes (scheduler interval)&lt;/td&gt;
&lt;td&gt;Complete historical record&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ransomware detection&lt;/td&gt;
&lt;td&gt;ARP via EMS (&lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;~30 seconds (webhook)&lt;/td&gt;
&lt;td&gt;ML-based pattern detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event-driven file activity signal&lt;/td&gt;
&lt;td&gt;FPolicy (this post)&lt;/td&gt;
&lt;td&gt;~6 seconds (TCP)&lt;/td&gt;
&lt;td&gt;Create events validated; other operations depend on mode/version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forensic investigation&lt;/td&gt;
&lt;td&gt;Audit logs + FPolicy correlation&lt;/td&gt;
&lt;td&gt;Combined&lt;/td&gt;
&lt;td&gt;Timeline reconstruction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;FPolicy is not a replacement for audit logs.&lt;/strong&gt; It provides an event-driven signal for detection and alerting. Audit logs provide the authoritative, complete historical record for compliance and forensics. Use them together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use Fargate for FPolicy TCP listener&lt;/strong&gt; — Lambda cannot maintain persistent TCP connections. Fargate provides the long-running listener without OS management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use SQS to decouple ingestion from shipping&lt;/strong&gt; — If Datadog is slow or Lambda is throttled, events buffer safely in SQS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate operation coverage in your environment&lt;/strong&gt; — Async mode reliably delivered create events in my testing. Rename/delete behavior varies by ONTAP version and mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use audit logs for forensic completeness&lt;/strong&gt; — FPolicy provides event-driven signal for detection; audit logs (Part 2) provide the complete historical record.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat FPolicy as event-driven alerting, not full audit replacement&lt;/strong&gt; — The two are complementary, not interchangeable.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Production Considerations Beyond This Validation
&lt;/h2&gt;

&lt;p&gt;This post validates the end-to-end path. For production deployments, the following topics warrant additional design work:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Key Questions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HA / Multi-AZ&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ONTAP external engine supports &lt;code&gt;primary-servers&lt;/code&gt; and &lt;code&gt;secondary-servers&lt;/code&gt;. How to run multiple Fargate tasks across AZs?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope Design&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Which volumes, operations, and protocols to monitor? How to avoid noisy workloads?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security Hardening&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TLS/mTLS for FPolicy, ECR image scanning, VPC Flow Logs, task role least-privilege&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FPolicy generates events per file operation — Datadog ingest can become the dominant cost at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operations Runbook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Task restart, engine disconnected, SQS backlog, Datadog missing events, NFS hang&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stable Endpoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auto-update Lambda for engine IP, or primary/secondary server design for zero-downtime restarts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These topics are documented in the repository:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-production-architecture-patterns.md" rel="noopener noreferrer"&gt;Production Architecture Patterns&lt;/a&gt;&lt;/strong&gt; — Single task, primary/secondary, auto-update, multi-AZ patterns with failure mode matrix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-operational-guide.md" rel="noopener noreferrer"&gt;Operational Guide&lt;/a&gt;&lt;/strong&gt; — 4-layer health model, runbooks, IP reconciliation, synthetic health check&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-poc-checklist.md" rel="noopener noreferrer"&gt;PoC Checklist&lt;/a&gt;&lt;/strong&gt; — Preconditions, scope, validation steps, success criteria, go/no-go&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Contributions and questions are welcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  Series Navigation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Why Your FSx for ONTAP Logs Deserve Better&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Shipping FSx for ONTAP Logs to Datadog, The Serverless Way&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Event-Driven Ransomware Detection with ONTAP ARP + Datadog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4&lt;/strong&gt;: FPolicy File Activity Pipeline (this post)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Coming next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Splunk&lt;/strong&gt;: Replacing EC2 + Universal Forwarder with Lambda + HEC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: The vendor-neutral escape hatch&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Questions about FPolicy or the Fargate architecture? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/fsxn-observability-integrations&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>datadog</category>
      <category>amazonfsxfornetappontap</category>
    </item>
    <item>
      <title>Diario de una builder: El camino hacia la orquestación de dos mundos</title>
      <dc:creator>Diana Castro</dc:creator>
      <pubDate>Mon, 18 May 2026 00:38:32 +0000</pubDate>
      <link>https://dev.to/aws-builders/diario-de-una-builder-el-camino-hacia-la-orquestacion-de-dos-mundos-4fd0</link>
      <guid>https://dev.to/aws-builders/diario-de-una-builder-el-camino-hacia-la-orquestacion-de-dos-mundos-4fd0</guid>
      <description>&lt;h1&gt;
  
  
  Aprender una segunda nube sin empezar desde cero
&lt;/h1&gt;

&lt;p&gt;En tecnología hay una verdad incómoda, pero también liberadora: nunca terminamos de dominar completamente un tema. Lo que sabías ayer puede quedar obsoleto mañana y, en el mundo de las nubes públicas, donde los servicios evolucionan constantemente, es prácticamente imposible conocer cada detalle de cada herramienta.&lt;/p&gt;

&lt;p&gt;Más que aspirar a saberlo todo, el verdadero enfoque está en comprender los fundamentos y especializarse en ciertos dominios. Se trata de reconocer qué servicios existen, para qué fueron diseñados y en qué escenarios aportan valor. Así, cuando enfrentas un problema real, no partes desde cero: sabes qué buscar y dónde apoyarte.&lt;/p&gt;




&lt;h2&gt;
  
  
  El reto de aprender otra nube
&lt;/h2&gt;

&lt;p&gt;Más que dominar una nube en su totalidad, el enfoque real está en el aprendizaje continuo y en desarrollar criterio técnico para entender cómo funcionan los servicios y cuándo utilizarlos.&lt;/p&gt;

&lt;p&gt;Y por esas oportunidades que da la vida —que se agradecen enormemente— terminé frente a un nuevo desafío: aprender una segunda nube.&lt;/p&gt;

&lt;p&gt;Un reto que impone respeto.&lt;br&gt;&lt;br&gt;
Que incluso puede generar cierta incertidumbre.&lt;br&gt;&lt;br&gt;
Pero que también expande la forma en que pensamos la arquitectura.&lt;/p&gt;

&lt;p&gt;La pregunta entonces fue:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;¿Cómo abordar este reto sin empezar completamente desde cero?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;La respuesta estuvo en reutilizar el conocimiento base.&lt;/p&gt;

&lt;p&gt;En lugar de aprender desde una hoja en blanco, comencé a buscar patrones, equivalencias y analogías:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Este servicio se parece a este otro.&lt;/li&gt;
&lt;li&gt;Esta solución resuelve un problema similar en otra nube.&lt;/li&gt;
&lt;li&gt;Este concepto cambia de nombre, pero no necesariamente de propósito.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Y sí, ese enfoque funciona… hasta que deja de funcionar.&lt;/p&gt;


&lt;h2&gt;
  
  
  Cuando las equivalencias dejan de ser suficientes
&lt;/h2&gt;

&lt;p&gt;El primer impulso al aprender una segunda nube es buscar traducciones directas entre servicios. Algo natural. Necesitamos referencias conocidas para orientarnos.&lt;/p&gt;

&lt;p&gt;Pero eventualmente llegan las diferencias importantes.&lt;/p&gt;

&lt;p&gt;Descubres que:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Los &lt;code&gt;Region Pairs&lt;/code&gt; en Azure abordan Disaster Recovery de una forma distinta.&lt;/li&gt;
&lt;li&gt;El modelo de identidad no se mapea &lt;code&gt;1:1&lt;/code&gt; con AWS.&lt;/li&gt;
&lt;li&gt;Las suposiciones sobre failover automático pueden estar completamente invertidas.&lt;/li&gt;
&lt;li&gt;La organización de recursos responde a filosofías diferentes.&lt;/li&gt;
&lt;li&gt;Incluso la forma de operar y navegar la plataforma cambia.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Y ahí ocurre algo interesante: dejas de intentar traducir una nube hacia la otra y comienzas a entender cómo piensa cada proveedor.&lt;/p&gt;

&lt;p&gt;Ese suele ser el punto donde realmente empieza el aprendizaje.&lt;/p&gt;


&lt;h2&gt;
  
  
  Lo más valioso no es memorizar servicios
&lt;/h2&gt;

&lt;p&gt;Con el tiempo entendí algo importante:&lt;/p&gt;

&lt;p&gt;Multi-cloud no significa saber más nombres de servicios.&lt;/p&gt;

&lt;p&gt;Significa desarrollar la capacidad de:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;identificar patrones,&lt;/li&gt;
&lt;li&gt;entender modelos operativos distintos,&lt;/li&gt;
&lt;li&gt;cuestionar supuestos,&lt;/li&gt;
&lt;li&gt;y diseñar aprovechando las fortalezas de cada nube.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Porque al final, la arquitectura no se trata de memorizar catálogos.&lt;br&gt;&lt;br&gt;
Se trata de criterio técnico.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip&lt;/strong&gt; &lt;br&gt;
Multi-cloud no es saber más servicios. &lt;br&gt;
Es aprender a pensar diferente y &lt;br&gt;
diseñar sobre las fortalezas de cada nube.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  El modelo de responsabilidad compartida
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;(AWS Shared Responsibility Model &amp;amp; Azure Shared Responsibility Model)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;El modelo de responsabilidad compartida es conceptualmente el mismo en AWS y Azure: el proveedor asegura la infraestructura de la nube, mientras que el cliente es responsable de la configuración, los datos y el acceso.&lt;/p&gt;

&lt;p&gt;Sin embargo, aunque el principio es equivalente, su implementación varía según el nivel de abstracción del servicio y la filosofía de cada proveedor.&lt;/p&gt;

&lt;p&gt;A simple vista puede parecer un concepto sencillo… hasta que llegas a los detalles.&lt;/p&gt;

&lt;p&gt;Los valores por defecto, las configuraciones iniciales y la forma en que cada nube aplica sus controles no son idénticos. Y, como suele ocurrir en tecnología, el diablo está en los detalles.&lt;/p&gt;

&lt;p&gt;Podemos pensar en la clásica analogía de la casa:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;El proveedor construye la estructura.&lt;/li&gt;
&lt;li&gt;Garantiza que la infraestructura sea segura.&lt;/li&gt;
&lt;li&gt;Pero tú decides quién entra, qué permisos tiene y cómo proteges lo que guardas dentro.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;El problema es que no todas las casas vienen configuradas igual.&lt;/p&gt;

&lt;p&gt;Algunas plataformas habilitan más controles desde el inicio.&lt;br&gt;&lt;br&gt;
Otras requieren que el cliente los defina explícitamente.&lt;/p&gt;

&lt;p&gt;Y ahí es donde se vuelve evidente que, aunque el modelo sea el mismo en teoría, la implementación cambia significativamente en la práctica.&lt;/p&gt;

&lt;p&gt;Porque en multi-cloud no basta con entender &lt;em&gt;qué&lt;/em&gt; eres responsable de proteger.&lt;/p&gt;

&lt;p&gt;También necesitas entender:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cómo cada proveedor interpreta esa responsabilidad,&lt;/li&gt;
&lt;li&gt;qué controles vienen habilitados por defecto,&lt;/li&gt;
&lt;li&gt;qué configuraciones requieren intervención manual,&lt;/li&gt;
&lt;li&gt;y qué supuestos de seguridad estás heredando sin darte cuenta.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ese suele ser uno de los primeros momentos donde descubres que aprender una segunda nube no es memorizar servicios… sino ajustar la manera en que piensas la seguridad.&lt;/p&gt;
&lt;h2&gt;
  
  
  Estructura de la nube
&lt;/h2&gt;

&lt;p&gt;Sería un error intentar definir equivalencias entre servicios sin comprender primero cómo está organizada cada nube. Antes de hablar de servicios, redes o seguridad, necesitamos entender la base sobre la que todo está construido.&lt;/p&gt;

&lt;p&gt;Porque aunque AWS y Azure comparten muchos conceptos, la forma en que estructuran su infraestructura refleja filosofías bastante distintas.&lt;/p&gt;

&lt;p&gt;Este recorrido no busca ser exhaustivo.&lt;br&gt;&lt;br&gt;
La idea es construir un mapa mental rápido que ayude a entender dónde empiezan las similitudes… y dónde aparecen las diferencias importantes.&lt;/p&gt;


&lt;h3&gt;
  
  
  Organización global
&lt;/h3&gt;

&lt;p&gt;A nivel global, Azure y AWS adoptan estrategias diferentes para organizar y aislar su infraestructura.&lt;/p&gt;

&lt;p&gt;En Azure, la organización global se basa en &lt;strong&gt;Geographies&lt;/strong&gt;, que agrupan múltiples regiones dentro de un mismo límite orientado principalmente a:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cumplimiento normativo,&lt;/li&gt;
&lt;li&gt;residencia de datos,&lt;/li&gt;
&lt;li&gt;y latencia.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Estas geografías forman parte de un entorno altamente interconectado donde los servicios, la identidad y la gobernanza se gestionan de forma relativamente unificada.&lt;/p&gt;

&lt;p&gt;AWS, en cambio, estructura su organización global mediante &lt;strong&gt;Partitions&lt;/strong&gt;, que representan límites de aislamiento mucho más marcados tanto a nivel técnico como regulatorio.&lt;/p&gt;

&lt;p&gt;Cada partición funciona prácticamente como un entorno independiente:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;servicios separados,&lt;/li&gt;
&lt;li&gt;endpoints distintos,&lt;/li&gt;
&lt;li&gt;controles propios,&lt;/li&gt;
&lt;li&gt;e incluso aislamiento de IAM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ese enfoque hace que AWS priorice mucho más el desacoplamiento entre entornos globales.&lt;/p&gt;


&lt;h4&gt;
  
  
  Regiones y Zonas de Disponibilidad
&lt;/h4&gt;

&lt;p&gt;En este nivel, la organización entre AWS y Azure se vuelve mucho más comparable, aunque siguen existiendo diferencias importantes.&lt;/p&gt;

&lt;p&gt;Ambos proveedores operan con regiones distribuidas globalmente, cada una compuesta por múltiples &lt;strong&gt;Availability Zones (AZs)&lt;/strong&gt; diseñadas para ofrecer alta disponibilidad y resiliencia.&lt;/p&gt;

&lt;p&gt;Sin embargo, la implementación cambia bastante.&lt;/p&gt;

&lt;p&gt;Una de las diferencias más relevantes es que Azure trabaja con el concepto de &lt;strong&gt;Region Pairs&lt;/strong&gt;, donde cada región tiene una contraparte definida para escenarios de recuperación ante desastres.&lt;/p&gt;

&lt;p&gt;Esto permite que Microsoft:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;coordine actualizaciones,&lt;/li&gt;
&lt;li&gt;priorice recuperación,&lt;/li&gt;
&lt;li&gt;y mantenga estrategias de continuidad más estructuradas.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;En AWS no existe un equivalente automático.&lt;/p&gt;

&lt;p&gt;Las estrategias multi-región deben diseñarse explícitamente por el arquitecto.&lt;br&gt;&lt;br&gt;
Eso entrega más flexibilidad, pero también más responsabilidad.&lt;/p&gt;

&lt;p&gt;A nivel de AZs también existen diferencias relevantes.&lt;/p&gt;

&lt;p&gt;AWS mantiene una cobertura bastante consistente: la mayoría de regiones cuentan con entre 2 y 6 zonas de disponibilidad.&lt;/p&gt;

&lt;p&gt;En Azure, aunque muchas regiones modernas sí disponen de múltiples AZs, no todas las regiones ofrecen soporte completo de Availability Zones, algo que puede afectar decisiones de arquitectura dependiendo de la ubicación elegida.&lt;/p&gt;


&lt;h3&gt;
  
  
  Datacenters y extensiones de baja latencia
&lt;/h3&gt;

&lt;p&gt;En el nivel más bajo de infraestructura, ambos proveedores operan sobre datacenters físicos.&lt;/p&gt;

&lt;p&gt;Tanto en Azure como en AWS, estos datacenters forman parte de una abstracción superior: las Availability Zones, que agrupan múltiples instalaciones físicas para reducir puntos únicos de fallo.&lt;/p&gt;

&lt;p&gt;En Azure, aunque el datacenter no se expone directamente como recurso, existen conceptos importantes como: &lt;strong&gt;Fault Domains&lt;/strong&gt;, &lt;strong&gt;Update Domains&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Estos permiten distribuir máquinas virtuales minimizando el impacto de fallos físicos o mantenimientos programados.&lt;/p&gt;

&lt;p&gt;AWS no expone exactamente la misma granularidad.&lt;/p&gt;

&lt;p&gt;En su lugar, utiliza mecanismos como:&lt;strong&gt;Placement Groups&lt;/strong&gt;, distribución entre AZs y diseño de resiliencia a nivel regional.&lt;/p&gt;


&lt;h3&gt;
  
  
  Local Zones y edge computing
&lt;/h3&gt;

&lt;p&gt;Más allá del datacenter tradicional, ambos proveedores han extendido su infraestructura hacia ubicaciones más cercanas al usuario final para reducir latencia.&lt;/p&gt;

&lt;p&gt;En AWS, esto se materializa mediante &lt;strong&gt;Local Zones&lt;/strong&gt;, que extienden una región hacia áreas metropolitanas específicas permitiendo ejecutar cargas con latencias extremadamente bajas sin desplegar una región completa.&lt;/p&gt;

&lt;p&gt;Azure ofrece iniciativas similares como: &lt;strong&gt;Azure Local Zones&lt;/strong&gt;, &lt;strong&gt;Azure Stack Edge&lt;/strong&gt;. Aunque actualmente su disponibilidad es más limitada y el enfoque suele combinar baja latencia con integración híbrida.&lt;/p&gt;


&lt;h3&gt;
  
  
  Resumen comparativo
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concepto&lt;/th&gt;
&lt;th&gt;Azure&lt;/th&gt;
&lt;th&gt;AWS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nivel 1: Global&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Geography&lt;/strong&gt; (&lt;code&gt;US&lt;/code&gt;, &lt;code&gt;Europe&lt;/code&gt;, &lt;code&gt;Asia Pacific&lt;/code&gt;)  &lt;br&gt;&lt;br&gt;• Agrupa múltiples regiones  &lt;br&gt;• Define residencia de datos  &lt;br&gt;• Boundary de compliance  &lt;br&gt;• Entorno unificado&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Partition&lt;/strong&gt; (&lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;aws-cn&lt;/code&gt;, &lt;code&gt;aws-us-gov&lt;/code&gt;)  &lt;br&gt;&lt;br&gt;• Agrupa múltiples regiones  &lt;br&gt;• Aislamiento completo de IAM, servicios y endpoints  &lt;br&gt;• Boundary legal y regulatorio  &lt;br&gt;• Entornos independientes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nivel 2: Regional&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Region&lt;/strong&gt; (&lt;code&gt;East US&lt;/code&gt;, &lt;code&gt;West Europe&lt;/code&gt;)  &lt;br&gt;&lt;br&gt;• Múltiples regiones globales  &lt;br&gt;• Cada región puede tener múltiples AZs  &lt;br&gt;• Region Pairs definidos  &lt;br&gt;• Updates coordinados  &lt;br&gt;• Priorización de recuperación&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Region&lt;/strong&gt; (&lt;code&gt;us-east-1&lt;/code&gt;, &lt;code&gt;eu-west-1&lt;/code&gt;)  &lt;br&gt;&lt;br&gt;• Múltiples regiones globales  &lt;br&gt;• Cada región tiene múltiples AZs  &lt;br&gt;• No existe emparejamiento automático  &lt;br&gt;• Estrategia multi-región definida por el arquitecto&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nivel 3: Availability Zones&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Availability Zone (AZ)&lt;/strong&gt;  &lt;br&gt;&lt;br&gt;• 3 o más AZs en regiones compatibles  &lt;br&gt;• Datacenters físicamente separados  &lt;br&gt;• Baja latencia entre AZs  &lt;br&gt;• No todas las regiones tienen AZs&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Availability Zone (AZ)&lt;/strong&gt;  &lt;br&gt;&lt;br&gt;• La mayoría de regiones tienen múltiples AZs  &lt;br&gt;• Datacenters físicamente separados  &lt;br&gt;• Baja latencia entre AZs  &lt;br&gt;• Cobertura más consistente&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nivel 4: Datacenter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Datacenter (no expuesto al usuario)&lt;/strong&gt;  &lt;br&gt;&lt;br&gt;• Múltiples datacenters por AZ  &lt;br&gt;• Fault Domains  &lt;br&gt;• Update Domains  &lt;br&gt;• Abstracción gestionada por plataforma&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Datacenter (no expuesto al usuario)&lt;/strong&gt;  &lt;br&gt;&lt;br&gt;• Múltiples datacenters por AZ  &lt;br&gt;• Placement Groups  &lt;br&gt;• Distribución gestionada por arquitectura  &lt;br&gt;• Sin equivalente directo a Update Domains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extensiones locales&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Azure Local Zones / Azure Stack Edge&lt;/strong&gt;  &lt;br&gt;&lt;br&gt;• Baja latencia  &lt;br&gt;• Escenarios híbridos  &lt;br&gt;• Disponibilidad más limitada&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local Zones / Wavelength Zones&lt;/strong&gt;  &lt;br&gt;&lt;br&gt;• Extensión metropolitana de regiones  &lt;br&gt;• Latencia ultra baja  &lt;br&gt;• Integración 5G y edge computing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Las similitudes entre AWS y Azure facilitan el aprendizaje, pero las diferencias en su implementación son las que realmente definen una buena arquitectura.&lt;br&gt;&lt;br&gt;
Diseñar correctamente implica adaptar patrones, no traducirlos literalmente.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Cómo se organizan las nubes
&lt;/h3&gt;

&lt;p&gt;Uno de mis primeros choques mentales en el proceso multi nube fue entender que AWS y Azure no organizan sus recursos de la misma manera. Parece un detalle administrativo sin demasiada importancia… hasta que empiezan las conversaciones sobre ambientes, permisos, facturación, gobernanza o separación de cargas. Ahí uno entiende rápidamente que la estructura organizacional de cada nube impacta muchísimo más de lo que imaginaba al inicio.&lt;/p&gt;

&lt;p&gt;De hecho, probablemente este ha sido uno de los temas más difíciles tanto de entender como de explicar cuando converso con colegas que vienen principalmente de trabajar con una sola nube.&lt;/p&gt;

&lt;p&gt;En AWS, el modelo mental gira alrededor de la cuenta. Desde mi punto de vista, ahí es donde normalmente se establece la primera gran separación organizacional. Por ejemplo, si alguien plantea:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“quiero separar ambientes”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;La respuesta natural suele ser crear cuentas distintas para producción, desarrollo, seguridad o logging, algo muy alineado con las buenas prácticas de AWS.&lt;/p&gt;

&lt;p&gt;Sobre esas cuentas se construyen estructuras organizacionales mediante Amazon Web Services Organizations, que permiten agruparlas con fines administrativos y de control. A partir de ahí aparecen conceptos como Organizational Units (OU), Service Control Policies (SCP) e identidades centralizadas que ayudan a establecer reglas comunes entre múltiples cuentas.&lt;/p&gt;

&lt;p&gt;En Azure, el enfoque se siente mucho más jerárquico e integrado desde el inicio. El modelo normalmente se entiende así:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tenant → Subscription → Resource Group → Resource
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cada nivel cumple un propósito distinto relacionado con organización, facturación, permisos y administración. La suscripción no representa el mismo nivel de separación operativa que una cuenta AWS; muchas veces funciona más como un contenedor administrativo dentro de una jerarquía mayor controlada por el tenant.&lt;/p&gt;

&lt;p&gt;Desde mi perspectiva, AWS prioriza más explícitamente la separación mediante cuentas, mientras Azure aborda la organización desde una jerarquía profundamente integrada al modelo operativo de la plataforma. Y ojo, eso no significa que AWS no tenga jerarquías o estructuras organizacionales; simplemente la cuenta suele convertirse en el elemento principal alrededor del cual se diseñan muchas decisiones arquitectónicas.&lt;/p&gt;

&lt;p&gt;Veamos con más detalle cada elemento desde la perspectiva de cada proveedor.&lt;/p&gt;




&lt;h3&gt;
  
  
  Enfoque de Azure
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Tenant
&lt;/h4&gt;

&lt;p&gt;Es el nivel más alto. Representa la organización completa en Azure y está asociado a una instancia de Microsoft Entra ID (anteriormente Azure Active Directory). Cuando una empresa contrata Azure, se crea un tenant. Todo lo demás vive dentro de él.&lt;/p&gt;




&lt;h4&gt;
  
  
  Management Group
&lt;/h4&gt;

&lt;p&gt;Es opcional, pero muy útil en organizaciones grandes. Permite agrupar suscripciones para aplicar políticas y permisos de forma centralizada.&lt;/p&gt;

&lt;p&gt;Por ejemplo, puedes tener un Management Group para todas las suscripciones de producción y otro para desarrollo, aplicando reglas distintas sin tener que configurar cada suscripción individualmente. También podrías tener un Management Group que agrupe todas las suscripciones de la organización únicamente para gobierno y cumplimiento.&lt;/p&gt;




&lt;h4&gt;
  
  
  Subscription
&lt;/h4&gt;

&lt;p&gt;Es el contenedor administrativo y financiero principal. Todo recurso que se crea en Azure vive dentro de una suscripción. También es donde se aplican cuotas y donde se consolida la facturación.&lt;/p&gt;

&lt;p&gt;Muchas organizaciones usan suscripciones separadas para producción, desarrollo o unidades de negocio, más por administración y control financiero que por separación técnica entre entornos.&lt;/p&gt;

&lt;p&gt;Un detalle importante —y fuente frecuente de confusión— es que, aunque la suscripción sea un contenedor administrativo, no puedes mezclar recursos de distintas suscripciones dentro del mismo Resource Group.&lt;/p&gt;




&lt;h4&gt;
  
  
  Resource Group
&lt;/h4&gt;

&lt;p&gt;Es un contenedor lógico dentro de una suscripción que agrupa recursos relacionados con una carga de trabajo: App Services, bases de datos, Cosmos DB, redes, etc.&lt;/p&gt;

&lt;p&gt;Mientras los recursos pertenezcan al mismo scope administrativo, pueden agruparse dentro de un Resource Group. Además de organizar recursos, permite aplicar permisos mediante RBAC y gestionar el ciclo de vida completo de una solución: si eliminas el Resource Group, eliminas todo lo que contiene.&lt;/p&gt;

&lt;p&gt;Personalmente, este es uno de los elementos que más me ayudó durante mi proceso de adopción de Azure.&lt;/p&gt;




&lt;h4&gt;
  
  
  Resource
&lt;/h4&gt;

&lt;p&gt;Es el recurso concreto: una VM, un Storage Account, un NAT Gateway o una base de datos. Representa la unidad mínima de infraestructura o servicio dentro de Azure.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fshyr9m6gyc5r5tjcc6n6.jpg" alt=" " width="800" height="343"&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Enfoque AWS
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Root Account
&lt;/h4&gt;

&lt;p&gt;Es la cuenta inicial que se crea cuando una organización comienza a utilizar AWS. Tiene acceso total e irrestricto a todos los recursos y servicios.&lt;/p&gt;

&lt;p&gt;La recomendación general es no usarla para trabajo diario, protegerla con MFA y reservarla únicamente para tareas administrativas muy específicas.&lt;/p&gt;




&lt;h4&gt;
  
  
  AWS Organizations
&lt;/h4&gt;

&lt;p&gt;Es la estructura que permite gobernar múltiples cuentas AWS desde un punto centralizado. Se habilita desde la Root Account, que pasa a convertirse en la Management Account de la organización.&lt;/p&gt;

&lt;p&gt;Desde ahí pueden crearse cuentas hijas, agruparlas y aplicar políticas comunes.&lt;/p&gt;




&lt;h4&gt;
  
  
  Organizational Unit (OU)
&lt;/h4&gt;

&lt;p&gt;Es un contenedor dentro de AWS Organizations que agrupa cuentas con un propósito común.&lt;/p&gt;

&lt;p&gt;Por ejemplo, puedes tener una OU para producción, otra para desarrollo y otra para seguridad, incluyendo los niveles de anidación que necesites.&lt;/p&gt;

&lt;p&gt;Las políticas aplicadas a una OU se heredan a todas las cuentas contenidas dentro de ella, permitiendo gobernar a escala sin configurar cada cuenta individualmente.&lt;/p&gt;




&lt;h4&gt;
  
  
  Service Control Policy (SCP)
&lt;/h4&gt;

&lt;p&gt;Es un mecanismo de control aplicado sobre OUs o cuentas.&lt;/p&gt;

&lt;p&gt;Define el máximo nivel de acciones permitidas dentro de una cuenta. Aunque un usuario tenga permisos amplios mediante IAM, si una SCP restringe una acción, la restricción prevalece.&lt;/p&gt;

&lt;p&gt;Las SCP no otorgan permisos por sí mismas; únicamente establecen límites.&lt;/p&gt;




&lt;h4&gt;
  
  
  Cuenta AWS
&lt;/h4&gt;

&lt;p&gt;Es probablemente la unidad organizacional más importante dentro del modelo AWS.&lt;/p&gt;

&lt;p&gt;Cada cuenta posee sus propios recursos, redes, facturación y límites de servicio. El acceso entre cuentas no ocurre automáticamente; normalmente requiere configuraciones explícitas mediante IAM, networking o servicios compartidos.&lt;/p&gt;

&lt;p&gt;Es el equivalente conceptual más cercano a una Subscription de Azure, aunque con una separación operativa mucho más marcada desde el diseño de la plataforma.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F048kaqtvkcyoomk34sow.jpg" alt=" " width="800" height="349"&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Equivalencias conceptuales
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Nivel Azure&lt;/th&gt;
&lt;th&gt;Equivalente conceptual AWS&lt;/th&gt;
&lt;th&gt;Nota clave&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tenant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS Organizations / Root Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;En Azure todo vive dentro de un tenant asociado a Entra ID; en AWS el contexto organizacional suele construirse alrededor de Organizations y la cuenta raíz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Management Group&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Organizational Unit (OU)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ambos permiten agrupar contenedores hijos para aplicar políticas y gobernanza centralizada&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Subscription&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cuenta AWS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ambos funcionan como contenedores administrativos y financieros, aunque la cuenta AWS suele representar una separación operativa más marcada&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Group&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No existe equivalente directo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS utiliza tags, stacks y convenciones organizacionales para agrupar recursos, pero no existe un contenedor con el mismo peso operativo y ciclo de vida que un Resource Group&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Resource&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;La unidad mínima consumible de infraestructura o servicio en ambas nubes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;Y esto nos lleva al tema de facturación, que también refleja bastante la filosofía de organización de cada nube.&lt;/p&gt;

&lt;p&gt;En Azure, la suscripción tiene un peso administrativo y financiero muy importante; muchas estrategias de gobernanza, límites y control de costos se construyen alrededor de ella.&lt;/p&gt;

&lt;p&gt;En AWS, aunque la cuenta sigue siendo un elemento financiero clave, la granularidad del análisis de costos suele apoyarse muchísimo más en estrategias de tagging y consolidación mediante AWS Organizations.&lt;/p&gt;

&lt;p&gt;Mi impresión personal es que Azure incentiva más una segmentación jerárquica desde la propia estructura organizacional, mientras AWS favorece una separación basada en cuentas complementada con modelos detallados de etiquetado para gobierno financiero y operacional.&lt;/p&gt;

&lt;h3&gt;
  
  
  Veamos un ejemplo práctico
&lt;/h3&gt;

&lt;p&gt;Imaginemos una organización dedicada a investigación y desarrollo que está iniciando su adopción cloud y necesita construir una estructura ordenada, segura y escalable tanto en AWS como en Azure.&lt;/p&gt;

&lt;p&gt;La organización quiere separar claramente sus ambientes de:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Desarrollo&lt;/li&gt;
&lt;li&gt;Pruebas&lt;/li&gt;
&lt;li&gt;Preproducción&lt;/li&gt;
&lt;li&gt;Producción&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Además, busca implementar controles bien definidos para:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;permisos y accesos&lt;/li&gt;
&lt;li&gt;facturación y control de costos&lt;/li&gt;
&lt;li&gt;gobernanza&lt;/li&gt;
&lt;li&gt;cumplimiento&lt;/li&gt;
&lt;li&gt;networking compartido&lt;/li&gt;
&lt;li&gt;servicios de seguridad centralizados&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple vista, el objetivo parece idéntico en ambas nubes: organizar recursos, separar ambientes y aplicar políticas. Sin embargo, cuando empezamos a diseñar la estructura, rápidamente aparecen diferencias importantes en la filosofía organizacional de cada proveedor.&lt;/p&gt;

&lt;p&gt;En AWS, el diseño suele inclinarse hacia una separación por cuentas, donde cada ambiente vive en una cuenta independiente administrada mediante AWS Organizations y Organizational Units (OU).&lt;/p&gt;

&lt;p&gt;En Azure, el enfoque normalmente se construye alrededor de una jerarquía organizacional basada en:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tenant → Management Groups → Subscriptions → Resource Groups
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;donde la gobernanza y la administración se integran profundamente dentro de la estructura jerárquica de la plataforma.&lt;/p&gt;

&lt;p&gt;El siguiente diagrama muestra cómo podría modelarse este mismo escenario en ambas nubes y ayuda a visualizar por qué, aunque los objetivos sean similares, la forma de pensar y organizar la infraestructura cambia considerablemente entre AWS y Azure.}&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmy9r6yy6zpninrnctvp.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmy9r6yy6zpninrnctvp.jpg" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Identidad: donde todo inicia
&lt;/h2&gt;

&lt;p&gt;Puedes replicar infraestructura entre nubes, pero si no entiendes cómo funciona la identidad, no puedes gobernarlas. Y esta es, quizá, una de las particularidades más complejas cuando estás transitando entre dos mundos.&lt;/p&gt;

&lt;p&gt;En lo personal, este tema me costó un poco. Ambos entornos resuelven la misma necesidad de formas similares, pero —y aquí está el punto clave— similar no es lo mismo.&lt;/p&gt;

&lt;p&gt;Mi mayor confusión venía de esto:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AWS te da control fino desde el inicio, mientras que Azure te ofrece una capa de abstracción inicial y luego te permite profundizar.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Analicémoslo con más detalle.&lt;/p&gt;




&lt;h3&gt;
  
  
  AWS: identidad y permisos en un mismo sistema
&lt;/h3&gt;

&lt;p&gt;En AWS, la identidad y los permisos se definen dentro de un mismo sistema: &lt;strong&gt;AWS Identity and Access Management (IAM)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Aquí tienes control granular a través de políticas, donde defines exactamente qué puede hacer cada identidad sobre cada recurso.&lt;/p&gt;

&lt;p&gt;Yo lo veo así:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Usuarios / Grupos / Roles
Policies (JSON)
Permisos a servicios y recursos
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Las asignaciones son altamente granulares.&lt;/p&gt;

&lt;p&gt;Ese control fino permite aplicar el principio de mínimo privilegio desde el inicio, aunque puede resultar más complejo y, en ocasiones, un poco árido al principio.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhd0ai95jh2290gx2o5t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhd0ai95jh2290gx2o5t.png" alt=" " width="452" height="421"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Azure: identidad y autorización como capas separadas
&lt;/h3&gt;

&lt;p&gt;En Azure, en cambio, el modelo se separa en dos capas bien definidas.&lt;/p&gt;

&lt;p&gt;Por un lado está la identidad, gestionada en &lt;strong&gt;Microsoft Entra ID&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Usuarios
Grupos
Aplicaciones / Service Principals
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aquí es donde defines quién eres.&lt;/p&gt;




&lt;p&gt;Por otro lado está la autorización, gestionada mediante &lt;strong&gt;Azure Role-Based Access Control (RBAC)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Roles: Owner, Contributor, Reader (y muchos más)

Asignaciones a nivel de:
- Subscription
- Resource Group
- Recurso específico
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aquí es donde defines qué puede hacer esa identidad.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftbrninn1wzkpbqr1zjm3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftbrninn1wzkpbqr1zjm3.png" alt=" " width="456" height="469"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  La diferencia importante
&lt;/h3&gt;

&lt;p&gt;Esta separación es clave para entender Azure.&lt;/p&gt;

&lt;p&gt;Mientras en AWS todo vive en un mismo sistema, en Azure debes pensar en dos dimensiones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;identidad&lt;/li&gt;
&lt;li&gt;permisos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Y aunque ambos modelos terminan resolviendo el mismo problema, la forma en que llegas ahí cambia bastante entre plataformas.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cómo se comunican los recursos - Networking
&lt;/h2&gt;

&lt;p&gt;Y aquí es donde realmente empiezan las diferencias filosóficas fuertes entre ambas nubes. Y siendo muy honesta, el networking no es mi fuerte. AWS y Azure se parecen bastante superficialmente, pero me parece que el diseño mental cambia un poco, por lo que les compartiré mi “Piedra Roseta” para tratar de hacer más fácil el proceso de adaptación a otra nube y algunas reflexiones sobre los elementos de networking más destacables.&lt;/p&gt;




&lt;h3&gt;
  
  
  VPC vs VNet
&lt;/h3&gt;

&lt;p&gt;Conceptualmente, ambos servicios cumplen el mismo objetivo: crear redes privadas lógicas dentro de la nube para aislar y conectar recursos de forma segura.&lt;/p&gt;

&lt;p&gt;Tanto AWS como Azure permiten:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;definir CIDR,&lt;/li&gt;
&lt;li&gt;segmentar mediante subnets,&lt;/li&gt;
&lt;li&gt;controlar tráfico,&lt;/li&gt;
&lt;li&gt;conectar entornos on-premises,&lt;/li&gt;
&lt;li&gt;e incluso otras nubes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hasta aquí, pareciera que hablamos exactamente de lo mismo. Pero nuevamente, el modelo puede parecer similar mientras la filosofía detrás del diseño cambia bastante.&lt;/p&gt;

&lt;p&gt;En AWS, la VPC se siente muy explícita en el aislamiento. El arquitecto define de forma muy consciente cómo se segmenta la red, cómo se enruta el tráfico y qué componentes permiten la salida o entrada hacia Internet. Soy de software, eso siempre me ha costado.&lt;/p&gt;

&lt;p&gt;Muchos elementos deben declararse explícitamente:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internet Gateways&lt;/li&gt;
&lt;li&gt;Route Tables&lt;/li&gt;
&lt;li&gt;NAT Gateways&lt;/li&gt;
&lt;li&gt;asociaciones de subnets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Desde el inicio hay mucho control y consciencia de lo que es permitido y no, y por supuesto muchos dolores de cabeza cuando no le puedes llegar a un recurso.&lt;/p&gt;

&lt;p&gt;En Azure, la VNet se percibe más integrada al ecosistema general de la suscripción y la región. El modelo suele sentirse más abstraído y conectado al diseño operativo de Azure.&lt;/p&gt;

&lt;p&gt;Aunque también existen tablas de ruteo, gateways y segmentación, varios comportamientos vienen más integrados dentro del modelo de la plataforma.&lt;/p&gt;

&lt;p&gt;Uno de los detalles más importantes es la relación entre subnets y zonas de disponibilidad.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;En AWS, una subnet pertenece a una Availability Zone específica.&lt;/li&gt;
&lt;li&gt;En Azure, las subnets viven a nivel regional y los recursos son los que posteriormente se distribuyen entre zonas cuando el servicio lo soporta.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Es un pequeño detalle que cambia bastante la forma de pensar en términos de resiliencia y diseño de red.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Al momento de escribir este artículo una región solo tenía una AZ.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  NSG vs Security Groups ¿qué tan parecidos?
&lt;/h3&gt;

&lt;p&gt;Al inicio, los Network Security Groups (NSG) de Azure y los Security Groups de AWS parecen prácticamente lo mismo, pero no hay que dejarse engañar. Al principio es solo ese falso sentimiento de:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“esto lo conozco”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ambos permiten controlar tráfico de entrada y salida hacia recursos dentro de la red. Sin embargo, conforme se profundiza, aparecen diferencias importantes en filosofía y funcionamiento.&lt;/p&gt;

&lt;p&gt;En AWS, los Security Groups son &lt;strong&gt;stateful&lt;/strong&gt; y se enfocan principalmente en proteger workloads o interfaces de red específicas como:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2&lt;/li&gt;
&lt;li&gt;RDS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Funcionan únicamente mediante reglas &lt;strong&gt;ALLOW&lt;/strong&gt;; si el tráfico no está explícitamente permitido, se deniega implícitamente.&lt;/p&gt;

&lt;p&gt;No existen reglas &lt;strong&gt;DENY&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;AWS además separa otro componente llamado &lt;strong&gt;Network ACL (NACL)&lt;/strong&gt;, que funciona a nivel subnet.&lt;/p&gt;

&lt;p&gt;Los NACL son:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stateless,&lt;/li&gt;
&lt;li&gt;permiten reglas ALLOW,&lt;/li&gt;
&lt;li&gt;permiten reglas DENY.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Esto crea una separación bastante clara entre controles a nivel subnet y controles a nivel workload.&lt;/p&gt;

&lt;p&gt;En Azure, los NSG consolidan parte de ambos conceptos.&lt;/p&gt;

&lt;p&gt;También son &lt;strong&gt;stateful&lt;/strong&gt;, pero pueden aplicarse tanto a:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;subnets,&lt;/li&gt;
&lt;li&gt;como directamente a NICs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A diferencia de los Security Groups de AWS, los NSG sí soportan reglas &lt;strong&gt;DENY&lt;/strong&gt; explícitas.&lt;/p&gt;

&lt;p&gt;Ese pequeño detalle cambia bastante el enfoque mental.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS separa más explícitamente las capas de seguridad de red.&lt;/li&gt;
&lt;li&gt;Azure tiende a integrar más funcionalidades dentro de un mismo componente.&lt;/li&gt;
&lt;/ul&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro Tip&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Mientras en AWS se trabajan capas de control separadas — NACL para subnet y Security Groups a nivel de servicios — Azure consolida el modelo en NSG.&lt;br&gt;&lt;br&gt;
Esto permite entrever la diferencia filosófica de que AWS tiende a separar componentes mientras que Azure consolida funcionalidades.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Tal y como les prometí: mi “Piedra Rosetta”
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Azure VNet&lt;/th&gt;
&lt;th&gt;AWS VPC&lt;/th&gt;
&lt;th&gt;Diferencias Clave&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Red virtual privada regional&lt;/td&gt;
&lt;td&gt;Red virtual privada regional&lt;/td&gt;
&lt;td&gt;Azure integra la VNet más visiblemente dentro del modelo de suscripción y Resource Groups, mientras AWS trata la VPC como un boundary de aislamiento más explícito y desacoplado&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subnets regionales&lt;/td&gt;
&lt;td&gt;Subnets asociadas a una AZ específica&lt;/td&gt;
&lt;td&gt;En Azure las subnets pertenecen a la VNet regional; en AWS cada subnet vive dentro de una Availability Zone específica&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NSG aplicable a subnet o NIC&lt;/td&gt;
&lt;td&gt;Security Groups aplicados a interfaces/instancias&lt;/td&gt;
&lt;td&gt;Azure permite aplicar controles tanto a nivel subnet como NIC y permite Allows y Deny; en AWS los Security Groups se enfocan principalmente en interfaces y workloads, solo permiten Allows y el concepto NACL no existe aislado en Azure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User Defined Routes (UDR)&lt;/td&gt;
&lt;td&gt;Route Tables&lt;/td&gt;
&lt;td&gt;Azure maneja el routing de forma más integrada dentro de la plataforma; en AWS las asociaciones entre subnets y Route Tables suelen ser más explícitas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPN Gateway&lt;/td&gt;
&lt;td&gt;Site to Site VPN&lt;/td&gt;
&lt;td&gt;Ambos servicios permiten conectar redes on-premises con la nube mediante túneles IPsec, soportando escenarios híbridos y routing dinámico con BGP. Sin embargo, Azure expone de forma más explícita conceptos tradicionales de networking como tipos de VPN (route-based y policy-based), SKUs, configuraciones active-active y opciones avanzadas desde el proceso inicial de despliegue. En AWS, aunque estas capacidades también existen, el servicio administrado abstrae más parte de la complejidad operativa y el flujo suele sentirse más guiado desde la experiencia de implementación&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ExpressRoute&lt;/td&gt;
&lt;td&gt;Direct Connect&lt;/td&gt;
&lt;td&gt;Tanto Azure ExpressRoute como AWS Direct Connect suelen requerir la participación de carriers o partners especializados para establecer la conectividad física. Ambos servicios buscan reducir la dependencia de Internet pública y ofrecer conexiones más estables y predecibles. Sin embargo, históricamente ExpressRoute ha tenido una orientación más integrada hacia el ecosistema Microsoft mediante distintos modelos de peering que permiten conectividad privada no solo hacia VNets, sino también hacia servicios Microsoft y plataformas SaaS asociadas. Direct Connect, por su parte, suele percibirse más enfocado en conectividad dedicada hacia VPCs, redes y workloads específicos dentro de AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service Endpoints / Private Endpoints&lt;/td&gt;
&lt;td&gt;VPC Endpoints&lt;/td&gt;
&lt;td&gt;Azure diferencia dos enfoques explícitos: Service Endpoints restringen el acceso al servicio a VNets autorizadas sin crear interfaces de red adicionales, mientras que Private Endpoints asignan una IP privada dentro de la VNet y permiten resolución mediante DNS privado, posibilitando además deshabilitar opcionalmente el acceso público al servicio. AWS agrupa estos patrones bajo el concepto de VPC Endpoints, diferenciando internamente entre Gateway Endpoints — integrados mediante route tables y limitados principalmente a S3 y DynamoDB — e Interface Endpoints, que crean una ENI con IP privada y permiten conectividad privada hacia una amplia variedad de servicios AWS y servicios compatibles con PrivateLink, incluso en escenarios híbridos mediante VPN o Direct Connect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway&lt;/td&gt;
&lt;td&gt;NAT Gateway&lt;/td&gt;
&lt;td&gt;Ambas nubes usan NAT Gateway para que recursos en subnets privadas accedan a internet sin exponer su IP directamente. En Azure basta con asociarlo a la subnet sin tocar route tables. En AWS el proceso es más explícito: requiere un Internet Gateway, una subnet pública donde reside el NAT Gateway, y una entrada manual en la route table de cada subnet privada — lo que da más control pero también más superficie de error, especialmente en arquitecturas multi-zona&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Public IP&lt;/td&gt;
&lt;td&gt;Elastic IP&lt;/td&gt;
&lt;td&gt;Azure trata la IP pública como un recurso independiente que puede asociarse a componentes como NICs, Load Balancers o NAT Gateways. Aunque la IP existe como recurso separado, operativamente suele crearse y administrarse en conjunto con el servicio asociado. Para conservarla basta con utilizar asignación estática y desasociarla sin eliminar el recurso, permitiendo reutilizarla posteriormente. AWS el modelo mental es algo distinto: utiliza Elastic IPs como mecanismo principal para direcciones públicas persistentes. Estas se reservan explícitamente dentro de la cuenta y pueden asociarse o moverse entre instancias y servicios de manera independiente. Ambas nubes cobran por IPs públicas estáticas no asociadas; la diferencia es que AWS hace de la reasignación explícita parte natural del modelo operativo, mientras que Azure suele integrar más la administración de la IP al ciclo de vida del recurso que la consume&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Interactuando con la nube
&lt;/h2&gt;

&lt;p&gt;No podía cerrar esta primera parte sin hablar de algo que también cambia muchísimo entre proveedores: la forma en que interactuamos con la nube día a día.&lt;/p&gt;

&lt;p&gt;Ambas plataformas cuentan con:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;consola web,&lt;/li&gt;
&lt;li&gt;APIs,&lt;/li&gt;
&lt;li&gt;SDKs,&lt;/li&gt;
&lt;li&gt;Infrastructure as Code,&lt;/li&gt;
&lt;li&gt;y CLI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sin embargo, nuevamente la filosofía detrás del diseño se siente bastante distinta.&lt;/p&gt;

&lt;p&gt;A nivel de consola, en Azure Resource Manager (ARM) funciona como una capa unificada de administración para despliegues, permisos, políticas y organización de recursos. Esa integración hace que muchas operaciones se perciban más centralizadas y coherentes con la estructura jerárquica previamente resaltada.&lt;/p&gt;

&lt;p&gt;En AWS, la experiencia suele sentirse más orientada a servicios individuales.&lt;/p&gt;

&lt;p&gt;Aunque existen mecanismos unificadores como:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Organizations,&lt;/li&gt;
&lt;li&gt;CloudFormation,&lt;/li&gt;
&lt;li&gt;o Control Tower,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;la interacción diaria muchas veces implica navegar entre servicios relativamente desacoplados entre sí.&lt;/p&gt;

&lt;p&gt;Eso ofrece muchísimo control y flexibilidad, pero también puede requerir entender mejor cómo interactúan múltiples componentes para operar con fluidez.&lt;/p&gt;

&lt;p&gt;No considero que un enfoque sea “mejor” que el otro; más bien destacan la diferencia de filosofía entre ambas nubes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reflexiones finales
&lt;/h2&gt;

&lt;p&gt;Este es apenas un primer acercamiento al reto de convertirse en un arquitecto multi nube.&lt;/p&gt;

&lt;p&gt;En un momento donde cada vez más organizaciones dejan atrás la idea de depender de un único proveedor, necesitamos desarrollar la capacidad de comprender las fortalezas, limitaciones y filosofía operativa de cada plataforma.&lt;/p&gt;

&lt;p&gt;Ser multi nube no significa solamente aprender servicios equivalentes entre AWS y Azure. También implica entender cómo piensa cada ecosistema, cómo organiza sus recursos, cómo gobierna su infraestructura y cómo toma decisiones operativas.&lt;/p&gt;

&lt;p&gt;Al final, el verdadero reto es saber qué pieza ajustar en cada ambiente para construir soluciones que sean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sostenibles,&lt;/li&gt;
&lt;li&gt;eficientes,&lt;/li&gt;
&lt;li&gt;y financieramente responsables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Yo sigo aprendiendo en ese proceso y más adelante quiero compartirles también mis experiencias y estrategias alrededor de IA en ambos mundos cloud.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>azure</category>
      <category>cloudcomputing</category>
    </item>
    <item>
      <title>Operational Hardening — Guardrails, Secrets Rotation &amp; SLO — FSx ONTAP S3AP Phase 12</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 17 May 2026 18:21:39 +0000</pubDate>
      <link>https://dev.to/aws-builders/operational-hardening-guardrails-secrets-rotation-slo-fsx-ontap-s3ap-phase-12-1k4o</link>
      <guid>https://dev.to/aws-builders/operational-hardening-guardrails-secrets-rotation-slo-fsx-ontap-s3ap-phase-12-1k4o</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.&lt;/p&gt;

&lt;p&gt;Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;Phase 12&lt;/strong&gt; of the FSx for ONTAP S3AP serverless pattern library. Building on &lt;a href="https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;Phase 10&lt;/a&gt; and &lt;a href="https://dev.to/aws-builders/production-ready-fpolicy-event-pipeline-across-17-ucs-fsx-for-ontap-s3-access-points-phase-11-57p8"&gt;Phase 11&lt;/a&gt;, Phase 12 delivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capacity Guardrails&lt;/strong&gt;: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets Rotation&lt;/strong&gt;: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic Monitoring&lt;/strong&gt;: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity Forecasting&lt;/strong&gt;: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Lineage Tracking&lt;/strong&gt;: DynamoDB table with GSI for processing history and opt-in integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protobuf TCP Framing&lt;/strong&gt;: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO Definition&lt;/strong&gt;: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy Pipeline E2E&lt;/strong&gt;: NFS file creation → FPolicy → SQS delivery confirmed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Store Replay&lt;/strong&gt;: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Property-Based Testing&lt;/strong&gt;: 16 Hypothesis properties, 53 tests, 3 bugs discovered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Access Point Deep Dive&lt;/strong&gt;: Multi-layer authorization, IAM ARN format, VPC network constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key metrics&lt;/strong&gt;: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph LR
    A[Auto-Expand Request] --&amp;gt; B{GuardrailMode?}
    B --&amp;gt;|DRY_RUN| C[Log + Allow&amp;lt;br/&amp;gt;fail-open on DDB error]
    B --&amp;gt;|ENFORCE| D[Check + Block&amp;lt;br/&amp;gt;fail-closed on DDB error]
    B --&amp;gt;|BREAK_GLASS| E[Bypass All Checks&amp;lt;br/&amp;gt;SNS Alert + Audit Log]
    C --&amp;gt; F[DynamoDB Tracking]
    D --&amp;gt; F
    E --&amp;gt; F
    F --&amp;gt; G[CloudWatch EMF Metrics]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Behavior on Check Failure&lt;/th&gt;
&lt;th&gt;Behavior on DynamoDB Error&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DRY_RUN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Log warning, allow action&lt;/td&gt;
&lt;td&gt;Fail-open (allow)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ENFORCE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Block action, emit metric&lt;/td&gt;
&lt;td&gt;Fail-closed (deny)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BREAK_GLASS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip all checks&lt;/td&gt;
&lt;td&gt;SNS alert + audit log&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Core implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.guardrails&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CapacityGuardrail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GuardrailMode&lt;/span&gt;

&lt;span class="n"&gt;guardrail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CapacityGuardrail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Mode from GUARDRAIL_MODE env var
&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;guardrail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check_and_execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;action_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;volume_grow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;requested_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;execute_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_grow_function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;volume_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vol-abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action executed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action denied: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Three safety checks (ENFORCE mode)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rate limit&lt;/strong&gt;: Max 10 actions per day per action type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily cap&lt;/strong&gt;: Max 500 GB cumulative expansion per day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cooldown&lt;/strong&gt;: 300-second minimum interval between actions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All thresholds are configurable via environment variables (&lt;code&gt;GUARDRAIL_RATE_LIMIT&lt;/code&gt;, &lt;code&gt;GUARDRAIL_DAILY_CAP_GB&lt;/code&gt;, &lt;code&gt;GUARDRAIL_COOLDOWN_SECONDS&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  DynamoDB tracking schema
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Action type (e.g., &lt;code&gt;volume_grow&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Date (&lt;code&gt;YYYY-MM-DD&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;daily_total_gb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number&lt;/td&gt;
&lt;td&gt;Cumulative GB expanded today&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;action_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number&lt;/td&gt;
&lt;td&gt;Number of actions today&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;last_action_ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;ISO timestamp of last action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;actions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List&lt;/td&gt;
&lt;td&gt;Audit trail of all actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ttl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number&lt;/td&gt;
&lt;td&gt;30-day auto-expiry&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-dynamodb-guardrails-table.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-dynamodb-guardrails-table.png" alt="DynamoDB Guardrails Table" width="800" height="1272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  BREAK_GLASS production considerations
&lt;/h3&gt;

&lt;p&gt;In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant SM as Secrets Manager
    participant Lambda as Rotation Lambda (VPC)
    participant ONTAP as FSx ONTAP REST API

    SM-&amp;gt;&amp;gt;Lambda: Step 1: createSecret
    Lambda-&amp;gt;&amp;gt;SM: Generate new password, store as AWSPENDING

    SM-&amp;gt;&amp;gt;Lambda: Step 2: setSecret
    Lambda-&amp;gt;&amp;gt;ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
    ONTAP--&amp;gt;&amp;gt;Lambda: 200 OK

    SM-&amp;gt;&amp;gt;Lambda: Step 3: testSecret
    Lambda-&amp;gt;&amp;gt;ONTAP: GET /api/cluster (using new password)
    ONTAP--&amp;gt;&amp;gt;Lambda: 200 OK (cluster UUID returned)

    SM-&amp;gt;&amp;gt;Lambda: Step 4: finishSecret
    Lambda-&amp;gt;&amp;gt;SM: Promote AWSPENDING → AWSCURRENT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key design decisions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VPC deployment&lt;/strong&gt;: Lambda must be in the same VPC as the ONTAP management LIF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;90-day interval&lt;/strong&gt;: Configurable via CloudFormation parameter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt;: Step 3 (&lt;code&gt;testSecret&lt;/code&gt;) verifies the new password works by calling the ONTAP cluster API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rollback safety&lt;/strong&gt;: If &lt;code&gt;testSecret&lt;/code&gt; fails, the old password remains as AWSCURRENT&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bugs discovered during live testing
&lt;/h3&gt;

&lt;p&gt;Three bugs were found and fixed during the actual rotation execution:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AWSPENDING empty check&lt;/strong&gt;: &lt;code&gt;createSecret&lt;/code&gt; must handle the case where &lt;code&gt;get_secret_value(VersionStage='AWSPENDING')&lt;/code&gt; raises &lt;code&gt;ResourceNotFoundException&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;management_ip fallback&lt;/strong&gt;: The Lambda must support both &lt;code&gt;management_ip&lt;/code&gt; (new) and &lt;code&gt;ontap_mgmt_ip&lt;/code&gt; (legacy) keys in the secret JSON&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster UUID validation&lt;/strong&gt;: &lt;code&gt;testSecret&lt;/code&gt; now validates the response contains a valid &lt;code&gt;uuid&lt;/code&gt; field, not just HTTP 200&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Verification result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret):    ✅ ONTAP password changed via REST API
Step 3 (testSecret):   ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Operational note
&lt;/h3&gt;

&lt;p&gt;Rotating &lt;code&gt;fsxadmin&lt;/code&gt; affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's &lt;code&gt;urllib3&lt;/code&gt; or &lt;code&gt;requests&lt;/code&gt; configuration handles certificate verification appropriately (see &lt;code&gt;shared/ontap_client.py&lt;/code&gt; for the pattern used in this project).&lt;/p&gt;

&lt;p&gt;For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing &lt;code&gt;fsxadmin&lt;/code&gt; across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Synthetic Monitoring — CloudWatch Synthetics Canary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP Health Check&lt;/strong&gt;: REST API call to the management endpoint (VPC-internal)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Access Point Check&lt;/strong&gt;: ListObjectsV2 against the S3AP alias&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Critical finding: network-origin and endpoint configuration matter
&lt;/h3&gt;

&lt;p&gt;During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.&lt;/p&gt;

&lt;p&gt;This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/configuring-network-access-for-s3-access-points.html" rel="noopener noreferrer"&gt;documents&lt;/a&gt; support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.&lt;/p&gt;

&lt;p&gt;In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Observed requirement in this environment&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP REST API&lt;/td&gt;
&lt;td&gt;VPC-internal access to management LIF&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3AP health check&lt;/td&gt;
&lt;td&gt;Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy&lt;/td&gt;
&lt;td&gt;⚠️ Timed out from the initial VPC Canary configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Split into two monitoring paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ONTAP health: VPC-internal Canary (confirmed working, 88ms response)&lt;/li&gt;
&lt;li&gt;S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is documented as a critical constraint in &lt;code&gt;docs/guides/s3ap-fsxn-specification.md&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canary runtime version lesson
&lt;/h3&gt;

&lt;p&gt;The template initially specified &lt;code&gt;syn-python-selenium-3.0&lt;/code&gt;, which was deprecated on 2026-02-03. Updated to &lt;code&gt;syn-python-selenium-11.0&lt;/code&gt;. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS builder lesson: VPC placement is a design choice
&lt;/h3&gt;

&lt;p&gt;A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc-internet.html" rel="noopener noreferrer"&gt;connected to a VPC&lt;/a&gt;, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ztvwmwi58ki19r00lr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9ztvwmwi58ki19r00lr.png" alt="Synthetics Canary" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Capacity Forecasting — Linear Regression with stdlib Only
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A Lambda function running on a daily EventBridge schedule:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetches 30 days of FSx &lt;code&gt;StorageUsed&lt;/code&gt; metrics from CloudWatch&lt;/li&gt;
&lt;li&gt;Performs linear regression using only Python's &lt;code&gt;math&lt;/code&gt; module (zero external dependencies)&lt;/li&gt;
&lt;li&gt;Publishes &lt;code&gt;DaysUntilFull&lt;/code&gt; as a CloudWatch custom metric&lt;/li&gt;
&lt;li&gt;Sends SNS alert when forecast drops below threshold (default: 30 days)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Linear regression implementation (stdlib only)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;linear_regression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Least-squares linear regression using only math module.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_points&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Need at least 2 data points for regression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum_xy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum_x2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data_points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
        &lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
        &lt;span class="n"&gt;sum_xy&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
        &lt;span class="n"&gt;sum_x2&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;

    &lt;span class="n"&gt;denominator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_x2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;denominator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;1e-10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;slope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_xy&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;denominator&lt;/span&gt;
    &lt;span class="n"&gt;intercept&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;slope&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sum_x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intercept&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Edge cases handled
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;DaysUntilFull&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 2 data points&lt;/td&gt;
&lt;td&gt;-1&lt;/td&gt;
&lt;td&gt;Insufficient data, no prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;slope ≤ 0 (shrinking/flat)&lt;/td&gt;
&lt;td&gt;-1&lt;/td&gt;
&lt;td&gt;Never fills up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Already over capacity&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Immediate alert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very low usage (0.03%)&lt;/td&gt;
&lt;td&gt;169,374&lt;/td&gt;
&lt;td&gt;Normal — far future prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Live verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"days_until_full"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;169374&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_usage_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.03&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_capacity_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1024.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"growth_rate_gb_per_day"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.006&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"forecast_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2490-02-06T06:26:42Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.&lt;/p&gt;

&lt;p&gt;This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat &lt;code&gt;DaysUntilFull&lt;/code&gt; as an early-warning signal, not an exact prediction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-lambda-capacity-forecast.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FYoshiki0705%2FFSx-for-ONTAP-S3AccessPoints-Serverless-Patterns%2Fmain%2Fdocs%2Fscreenshots%2Fmasked%2Fphase12-lambda-capacity-forecast.png" alt="Capacity Forecast Lambda" width="800" height="1105"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Data Lineage Tracking — DynamoDB with GSI
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    subgraph "DynamoDB: fsxn-s3ap-data-lineage"
        PK[PK: source_file_key&amp;lt;br/&amp;gt;SK: processing_timestamp]
        GSI[GSI: uc_id-timestamp-index&amp;lt;br/&amp;gt;PK: uc_id, SK: processing_timestamp]
    end

    Q1[Query by file] --&amp;gt;|PK lookup| PK
    Q2[Query by UC + time range] --&amp;gt;|GSI query| GSI
    Q3[Query by execution ARN] --&amp;gt;|Scan + filter| PK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For high-volume environments, consider adding a dedicated GSI on &lt;code&gt;step_functions_execution_arn&lt;/code&gt;. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration helper (opt-in)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.lineage&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LineageTracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LineageRecord&lt;/span&gt;

&lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LineageTracker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LineageRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;source_file_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/vol1/legal/contracts/deal-001.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;processing_timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-16T14:30:45.123Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_functions_execution_arn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:states:...:execution:...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;uc_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;legal-compliance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://output-bucket/legal/reports/deal-001-analysis.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4523&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lineage_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Design principles
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-blocking&lt;/strong&gt;: Write failures emit a warning log but never interrupt the main processing pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL&lt;/strong&gt;: 365-day auto-expiry via DynamoDB TTL (configurable via &lt;code&gt;LINEAGE_TTL_DAYS&lt;/code&gt; environment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opt-in&lt;/strong&gt;: UCs integrate by importing the helper — no mandatory coupling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PAY_PER_REQUEST&lt;/strong&gt;: No capacity planning needed for variable workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Future: compliance-grade lineage (v2)
&lt;/h3&gt;

&lt;p&gt;For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future &lt;code&gt;LineageRecord&lt;/code&gt; v2:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;input_checksum&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SHA-256 of source file for integrity verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;output_checksum&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SHA-256 of generated output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fpolicy_sequence_number&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ONTAP-assigned sequence for ordering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;policy_version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;FPolicy policy configuration version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;uc_template_version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;UC CloudFormation template version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;guardrail_mode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Active guardrail mode at processing time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retention_profile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retention class for compliance tiering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Protobuf TCP Framing — Adaptive Reader
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing &lt;code&gt;read_fpolicy_message()&lt;/code&gt; assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;An adaptive &lt;code&gt;ProtobufFrameReader&lt;/code&gt; that supports three framing modes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    A[Incoming TCP Stream] --&amp;gt; B{FramingMode}
    B --&amp;gt;|AUTO_DETECT| C[Probe first 4 bytes]
    C --&amp;gt;|Valid uint32 length| D[LENGTH_PREFIXED]
    C --&amp;gt;|Otherwise| E[FRAMELESS]
    B --&amp;gt;|LENGTH_PREFIXED| D
    B --&amp;gt;|FRAMELESS| E
    D --&amp;gt; F[4-byte big-endian header → payload]
    E --&amp;gt; G[varint-delimited → payload]
    F --&amp;gt; H[Decoded Message]
    G --&amp;gt; H
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Three modes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Wire Format&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LENGTH_PREFIXED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4-byte big-endian length + payload&lt;/td&gt;
&lt;td&gt;XML mode (legacy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FRAMELESS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;varint-delimited protobuf&lt;/td&gt;
&lt;td&gt;Protobuf mode (ONTAP 9.15.1+)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AUTO_DETECT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Probe first bytes, then lock mode&lt;/td&gt;
&lt;td&gt;Unknown/mixed environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Auto-detection heuristic
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_auto_detect_and_read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Probe first 4 bytes to determine framing mode.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;peek&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readexactly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;candidate_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unpack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;!I&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;peek&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;candidate_length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_max_message_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Valid length header → LENGTH_PREFIXED
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_detected_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FramingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LENGTH_PREFIXED&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readexactly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Not a valid length → FRAMELESS (varint-delimited)
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_detected_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FramingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FRAMELESS&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;peek&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_read_varint_delimited&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Safety features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Max message size enforcement&lt;/strong&gt; (default 1 MB): Prevents DoS via malformed length headers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FramingError exception&lt;/strong&gt;: Structured error with offset and raw data for debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful EOF handling&lt;/strong&gt;: Returns &lt;code&gt;None&lt;/code&gt; on connection close without raising&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Integration with existing FPolicy server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.integrations.protobuf_integration&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_fpolicy_reader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;read_fpolicy_message_v2&lt;/span&gt;

&lt;span class="c1"&gt;# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_fpolicy_reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;read_fpolicy_message_v2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 13 protobuf validation scope
&lt;/h3&gt;

&lt;p&gt;The following questions will be confirmed with NetApp support during live wire validation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)&lt;/li&gt;
&lt;li&gt;Message boundary behavior under high throughput&lt;/li&gt;
&lt;li&gt;Keep-alive behavior in protobuf mode vs XML mode&lt;/li&gt;
&lt;li&gt;Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?&lt;/li&gt;
&lt;li&gt;Mixed-mode migration path (XML → protobuf transition without event loss)&lt;/li&gt;
&lt;li&gt;Maximum message size guidance from ONTAP side&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. SLO Definition — 4 Targets with CloudWatch Dashboard
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;Four SLO targets covering the critical path of the event-driven pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SLO&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;SLO met when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Event Ingestion Latency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;EventIngestionLatency_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;P99 &amp;lt; 5,000 ms&lt;/td&gt;
&lt;td&gt;LessThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Processing Success Rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ProcessingSuccessRate_pct&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&amp;gt; 99.5%&lt;/td&gt;
&lt;td&gt;GreaterThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reconnect Time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FPolicyReconnectTime_sec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 30 sec&lt;/td&gt;
&lt;td&gt;LessThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay Completion Time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReplayCompletionTime_sec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 300 sec (5 min)&lt;/td&gt;
&lt;td&gt;LessThanThreshold&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For success rate, the CloudWatch Alarm fires when the metric drops &lt;em&gt;below&lt;/em&gt; 99.5% (ComparisonOperator: &lt;code&gt;LessThanThreshold&lt;/code&gt;), even though the SLO target is expressed as "&amp;gt; 99.5%".&lt;/p&gt;

&lt;h3&gt;
  
  
  CloudWatch Dashboard
&lt;/h3&gt;

&lt;p&gt;The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.slo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SLO_TARGETS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evaluate_slos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_dashboard_widgets&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate all SLOs programmatically
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_slos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cloudwatch_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;met&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VIOLATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slo_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (value=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, threshold=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate dashboard widget JSON for CloudFormation
&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_dashboard_widgets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Alarm-based violation detection
&lt;/h3&gt;

&lt;p&gt;Each SLO has a corresponding CloudWatch Alarm:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alarm Name&lt;/th&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Evaluation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-ingestion-latency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-success-rate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-reconnect-time&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-s3ap-slo-replay-completion&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;3 consecutive periods&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffw6nv9al1lm7hzzq8exu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffw6nv9al1lm7hzzq8exu.png" alt="SLO Dashboard" width="800" height="710"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  8. FPolicy Pipeline E2E Verification
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  The verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant NFS as NFS Client (Bastion)
    participant ONTAP as FSx for ONTAP
    participant FP as FPolicy Server (Fargate)
    participant SQS as SQS Queue

    NFS-&amp;gt;&amp;gt;ONTAP: echo "test" &amp;gt; /mnt/fpolicy_vol/test.txt
    ONTAP-&amp;gt;&amp;gt;FP: NOTI_REQ (FILE_CREATE event)
    FP-&amp;gt;&amp;gt;FP: Parse event, extract metadata
    FP-&amp;gt;&amp;gt;SQS: SendMessage (JSON payload)
    SQS--&amp;gt;&amp;gt;SQS: Message available for consumers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Timeline (actual observed)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T+0s&lt;/td&gt;
&lt;td&gt;TCP connection test&lt;/td&gt;
&lt;td&gt;ONTAP → Fargate IP (10.0.128.98:9898)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+10s&lt;/td&gt;
&lt;td&gt;Session established&lt;/td&gt;
&lt;td&gt;NEGO_REQ → NEGO_RESP handshake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+12s&lt;/td&gt;
&lt;td&gt;KEEP_ALIVE starts&lt;/td&gt;
&lt;td&gt;2-minute interval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+30s&lt;/td&gt;
&lt;td&gt;NFS file created&lt;/td&gt;
&lt;td&gt;&lt;code&gt;echo "test" &amp;gt; /mnt/fpolicy_vol/test_fpolicy_event.txt&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+31s&lt;/td&gt;
&lt;td&gt;NOTI_REQ received&lt;/td&gt;
&lt;td&gt;FPolicy server receives file creation event&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+32s&lt;/td&gt;
&lt;td&gt;SQS delivery&lt;/td&gt;
&lt;td&gt;Event sent to SQS queue (FPolicy_Q)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  SQS message format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FILE_CREATE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"svm_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FSxN_OnPre"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"volume_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vol1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/vol1/test_fpolicy_event.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"client_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10.0.128.98"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T08:45:32Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sequence_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  IAM issue discovered and fixed
&lt;/h3&gt;

&lt;p&gt;The ECS task role's SQS policy used a Resource ARN pattern &lt;code&gt;arn:aws:sqs:...:fsxn-fpolicy-*&lt;/code&gt; that didn't match the actual queue name &lt;code&gt;FPolicy_Q&lt;/code&gt;. Fix: use explicit ARN or &lt;code&gt;*&lt;/code&gt; wildcard in the template.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Event contract assumptions
&lt;/h3&gt;

&lt;p&gt;The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate events can occur (especially during Persistent Store replay)&lt;/li&gt;
&lt;li&gt;Delivery order is not guaranteed (confirmed in Section 9)&lt;/li&gt;
&lt;li&gt;Consumers must be idempotent&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;file_path + timestamp + sequence_number&lt;/code&gt; serves as an idempotency key candidate&lt;/li&gt;
&lt;li&gt;Replay events may arrive after newer events&lt;/li&gt;
&lt;li&gt;Schema versioning should be introduced before multi-UC production rollout&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important prerequisite&lt;/strong&gt;: FPolicy Persistent Store is available for &lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html" rel="noopener noreferrer"&gt;asynchronous non-mandatory policies&lt;/a&gt; only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have &lt;a href="https://docs.netapp.com/us-en/ontap-restapi/protocols_fpolicy_svm.uuid_persistent-stores_endpoint_overview.html" rel="noopener noreferrer"&gt;only one Persistent Store&lt;/a&gt;, and the same store can be used by multiple policies within that SVM.&lt;/p&gt;

&lt;h3&gt;
  
  
  The test procedure
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Stop Fargate task (ECS &lt;code&gt;stop-task&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Create 5 files via NFS during downtime (&lt;code&gt;replay-test-1.txt&lt;/code&gt; through &lt;code&gt;replay-test-5.txt&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Wait for ECS service auto-recovery (new task launch)&lt;/li&gt;
&lt;li&gt;Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)&lt;/li&gt;
&lt;li&gt;Verify all 5 events arrive in SQS&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Events generated during downtime&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Events replayed to SQS&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lost events&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay delivery order&lt;/td&gt;
&lt;td&gt;3, 1, 2, 5, 4 (non-sequential)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay completion time&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key observation: Out-of-order replay
&lt;/h3&gt;

&lt;p&gt;Persistent Store replays events in a &lt;strong&gt;non-sequential order&lt;/strong&gt; — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency&lt;/strong&gt;: Deduplicate by file path + timestamp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp-based ordering&lt;/strong&gt;: Sort by event timestamp, not arrival order&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  20-file burst validation
&lt;/h3&gt;

&lt;p&gt;Additionally, a 20-file burst test confirmed zero event loss under higher load:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Files Created&lt;/th&gt;
&lt;th&gt;Events Delivered&lt;/th&gt;
&lt;th&gt;Loss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Replay (5 files)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Burst (20 files)&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Phase 13 replay storm metrics
&lt;/h3&gt;

&lt;p&gt;The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store volume usage before/after replay&lt;/td&gt;
&lt;td&gt;Capacity planning for the store volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Events queued vs events replayed&lt;/td&gt;
&lt;td&gt;Completeness verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay throughput (events/sec)&lt;/td&gt;
&lt;td&gt;Performance baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay duration&lt;/td&gt;
&lt;td&gt;SLO calibration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out-of-order distance&lt;/td&gt;
&lt;td&gt;Downstream buffer sizing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate events&lt;/td&gt;
&lt;td&gt;Idempotency requirement validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP EMS logs around disconnect/reconnect&lt;/td&gt;
&lt;td&gt;Root cause correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Operational framing: event durability as RPO/RTO
&lt;/h3&gt;

&lt;p&gt;Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while &lt;code&gt;ReplayCompletionTime_sec&lt;/code&gt; provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 12 validation scope
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Phase 12 Assumption&lt;/th&gt;
&lt;th&gt;Production Consideration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SVM&lt;/td&gt;
&lt;td&gt;Single SVM validation&lt;/td&gt;
&lt;td&gt;Multi-SVM needs per-SVM policy and Persistent Store planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Volume&lt;/td&gt;
&lt;td&gt;Test volume&lt;/td&gt;
&lt;td&gt;Production volumes should be grouped by UC/event profile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocol&lt;/td&gt;
&lt;td&gt;NFS-based E2E test&lt;/td&gt;
&lt;td&gt;NFSv3/NFSv4.1/SMB replay validation remains Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event types&lt;/td&gt;
&lt;td&gt;File create&lt;/td&gt;
&lt;td&gt;Modify/delete/rename validation remains Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy mode&lt;/td&gt;
&lt;td&gt;Async non-mandatory&lt;/td&gt;
&lt;td&gt;Required for Persistent Store (&lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html" rel="noopener noreferrer"&gt;NetApp docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.&lt;/p&gt;

&lt;h3&gt;
  
  
  The approach
&lt;/h3&gt;

&lt;p&gt;Using Python's &lt;a href="https://hypothesis.readthedocs.io/" rel="noopener noreferrer"&gt;Hypothesis&lt;/a&gt; library, we defined 16 properties across the Phase 12 modules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property Group&lt;/th&gt;
&lt;th&gt;Properties&lt;/th&gt;
&lt;th&gt;Tests&lt;/th&gt;
&lt;th&gt;Bugs Found&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Protobuf Frame Reader&lt;/td&gt;
&lt;td&gt;5 (round-trip, max size, EOF, multi-message, auto-detect)&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capacity Guardrails&lt;/td&gt;
&lt;td&gt;4 (mode behavior, rate limit, daily cap, cooldown)&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Lineage&lt;/td&gt;
&lt;td&gt;3 (record/query round-trip, GSI consistency, TTL)&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLO Evaluation&lt;/td&gt;
&lt;td&gt;2 (threshold comparison, no-data handling)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capacity Forecast&lt;/td&gt;
&lt;td&gt;2 (regression accuracy, edge cases)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Bugs discovered
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Protobuf reader&lt;/strong&gt;: &lt;code&gt;AUTO_DETECT&lt;/code&gt; mode failed when the first 4 bytes happened to form a valid-looking length that exceeded &lt;code&gt;max_message_size&lt;/code&gt;. Fix: treat oversized candidate lengths as FRAMELESS indicator.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Guardrails&lt;/strong&gt;: &lt;code&gt;BREAK_GLASS&lt;/code&gt; mode didn't emit the &lt;code&gt;GuardrailBypass&lt;/code&gt; metric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SLO evaluation&lt;/strong&gt;: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation), &lt;code&gt;max(datapoints, key=lambda dp: dp["Timestamp"])&lt;/code&gt; was non-deterministic. Fix: add secondary sort by value.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example property test
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;binary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;min_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nd"&gt;@settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_length_prefixed_round_trip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Property: LENGTH_PREFIXED encode → decode preserves all messages.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;stream_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_make_length_prefixed_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_make_stream_reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;frame_reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtobufFrameReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FramingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LENGTH_PREFIXED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_message_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;decoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame_reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_message&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;decoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;decoded&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;  &lt;span class="c1"&gt;# Round-trip property
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The critical finding
&lt;/h3&gt;

&lt;p&gt;FSx for ONTAP S3 Access Points are &lt;strong&gt;not standard S3 endpoints&lt;/strong&gt;. They use the FSx data plane, which has different network routing characteristics than standard S3.&lt;/p&gt;

&lt;p&gt;In this pattern library, FSx for ONTAP S3 Access Points serve as an &lt;strong&gt;AWS service integration boundary&lt;/strong&gt;: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-layer authorization model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    Client[S3 API Client] --&amp;gt; IAM{Layer 1: IAM Policy}
    IAM --&amp;gt;|identity-based policy| AP{Layer 2: AP Resource Policy}
    AP --&amp;gt;|resource policy| FS{Layer 3: File System Identity}
    FS --&amp;gt;|UNIX UID or AD user| Volume[ONTAP Volume]

    IAM -.-&amp;gt;|❌ Denied| Block1[Access Denied]
    AP -.-&amp;gt;|❌ Denied| Block2[Access Denied]
    FS -.-&amp;gt;|❌ No permission| Block3[Access Denied]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-ap-manage-access-fsxn.html" rel="noopener noreferrer"&gt;documents&lt;/a&gt; this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correct IAM ARN format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:ap-northeast-1:&amp;lt;ACCOUNT_ID&amp;gt;:accesspoint/fsxn-eda-s3ap"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:ap-northeast-1:&amp;lt;ACCOUNT_ID&amp;gt;:accesspoint/fsxn-eda-s3ap/object/*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Common mistake&lt;/strong&gt;: Using the S3AP alias (&lt;code&gt;xxx-ext-s3alias&lt;/code&gt;) as a bucket ARN. The alias is only valid as the &lt;code&gt;Bucket&lt;/code&gt; parameter in boto3 calls — IAM policies require the full access point ARN.&lt;/p&gt;

&lt;h3&gt;
  
  
  VPC network constraint (environment-specific observation)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Access Pattern&lt;/th&gt;
&lt;th&gt;Observed Result&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint)&lt;/td&gt;
&lt;td&gt;⚠️ Timeout in this config&lt;/td&gt;
&lt;td&gt;Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internet → S3 AP (NetworkOrigin=Internet)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Routes correctly with valid IAM credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC)&lt;/td&gt;
&lt;td&gt;Supported per AWS docs; not verified in Phase 12&lt;/td&gt;
&lt;td&gt;Requires VPC-origin AP and matching endpoint policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Lambda → ONTAP REST API&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Direct management LIF access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/configuring-network-access-for-s3-access-points.html" rel="noopener noreferrer"&gt;documents&lt;/a&gt; that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural implication for this pattern&lt;/strong&gt;: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run outside VPC (with Internet access)&lt;/li&gt;
&lt;li&gt;Use NAT Gateway for outbound routing&lt;/li&gt;
&lt;li&gt;Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Write support and practical constraints
&lt;/h3&gt;

&lt;p&gt;FSx ONTAP S3 Access Points support &lt;code&gt;PutObject&lt;/code&gt;, &lt;code&gt;DeleteObject&lt;/code&gt;, multipart uploads (&lt;code&gt;CreateMultipartUpload&lt;/code&gt;, &lt;code&gt;UploadPart&lt;/code&gt;, &lt;code&gt;CompleteMultipartUpload&lt;/code&gt;), and other write operations — they are not read-only. The &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/access-points-for-fsxn-object-api-support.html" rel="noopener noreferrer"&gt;access point compatibility table&lt;/a&gt; documents the full list of supported S3 API operations.&lt;/p&gt;

&lt;p&gt;However, S3 Access Points are not full S3 buckets. Key constraints include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum upload size: 5 GB&lt;/li&gt;
&lt;li&gt;Only &lt;code&gt;FSX_ONTAP&lt;/code&gt; storage class&lt;/li&gt;
&lt;li&gt;Only SSE-FSX encryption&lt;/li&gt;
&lt;li&gt;No ACLs (except &lt;code&gt;bucket-owner-full-control&lt;/code&gt;), no Object Versioning, no Object Lock, no presigned URLs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Cross-Project Feedback — Template Hardening
&lt;/h2&gt;

&lt;p&gt;During Phase 12, the companion project &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;fsxn-observability-integrations&lt;/a&gt; reviewed our CloudFormation templates and provided actionable feedback. All items were applied:&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Group: SourceSecurityGroupId over CIDR
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; (broad):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;SecurityGroupIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
    &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9898&lt;/span&gt;
    &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9898&lt;/span&gt;
    &lt;span class="na"&gt;CidrIp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.0.0.0/8"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt; (precise):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;SecurityGroupIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;IpProtocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
    &lt;span class="na"&gt;FromPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;FPolicyPort&lt;/span&gt;
    &lt;span class="na"&gt;ToPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;FPolicyPort&lt;/span&gt;
    &lt;span class="na"&gt;SourceSecurityGroupId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;FsxnSvmSecurityGroupId&lt;/span&gt;
    &lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FPolicy TCP from FSxN SVM Security Group&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  ONTAP CLI: Deprecated &lt;code&gt;vserver&lt;/code&gt; prefix
&lt;/h3&gt;

&lt;p&gt;ONTAP 9.11+ deprecates the &lt;code&gt;vserver&lt;/code&gt; prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deprecated (still works for backward compatibility)&lt;/span&gt;
vserver fpolicy policy external-engine create &lt;span class="nt"&gt;-vserver&lt;/span&gt; FSxN_OnPre ...

&lt;span class="c"&gt;# Recommended (ONTAP 9.11+)&lt;/span&gt;
fpolicy policy external-engine create &lt;span class="nt"&gt;-vserver&lt;/span&gt; FSxN_OnPre ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  KMS Decrypt: When it's needed (and when it's not)
&lt;/h3&gt;

&lt;p&gt;Added documentation clarifying SQS encryption behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SqsManagedSseEnabled: true&lt;/code&gt; → kms:Decrypt is &lt;strong&gt;NOT&lt;/strong&gt; needed (transparent)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;KmsMasterKeyId: alias/aws/sqs&lt;/code&gt; → kms:Decrypt &lt;strong&gt;IS&lt;/strong&gt; needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our templates use &lt;code&gt;SqsManagedSseEnabled: true&lt;/code&gt;, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  EC2 AMI: Removed redundant Docker install
&lt;/h3&gt;

&lt;p&gt;ECS-optimized AMIs (&lt;code&gt;{{resolve:ssm:/aws/service/ecs/optimized-ami/...}}&lt;/code&gt;) already include Docker. Removed the unnecessary &lt;code&gt;yum install -y docker&lt;/code&gt; from UserData scripts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cpu/Memory: String type is intentional
&lt;/h3&gt;

&lt;p&gt;Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with &lt;code&gt;AllowedValues&lt;/code&gt; provides better validation than Number type for this constrained parameter space.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. What's Next — Phase 13 Outlook
&lt;/h2&gt;

&lt;p&gt;Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Capacity guardrails preventing runaway auto-scaling&lt;/li&gt;
&lt;li&gt;✅ Automated secrets rotation on 90-day cycle&lt;/li&gt;
&lt;li&gt;✅ Proactive capacity forecasting with daily predictions&lt;/li&gt;
&lt;li&gt;✅ SLO-based observability with alarm-driven alerting&lt;/li&gt;
&lt;li&gt;✅ Data lineage tracking for audit and debugging&lt;/li&gt;
&lt;li&gt;✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios&lt;/li&gt;
&lt;li&gt;✅ Property-based testing catching real bugs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ownership boundary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Primary Owner&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Shared event platform&lt;/td&gt;
&lt;td&gt;Platform / storage team&lt;/td&gt;
&lt;td&gt;FPolicy server, SQS queue, EventBridge bus, Persistent Store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP operations&lt;/td&gt;
&lt;td&gt;Storage team&lt;/td&gt;
&lt;td&gt;SVM, volume, FPolicy policy, Persistent Store capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security operations&lt;/td&gt;
&lt;td&gt;Security / platform team&lt;/td&gt;
&lt;td&gt;Secrets rotation, BREAK_GLASS approval, IAM policies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workload UC&lt;/td&gt;
&lt;td&gt;Application / data team&lt;/td&gt;
&lt;td&gt;Step Functions, UC routing rules, output destinations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Platform + workload teams&lt;/td&gt;
&lt;td&gt;SLO dashboard, UC-specific alarms, runbooks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Production Readiness Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Phase 12 Status&lt;/th&gt;
&lt;th&gt;Remaining Work&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Capacity Guardrails&lt;/td&gt;
&lt;td&gt;Verified (DRY_RUN/ENFORCE/BREAK_GLASS)&lt;/td&gt;
&lt;td&gt;Approval workflow optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Rotation&lt;/td&gt;
&lt;td&gt;4-step rotation verified&lt;/td&gt;
&lt;td&gt;Ensure all clients read from Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLO Dashboard&lt;/td&gt;
&lt;td&gt;Deployed, 4 alarms active&lt;/td&gt;
&lt;td&gt;Runbooks and alarm response automation in Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store Replay&lt;/td&gt;
&lt;td&gt;5-event + 20-event scenarios verified&lt;/td&gt;
&lt;td&gt;1000+ replay storm testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3AP Monitoring&lt;/td&gt;
&lt;td&gt;ONTAP health path verified&lt;/td&gt;
&lt;td&gt;Split S3AP health check (VPC-external)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protobuf Framing&lt;/td&gt;
&lt;td&gt;Property/integration tested&lt;/td&gt;
&lt;td&gt;Live ONTAP protobuf wire validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-account OAM&lt;/td&gt;
&lt;td&gt;Stack deployed conditionally&lt;/td&gt;
&lt;td&gt;Second-account validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production UC E2E&lt;/td&gt;
&lt;td&gt;Pipeline verified to SQS delivery&lt;/td&gt;
&lt;td&gt;Full TriggerMode=EVENT_DRIVEN UC flow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Dashboard&lt;/td&gt;
&lt;td&gt;Not yet deployed&lt;/td&gt;
&lt;td&gt;Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Phase 13 candidates
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Operational readiness&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Canary S3AP check separation&lt;/strong&gt;: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO violation runbooks&lt;/strong&gt;: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay storm testing&lt;/strong&gt;: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Enterprise deployment&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-account OAM validation&lt;/strong&gt;: Deploy workload-account-oam-link.yaml in a second AWS account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared platform vs workload boundary&lt;/strong&gt;: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production UC end-to-end&lt;/strong&gt;: Deploy a UC template with &lt;code&gt;TriggerMode=EVENT_DRIVEN&lt;/code&gt; and verify the complete flow from NFS file creation through Step Functions execution to output generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Protocol and cost&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Protobuf live wire validation&lt;/strong&gt;: Confirm protobuf TCP framing with NetApp support and validate &lt;code&gt;AUTO_DETECT&lt;/code&gt; mode against real ONTAP protobuf traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization dashboard&lt;/strong&gt;: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Decision trees and operational guides&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Decision trees&lt;/strong&gt;: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NetApp Partner Delivery Checklist&lt;/strong&gt;: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Cost model awareness
&lt;/h3&gt;

&lt;p&gt;While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Cost Type&lt;/th&gt;
&lt;th&gt;Driver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy server (Fargate/EC2)&lt;/td&gt;
&lt;td&gt;Fixed baseline&lt;/td&gt;
&lt;td&gt;Always-on listener&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway&lt;/td&gt;
&lt;td&gt;Fixed + per-GB&lt;/td&gt;
&lt;td&gt;Required if VPC Lambda needs Internet-origin S3AP access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Synthetics&lt;/td&gt;
&lt;td&gt;Per-canary-run&lt;/td&gt;
&lt;td&gt;5-minute interval = 8,640 runs/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch custom metrics + Logs&lt;/td&gt;
&lt;td&gt;Per-metric + per-GB ingested&lt;/td&gt;
&lt;td&gt;SLO metrics, FPolicy server logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB (lineage + guardrails)&lt;/td&gt;
&lt;td&gt;Per-request (PAY_PER_REQUEST)&lt;/td&gt;
&lt;td&gt;Event volume dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQS / EventBridge&lt;/td&gt;
&lt;td&gt;Per-message / per-event&lt;/td&gt;
&lt;td&gt;Event volume dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Store volume&lt;/td&gt;
&lt;td&gt;Per-GB provisioned&lt;/td&gt;
&lt;td&gt;Sized for max queued events during downtime&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Design decision for new deployments&lt;/strong&gt;: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).&lt;/p&gt;

&lt;h3&gt;
  
  
  NetworkOrigin decision table
&lt;/h3&gt;

&lt;p&gt;Based on &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/configuring-network-access-for-s3-access-points.html" rel="noopener noreferrer"&gt;AWS documentation&lt;/a&gt;, the following decision criteria apply:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose VPC-origin when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All consumers are Lambda/ECS/EC2 inside the same VPC&lt;/li&gt;
&lt;li&gt;Private connectivity is mandatory (no internet-routed path allowed)&lt;/li&gt;
&lt;li&gt;VPC endpoint policy is part of the security boundary&lt;/li&gt;
&lt;li&gt;Network restriction is built-in (cannot be accidentally misconfigured)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Internet-origin when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;External accounts or on-premises clients need access&lt;/li&gt;
&lt;li&gt;Consumers are outside the bound VPC&lt;/li&gt;
&lt;li&gt;Internet-routed access with IAM controls is acceptable&lt;/li&gt;
&lt;li&gt;Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;VPC-origin&lt;/th&gt;
&lt;th&gt;Internet-origin&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Network enforcement&lt;/td&gt;
&lt;td&gt;Built-in explicit Deny for non-VPC traffic&lt;/td&gt;
&lt;td&gt;Policy-based only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC endpoint required&lt;/td&gt;
&lt;td&gt;Yes (Gateway or Interface in bound VPC)&lt;/td&gt;
&lt;td&gt;Only if using &lt;code&gt;aws:SourceVpc&lt;/code&gt; conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-VPC access&lt;/td&gt;
&lt;td&gt;Via Interface endpoint + peering/TGW to bound VPC&lt;/td&gt;
&lt;td&gt;Via policy conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change access scope&lt;/td&gt;
&lt;td&gt;Must recreate access point&lt;/td&gt;
&lt;td&gt;Update policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises access&lt;/td&gt;
&lt;td&gt;Via Interface endpoint in bound VPC&lt;/td&gt;
&lt;td&gt;Direct with IAM credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost implication&lt;/td&gt;
&lt;td&gt;VPC endpoint (Gateway=free, Interface=hourly)&lt;/td&gt;
&lt;td&gt;NAT Gateway if VPC Lambda needs access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Critical&lt;/strong&gt;: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 12 readiness by workload type
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Phase 12 Ready?&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Controlled PoC / single-account&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;All core components verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low/moderate event volume (&amp;lt; 100 events/day)&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;20-event burst validated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DRY_RUN guardrail validation&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;Safe to deploy immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets rotation validation&lt;/td&gt;
&lt;td&gt;✅ Ready&lt;/td&gt;
&lt;td&gt;4-step rotation verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-volume replay storm (1000+ events)&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Throughput curve and store capacity not yet measured&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-account production&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;OAM link deployed but second-account validation pending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strict SLO operations requiring runbooks&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Dashboard deployed, runbooks not yet written&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Live protobuf production mode&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Wire validation with NetApp support pending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full EVENT_DRIVEN UC end-to-end&lt;/td&gt;
&lt;td&gt;⏳ Phase 13&lt;/td&gt;
&lt;td&gt;Pipeline verified to SQS, Step Functions flow pending&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Phase 13 runbook scope: first-response diagnostic bundle
&lt;/h3&gt;

&lt;p&gt;For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# FPolicy status&lt;/span&gt;
fpolicy show &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt; &lt;span class="nt"&gt;-fields&lt;/span&gt; policy-name,status
fpolicy policy external-engine show &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;
fpolicy persistent-store show &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;

&lt;span class="c"&gt;# Connection and event state&lt;/span&gt;
fpolicy show-engine &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;
fpolicy show-passthrough-read-connection &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;SVM&amp;gt;

&lt;span class="c"&gt;# EMS logs for FPolicy events&lt;/span&gt;
event log show &lt;span class="nt"&gt;-messagename&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;fpolicy&lt;span class="k"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deployed Infrastructure
&lt;/h2&gt;

&lt;p&gt;7 CloudFormation stacks deployed and verified:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-guardrails-table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;DynamoDB tracking table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-lineage-table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Data lineage DynamoDB + GSI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-slo-dashboard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;CloudWatch dashboard + 4 alarms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-oam-link&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-capacity-forecast&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Lambda + EventBridge schedule&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-secrets-rotation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;VPC Lambda + rotation config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fsxn-phase12-synthetic-monitoring&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CREATE_COMPLETE&lt;/td&gt;
&lt;td&gt;Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falcihlbv8m756gbdrr1o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falcihlbv8m756gbdrr1o.png" alt="CloudFormation Stacks" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Results Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unit Tests&lt;/td&gt;
&lt;td&gt;116&lt;/td&gt;
&lt;td&gt;Local (CI-reproducible)&lt;/td&gt;
&lt;td&gt;✅ All pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Property Tests (Hypothesis)&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;Local (CI-reproducible)&lt;/td&gt;
&lt;td&gt;✅ All pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFormation Deployments&lt;/td&gt;
&lt;td&gt;7 stacks&lt;/td&gt;
&lt;td&gt;AWS integration&lt;/td&gt;
&lt;td&gt;✅ All CREATE_COMPLETE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda Invocations&lt;/td&gt;
&lt;td&gt;2 (forecast + rotation)&lt;/td&gt;
&lt;td&gt;AWS integration&lt;/td&gt;
&lt;td&gt;✅ Successful&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy E2E&lt;/td&gt;
&lt;td&gt;1 pipeline test&lt;/td&gt;
&lt;td&gt;AWS manual verification&lt;/td&gt;
&lt;td&gt;✅ Event delivered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay E2E&lt;/td&gt;
&lt;td&gt;5 events&lt;/td&gt;
&lt;td&gt;AWS manual verification&lt;/td&gt;
&lt;td&gt;✅ Zero loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20-file burst&lt;/td&gt;
&lt;td&gt;20 events&lt;/td&gt;
&lt;td&gt;AWS manual verification&lt;/td&gt;
&lt;td&gt;✅ Zero loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bugs found (property testing)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Local (CI-reproducible)&lt;/td&gt;
&lt;td&gt;✅ All fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  NetApp-Specific Takeaways
&lt;/h2&gt;

&lt;p&gt;For NetApp users and partners evaluating this pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FPolicy Persistent Store&lt;/strong&gt; works as the durability layer for asynchronous non-mandatory FPolicy policies (&lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html" rel="noopener noreferrer"&gt;NetApp docs&lt;/a&gt;), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Access Points for FSx for ONTAP&lt;/strong&gt; are not standard S3 buckets: they support &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/access-points-for-fsxn-object-api-support.html" rel="noopener noreferrer"&gt;selected S3 API operations&lt;/a&gt; including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NetworkOrigin is a design-time decision&lt;/strong&gt;. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP-common vs AWS-specific&lt;/strong&gt;: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational readiness&lt;/strong&gt; requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.&lt;/p&gt;

&lt;p&gt;The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.&lt;/p&gt;

&lt;p&gt;With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Previous phases&lt;/strong&gt;: &lt;a href="https://dev.to/yoshikifujiwara/fsx-for-ontap-s3-access-points-as-a-serverless-automation-boundary-ai-data-pipelines-ili"&gt;Phase 1&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/public-sector-use-cases-unified-output-destination-and-a-localization-batch-fsx-for-ontap-s3-2hmo"&gt;Phase 7&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/operational-hardening-ci-grade-validation-and-pattern-c-b-hybrid-fsx-for-ontap-s3-access-587h"&gt;Phase 8&lt;/a&gt; · &lt;a href="https://dev.to/yoshikifujiwara/production-rollout-vpc-endpoint-auto-detection-and-the-cdk-no-go-fsx-for-ontap-s3-access-3lni"&gt;Phase 9&lt;/a&gt; · &lt;a href="https://dev.to/aws-builders/fpolicy-event-driven-pipeline-multi-account-stacksets-and-cost-optimization-fsx-for-ontap-s3-5bd6"&gt;Phase 10&lt;/a&gt; · &lt;a href="https://dev.to/aws-builders/production-ready-fpolicy-event-pipeline-across-17-ucs-fsx-for-ontap-s3-access-points-phase-11-57p8"&gt;Phase 11&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>amazonfsxfornetappontap</category>
      <category>s3accesspoints</category>
    </item>
    <item>
      <title>Everything is Under Control</title>
      <dc:creator>mgbec</dc:creator>
      <pubDate>Sun, 17 May 2026 16:43:44 +0000</pubDate>
      <link>https://dev.to/aws-builders/everything-is-under-control-gaf</link>
      <guid>https://dev.to/aws-builders/everything-is-under-control-gaf</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8n26p31s1pbvs3wrviyo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8n26p31s1pbvs3wrviyo.png" width="742" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m a control enthusiast, not a control freak. And control is part of my job description, so no apologies. As an enterprise, with all the new AI tools entering the atmosphere every day, we want to enable innovation and efficiency. We also need to have governance over these tools and their usage. Organizations want to make sure they minimize any potential risks, and of course, have observability into everything that is happening.&lt;/p&gt;

&lt;p&gt;I wanted to test an AgentCore Gateway workflow with multiple control mechanisms- &lt;a href="https://github.com/mgbec/CEDAR-plus-interceptor" rel="noopener noreferrer"&gt;https://github.com/mgbec/CEDAR-plus-interceptor&lt;/a&gt;. There are three pieces I put into play:&lt;/p&gt;

&lt;h3&gt;
  
  
  OAuth 2.1 (via Cognito) — “Who are you?”
&lt;/h3&gt;

&lt;p&gt;The problem it solves: Identity and authentication. Before the gateway can make any access decisions, it needs to know who’s making the request and verify they’re legitimate.&lt;/p&gt;

&lt;p&gt;What it does in this scenario:&lt;/p&gt;

&lt;p&gt;-The agent (or user) authenticates against Cognito with their email/password.&lt;br&gt;&lt;br&gt;
-Cognito issues a JWT containing the user’s identity (sub) and group memberships (cognito:groups: [“engineering”])&lt;br&gt;&lt;br&gt;
-The gateway’s CUSTOM_JWT authorizer validates the token signature, expiry, audience, and issuer against Cognito’s OIDC discovery endpoint.&lt;br&gt;&lt;br&gt;
-If the token is invalid or missing → 401 immediately, nothing else runs&lt;/p&gt;

&lt;p&gt;What it can’t do: It has no opinion on what the authenticated user is allowed to do. A valid token from a marketing user looks the same as one from an admin at this layer — both pass authentication.&lt;/p&gt;

&lt;p&gt;I had to think about one detail here that was a little confusing to me. Cognito returns both an ID Token and an Access Token. The ID Token tells the client application who the user is and the Access Token tells the gateway about the application client and the scope they are granted. The Access Token does not authorize the user to do anything beyond get to the gateway, however. The access token’s scope claim only gets the request past the gateway’s front door — it’s a binary check: “does this token have a valid scope?”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxfzugkxbe4xsdmvrgh0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxfzugkxbe4xsdmvrgh0.png" width="616" height="785"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Real-world analogy: The badge reader at the building entrance. It confirms you’re an employee, but doesn’t know which floors you’re allowed on.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cedar Policy — “Are you allowed to do this?”
&lt;/h3&gt;

&lt;p&gt;The problem it solves: Authorization. Given a verified identity with known group memberships, should this specific tool invocation be permitted?&lt;/p&gt;

&lt;p&gt;What it does in this scenario:&lt;/p&gt;

&lt;p&gt;-Reads the cognito:groups claim from the validated JWT to determine the principal&lt;/p&gt;

&lt;p&gt;-Evaluates Cedar rules: “Is Group::”engineering” permitted Action::”InvokeTool” on Tool::”DatabaseTools___delete_records”?”&lt;/p&gt;

&lt;p&gt;-Returns allow or deny based purely on the static policy set&lt;/p&gt;

&lt;p&gt;The forbid on delete_records for engineers is absolute — no other rule can override it&lt;/p&gt;

&lt;p&gt;What it can’t do:&lt;/p&gt;

&lt;p&gt;It can’t count how many times you’ve called a tool today&lt;/p&gt;

&lt;p&gt;It can’t call an external service to check something&lt;/p&gt;

&lt;p&gt;It can’t modify the request or response&lt;/p&gt;

&lt;p&gt;It can’t make decisions based on the request body content (e.g., “only allow SELECT queries, not DELETE queries”)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Real-world analogy: The access control list on each floor. Engineering badges open the lab doors but not the server room. Marketing badges only open the conference rooms.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Request Interceptor (Rate Limiter Lambda)- “Should we let this through right now?”
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/mgbec/CEDAR-plus-interceptor/tree/main/lambdas/rate-limiter" rel="noopener noreferrer"&gt;https://github.com/mgbec/CEDAR-plus-interceptor/tree/main/lambdas/rate-limiter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem it solves: Runtime enforcement that requires state, external lookups, or data transformation — things that can’t be expressed as static allow/deny rules.&lt;/p&gt;

&lt;p&gt;What it does in this scenario:&lt;/p&gt;

&lt;p&gt;-Runs only after OAuth and Cedar have both passed (no point rate-limiting a request that would be denied anyway)&lt;/p&gt;

&lt;p&gt;-Reads the user ID and group from the request context&lt;/p&gt;

&lt;p&gt;-Queries DynamoDB: “How many requests has this user made in the current hour?”&lt;/p&gt;

&lt;p&gt;-Compares against the role-based quota (admins: 100, engineering: 50, marketing: 20)&lt;/p&gt;

&lt;p&gt;-Either passes the request through or returns 429&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Real-world analogy: The security guard who checks if the parking lot is full before letting your car in, even though your badge is valid and you’re allowed on that floor.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Interceptor (PII Redactor Lambda)- “Is this role allowed to view PII?”
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/mgbec/CEDAR-plus-interceptor/tree/main/lambdas/pii-redactor" rel="noopener noreferrer"&gt;https://github.com/mgbec/CEDAR-plus-interceptor/tree/main/lambdas/pii-redactor&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This lambda reads the users’ Cognito group and determines if they are allowed to see PII based on that group membership. Mine is a pretty simple PII detector with detection for just SSN’s, Credit Card Numbers, email addresses, and phone numbers. In production you would want something more robust.&lt;/p&gt;

&lt;p&gt;The PII is redacted from responses before they reach the agent, depending on the group they are in.&lt;/p&gt;

&lt;p&gt;Static access control is not as ideal here in responders. You could implement role-based permissions in a Lambda, but it’d be harder to audit, version, and reason about than Cedar policies.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Real-world analogy: On the way out of the building, the guard would check you for contraband items being removed from company premises.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Building (and Troubleshooting)
&lt;/h3&gt;

&lt;p&gt;There was quite a bit of troubleshooting involved for me to build this out. I tried both CDK and Terraform. Terraform seemed to work better, but there were some resources that were problematic. Kiro was incredibly helpful with debugging and part of this may have been user error. Issues that seemed to be true are:&lt;/p&gt;

&lt;p&gt;Rate limit counters persist across tests — DynamoDB counters use a 1-hour window. If you test marketing (limit 20) and then test again in the same hour, the counter is already at 20+ and everything gets blocked immediately. Clear the table between test runs or wait for the next hour.&lt;/p&gt;

&lt;p&gt;UpdateGateway replaces everything- The UpdateGateway API is a full replacement, not a patch. If you call it to attach the policy engine but don’t include interceptorConfigurations, the interceptor gets wiped. Every update must pass through ALL existing fields. &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore-control/latest/APIReference/API_UpdateGateway.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/bedrock-agentcore-control/latest/APIReference/API_UpdateGateway.html&lt;/a&gt;. This caused interceptors to disappear multiple times.&lt;/p&gt;

&lt;p&gt;Cedar Policy Entity Types- AgentCore::Group doesn’t exist. The valid principal type is AgentCore::OAuthUser. Group membership is checked via tags: principal.hasTag(“cognito:groups”) &amp;amp;&amp;amp; principal.getTag(“cognito:groups”) like “*engineering*”. &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-understanding-cedar.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-understanding-cedar.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tool-specific policies require the exact gateway ARN. You can’t use “resource is AgentCore::Gateway” for tool-scoped policies — the API rejects it. And when the gateway gets recreated (new ID), all policies become stale and need to be recreated with the new ARN.&lt;/p&gt;

&lt;p&gt;Gateway recreation breaks policy references- when Terraform recreates the gateway (e.g., terraform apply -replace), it gets a new ID and ARN. All Cedar policies that reference the old gateway ARN stop matching (default-deny kicks in). You have to delete and recreate the policies with the new ARN.&lt;/p&gt;

&lt;p&gt;From my understanding, the gateway ARN coupling is by design (security isolation between gateways). The best practice is to treat the gateway as a long-lived resource and avoid recreating it.&lt;/p&gt;

&lt;p&gt;Using a combination of scripts and Terraform seemed to work best for me, as long as I remembered the correct order of operations. The danger zone is when either tool updates the gateway — it can wipe what the other entity set. The safest workflow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;terraform apply (creates/updates gateway shell)&lt;/li&gt;
&lt;li&gt;create-policies.sh (attaches policy engine + interceptor, preserving existing config)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Observability (and Troubleshooting)
&lt;/h3&gt;

&lt;p&gt;My first test was a bit of a failure. There is a small amount of observability built into the output of the tests, so we can at least see that things did not go as planned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo7yx63etvlqog83jzfs6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo7yx63etvlqog83jzfs6.png" width="768" height="468"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, one of the best things about AgentCore is all of the detailed observability baked into the components.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbix4sig9wka13kj3dey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbix4sig9wka13kj3dey.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can even dig down into the trace level to watch our policies in action.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ohqhz7pe28tlpsy2wbi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ohqhz7pe28tlpsy2wbi.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can look at the bigger picture of our gateway performance with metrics like denied and allowed policy decisions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15bzk8o5nuk4a7tuf3ov.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15bzk8o5nuk4a7tuf3ov.png" width="800" height="318"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One important thing to note for observability of the PII redactor response interceptor:&lt;br&gt;&lt;br&gt;
The traces and logs capture the response from the Lambda target, which contains the full unredacted PII. The PII redactor runs after that, as the last step before the client receives the response. The observability system records what the Lambda returned, not what the client ultimately saw.&lt;/p&gt;

&lt;p&gt;The flow is:&lt;/p&gt;

&lt;p&gt;Lambda returns full PII&lt;br&gt;&lt;br&gt;
│&lt;br&gt;&lt;br&gt;
├──→ CloudWatch logs/traces capture THIS (unredacted)&lt;br&gt;&lt;br&gt;
│&lt;br&gt;&lt;br&gt;
▼&lt;br&gt;&lt;br&gt;
PII Redactor intercepts&lt;br&gt;&lt;br&gt;
│&lt;br&gt;&lt;br&gt;
▼&lt;br&gt;&lt;br&gt;
Client receives redacted response&lt;/p&gt;

&lt;p&gt;This is actually correct from a security audit perspective — you want the logs to show the full data so that security teams can audit what data was accessed. You can verify the redactor is working by comparing logs versus client response. To quickly see what is returned to the client, you can manually set the token and Gateway URL and then test with curl.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;TOKEN=$(./scripts/get-token.sh &lt;a href="mailto:engineer@example.com"&gt;engineer@example.com&lt;/a&gt; 2&amp;gt;/dev/null)&lt;br&gt;&lt;br&gt;
GATEWAY_URL=$(terraform -chdir=terraform output -raw gateway_url)&lt;br&gt;&lt;br&gt;
curl -s -X POST “$GATEWAY_URL” \&lt;br&gt;&lt;br&gt;
-H “Authorization: Bearer $TOKEN” \&lt;br&gt;&lt;br&gt;
-H “Content-Type: application/json” \&lt;br&gt;&lt;br&gt;
-d ‘{“jsonrpc”: “2.0”, “id”: 1, “method”: “tools/call”, “params”: {“name”: “DatabaseTools___run_query”, “arguments”: {“sql”: “SELECT * FROM users”, “database”: “analytics”}}}’ \&lt;br&gt;&lt;br&gt;
| jq -r ‘.result.content[0].text’ | python3 -m json.tool&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Next try with an admin user, which should receive unredacted data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;TOKEN=$(./scripts/get-token.sh &lt;a href="mailto:admin@example.com"&gt;admin@example.com&lt;/a&gt; 2&amp;gt;/dev/null)&lt;br&gt;&lt;br&gt;
GATEWAY_URL=$(terraform -chdir=terraform output -raw gateway_url)&lt;br&gt;&lt;br&gt;
curl -s -X POST “$GATEWAY_URL” \&lt;br&gt;&lt;br&gt;
-H “Authorization: Bearer $TOKEN” \&lt;br&gt;&lt;br&gt;
-H “Content-Type: application/json” \&lt;br&gt;&lt;br&gt;
-d ‘{“jsonrpc”: “2.0”, “id”: 1, “method”: “tools/call”, “params”: {“name”: “DatabaseTools___run_query”, “arguments”: {“sql”: “SELECT * FROM users”, “database”: “analytics”}}}’ \&lt;br&gt;&lt;br&gt;
| jq -r ‘.result.content[0].text’ | python3 -m json.tool&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80xbzubwr96tm0jtsxac.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80xbzubwr96tm0jtsxac.png" width="533" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, do I feel like I have things completely under control? Not really, on many levels, but that may be a personal issue. These AgentCore Gateway, in addition to OAuth 2.1, Cedar Policies, and Lambda interceptors, are helping us with constraints and oversight, as well as giving us some assistance with governance. Again, as we have heard over and over, this is such a dynamic field. I’m looking forward to the evolution of our GenAI and cybersecurity fields and the technological transformations we will see. Thanks for reading!&lt;/p&gt;

</description>
      <category>amazonbedrock</category>
      <category>aigovernance</category>
      <category>ai</category>
      <category>amazonbedrockagentco</category>
    </item>
    <item>
      <title>Event-Driven Ransomware Detection with ONTAP ARP + Datadog</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 17 May 2026 09:16:54 +0000</pubDate>
      <link>https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda</link>
      <guid>https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;ONTAP's Autonomous Ransomware Protection (ARP) detects encryption patterns at the storage layer. When ARP fires, an EMS event is pushed via webhook to API Gateway → Lambda → Datadog. In my validation environment, end-to-end latency was around 30 seconds. This post shows how to wire it up, what the alert looks like, and how to respond.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Threat Model
&lt;/h2&gt;

&lt;p&gt;Ransomware encrypts files at hundreds or thousands of files per minute. Traditional detection — antivirus signatures, host-based EDR — often catches it after significant damage is done.&lt;/p&gt;

&lt;p&gt;What if your &lt;em&gt;storage&lt;/em&gt; could detect the encryption pattern before the host-based tools react?&lt;/p&gt;

&lt;p&gt;That's exactly what ONTAP Autonomous Ransomware Protection (ARP) does. It runs ML-based entropy analysis at the storage layer, detecting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sudden spikes in file entropy (encryption)&lt;/li&gt;
&lt;li&gt;Mass file extension changes (&lt;code&gt;.docx&lt;/code&gt; → &lt;code&gt;.encrypted&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Abnormal write patterns inconsistent with normal workload behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When ARP detects an attack, it changes the volume state to &lt;code&gt;attack-detected&lt;/code&gt; and fires an EMS event. Our job is to get that event to the security team in seconds, not hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Detection Pipeline
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2&lt;/a&gt;, we built the audit log pipeline and showed Datadog search queries for file access events. Now we turn those patterns into event-driven security alerting — starting with ONTAP's most powerful detection signal: Autonomous Ransomware Protection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ONTAP ARP detects encryption behavior
    │
    ▼ EMS event: arw.volume.state (severity: alert)
ONTAP EMS Webhook (HTTPS POST)
    │
    ▼
API Gateway (REST endpoint)
    │
    ▼
Lambda (EMS handler)
    │
    ▼ normalize → format → ship
Datadog Logs API v2 (source:fsxn-ems)
    │
    ▼
Datadog Monitor → PagerDuty / Slack / Email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;End-to-end latency: &lt;strong&gt;around 30 seconds&lt;/strong&gt; in my validation environment (ap-northeast-1). Your latency will vary depending on ONTAP event delivery, API Gateway/Lambda behavior, Datadog ingest latency, and notification routing.&lt;/p&gt;

&lt;p&gt;Compare this to the audit log path (Part 2), which depends on rotation interval + scheduler frequency. EMS webhooks are event-driven rather than scheduled, delivering alerts within seconds rather than minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying the EMS Integration
&lt;/h2&gt;

&lt;p&gt;The EMS Lambda is deployed alongside the FPolicy shipping Lambda in a single stack. Note that the FPolicy TCP listener itself remains a separate ECS Fargate-based path (as described in Part 1) because ONTAP FPolicy requires a persistent TCP connection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template-ems-fpolicy.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-ems-fpolicy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogSite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap1.datadoghq.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Gets Created
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EMS Lambda&lt;/td&gt;
&lt;td&gt;Receives EMS webhooks, normalizes, ships to Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy Lambda&lt;/td&gt;
&lt;td&gt;Receives FPolicy events from SQS, ships to Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Gateway (from shared EMS webhook stack)&lt;/td&gt;
&lt;td&gt;HTTPS endpoint for ONTAP EMS webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM Roles&lt;/td&gt;
&lt;td&gt;Least-privilege for each Lambda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Log Groups&lt;/td&gt;
&lt;td&gt;Execution logs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Webhook Security
&lt;/h3&gt;

&lt;p&gt;For production, do not expose an unauthenticated webhook endpoint. ONTAP EMS webhook destinations support HTTPS and mutual authentication options. Use HTTPS for the API Gateway endpoint, restrict access where possible, and consider validating a shared secret or header in the Lambda handler.&lt;/p&gt;

&lt;h3&gt;
  
  
  ONTAP EMS Configuration
&lt;/h3&gt;

&lt;p&gt;After deployment, configure ONTAP EMS to forward ARP-related events to the API Gateway endpoint. At minimum, include &lt;code&gt;arw.volume.state&lt;/code&gt; and other &lt;code&gt;arw.*&lt;/code&gt; events you want to monitor. Refer to the &lt;a href="https://docs.netapp.com/us-en/ontap/error-messages/configure-webhooks-event-notifications-task.html" rel="noopener noreferrer"&gt;NetApp EMS webhook documentation&lt;/a&gt; for destination and filter configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The EMS Lambda Handler
&lt;/h2&gt;

&lt;p&gt;The handler receives an API Gateway proxy event containing the EMS webhook payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Process EMS webhook from ONTAP via API Gateway.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_api_key&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;request_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_get_request_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMS handler invoked: requestId=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract EMS events from webhook body
&lt;/span&gt;    &lt;span class="n"&gt;ems_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_ems_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parsed %d EMS event(s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ems_events&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Normalize to common schema
&lt;/span&gt;    &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_normalize_ems_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ems_events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Format for Datadog
&lt;/span&gt;    &lt;span class="n"&gt;dd_logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_format_for_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Ship to Datadog
&lt;/span&gt;    &lt;span class="n"&gt;shipped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_ship_to_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dd_logs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_api_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMS events processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ems_events&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shipped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EMS Event Normalization
&lt;/h3&gt;

&lt;p&gt;ONTAP EMS events arrive with fields like &lt;code&gt;messageName&lt;/code&gt;, &lt;code&gt;severity&lt;/code&gt;, &lt;code&gt;node&lt;/code&gt;, &lt;code&gt;svmName&lt;/code&gt;, &lt;code&gt;parameters&lt;/code&gt;. The handler normalizes them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_normalize_ems_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Normalize raw EMS events to internal schema.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messageName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svmName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Datadog Formatting (source:fsxn-ems)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_format_for_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Format normalized EMS events for Datadog Logs API v2.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;dd_logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;dd_logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ddsource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fsxn-ems&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ddtags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source:fsxn-ems,service:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DD_SERVICE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;,env:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DD_ENV&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hostname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DD_SERVICE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attributes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dd_logs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ARP Event Payload (Normalized by Lambda)
&lt;/h2&gt;

&lt;p&gt;ONTAP EMS webhooks deliver event notifications to the API Gateway endpoint. The Lambda's &lt;code&gt;_extract_ems_events()&lt;/code&gt; function parses the incoming API Gateway proxy event body, then &lt;code&gt;_normalize_ems_events()&lt;/code&gt; produces the following internal schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arw.volume.state"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"alert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_node"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fsxn-node-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"svm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"svm-prod-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-17T01:04:22Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Anti-ransomware: Volume vol_data state changed to attack-detected"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"volume_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vol_data"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"attack-detected"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Datadog, this arrives as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-ems&lt;/span&gt;
&lt;span class="py"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-node-01&lt;/span&gt;
&lt;span class="py"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-ontap&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.event_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;arw.volume.state&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;alert&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.svm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;svm-prod-01&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.parameters.volume_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;vol_data&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.parameters.state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;attack-detected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6938t9vha04oxhtypb5q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6938t9vha04oxhtypb5q.png" alt="ARP event in Datadog Log Explorer" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffptforpw8q8ts2uiuyw3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffptforpw8q8ts2uiuyw3.png" alt="ARP event detail panel" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up the Datadog Monitor
&lt;/h2&gt;

&lt;p&gt;Create a Monitor that triggers on any ARP alert:&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor Configuration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Log Explorer search query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Datadog Monitor API JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"🚨 FSx for ONTAP: Ransomware Detected (ARP)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"log alert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"logs(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;).index(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;*&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;).rollup(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;).last(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;5m&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;) &amp;gt; 0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"🚨 ONTAP Autonomous Ransomware Protection detected suspicious activity.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;**Volume**: {{attributes.parameters.volume_name}}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;**SVM**: {{attributes.svm}}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;**Node**: {{host}}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;**Time**: {{date}}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;## Recommended Actions&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;1. Verify the ARP event in ONTAP and Datadog.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;2. Check FPolicy/audit logs for user/client IP correlation.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;3. Follow the approved storage incident response runbook for snapshot, access restriction, or recovery actions.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;@pagerduty @slack-security-alerts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"thresholds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"critical"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"notify_no_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_delay"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What This Monitor Does
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triggers on&lt;/strong&gt;: Any &lt;code&gt;arw.volume.state&lt;/code&gt; event with &lt;code&gt;state:attack-detected&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold&lt;/strong&gt;: Critical when count &amp;gt; 0 in a 5-minute window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notification&lt;/strong&gt;: PagerDuty + Slack with volume name, SVM, and response steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-data handling&lt;/strong&gt;: Disabled (absence of ARP events is normal)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Adjust template variables (&lt;code&gt;{{attributes.*}}&lt;/code&gt;, &lt;code&gt;{{host}}&lt;/code&gt;, &lt;code&gt;{{date}}&lt;/code&gt;) based on how your Datadog site renders log attributes in monitor notifications. Test with a simulated event before relying on production alerts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  FPolicy: The Complementary Signal
&lt;/h2&gt;

&lt;p&gt;While ARP detects the encryption &lt;em&gt;pattern&lt;/em&gt;, FPolicy provides the file-level &lt;em&gt;detail&lt;/em&gt;. Together they answer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Is ransomware active?&lt;/td&gt;
&lt;td&gt;ARP (EMS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which files are affected?&lt;/td&gt;
&lt;td&gt;FPolicy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who is doing it?&lt;/td&gt;
&lt;td&gt;FPolicy (&lt;code&gt;user&lt;/code&gt; field)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;From where?&lt;/td&gt;
&lt;td&gt;FPolicy (&lt;code&gt;client_ip&lt;/code&gt; field)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What operations?&lt;/td&gt;
&lt;td&gt;FPolicy (&lt;code&gt;operation&lt;/code&gt;: create, write, rename, delete)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  FPolicy Event in Datadog
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;fsxn-fpolicy&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;create&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;/vol/data/finance/confidential_report.xlsx&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;suspicious_user@corp.local&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.client_ip&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;10.0.1.55&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="py"&gt;attributes.protocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;cifs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda2xm6pmhu8bqw5883pj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda2xm6pmhu8bqw5883pj.png" alt="FPolicy events in Datadog" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Correlation Query
&lt;/h3&gt;

&lt;p&gt;After an ARP alert, investigate with FPolicy data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source:fsxn-fpolicy @attributes.svm:svm-prod-01 @attributes.operation:(create OR write OR rename)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows all file modifications on the affected SVM, helping identify the responsible user and client.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident Response Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;1.&lt;/span&gt; ARP fires → EMS webhook → Datadog alert (around 30 seconds)
     │
&lt;span class="p"&gt;2.&lt;/span&gt; Responder receives PagerDuty/Slack notification
     │
&lt;span class="p"&gt;3.&lt;/span&gt; Verify in Datadog and ONTAP:
&lt;span class="p"&gt;   -&lt;/span&gt; source:fsxn-ems → confirm ARP event details
&lt;span class="p"&gt;   -&lt;/span&gt; source:fsxn-fpolicy → identify user, IP, affected files
&lt;span class="p"&gt;   -&lt;/span&gt; ONTAP: security anti-ransomware volume show
     │
&lt;span class="p"&gt;4.&lt;/span&gt; Correlate and assess:
&lt;span class="p"&gt;   -&lt;/span&gt; Is this a true positive or legitimate bulk operation?
&lt;span class="p"&gt;   -&lt;/span&gt; What is the blast radius (volumes, files, users)?
     │
&lt;span class="p"&gt;5.&lt;/span&gt; Containment (only after verification, per approved runbook):
&lt;span class="p"&gt;   -&lt;/span&gt; Create snapshot (preserve recovery point)
&lt;span class="p"&gt;   -&lt;/span&gt; Restrict volume access if confirmed malicious
&lt;span class="p"&gt;   -&lt;/span&gt; Review ARP suspect list
     │
&lt;span class="p"&gt;6.&lt;/span&gt; Recovery:
&lt;span class="p"&gt;   -&lt;/span&gt; Restore from snapshot (pre-attack state)
&lt;span class="p"&gt;   -&lt;/span&gt; Re-enable access after containment
&lt;span class="p"&gt;   -&lt;/span&gt; Update audit policies if gaps found
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: ARP alerts are high-confidence signals, but false positives can occur (e.g., legitimate backup encryption, bulk file operations). Always verify before applying disruptive containment actions such as restricting volume access. Follow your organization's incident response process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For a more detailed role-based runbook, see the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/arp-incident-response-guide.md" rel="noopener noreferrer"&gt;ARP Incident Response Guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond ARP: Other EMS Use Cases
&lt;/h2&gt;

&lt;p&gt;The same EMS webhook pipeline handles other critical ONTAP events:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;EMS Event&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;arw.volume.state&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;alert&lt;/td&gt;
&lt;td&gt;Ransomware detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;wafl.quota.softlimit.exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;warning&lt;/td&gt;
&lt;td&gt;Capacity planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;wafl.quota.hardlimit.exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;Immediate capacity action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cf.fsm.takeover&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;alert&lt;/td&gt;
&lt;td&gt;HA failover notification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sms.vol.full&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;Volume full — data at risk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;net.linkDown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;warning&lt;/td&gt;
&lt;td&gt;Network connectivity issue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All arrive in Datadog as &lt;code&gt;source:fsxn-ems&lt;/code&gt; with the event name in &lt;code&gt;@attributes.event_name&lt;/code&gt;, enabling targeted Monitors for each scenario. For the full cross-vendor field mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/normalized-event-schema.md" rel="noopener noreferrer"&gt;Normalized Event Schema&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validation Results
&lt;/h2&gt;

&lt;p&gt;This integration was validated end-to-end:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ARP event → Datadog&lt;/td&gt;
&lt;td&gt;✅ Arrived&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quota exceeded → Datadog&lt;/td&gt;
&lt;td&gt;✅ Arrived&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FPolicy file create → Datadog&lt;/td&gt;
&lt;td&gt;✅ Arrived (via SQS → Lambda path)&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda error handling&lt;/td&gt;
&lt;td&gt;✅ DLQ capture&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API key from Secrets Manager&lt;/td&gt;
&lt;td&gt;✅ Cached&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Validation performed in ap-northeast-1 with the deployed &lt;code&gt;fsxn-datadog-ems-fpolicy&lt;/code&gt; stack.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaw8jtsuq80xpwmutw7y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaw8jtsuq80xpwmutw7y.png" alt="Lambda execution logs" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Considerations for Security Teams
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Webhook security&lt;/strong&gt;: Use HTTPS for EMS webhook delivery. Do not expose an unauthenticated API Gateway endpoint in production. Validate a shared secret, header, or mTLS identity where possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detection latency&lt;/strong&gt;: EMS webhooks are event-driven. ARP detection itself depends on ONTAP's ML model — it typically fires within seconds of detecting the pattern, not after a fixed interval. End-to-end latency from ARP detection to Datadog visibility depends on webhook delivery, Lambda processing, and Datadog ingest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False positives&lt;/strong&gt;: ARP can trigger on legitimate bulk encryption operations (e.g., backup software encrypting files). Design your response workflow to include a verification step before disruptive actions like restricting volume access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coverage&lt;/strong&gt;: ARP behavior depends on your ONTAP version, volume type, and whether ARP/AI is available. Older NAS FlexVol configurations may start in learning mode before active detection, while newer ONTAP versions (9.16.1+ with ARP/AI) can become active immediately for supported volumes. Always verify &lt;code&gt;security anti-ransomware volume show&lt;/code&gt; before relying on alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit trail&lt;/strong&gt;: The EMS event in Datadog serves as the detection timestamp for incident timelines. FPolicy events provide the forensic detail. Together they form a complete audit trail from detection to response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost profile&lt;/strong&gt;: EMS events are usually low-volume and alert-oriented, while FPolicy can be high-volume depending on policy scope. Treat their Datadog ingest and alerting cost profiles separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;If you want the shortest path to a first successful ARP alert test, see the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/quick-start-minimum.md" rel="noopener noreferrer"&gt;minimum quick start&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The following simulated event exercises the Lambda normalization and Datadog shipping path. Your actual ONTAP EMS webhook payload may differ depending on EMS webhook configuration, so validate with a real EMS event before production use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy EMS + FPolicy integration&lt;/span&gt;
aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template-ems-fpolicy.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-ems-fpolicy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-secret-arn&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogSite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap1.datadoghq.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM

&lt;span class="c"&gt;# Create a test event file&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; arp-test-event.json &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
{
  "body": "{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;messageName&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;arw.volume.state&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;severity&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;alert&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;node&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;fsxn-node-01&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;svmName&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;svm-prod-01&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;time&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%SZ&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;message&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;Anti-ransomware: Volume vol_data state changed to attack-detected&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;parameters&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;volume_name&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;vol_data&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;state&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;attack-detected&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="sh"&gt;}}",
  "requestContext": {"requestId": "test"}
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Invoke Lambda with the test event&lt;/span&gt;
aws lambda invoke &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function-name&lt;/span&gt; fsxn-datadog-ems-fpolicy-ems &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; file://arp-test-event.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cli-binary-format&lt;/span&gt; raw-in-base64-out &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1 &lt;span class="se"&gt;\&lt;/span&gt;
  arp-test-output.json

&lt;span class="c"&gt;# Check Datadog: source:fsxn-ems @attributes.event_name:arw.volume.state&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This completes the Datadog series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1&lt;/strong&gt;: Architecture and project introduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: Audit log pipeline implementation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: Event-driven ransomware detection (this post)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Coming up next in the series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Splunk&lt;/strong&gt;: Replacing EC2 + Universal Forwarder with Lambda + HEC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: The vendor-neutral escape hatch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana Cloud&lt;/strong&gt;: Loki Push API with label cardinality guidance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each will follow the same pattern: deploy, validate, document the gotchas.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have questions about ARP detection or the EMS pipeline? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c"&gt;Part 2 — Shipping FSx for ONTAP Logs to Datadog, The Serverless Way&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations" rel="noopener noreferrer"&gt;github.com/Yoshiki0705/fsxn-observability-integrations&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>datadog</category>
      <category>amazonfsxfornetappontap</category>
    </item>
    <item>
      <title>Shipping FSx for ONTAP Logs to Datadog — The Serverless Way</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 17 May 2026 09:16:31 +0000</pubDate>
      <link>https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c</link>
      <guid>https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Deploy a CloudFormation stack, configure ONTAP audit logging, and see structured file access events in Datadog Log Explorer within minutes — no EC2, no NFS mounts, no agents. This post walks through the full implementation: CloudFormation template, Lambda handler code, Datadog field mapping, and operational validation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Part 1&lt;/a&gt;, I introduced the architecture: FSx for ONTAP audit volume → S3 Access Point → EventBridge Scheduler → Lambda → Datadog. Now let's build it.&lt;/p&gt;

&lt;p&gt;By the end of this post, you'll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A deployed CloudFormation stack with Lambda, Scheduler, DLQ, and alarms&lt;/li&gt;
&lt;li&gt;ONTAP audit events flowing into Datadog Log Explorer&lt;/li&gt;
&lt;li&gt;Structured attributes (&lt;code&gt;@attributes.svm&lt;/code&gt;, &lt;code&gt;@attributes.user&lt;/code&gt;, &lt;code&gt;@attributes.operation&lt;/code&gt;, &lt;code&gt;@attributes.path&lt;/code&gt;, &lt;code&gt;@attributes.client_ip&lt;/code&gt;, &lt;code&gt;@attributes.result&lt;/code&gt;) ready for search, filtering, and Datadog facet creation&lt;/li&gt;
&lt;li&gt;An operational CloudWatch dashboard monitoring pipeline health&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before deploying, you need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;FSx for ONTAP file system&lt;/strong&gt; with an SVM configured for audit logging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FSx for ONTAP S3 Access Point&lt;/strong&gt; attached to the audit volume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog account&lt;/strong&gt; (free trial works) with an API Key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Key in Secrets Manager&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws secretsmanager create-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; fsxn-datadog-api-key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"api_key":"&amp;lt;your-dd-api-key&amp;gt;"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ONTAP audit logging enabled&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Time-based rotation for quick validation&lt;/span&gt;
vserver audit create &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;svm-name&amp;gt; &lt;span class="nt"&gt;-destination&lt;/span&gt; /audit_log &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-events&lt;/span&gt; file-ops &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-format&lt;/span&gt; evtx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-rotate-schedule-minute&lt;/span&gt; 0,5,10,15,20,25,30,35,40,45,50,55
vserver audit &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;-vserver&lt;/span&gt; &amp;lt;svm-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;For quick validation, use time-based rotation. If you only use &lt;code&gt;-rotate-size&lt;/code&gt;, low-volume environments may not produce rotated audit files within the expected validation window. Adjust the &lt;code&gt;-events&lt;/code&gt; list based on what you want to audit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Enabling &lt;code&gt;vserver audit&lt;/code&gt; is only one part of file access auditing. Make sure the target SMB folders have SACLs configured, or NFSv4 ACL audit flags are set for NFS workloads. Otherwise, the audit pipeline may be healthy but no file access events will be generated.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For detailed ONTAP-side setup, including audit volume sizing, SACL/NFSv4 ACL examples, and source health checks, see the repository's &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/ontap-audit-setup.md" rel="noopener noreferrer"&gt;ONTAP Audit Setup Guide&lt;/a&gt; and &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/operational-guide.md" rel="noopener noreferrer"&gt;Operational Guide&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify how audit files appear via S3 API&lt;/strong&gt; (to set &lt;code&gt;AuditLogPrefix&lt;/code&gt; correctly):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws s3api list-objects-v2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bucket&lt;/span&gt; &amp;lt;fsx-s3-access-point-arn-or-alias&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-keys&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set &lt;code&gt;AuditLogPrefix&lt;/code&gt; to match the key prefix you see. If the access point is attached directly to the audit volume root, this may be empty.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: &lt;code&gt;/audit_log&lt;/code&gt; is the ONTAP namespace path. The S3 object key prefix can differ depending on the access point attachment, so always verify with &lt;code&gt;list-objects-v2&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The CloudFormation Stack
&lt;/h2&gt;

&lt;p&gt;The Datadog integration deploys as a single self-contained stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-file&lt;/span&gt; integrations/datadog/template.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameter-overrides&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;FsxS3AccessPointArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogApiKeySecretArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;DatadogSite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ap1.datadoghq.com &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;AuditLogPrefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;prefix-from-list-objects-v2&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ScheduleRate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"rate(5 minutes)"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Gets Created
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda Function&lt;/td&gt;
&lt;td&gt;Reads audit logs from S3 AP, parses EVTX/XML, ships to Datadog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EventBridge Scheduler&lt;/td&gt;
&lt;td&gt;Invokes Lambda every 5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduler IAM Role&lt;/td&gt;
&lt;td&gt;Allows Scheduler to invoke Lambda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda Execution Role&lt;/td&gt;
&lt;td&gt;S3 AP read, Secrets Manager read, CloudWatch Logs, DLQ send permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dead Letter Queue (SQS)&lt;/td&gt;
&lt;td&gt;Captures failed events for replay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Alarms (3)&lt;/td&gt;
&lt;td&gt;Errors, throttles, DLQ depth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Dashboard&lt;/td&gt;
&lt;td&gt;Operational health: errors, duration, invocations, DLQ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Log Group&lt;/td&gt;
&lt;td&gt;Lambda execution logs (30-day retention)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Parameters
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Required&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FsxS3AccessPointArn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;FSx for ONTAP S3 Access Point ARN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DatadogApiKeySecretArn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Secrets Manager ARN for the API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DatadogSite&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Datadog site (default: &lt;code&gt;ap1.datadoghq.com&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ScheduleRate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Processing frequency (default: &lt;code&gt;rate(5 minutes)&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AuditLogPrefix&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Object key prefix as seen via S3 API. Leave empty if audit files appear at the access point root.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;VpcEnabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Enable VPC config — requires NAT Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Lambda Handler
&lt;/h2&gt;

&lt;p&gt;The handler follows a straightforward flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scheduled invocation
  → List objects from FSx for ONTAP S3 AP (via S3 ListObjectsV2)
  → Filter by checkpoint (skip already-processed files)
  → For each new file:
      → Read via S3 GetObject
      → Detect format (EVTX magic bytes or XML declaration)
      → Parse into normalized events
      → Format for Datadog Logs API v2
      → Batch (≤5MB, ≤1000 items per request)
      → Ship with exponential backoff (max 3 attempts)
  → Update checkpoint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Datadog API Limits
&lt;/h3&gt;

&lt;p&gt;The Datadog Logs API v2 enforces the following per-request limits (&lt;a href="https://docs.datadoghq.com/api/latest/logs/" rel="noopener noreferrer"&gt;docs&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum payload size (uncompressed): &lt;strong&gt;5MB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Maximum size for a single log: &lt;strong&gt;1MB&lt;/strong&gt; (larger logs are truncated, not rejected)&lt;/li&gt;
&lt;li&gt;Maximum array size: &lt;strong&gt;1000 entries&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shipper batches conservatively below these limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Shipping Logic
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_ship_to_datadog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Ship normalized logs to Datadog Logs Intake API v2.

    If any batch fails after retries, raise an exception so the Lambda
    invocation is treated as failed and the checkpoint is not advanced.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;shipped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;failed_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;_create_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_send_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;shipped&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;failed_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;failed_batches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed_batches&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; batch(es) failed after retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;shipped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Checkpoint Semantics
&lt;/h3&gt;

&lt;p&gt;The checkpoint is advanced only after all batches for an audit log file are successfully delivered to Datadog. If any batch fails after retries, the Lambda invocation fails (raises an exception) and the checkpoint is not updated.&lt;/p&gt;

&lt;p&gt;This makes the pipeline &lt;strong&gt;at-least-once&lt;/strong&gt;: the same audit file may be retried on the next scheduled invocation, so downstream queries should tolerate duplicate events. For production, consider adding a deterministic event ID derived from the audit file key and event record offset to support deduplication where your observability platform supports it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Because EventBridge Scheduler invokes Lambda asynchronously, a failed invocation (unhandled exception) triggers Lambda's built-in retry behavior (up to 2 retries by default). After all retries are exhausted, the event payload is sent to the configured DLQ.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Retry with Exponential Backoff
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_send_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send a single batch with retry on 429/5xx, up to MAX_RETRIES attempts.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;DATADOG_LOGS_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DD-API-KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# jitter
&lt;/span&gt;            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="c1"&gt;# Client error (4xx) — don't retry
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation uses exponential backoff with jitter (&lt;code&gt;2^attempt + random&lt;/code&gt;) to avoid synchronized retries when multiple Lambda invocations hit vendor-side throttling simultaneously. Note that &lt;code&gt;MAX_RETRIES&lt;/code&gt; in the code represents the total number of attempts, not retries after an initial attempt.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Key Caching
&lt;/h3&gt;

&lt;p&gt;The API key is fetched from Secrets Manager once per Lambda execution context (cold start) and cached in a module-level variable. This avoids per-invocation Secrets Manager calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_api_key_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_api_key&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secrets_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SecretId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;API_KEY_SECRET_ARN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;_api_key_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dd_api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_api_key_cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Datadog Field Mapping
&lt;/h2&gt;

&lt;p&gt;Every audit event arrives in Datadog with structured attributes. The Lambda sends these via the Datadog Logs API v2 payload fields (&lt;code&gt;ddsource&lt;/code&gt;, &lt;code&gt;hostname&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, &lt;code&gt;message&lt;/code&gt;) and custom attributes nested under &lt;code&gt;attributes&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Datadog Log Explorer&lt;/th&gt;
&lt;th&gt;Payload Field&lt;/th&gt;
&lt;th&gt;ONTAP Source&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;source&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ddsource&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configured&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configured&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fsxn-ontap&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;host&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;hostname&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SVM name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.svm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.svm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SVMName / Computer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.user&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.user&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;UserName / SubjectUserName&lt;/td&gt;
&lt;td&gt;&lt;code&gt;admin@corp.local&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.client_ip&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.client_ip&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ClientIP / IpAddress&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.0.1.50&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.operation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operation / ObjectType&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReadData&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ObjectName&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/vol/data/reports/q4.xlsx&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Result / Keywords&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Success&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes.event_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes.event_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;EventID&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4663&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes._pipeline.processed_at&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes._pipeline.processed_at&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lambda timestamp&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2026-05-17T01:30:00Z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attributes._pipeline.source_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;attributes._pipeline.source_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;S3 object key&lt;/td&gt;
&lt;td&gt;&lt;code&gt;audit_log/audit_svm_20260517.evtx&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Set &lt;code&gt;DatadogSite&lt;/code&gt; to your Datadog site, such as &lt;code&gt;datadoghq.com&lt;/code&gt; (US1), &lt;code&gt;datadoghq.eu&lt;/code&gt; (EU1), or &lt;code&gt;ap1.datadoghq.com&lt;/code&gt; (AP1/Tokyo). The site determines the API endpoint.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For the full cross-vendor mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the &lt;a href="https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/normalized-event-schema.md" rel="noopener noreferrer"&gt;Normalized Event Schema&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Datadog Search Queries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# All FSx for ONTAP audit events
source:fsxn

# Failed access attempts
source:fsxn @attributes.result:Failure

# Specific user activity
source:fsxn @attributes.user:"admin@corp.local"

# Delete operations on sensitive paths
source:fsxn @attributes.operation:delete @attributes.path:"/vol/data/confidential/*"

# Pipeline processing metadata
source:fsxn @attributes._pipeline.source_file:*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Part 3, we'll turn these queries into Datadog Monitors for ARP ransomware detection and suspicious file activity alerting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigation Query Starters
&lt;/h3&gt;

&lt;p&gt;When investigating an incident, start with these patterns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Search query&lt;/th&gt;
&lt;th&gt;Then group by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What did this user do?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.user:"suspect@corp.local"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@attributes.operation&lt;/code&gt; or &lt;code&gt;@attributes.path&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who accessed this file?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.path:"/vol/data/secret.pdf"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@attributes.user&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which clients generated failures?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.result:Failure&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@attributes.client_ip&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where are deletes concentrated?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.operation:delete&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@attributes.path&lt;/code&gt; or a path prefix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What happened on this SVM in the last hour?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;source:fsxn @attributes.svm:svm-prod-01&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@attributes.operation&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;For high-volume environments, avoid grouping by full file path unless needed. Consider deriving a lower-cardinality field such as a path prefix or data area classification.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Operational Validation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Quick Validation (5–10 minutes)
&lt;/h3&gt;

&lt;p&gt;With a 5-minute audit rotation and 5-minute Scheduler interval, the first events typically appear within a few minutes, but allow up to 10 minutes depending on timing.&lt;/p&gt;

&lt;p&gt;Before waiting for logs, generate a test file operation on the audited SMB/NFS share — such as creating and deleting a small test file — to ensure ONTAP produces an audit event.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 0. Get stack outputs (log group name, DLQ URL, etc.)&lt;/span&gt;
aws cloudformation describe-stacks &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Stacks[0].Outputs'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1

&lt;span class="c"&gt;# 1. Confirm Scheduler is invoking Lambda&lt;/span&gt;
aws logs filter-log-events &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &amp;lt;LambdaLogGroupName from outputs&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import time; print(int((time.time()-300)*1000))"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1

&lt;span class="c"&gt;# 2. Confirm DLQ is empty&lt;/span&gt;
aws sqs get-queue-attributes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--queue-url&lt;/span&gt; &amp;lt;dlq-url&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attribute-names&lt;/span&gt; All &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Attributes.ApproximateNumberOfMessages'&lt;/span&gt;

&lt;span class="c"&gt;# 3. Search in Datadog&lt;/span&gt;
&lt;span class="c"&gt;#    source:fsxn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CloudWatch Dashboard
&lt;/h3&gt;

&lt;p&gt;The stack includes a pre-built dashboard (&lt;code&gt;fsxn-datadog-integration-health&lt;/code&gt;) with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda Errors &amp;amp; Throttles&lt;/li&gt;
&lt;li&gt;Lambda Duration (avg/max)&lt;/li&gt;
&lt;li&gt;Lambda Invocations&lt;/li&gt;
&lt;li&gt;DLQ Depth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For production, consider publishing custom metrics such as files processed, events shipped, batch failures, and checkpoint lag to gain deeper pipeline observability beyond Lambda-level metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to Watch For
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No logs in Datadog&lt;/td&gt;
&lt;td&gt;Scheduler not running, or no new audit files&lt;/td&gt;
&lt;td&gt;Check CloudWatch Logs for Lambda invocations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs arrive but fields are empty&lt;/td&gt;
&lt;td&gt;EVTX/XML parsing issue&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;@attributes.event_type&lt;/code&gt; — if "unknown", parser needs tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DLQ messages appearing&lt;/td&gt;
&lt;td&gt;Datadog API rejection&lt;/td&gt;
&lt;td&gt;Check API key validity, site configuration, timestamp age&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda timeout&lt;/td&gt;
&lt;td&gt;S3 AP access issue (VPC Gateway EP?)&lt;/td&gt;
&lt;td&gt;Verify NAT Gateway or deploy Lambda outside VPC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Old Timestamps May Not Appear in Log Explorer
&lt;/h3&gt;

&lt;p&gt;The Datadog Logs API accepts log events with timestamps up to 18 hours in the past. If your audit files are rotated or processed too late, older events may not appear as expected in Log Explorer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Use a time-based ONTAP audit rotation schedule and a Scheduler frequency that keeps processing well within the 18-hour window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gzip Compression Issue (AP1 Site)
&lt;/h3&gt;

&lt;p&gt;During E2E validation, gzip-compressed payloads were accepted (HTTP 202) but not indexed on the AP1 site. The &lt;code&gt;ENABLE_GZIP&lt;/code&gt; parameter defaults to &lt;code&gt;false&lt;/code&gt; for this reason.&lt;/p&gt;

&lt;h3&gt;
  
  
  S3 Access Point Timeout in VPC
&lt;/h3&gt;

&lt;p&gt;If Lambda is in a VPC with only an S3 Gateway Endpoint, reads from FSx for ONTAP S3 Access Points will timeout. Add NAT Gateway or deploy Lambda outside VPC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day-2 Operations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DLQ Replay
&lt;/h3&gt;

&lt;p&gt;This stack uses an SQS queue as the Lambda asynchronous invocation DLQ. Because the DLQ is attached to Lambda (not an SQS source queue), &lt;code&gt;sqs start-message-move-task&lt;/code&gt; cannot redrive messages automatically.&lt;/p&gt;

&lt;p&gt;For replay, inspect the DLQ message, identify the failed invocation payload, and re-invoke Lambda manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inspect failed messages&lt;/span&gt;
aws sqs receive-message &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--queue-url&lt;/span&gt; &amp;lt;dlq-url&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-number-of-messages&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attribute-names&lt;/span&gt; All &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--message-attribute-names&lt;/span&gt; All
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After fixing the root cause (e.g., expired API key, Datadog site misconfiguration), re-run the scheduled processor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws lambda invoke &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function-name&lt;/span&gt; &amp;lt;lambda-function-name&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cli-binary-format&lt;/span&gt; raw-in-base64-out &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; &lt;span class="s1"&gt;'{}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1 &lt;span class="se"&gt;\&lt;/span&gt;
  replay-output.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this pattern, replay usually means re-running the scheduled processor after fixing the root cause. Because the checkpoint is not advanced on failed delivery, the same audit file remains eligible for processing on the next invocation. This does not re-submit the DLQ message itself — it re-runs the processor so files whose checkpoints were not advanced can be picked up again.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For production, consider adding a dedicated replay Lambda that reads DLQ messages, validates the payload, and re-submits failed processing requests in a controlled way.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Checkpoint Reset (Reprocess All Files)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Warning&lt;/strong&gt;: Resetting the checkpoint causes previously processed audit files to be eligible for reprocessing. This can generate duplicate logs in Datadog. Use only for controlled replay or testing.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws dynamodb delete-item &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; fsxn-observability-audit-checkpoint &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="s1"&gt;'{"svm_name": {"S": "svm-prod-01"}, "file_key": {"S": "LATEST"}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Teardown
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation delete-stack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; fsxn-datadog-integration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; ap-northeast-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Deleting the stack does not affect ONTAP audit logging or data on the FSx for ONTAP volume.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Cost Estimate
&lt;/h2&gt;

&lt;p&gt;For a typical deployment (1 SVM, 100MB audit logs/day, 5-minute schedule):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda (288 invocations/day × 5s avg)&lt;/td&gt;
&lt;td&gt;~$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EventBridge Scheduler&lt;/td&gt;
&lt;td&gt;~$0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB (checkpoint)&lt;/td&gt;
&lt;td&gt;~$0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Manager&lt;/td&gt;
&lt;td&gt;~$0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Logs (30-day)&lt;/td&gt;
&lt;td&gt;~$1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway (if VPC)&lt;/td&gt;
&lt;td&gt;Region-dependent hourly + per-GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total (no VPC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total (with VPC/NAT)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$30–50+/month depending on Region&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Cost numbers are illustrative. Assume a 5-minute schedule, 5-second average runtime, and 100MB/day of audit logs. NAT Gateway pricing is regional and includes hourly charges plus per-GB data processing charges. Check the &lt;a href="https://calculator.aws/" rel="noopener noreferrer"&gt;AWS Pricing Calculator&lt;/a&gt; for your target Region.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Datadog ingest and retention costs are not included in this AWS-side estimate and can become the dominant cost driver for high-volume audit policies, especially when read auditing is enabled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence retention&lt;/strong&gt;: This pipeline optimizes search and alerting via normalized events in Datadog. If you need audit evidence retention for compliance, design raw EVTX/XML retention separately on the audit volume or in an archive path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost control&lt;/strong&gt;: For high-volume environments, consider a tiered strategy: send security-relevant operations such as deletes, permission changes, and failed access to indexed logs; reduce, archive, or exclude noisy read events only if your audit and compliance requirements allow it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Compare this to an always-on EC2 collector instance, plus EBS, patching labor, and agent licensing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3&lt;/a&gt;, we'll add event-driven security alerting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ONTAP Autonomous Ransomware Protection (ARP) detection&lt;/li&gt;
&lt;li&gt;EMS webhook → API Gateway → Lambda → Datadog&lt;/li&gt;
&lt;li&gt;Datadog Monitor configuration for instant alerts&lt;/li&gt;
&lt;li&gt;Incident response workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Datadog is the first E2E-verified integration in this pattern library; the same structure will be used for the remaining vendor integrations as they are validated.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions about the Datadog integration? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod"&gt;Part 1 — Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Next: &lt;a href="https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda"&gt;Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>datadog</category>
      <category>amazonfsxfornetappontap</category>
    </item>
  </channel>
</rss>
