DEV Community: AWS Community Builders

AI Terms, Simply Explained: Notes from My Learning Journey

Sandeep Sangu — Wed, 20 May 2026 05:38:51 +0000

While preparing for the AWS Certified AI Practitioner exam, I thought it would be helpful to ✍️ down my understanding of some common AI and GenAI terms.

These notes reflect my understanding, shaped by different learning resources, including AWS publicly available content and from experiences.

This is not a textbook or a glossary. 📚

It’s a simple explanation of key terms, written in a way that I would have liked to read when I first started — with real-world analogies and no jargon.

Let’s get started. 🚀

Why Fundamentals Matter

As we all know, terms like Machine Learning, AI, Generative AI, and Agentic AI are becoming common. These are the ones we hear the most, but there are many more working quietly behind the scenes.

Personally, I believe staying relevant and up to date is the key.

When you understand the fundamentals right, it becomes easier to connect the dots when you work on real AI projects — and that confidence makes a real difference.

Fundamentals

1️⃣ Artificial Intelligence (AI) is the idea of making computers do things that would normally require human intelligence. 🤖

Think of it as teaching machines to solve problems, understand language, or even make decisions — tasks that earlier needed a person.

Real-life examples we already use: 📚

Voice Assistants like Siri and Alexa that understand what you say and respond. 🗣️
Recommendation Systems on Netflix or Amazon that suggest what to watch or buy. 🎬🛒
Chatbots that help answer your questions on websites. 💬

Why it matters:

AI is now behind many tools and services we use daily. Knowing the basics helps you understand how these systems are built and what’s happening behind the scenes.

🔍 Quick Note: Why Data Matters

All AI systems — whether it's Machine Learning, Generative AI, or Chatbots — rely heavily on data. Data is what helps AI learn, find patterns, and make decisions.

Where does the data come from?

It can be collected from public datasets, user interactions, company records, or even purchased from authorized data providers.

In short: No data, no AI.

The better the data, the smarter the AI becomes.

2️⃣ Machine Learning (ML) 🧠 is a branch of AI focused on teaching computers to learn from data, without being explicitly programmed for every task.

While AI is the broader idea of making machines intelligent, ML is one way we achieve it — by helping machines find patterns in data and improve over time.

Real-life examples: 📚

Movie recommendations on Netflix that get better the more you watch. 🎬
Spam filters in your email that learn what to block. ✉️🚫
Fraud detection systems 🏦 used by banks to spot unusual transactions.

Why it matters:

Machine Learning powers many of the AI applications we interact with daily. Understanding how ML works helps demystify how intelligent systems make decisions based on data.

3️⃣ Artificial Neural Networks (ANN) are computer systems inspired by how the human brain works.

They are made up of layers of simple units called neurons, connected to each other, and are designed to recognize patterns in data — much like how our brain processes information.

How it works:

The input layer receives the raw data.
Hidden layers work through the data to find patterns and relationships.
The output layer gives the final result or decision.

Real-life examples: 📚

Facial recognition systems that unlock your phone. 📱🔓
Voice recognition 🎙️ in assistants like Alexa or Google Assistant.
Handwriting recognition when you digitize notes. ✍️📝

Why it matters:

Neural networks are at the heart of many AI applications that require pattern recognition. They help machines process complex data and make decisions more like how humans do.

4️⃣ Deep Learning is a type of Machine Learning that uses large neural networks with many layers — which is why it's called deep.

You can think of it as a more powerful way for machines to learn complex tasks by breaking them down into smaller steps — similar to how we build a house brick by brick 🧱🏠, or how we first set up infrastructure before deploying an app in tech projects. 🖥️🚀

Real-life examples: 📚

Self-driving cars 🚗🚦recognizing traffic signs and pedestrians.
Photo apps 📸🧑‍🤝‍🧑 that automatically recognize and tag faces.

Why it matters:

Deep Learning has made it possible for machines to perform tasks that once needed human-level skills — like seeing, recognizing, and even understanding — at a much higher scale.

5️⃣ Generative AI (GenAI) is a type of AI that creates new content — like text, images, or even music — based on what it has learned.🧩

You can think of it like a chef who has studied thousands of recipes and can now create a new dish using that knowledge.🍳

Real-life examples we already see: 📚

ChatGPT helping write emails or answer questions.📝
Amazon Q Developer suggesting code, helping troubleshoot, and assisting in building AWS applications.💻
AI tools that generate artwork from text prompts.🎨

Why it matters:

Generative AI is speeding up how we create, design, and problem-solve — helping us move from ideas to results much faster.

6️⃣ Foundation Models (FM) are large AI models trained on a huge variety of data — text, images, or both — so they can handle many different tasks without being specialized for just one thing.

You can think of a Foundation Model like a strong base in construction — once built, it can support different types of buildings on top.🏗️🏢

Real-life examples you might know:📚

GPT-4,📝which powers ChatGPT for understanding and generating text.
Stable Diffusion, 🎨used for creating realistic images from text prompts.

Why it matters:

Instead of building a new AI model for every task, Foundation Models give us a powerful starting point that can be fine-tuned for specific needs — making AI development faster and more flexible.

7️⃣ Large Language Models (LLMs) are AI systems trained on huge amounts of text data to understand and generate human language.🧠📝

You can think of an LLM like a smart virtual assistant — or like a doctor who has seen thousands of cases and can diagnose based on experience, without having to look things up every time. 🩺📚

Where you see LLMs in action: 📚

Chatbots that answer customer service questions.💬
Email writing assistants that suggest better sentences.✉️
AI search tools that provide direct answers instead of links.🔍

Why it matters:

LLMs are powering a new generation of tools that can understand human language and respond naturally, helping make information and communication faster and easier.

Quick Note:

All LLMs are Foundation Models (FMs), but not all FMs are LLMs — FMs can handle other types of data too, like images or video.

Real-world example:

AWS offers a service called Amazon Bedrock, where you can access different LLMs like Anthropic's Claude and Meta's Llama 2 and AWS's own Amazon Titan models to build language-based applications.

8️⃣ Natural Language Processing (NLP) is the part of AI that helps computers understand and work with human language — both what we write and what we say. 🗣️💻

You can think of NLP like teaching a computer how to read, listen, and respond in ways that feel natural to us.

Behind the scenes: 🔍

NLP uses algorithms that learn from lots of examples — books, conversations, articles — so that computers can figure out what we mean and reply in a way that feels human.
It’s not hard-coded with rules — it learns patterns and improves over time, just like we do when we practice a new language.

Two important sides of NLP: 📚

Understanding Language (NLU): This is where the computer tries to figure out what the words really mean — like detecting the mood behind a sentence (happy, sad) or guessing what someone wants based on what they said.😊😠
Creating Language (NLG): This is where the computer generates text or speech — for example, turning typed words into spoken voice (text-to-speech) or turning spoken voice into written words (speech-to-text).✍️🔊

Why it matters:

NLP is what makes it possible for computers to have more natural conversations with us — whether it’s chatting with a support bot or using voice commands on a device.

9️⃣ Transformer Models are a type of AI model designed to understand and process language more effectively.🧠💬

Unlike older models that read sentences one word at a time, Transformers look at the entire sentence all at once.

What makes them special is a trick called attention — they figure out which words in a sentence are more important to focus on.

For example, in a customer review:

“The food was amazing, but the service was slow.”

The model pays more attention to words like “food,” “amazing,” “service,” and “slow” because they carry the real meaning, instead of small filler words.

Why it matters:

Transformers have become the foundation for many advanced AI systems, helping them understand language faster and more accurately than before.

VPC Peering: El puente de red para que recursos aislados se comuniquen

Javier Madriz — Tue, 19 May 2026 20:47:25 +0000

¡Bienvenidos todos a un nuevo workshop sobre redes! El día de hoy vamos a trabajar con VPC Peering: aprenderemos qué es, cuándo podemos usarlo, cuáles son sus beneficios e inclusive cuándo es mejor evitarlo. Pero no nos quedaremos solo en la teoría; implementaremos esta solución conectando dos redes totalmente aisladas y realizaremos pruebas para mover tráfico real entre los recursos desplegados en cada VPC.

¿Qué es un VPC Peering?

Un VPC Peering (o interconexión de VPC) es una conexión de red entre dos VPC que permite el enrutamiento de tráfico entre ellas utilizando direcciones IPv4 o IPv6 privadas. Los recursos desplegados en estas redes pueden comunicarse entre sí como si estuvieran dentro de la misma red local. Lo mejor de todo es que permite conectar VPC que están en la misma región, en regiones distintas e, inclusive, pertenecientes a cuentas de AWS diferentes.

Alcance del workshop

Nos enfocaremos en cómo establecer la interconexión y las reglas de seguridad necesarias para mover el tráfico de una VPC a otra de manera segura, aplicando el principio de mínimo privilegio.

Como en guías anteriores ya aprendimos los conceptos fundamentales de VPC y sus componentes, no repetiremos ese trabajo de forma manual. Para centrar nuestra atención únicamente en la interconexión, las rutas y la seguridad, he preparado una plantilla de CloudFormation en formato YAML. Con ella desplegaremos automáticamente las VPC que vamos a interconectar y los recursos que intercambiarán tráfico una vez establecido el peering.

Plantilla cloudformation para desplegar recursos

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Infraestructura base para Workshop de VPC Peering y EIC Endpoint - Entorno Privado'

Parameters:
  LatestAmiId:
    Type: 'AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>'
    Default: '/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64'
    Description: "AMI mas reciente de Amazon Linux 2023"

  InstanceType:
    Type: String
    Default: t3.micro
    Description: "Tipo de instancia para el laboratorio"

Resources:
  # --- INFRAESTRUCTURA VPC 01 ---
  VPC01:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/24
      EnableDnsSupport: true
      EnableDnsHostnames: true
      Tags:
        - Key: Name
          Value: vpc-01-workshop

  Subnet01:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId:
        Ref: VPC01
      CidrBlock: 10.0.0.0/25
      AvailabilityZone: !Select [ 0, !GetAZs "" ]
      Tags:
        - Key: Name
          Value: subnet-01-privada

  RouteTable01:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId:
        Ref: VPC01
      Tags:
        - Key: Name
          Value: rt-01-privada

  SubnetRouteTableAssociation01:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId:
        Ref: Subnet01
      RouteTableId:
        Ref: RouteTable01

  # --- SEGURIDAD VPC 01 ---
  SGEIC01:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security Group para EIC Endpoint 01"
      VpcId:
        Ref: VPC01
      Tags:
        - Key: Name
          Value: sg-eic-01

  SGInstance01:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security Group para Instancia 01"
      VpcId:
        Ref: VPC01
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          SourceSecurityGroupId:
            Ref: SGEIC01
      Tags:
        - Key: Name
          Value: sg-instance-01

  # Regla de salida para que el EIC llegue a la instancia
  EIC01Egress:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      GroupId:
        Ref: SGEIC01
      IpProtocol: tcp
      FromPort: 22
      ToPort: 22
      DestinationSecurityGroupId:
        Ref: SGInstance01

  # --- RECURSOS VPC 01 ---
  EICEndpoint01:
    Type: AWS::EC2::InstanceConnectEndpoint
    Properties:
      SubnetId:
        Ref: Subnet01
      SecurityGroupIds:
        - Ref: SGEIC01
      Tags:
        - Key: Name
          Value: eic-endpoint-01

  Instance01:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType:
        Ref: InstanceType
      ImageId:
        Ref: LatestAmiId
      SubnetId:
        Ref: Subnet01
      SecurityGroupIds:
        - Ref: SGInstance01
      Tags:
        - Key: Name
          Value: instancia-01-requester

  # --- INFRAESTRUCTURA VPC 02 ---
  VPC02:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 11.0.0.0/24
      EnableDnsSupport: true
      EnableDnsHostnames: true
      Tags:
        - Key: Name
          Value: vpc-02-workshop

  Subnet02:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId:
        Ref: VPC02
      CidrBlock: 11.0.0.0/25
      AvailabilityZone: !Select [ 0, !GetAZs "" ]
      Tags:
        - Key: Name
          Value: subnet-02-privada

  RouteTable02:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId:
        Ref: VPC02
      Tags:
        - Key: Name
          Value: rt-02-privada

  SubnetRouteTableAssociation02:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId:
        Ref: Subnet02
      RouteTableId:
        Ref: RouteTable02

  # --- SEGURIDAD VPC 02 ---
  SGEIC02:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security Group para EIC Endpoint 02"
      VpcId:
        Ref: VPC02
      Tags:
        - Key: Name
          Value: sg-eic-02

  EIC02Egress:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      GroupId:
        Ref: SGEIC02
      IpProtocol: tcp
      FromPort: 22
      ToPort: 22
      DestinationSecurityGroupId:
        Ref: SGInstance02

  SGInstance02:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security Group para Instancia 02"
      VpcId:
        Ref: VPC02
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          SourceSecurityGroupId:
            Ref: SGEIC02
      Tags:
        - Key: Name
          Value: sg-instance-02


  # --- RECURSOS VPC 02 ---
  EICEndpoint02:
    Type: AWS::EC2::InstanceConnectEndpoint
    Properties:
      SubnetId:
        Ref: Subnet02
      SecurityGroupIds:
        - Ref: SGEIC02
      Tags:
        - Key: Name
          Value: eic-endpoint-02

  Instance02:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType:
        Ref: InstanceType
      ImageId:
        Ref: LatestAmiId
      SubnetId:
        Ref: Subnet02
      SecurityGroupIds:
        - Ref: SGInstance02
      Tags:
        - Key: Name
          Value: instancia-02-accepter

Outputs:
  Instancia01ID:
    Description: "ID de la Instancia 01"
    Value:
      Ref: Instance01
  Instancia02ID:
    Description: "ID de la Instancia 02"
    Value:
      Ref: Instance02

Que despliega exactamente esta plantilla?

Para garantizar un entorno seguro y portátil, el código automatiza la infraestructura base bajo un modelo 100% privado:

2 VPC aisladas: vpc-01-workshop (10.0.0.0/24) y vpc-02-workshop (11.0.0.0/24), configuradas sin salida a internet (sin Internet Gateways ni NAT Gateways).

2 Subredes privadas: subnet-01-privada y subnet-02-privada, segmentadas con máscaras /25 en la primera zona de disponibilidad de la región.

2 Tablas de rutas base: rt-01-privada y rt-02-privada asociadas a sus respectivas subredes, listas para recibir las rutas del peering manualmente.

2 EC2 Instance Connect (EIC) Endpoints: eic-endpoint-01 y eic-endpoint-02. Estos componentes actúan como el puente seguro para conectarnos por SSH desde nuestra terminal sin usar IPs públicas ni llaves .pem.

2 Instancias EC2: instancia-01-requester e instancia-02-accepter con Amazon Linux 2023, ubicadas en el corazón de sus redes privadas.

4 Grupos de Seguridad (Security Groups): Dos para los endpoints (con reglas de salida en el puerto 22) y dos para las instancias, configurados bajo el principio de mínimo privilegio para aceptar conexiones SSH únicamente si provienen de su respectivo EIC Endpoint.

Paso 1: Despliegue de la infraestructura base

Guarda la plantilla anterior en un archivo .yaml y abre CloudFormation en la consola de AWS. Aunque el código funciona en cualquier región, te recomiendo usar N. Virginia (us-east-1) para que tu pantalla coincida exactamente con las imágenes de referencia que verás en cada paso.

Crea el Stack siguiendo estos pasos rápidos:

Haz clic en Create stack (With new resources).

Selecciona Choose an existing template -> Upload a template file y sube tu archivo .yaml.

Asigna un nombre a tu Stack y avanza presionando Next (deja el resto de opciones por defecto).

Haz clic en Submit.

El aprovisionamiento tomará un par de minutos. Una vez que el estado cambie a CREATE_COMPLETE, la automatización habrá terminado y estaremos listos para iniciar la configuración manual de nuestro VPC Peering.

Paso 2: Comprobar el aislamiento (El fallo esperado)

Con nuestro Stack desplegado, el primer paso será conectarnos a nuestra instancia-01-requester ubicada (en la vpc-01-workshop). Lo haremos a través del EIC Endpoint (EC2 Instance Connect) que automatizamos con la plantilla. Esto representa una excelente práctica de seguridad: eliminamos por completo la gestión de llaves de acceso .pem y añadimos una capa de protección adicional, si nunca has manejado los EIC te dejo el enlace al workshop anterior donde explicamos todos los endpoint incluyendo los EIC y ademas usamos cada uno de ellos en un ejemplo: Workshop VPC Endpoints.

Dirígete al servicio de EC2, selecciona la instancia instancia-01-requester y haz clic en el botón Connect.

En la próxima pantalla notarás algunos mensajes de advertencia como: «No public IPv4 or IPv6 address assigned» e «Instance is not in a public subnet». ¡No te preocupes! Lejos de ser un error, esto es una excelente señal: nos confirma que nuestras subredes y recursos están completamente aislados del mundo exterior.

En la pestaña de EC2 Instance Connect, cambia el tipo de conexión a Connect using EC2 Instance Connect Endpoint.
Verás que el sistema seleccionará automáticamente nuestro eic-endpoint-01 en la lista desplegable.
TPara finalizar, haz clic en el botón Connect.

Una vez dentro de la terminal de la instancia-01-requester, ejecutaremos un comando para intentar comunicarnos con la instancia-02-accepter. (Ve a la consola de EC2 y copia la dirección IP privada de esa segunda instancia, la vas a necesitar).

ping -c 4 <IP_PRIVADA_DE_INSTANCIA_02>

Nota rápida: El comando ping se utiliza para verificar la conectividad básica entre dos recursos. La bandera -c 4 (count) le indica al sistema que envíe exactamente 4 paquetes de prueba hacia la IP de destino.

Ejecuta el comando y observa el resultado:

El diagnóstico es claro: 4 paquetes transmitidos, 0 paquetes recibidos (100% packet loss). El comando se queda congelado y expira.

¿Por qué pasa esto? Porque ambas VPC están en un aislamiento absoluto. En términos de redes, el router de nuestra vpc-01-workshop recibe el paquete con destino a la red 11.0.0.x, revisa su tabla de rutas local y, al no encontrar ninguna instrucción que le diga cómo llegar allá, simplemente descarta el paquete. Para él, esa dirección IP no existe.

Ahora si viene lo bueno...

Paso 3: Creacion de un VPC Peering

A continuación, vamos a construir el puente entre ambas redes para que nuestros servidores dejen de estar aislados.

Dirígete al servicio de VPC en la consola de AWS.
En la columna izquierda, busca y selecciona Peering connections.
Haz clic en el botón Create peering connection.
Colocale un nombre, yo usare: pc-vpc1-to-vpc2
VPC ID (Requester): Selecciona vpc-01-workshop. Esta será la red encargada de iniciar la solicitud de interconexión.
Para este laboratorio, deja seleccionadas las opciones por defecto: My account (Mi cuenta) y This region (Esta región).
En el campo VPC ID (Accepter), selecciona vpc-02-workshop.

Finaliza haciendo clic en Create peering connection en la parte inferior.

Nota de Arquitecto: En este ejercicio ambas VPC conviven en la misma cuenta y región, pero ten en cuenta que el proceso es idéntico si decidieras interconectar redes en zonas geográficas o estructuras corporativas distintas; en ese caso, simplemente elegirías las opciones Another account o Another region según corresponda.

Ya creamos nuestro peering, pero ¡ATENCIÓN! Falta un paso crucial. Recuerda que al configurar esta conexión establecimos un solicitante (Requester) y un aceptador (Accepter). Esto significa que la solicitud está flotando en el aire y la vpc-02-workshop debe aceptarla formalmente para que el estado pase de Pending acceptance a Active.

En la misma pantalla donde acabas de crear el peering (justo debajo del banner verde de éxito): haz clic en el menú desplegable Actions (en la esquina superior derecha), elige la opción Accept request y confirma.

Nota de Arquitecto: Estamos aceptando la solicitud nosotros mismos porque ambas VPC están en nuestra cuenta de AWS. Si la VPC de destino perteneciera a otra cuenta corporativa o a un cliente externo, el administrador de esa cuenta tendría que iniciar sesión en su propia consola para aceptar tu conexión.

Paso 4: Configuración de Tablas de Rutas (El Mapa de Red)

Ya tenemos nuestro puente construido y activo (VPC Peering), pero si intentas hacer ping nuevamente, notarás que sigue fallando. ¿Por qué? Porque aunque el enlace lógico ya existe, los routers de nuestras VPC todavía no saben que deben usarlo. Nos falta configurar las instrucciones de navegación: las Tablas de Rutas.

Comenzaremos configurando el camino de ida. Vamos a dirigirnos al servicio VPC, tabla de rutas y seleccionamos la asociada a nuestra subred origen (rt-01-privada) para especificarle que cuando un recurso intente comunicarse con el bloque CIDR 11.0.0.0/24 (la red de la VPC-02), redirija ese tráfico utilizando nuestra Peering Connection (pc-vpc1-to-vpc2).

Selecciona la tabla de rutas rt-01-privada y dirígete a la pestaña Routes en la parte inferior.

Nota de observación: Verás que de momento solo existe una ruta con el destino (Target) configurado como local. Esta regla por defecto le indica a la VPC que cualquier tráfico dirigido al bloque CIDR 10.0.0.0/24 debe quedarse dentro de su propia red local.

Haz clic en el botón Edit routes (esquina superior derecha de la pestaña).
En el editor de rutas, haz clic en Add route y configura los siguientes campos:

- Destination: Coloca el bloque CIDR completo de la VPC-02 (11.0.0.0/24). Con esto le indicas al router el "rango de red de destino". Le estás diciendo: «Cualquier paquete que intente ir a cualquier recurso dentro de la VPC-02, debe aplicar esta regla». (Ojo: no colocamos la IP de la instancia individual, sino el rango de toda la red vecina).

Target: Selecciona Peering Connection en la lista desplegable. Al hacer clic en el cuadro de búsqueda vacío, el sistema te mostrará automáticamente el ID de nuestro peering pc-vpc1-to-vpc2
Por último, haz clic en Save changes (Guardar cambios) para fijar el mapa de ida.

El "muro" del laboratorio

Ok, tenemos el peering activo y nuestra ruta definida. Si volvemos a la terminal e intentamos ejecutar el comando nuevamente:

ping -c 4 IP_PRIVADA_DE_INSTANCIA_02

¿Qué sucede? Sí, nos topamos exactamente con el mismo mensaje: 4 packets transmitted, 0 received, 100% packet loss, time 3153ms.

Este es el típico escenario de redes que suele frustrar a la mayoría y hacerlos dudar de su configuración. ¡Pero hoy no será nuestro caso! Si nuestra conexión de peering está OK y la tabla de rutas de origen está OK, ¿qué otro elemento nos está bloqueando? Debemos revisar al guardián que protege directamente al recurso: el Security Group de la instancia-02-accepter.

Selecciona la pestaña Inbound rules (Reglas de entrada). Notarás que este grupo de seguridad fue creado por nuestra plantilla de CloudFormation con una única regla: permitir tráfico SSH en el puerto 22 exclusivamente desde el EIC Endpoint. Por eso podemos conectarnos sin problemas, pero cualquier otro tipo de acceso está denegado por defecto.

Para que nuestra instancia de destino acepte y responda a las solicitudes de eco (Echo Requests) que le envía el comando ping, necesitamos habilitar el protocolo ICMP (Internet Control Message Protocol). A diferencia de los servicios web comunes, este tráfico no utiliza puertos TCP o UDP, sino que opera directamente a nivel de red para enviar mensajes de diagnóstico. Como los Security Groups bloquean todo el tráfico entrante por defecto, debemos añadir una regla explícita para permitirlo. ¡Vamos a hacerlo!.

Haz clic en Edit inbound rules y luego en el botón Add rule.
Configura los siguientes campos en la nueva fila:
Type: Selecciona All ICMP - IPv4.
Source: Déjalo en Custom y en el cuadro de texto ingresa el bloque CIDR de la VPC-01 (10.0.0.0/24).
Haz clic en Save rules.
Source Custom: Agregamos el CIDR de la vpc-01 que es donde habita la instancia instancia-01-requester desde donde estamos ejecutado el comando ping y queremos que pueda comunicarse y ademas recibir respuesta de la instancia instancia-02-accepter a la que protege este grupo de seguridad.

¿Qué acabamos de hacer? Le indicamos al grupo de seguridad de la instancia-02-accepter que permita la entrada de paquetes de diagnóstico (ping), siempre y cuando provengan de algún recurso ubicado dentro de la red de la VPC-01.

El segundo ping

Con el grupo de seguridad con las reglas correctas, estamos listos para volver a ejecutar nuestro comando en la terminal:

ping -c 4 IP_PRIVADA_DE_INSTANCIA_02

Y el resultado es... ¡otra vez lo mismo!: 4 packets transmitted, 0 received, 100% packet loss, time 3100ms.

Te debes estar preguntando: «A ver, el peering está activo, la ruta de ida está lista y el Security Group ya permite el ping... ¿Qué rayos sucede? Javier, ¿acaso quieres estresarme?»

La realidad es que no, pero en la arquitectura de redes la mejor manera de aprender es fallando, entendiendo el porqué de las cosas y corrigiendo. Lo que estamos experimentando aquí es un concepto vital: el enrutamiento en AWS no es bidireccional por defecto.

Para que un ping sea exitoso, el paquete necesita un camino de ida y un camino de vuelta. Analicemos qué está pasando en este instante tras bambalinas:

El viaje de ida: El paquete sale de la instancia-01, el router de la VPC-01 ve la ruta hacia el peering, el puente cruza con éxito, llega a la VPC-02, el Security Group valida que es un paquete ICMP permitido y se lo entrega a la instancia-02. ¡La ida funciona perfecto!
El viaje de vuelta: La instancia-02 recibe el paquete y, como es educada, genera una respuesta (Echo Reply) con destino a la IP de la VPC-01.
El problema: Cuando este paquete de regreso llega al router de la VPC-02, este revisa su propia tabla de rutas. Como no hemos tocado la tabla de rutas de la VPC-02, el router solo ve su regla local (11.0.0.0/24). Al no tener una instrucción explícita que le diga cómo regresar a la red 10.0.0.0/24, el router no sabe qué hacer y tira la respuesta a la basura.

En resumen: la instancia-02 sí recibe el mensaje, pero sus respuestas se quedan atrapadas en su propia red. La instancia-01 se queda esperando eternamente un eco que jamás va a volver.

Configurando el camino de regreso

¿Qué debemos hacer entonces para solucionar el problema del paquete atrapado? Exacto: ir a la tabla de rutas rt-02-privada (asociada a la VPC-02) y repetir el mismo procedimiento que hicimos al principio. Esta vez, agregaremos una regla que especifique que todo el tráfico dirigido a la red de la VPC-01 (10.0.0.0/24) debe salir a través de nuestra Peering Connection.

Guarda los cambios y, ahora sí, regresemos a la terminal de nuestra primera instancia para lanzar el comando de nuevo:

ping -c 4 IP_PRIVADA_DE_INSTANCIA_02

¡Victoria! Respuesta totalmente satisfactoria: 4 packets transmitted, 4 received, 0% packet loss, time 3126ms. Oficialmente, este es el momento de poner a sonar de fondo We Are the Champions de Queen. ¡Ja, ja!

Probando la bidireccionalidad

Al configurar las rutas en ambos sentidos, el puente de red ha quedado completamente establecido de forma bidireccional, sin importar qué recurso inicie la comunicación. Esto significa que si nos conectamos a la instancia-02-accepter en la VPC-02, podríamos hacerle un ping de vuelta a la instancia-01-requester en la VPC-01.

Eso sí... espero que te hayas acordado del detalle vital que acabamos de aprender con los grupos de seguridad (firewall). Para que la instancia-01 pueda procesar esa solicitud, su propio Security Group debe permitirlo.

Dirígete al grupo de seguridad sg-instance-01 (el que protege a la instancia en la VPC-01) y añade la regla correspondiente en las Inbound rules:

Type: All ICMP - IPv4
Source: El CIDR de la VPC-02 (11.0.0.0/24)

Ahora, conéctate a la instancia de la VPC-02 (usando su pestaña de EC2 Instance Connect como hicimos en el Paso 1) y ejecuta el ping apuntando al origen:

ping -c 4 IP_PRIVADA_DE_INSTANCIA_01

¡Resultados impecables! 4 packets transmitted, 4 received, 0% packet loss, time 3114ms.

¡Lo hemos logrado! Conseguimos establecer una comunicación fluida, privada y segura entre recursos que habitan en redes totalmente aisladas gracias a VPC Peering y a una correcta gestión de enrutamiento y seguridad.

Cuándo usar (y cuándo evitar) un VPC Peering?

El VPC Peering es la herramienta ideal cuando necesitas una conexión directa, UNO a UNO, entre dos redes. Sin embargo, hay una regla de oro a nivel de infraestructura que es imprescindible recordar: las conexiones de Peering NO son transitivas.

¿Qué significa esto en la práctica? Imagina el siguiente escenario:

Tienes un Peering que conecta la VPC-A con la VPC-B.

Tienes otro Peering que conecta la VPC-B con la VPC-C.

Es muy común que los principiantes supongan que, como la VPC-B está en el medio, la red A podría comunicarse con la C utilizándola como puente. Pero en AWS esto no es posible. Debido a que el ruteo no es transitivo, para que la VPC-A y la VPC-C se puedan hablar, tendrías que crear obligatoriamente un tercer Peering directo entre ellas.

El problema de la escala:

Por esta misma naturaleza, el VPC Peering se recomienda únicamente cuando manejas un número pequeño de redes. Si tu infraestructura crece y necesitas interconectar 10, 20 o 50 VPCs entre sí, configurar conexiones "uno a uno" creará una telaraña inmanejable de enlaces y tablas de rutas, convirtiéndose en una pesadilla de administración.

Nota de Arquitecto: Cuando te enfrentes a un escenario donde necesitas interconectar muchas redes a gran escala, la solución ya no es el VPC Peering; en ese caso, debes dar el salto a un servicio de enrutamiento centralizado como AWS Transit Gateway que aprovecho para haceres spoiler, sera nuestro proximo workshop.

Paso Final: Eliminación de recursos (¡No olvides este paso!)

Después de que interactúes y pruebes todo lo que acabamos de construir, es fundamental eliminar los recursos para evitar costos innecesarios en tu cuenta de AWS. Sigue este orden específico para garantizar una limpieza exitosa:

VPC Peering: Dirígete al servicio de VPC, selecciona Peering connections en la columna izquierda, busca el peering que creamos (pcx-vpc01-to-vpc02), selecciónalo y haz clic en Delete.
CloudFormation Stack: Una vez borrado el peering, ve al servicio de CloudFormation, selecciona el Stack que desplegamos al inicio y haz clic en Delete. AWS se encargará de borrar las instancias, VPCs, tablas de rutas y Security Groups automáticamente.

¡Con esto habremos terminado! Espero sinceramente que hayas aprendido algo nuevo el día de hoy sobre enrutamiento y seguridad en nubes privadas virtuales.

Si tienes alguna opinión, feedback o duda, no olvides dejarla en la sección de comentarios. Además, te invito a compartir este contenido técnico; ¡podría ser de gran ayuda para otras personas en su camino de aprendizaje!

Nos vemos en el próximo workshop, donde abordaremos el siguiente nivel: AWS Transit Gateway.

Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Tue, 19 May 2026 09:10:53 +0000

TL;DR

We shipped the same FSx for ONTAP audit logs to three backends simultaneously — Datadog, Grafana Cloud, and Honeycomb — without changing a single line of Lambda code. The OpenTelemetry Collector sits between our Lambda and the backends as a routing layer. Adding or removing a backend is a YAML config change, not a code deployment.

Same audit logs → 3 backends simultaneously
Zero Lambda code changes between backends (SHA-256 verified)
OTel Collector as the vendor-neutral routing layer
All 3 event sources work: FSx audit logs via S3 Access Point, EMS webhooks, FPolicy file operations

What We're Building

In Part 2, we built a Lambda that speaks Datadog's API directly. It works great — but what happens when your security team wants Splunk, your SRE team wants Grafana, and your platform team is evaluating Honeycomb?

You'd need three separate Lambdas, each with vendor-specific formatting, auth, and retry logic. That's vendor lock-in expressed as infrastructure.

The Problem: Vendor-Specific APIs = Lock-in

Every observability vendor has their own wire format:

Vendor	Auth Header	Payload Format	Endpoint Pattern
Datadog	`DD-API-KEY: <key>`	Custom JSON schema	`https://http-intake.logs.{site}/api/v2/logs`
Splunk	`Authorization: Splunk <token>`	HEC `event` wrapper	`https://<host>:8088/services/collector/event`
Grafana Cloud	`Authorization: Basic <b64>`	OTLP	`https://otlp-gateway-prod-<region>.grafana.net/otlp`
Honeycomb	`x-honeycomb-team: <key>`	OTLP	`https://api.honeycomb.io`

If your Lambda speaks Datadog's API, switching to Grafana Cloud means rewriting your Lambda. That's the lock-in.

The Solution: OTLP as the Producer-to-Collector Contract

OpenTelemetry Protocol (OTLP) is the vendor-neutral producer-to-Collector contract. Our Lambda speaks OTLP — period. The OTel Collector handles routing, processing, and backend-specific export.

┌─────────────────────────────────────────────────────────────────────┐
│ AWS Account                                                         │
│                                                                     │
│  ┌──────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Audit Logs   │────▶│ Lambda           │     │ OTel Collector  │  │
│  │ (via S3 AP)  │────▶│ (OTLP Shipper)   │────▶│ (Docker/Fargate)│  │
│  │ EMS/FPolicy  │────▶│                  │     │                 │  │
│  └──────────────┘     └──────────────────┘     └─┬──────┬──────┬─┘  │
│                                                  │      │      │    │
└──────────────────────────────────────────────────┼──────┼──────┼────┘
                                                   │      │      │
                                                   ▼      ▼      ▼
                                              Datadog  Grafana Honeycomb
                                               (AP1)    Cloud

The Lambda sends OTLP/HTTP to the Collector. The Collector fans out to any combination of backends. Adding Honeycomb? Add 5 lines of YAML. Dropping Datadog? Remove 4 lines. No Lambda redeployment.

Prerequisites

Before starting, you need:

FSx for ONTAP with audit logging configured (see Part 2 for setup)
Docker installed locally (Colima works — see troubleshooting for compose compatibility)
At least one backend account:
- Datadog: API key + site (e.g., ap1.datadoghq.com)
- Grafana Cloud: Instance ID + API token (Cloud Portal → OTLP)
- Honeycomb: Ingest API key (starts with hcaik_)
AWS account with Lambda deployment capability
Parts 1–4 context (recommended but not required — this integration works standalone)

FSx for ONTAP S3 Access Point note: The Lambda reads audit logs through an S3 Access Point attached to the FSx for ONTAP volume. Data remains on the FSx file system — it is not copied to a separate S3 bucket. S3 API throughput via FSx depends on the file system's provisioned throughput capacity, not standard S3 scaling. Validate FSx read throughput separately from Collector and backend ingest throughput.

The OTel Collector Configuration

The Collector config is the heart of this pattern. Here's the full verified configuration for multi-backend delivery:

# otel-collector-config.yaml
# ✅ VERIFIED WORKING (2026-05-18)
# Image: otel/opentelemetry-collector-contrib:0.152.0
# Backends: Grafana Cloud (ap-northeast-0) + Honeycomb

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  # memory_limiter:        # Recommended for production
  #   check_interval: 1s
  #   limit_mib: 512
  #   spike_limit_mib: 128
  batch:
    timeout: 5s
    send_batch_size: 1000

exporters:
  otlp_http/grafana:
    endpoint: ${env:GRAFANA_OTLP_ENDPOINT}
    headers:
      Authorization: "Basic ${env:GRAFANA_BASIC_AUTH}"

  otlp_http/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
      x-honeycomb-dataset: ${env:HONEYCOMB_DATASET}

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions: [health_check]
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp_http/grafana, otlp_http/honeycomb]

Depending on your Honeycomb environment and dataset model, x-honeycomb-dataset may be optional or handled differently. Refer to your Honeycomb OTLP setup page for the recommended configuration.

This article uses otlp_http (the forward-compatible component name). If your Collector version does not recognize it, use the older otlphttp alias or upgrade the Collector.

Section Breakdown

Section	Purpose	Key Settings
`receivers.otlp`	Accepts OTLP/HTTP from Lambda	Port 4318 (OTLP standard)
`processors.batch`	Buffers logs before export	5s timeout OR 1000 records (whichever first)
`exporters.otlp_http/*`	Sends to each backend	Per-backend auth headers
`extensions.health_check`	Liveness probe	Port 13133 for `curl -f` checks
`service.pipelines`	Wires components together	logs: receiver → processor → exporters

Production note: This configuration is suitable for development and validation. For production, add retry_on_failure and sending_queue settings to exporters, configure memory_limiter processor, and consider persistent storage extensions. Without persistent buffering, telemetry in the Collector's in-memory batch can be lost during Collector restarts.

Adding Datadog as a Third Backend

To send to all three simultaneously, add the Datadog exporter:

exporters:
  # ... existing grafana + honeycomb exporters ...

  datadog:
    api:
      key: ${env:DD_API_KEY}
      site: ${env:DD_SITE}

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp_http/grafana, otlp_http/honeycomb, datadog]

That's it. Restart the Collector. Same Lambda, same OTLP payload, now three destinations.

For Datadog, this example uses the Collector's dedicated datadog exporter rather than generic otlp_http, because it handles Datadog-specific intake behavior, metadata mapping, and host tagging.

The Lambda Handler (OTLP Shipper)

Key Design Decisions

Why OTLP? — It gives the Lambda a single producer-to-Collector contract. The Collector then handles each backend's supported exporter or intake path. One format to maintain, not three.
Why no vendor SDK? — SDKs add cold start latency, dependency management, and vendor coupling. Pure urllib3 + JSON keeps the Lambda lean.
Why AUTH_MODE? — Different Collectors may need different auth. The Lambda supports none, basic, and bearer modes without code changes.

Field Mapping: FSx ONTAP → OTLP Attributes

The Lambda maps FSx ONTAP audit fields to semantic OTLP attribute keys:

FSx ONTAP Field	OTLP Attribute Key	Example Value
`EventID`	`event.type`	`4663`
`UserName`	`user.name`	`admin@corp.local`
`ClientIP`	`client.address`	`10.0.1.50`
`Operation`	`fsxn.operation`	`ReadData`
`ObjectName`	`fsxn.path`	`/vol/data/reports/q4.xlsx`
`Result`	`fsxn.result`	`Success`
`SVMName`	`fsxn.svm`	`svm-prod-01`

The examples above focus on S3 audit logs because they are the highest-volume path. The same OTLP shipper pattern is reused for EMS webhook events and FPolicy file operations using source-specific field mappers (ems_handler.py, fpolicy_handler.py), while preserving the same Collector-facing OTLP contract. For EMS and FPolicy, source-specific service names are used (fsxn-ems, fsxn-fpolicy) to distinguish event sources in the backend.

Resource-level attributes (set once per payload, not per log record):

Attribute	Value	Purpose
`service.name`	`fsxn-audit`	Service identification
`cloud.provider`	`aws`	Cloud context
`cloud.platform`	`aws_fsx`	Platform context

cloud.platform=aws_fsx is a project-specific value used to identify FSx for ONTAP as the data source. It is not part of the OpenTelemetry semantic conventions standard cloud.platform values (which include aws_ec2, aws_ecs, aws_eks, aws_lambda, etc.).

Severity Determination Logic

The Lambda determines OTLP severity from the Result field:

WARN_KEYWORDS = ("fail", "denied", "error")

def determine_severity(result: Optional[str]) -> tuple[int, str]:
    """Determine OTLP severity from FSx ONTAP Result field."""
    if not result:
        return (9, "INFO")
    lower = result.lower()
    for keyword in WARN_KEYWORDS:
        if keyword in lower:
            return (13, "WARN")
    return (9, "INFO")

This means failed access attempts (Result: "Failure") automatically get severityNumber: 13 (WARN), making them easy to filter in any backend.

The Lambda sets both severityNumber and severityText according to the OpenTelemetry Logs Data Model severity level definitions.

OTLP Payload Construction

def build_otlp_payload(
    logs: list[dict[str, Any]],
    service_name: str,
    source_key: str,
) -> dict[str, Any]:
    """Build OTLP Log Data Model payload."""
    log_records = [map_log_record(log) for log in logs]

    return {
        "resourceLogs": [{
            "resource": {
                "attributes": [
                    {"key": "service.name", "value": {"stringValue": service_name}},
                    {"key": "cloud.provider", "value": {"stringValue": "aws"}},
                    {"key": "cloud.platform", "value": {"stringValue": "aws_fsx"}},
                ]
            },
            "scopeLogs": [{
                "scope": {"name": "fsxn-otel-shipper", "version": "1.0.0"},
                "logRecords": log_records,
            }],
        }]
    }

No vendor SDK. No vendor-specific formatting. Just the OTLP Log Data Model.

Retry with Exponential Backoff

MAX_RETRIES = 3
BASE_INTERVAL = 2  # seconds

def _send_otlp_payload(payload, endpoint, auth_headers=None) -> bool:
    """Send OTLP payload via HTTP POST with retry logic.

    Retries on HTTP 429 and 5xx. Does not retry on 4xx (except 429).
    Exponential backoff: 2s, 4s, 8s with jitter.
    """
    url = f"{endpoint}/v1/logs"
    headers = {"Content-Type": "application/json"}
    if auth_headers:
        headers.update(auth_headers)

    json_body = json.dumps(payload).encode("utf-8")

    for attempt in range(MAX_RETRIES):
        response = http.request("POST", url, body=json_body, headers=headers, timeout=30.0)

        if response.status < 300:
            return True
        if response.status == 429 or response.status >= 500:
            wait_time = BASE_INTERVAL * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait_time)
            continue
        # Client error (4xx except 429) — don't retry
        return False
    return False

AUTH_MODE Support

The Lambda supports three authentication modes via the AUTH_MODE environment variable:

AUTH_MODE	Behavior	Use Case
`none`	No auth headers sent	Local Collector (no auth needed)
`basic`	`Authorization: Basic <base64(token)>`	Grafana Cloud direct
`bearer`	`Authorization: Bearer <token>`	Generic OTLP endpoints

When using the Collector pattern, set AUTH_MODE=none on the Lambda — the Collector handles backend auth via its own config.

Direct auth modes (basic, bearer) are useful for testing or bypassing the Collector. In the multi-backend pattern, keep AUTH_MODE=none and let the Collector handle backend credentials.

Deployment

Local Development: Docker Run

# 1. Configure credentials
cd integrations/otel-collector
cp .env.example .env
# Edit .env with your backend credentials:
#   GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
#   GRAFANA_BASIC_AUTH=<base64(instanceId:apiToken)>
#   HONEYCOMB_API_KEY=hcaik_<your-ingest-key>
#   HONEYCOMB_DATASET=fsxn-audit

# 2. Start OTel Collector
docker run -d --name otel-collector \
  -p 4318:4318 -p 13133:13133 \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
  --env-file .env \
  otel/opentelemetry-collector-contrib:0.152.0

# 3. Verify health
curl -f http://localhost:13133/
# Expected: HTTP 200 — {"status":"Server available", ...}

The health_check extension confirms the Collector process is available; it does not guarantee that each backend exporter is successfully delivering logs. Monitor exporter errors separately using the Collector's internal telemetry metrics if enabled and exposed.

# 4. Send a test payload
bash scripts/generate-otlp-payload.sh --output /tmp/payload.json
curl -X POST http://localhost:4318/v1/logs \
  -H "Content-Type: application/json" \
  -d @/tmp/payload.json

Colima users: docker compose v2 plugin is NOT available in Colima. All scripts in this repo detect this and fall back to docker run. If you see "docker compose: command not found", this is expected behavior.

First Success Path

If you're trying this for the first time, start small:

Run the Collector locally with one backend.
Send one fresh OTLP payload.
Confirm the event appears in that backend.
Add the second exporter.
Only then move to multi-backend or AWS deployment.

This keeps the first validation focused on the producer-to-Collector contract before introducing backend parity and production networking.

AWS Deployment: CloudFormation

aws cloudformation deploy \
  --template-file integrations/otel-collector/template.yaml \
  --stack-name fsxn-otel-integration \
  --parameter-overrides \
    S3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
    OtlpEndpoint=http://<your-collector-endpoint>:4318 \
    ApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-otel-key-XXXXXX \
    AuthMode=none \
  --capabilities CAPABILITY_IAM \
  --region ap-northeast-1

This template deploys the Lambda-side OTLP shipper. The Collector endpoint must already be reachable from the Lambda — for example, a local Collector for development, an EC2-hosted Collector, or an ECS/Fargate-based Collector in the same VPC. If the Lambda is in a VPC, ensure security groups allow outbound TCP 4318 to the Collector. See the repository's VPC Deployment Guide and Security Hardening Guide for production Collector deployment.

When the Collector handles auth, set AuthMode=none on the Lambda. The Collector config contains the per-backend credentials via environment variables (sourced from .env or Secrets Manager in production).

Environment Variables

Variable	Lambda	Collector	Description
`OTLP_ENDPOINT`	✅	—	Collector URL (e.g., `http://collector:4318`)
`AUTH_MODE`	✅	—	`none` / `basic` / `bearer`
`SERVICE_NAME`	✅	—	OTLP `service.name` attribute
`GRAFANA_OTLP_ENDPOINT`	—	✅	Grafana Cloud OTLP gateway URL
`GRAFANA_BASIC_AUTH`	—	✅	base64(instanceId:apiToken)
`HONEYCOMB_API_KEY`	—	✅	Ingest key (hcaik_...)
`HONEYCOMB_DATASET`	—	✅	Dataset name
`DD_API_KEY`	—	✅	Datadog API key
`DD_SITE`	—	✅	Datadog site (`datadoghq.com`, `datadoghq.eu`, `ap1.datadoghq.com`, etc.)

Verified Results

All backends were tested on 2026-05-18 using otel/opentelemetry-collector-contrib:0.152.0:

Backend	Region/Site	Status	Event Sources	Auth Method
Datadog	ap1.datadoghq.com	✅ Verified	S3 audit + EMS + FPolicy	Datadog exporter (`DD-API-KEY`)
Grafana Cloud	ap-northeast-0	✅ Verified	S3 audit + EMS + FPolicy	Basic Auth via `otlp_http`
Honeycomb	—	✅ Verified	S3 audit + EMS + FPolicy	`x-honeycomb-team` via `otlp_http`
Multi-Backend	Grafana + Honeycomb	✅ Verified	Simultaneous delivery	Both auth methods
Multi-Backend	Datadog + Grafana + Honeycomb	✅ Verified	Simultaneous 3-way delivery	All three exporters

All three backends received the same structured attributes:

event.type, user.name, client.address
fsxn.operation, fsxn.path, fsxn.result, fsxn.svm
cloud.provider=aws, cloud.platform=aws_fsx

OTLP standardizes the producer-to-Collector contract, but backend-specific indexing, query semantics, and retention behavior still need to be validated per destination. OpenTelemetry is not a backend — it defines APIs, protocols, and Collector components for telemetry generation, collection, processing, and export. Storage, visualization, and alerting are handled by the backends themselves. See the Backend Parity Matrix and PoC Checklist for backend-specific validation details.

The Proof: Zero Code Changes

Here's the key evidence. The Lambda handler's SHA-256 hash is identical regardless of which backend receives the logs:

$ shasum -a 256 integrations/otel-collector/lambda/handler.py
# Same hash whether targeting Datadog, Grafana Cloud, or Honeycomb
# The file never changes — only the Collector config does

What changes between backends? Only the OTel Collector config file.

Demonstration: Adding a Backend

Starting state: Grafana Cloud only.

# Before: single backend
service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana]

Adding Honeycomb:

# After: add 5 lines to exporters section + update pipeline
exporters:
  otlp_http/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
      x-honeycomb-dataset: ${env:HONEYCOMB_DATASET}

service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana, otlp_http/honeycomb]

Restart the Collector. Done. No Lambda redeployment, no code review, no CI/CD pipeline for the shipper.

Demonstration: Removing a Backend

Dropping Datadog during a migration to Grafana Cloud:

# Remove from exporters list — that's it
service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana]  # removed: datadog

Troubleshooting

Timestamp Rejection / Static Payload Gotcha

Datadog documents that logs older than 18 hours are dropped at intake (Datadog Logs API docs). Other backends may also reject or hide events with timestamps outside their accepted windows. In my testing, future timestamps also caused ingestion issues on some backends. When testing with static payloads, always generate fresh timestamps.

Fix: Use the payload generator to create fresh timestamps:

bash scripts/generate-otlp-payload.sh --output /tmp/fresh-payload.json
curl -X POST http://localhost:4318/v1/logs \
  -H "Content-Type: application/json" \
  -d @/tmp/fresh-payload.json

Grafana Cloud Auth Format

The loki exporter is NOT the correct approach for OTLP → Grafana Cloud.

❌ loki exporter with Loki push API
✅ otlp_http/grafana with OTLP gateway endpoint

The Basic Auth value must be base64(instanceId:apiToken):

# Generate the auth value
echo -n "<your-instance-id>:<your-grafana-cloud-api-token>" | base64

Where the instance ID is your numeric Grafana Cloud instance ID (found in Cloud Portal → OTLP configuration).

Honeycomb Key Types

Honeycomb has two key types. Only ingest keys work for data ingestion:

Key Prefix	Type	Works for OTLP?
`hcaik_`	Ingest API key	✅ Yes
`hcxik_`	Environment key	❌ No

If you see 401 Unauthorized from Honeycomb, check your key prefix.

Colima Docker Compose Compatibility

docker compose v2 plugin is not available in Colima environments. All scripts in this repository detect this automatically and fall back to docker run. This is expected — not an error.

If you need compose-like orchestration on Colima, use the explicit docker run commands shown in the Deployment section.

Common Mistake: loki Exporter vs otlp_http

A frequent misconfiguration when targeting Grafana Cloud:

# ❌ WRONG — loki exporter uses Loki-specific push API
exporters:
  loki:
    endpoint: https://logs-prod-<region>.grafana.net/loki/api/v1/push

# ✅ CORRECT — otlp_http uses the OTLP gateway
exporters:
  otlp_http/grafana:
    endpoint: https://otlp-gateway-prod-<region>.grafana.net/otlp

The OTLP gateway is Grafana Cloud's native OTLP ingestion endpoint. It handles logs, metrics, and traces through a single URL.

Cost Model: How to Think About It

Lambda Cost (OTLP Path vs Direct Send)

In my validation, the OTLP Lambda was simpler and shorter-lived than the vendor-specific direct-send path. Your duration will vary depending on batching, payload size, network path, and backend response time.

Component	Direct Send (Part 2)	OTLP + Collector
Lambda complexity	Vendor formatting + HTTP + retry	OTLP POST to nearby Collector
Lambda memory	256MB	256MB
Vendor SDK deps	Yes (adds cold start)	None
Retry complexity	Per-vendor	Delegated to Collector

OTel Collector Cost

The Collector introduces a fixed infrastructure cost that is independent of event volume:

Deployment	Best For
Docker on local machine	Development, testing
Docker on EC2 Spot (t3.small)	Low-volume production
ECS Fargate (0.5 vCPU, 1GB)	Production (no OS management)
ECS Fargate + NAT Gateway	VPC-internal production

When to Use Each Pattern

Scenario	Recommendation
Single vendor, low volume	Direct Send (Part 2 pattern) — no Collector overhead
Single vendor, high volume	Collector (buffering + backpressure benefits)
Multi-vendor evaluation	Collector (add/remove exporters freely)
Vendor migration in progress	Collector (parallel delivery during cutover)
Compliance: logs in multiple systems	Collector (fan-out is a config change)

The Collector has fixed infrastructure costs regardless of volume. As volume increases or vendors multiply, the Collector path becomes more cost-effective because it processes once and fans out. The Collector path centralizes fan-out outside the Lambda. Direct-send can also fan out within one Lambda, but that pushes vendor-specific formatting, retry behavior, and failure isolation back into application code.

Important: Backend ingest/retention costs are not included in these AWS-side estimates. Datadog, Grafana Cloud, and Honeycomb each have their own pricing models that can become the dominant cost at scale.

When to Use This Pattern

Multi-Vendor Evaluation

Want to try Honeycomb for a month alongside your existing Datadog setup? Add one exporter to the Collector config. No Lambda redeployment. No risk to your existing pipeline.

Compliance: Logs in Multiple Systems

Some organizations require audit logs in multiple systems — security team uses Splunk, dev team uses Datadog, compliance team needs a cold archive. The Collector fans out to all simultaneously from a single OTLP stream.

Migration Between Vendors

Moving from Datadog to Grafana Cloud? Run both exporters in parallel during migration. Verify data parity in the new system. Remove the old exporter when satisfied. Zero-downtime vendor migration.

Cost Optimization: Route by Volume

Use the Collector's processor pipeline to route high-volume noisy logs (read operations) to a cheaper backend while keeping security-critical events (deletes, permission changes) on a premium platform with alerting.

What's Next

For production hardening, the repository includes guides covering VPC deployment, health monitoring, persistent buffering, security hardening, and benchmarking. Auto-scaling and Multi-AZ deployment are natural next steps for production Collector operations.

For production and partner-led deployments, the repository includes:

Architecture Decision Record
VPC Deployment Guide — private networking, security groups, and Collector reachability from Lambda
Config Governance Guide
Security Hardening Guide
Operations Guide
Cost Model
PoC Checklist
Routing and Filtering Examples
Compliance Evidence Note
Migration Guide — zero-downtime migration from direct-send to the Collector path
OTel Semantic Mapping Guide — standard vs project-specific attributes, schema evolution, and what OTLP does not solve
Backend Parity Matrix — visibility and query behavior across Datadog, Grafana Cloud, and Honeycomb
Glossary / 用語集 — English/Japanese OTel terminology used in this project
Enterprise Workload Addendum — SAP, VMware, and mission-critical workload considerations
Storage Service Selection Note — when to use FSx for ONTAP, Amazon S3, Amazon EFS, and Amazon EBS

Key Takeaways

OTLP is the stable producer contract. Your Lambda speaks one protocol; the Collector handles backend-specific exporters.
OTel Collector is the routing and processing layer that decouples log producers from observability backends.
Zero Lambda code changes when switching or adding backends — verified with SHA-256 hash comparison.
Multi-backend delivery is a config change, not a code change. Add 5 lines of YAML, restart the Collector.
All three FSx ONTAP event sources work: FSx audit logs via S3 Access Point (Part 2), EMS webhooks (Part 3), and FPolicy file operations (Part 4).
Collector economics improve as volume increases or vendors multiply — fixed Collector cost is amortized across all destinations.
Start with direct send (Part 2) for simplicity. Graduate to the Collector when you need multi-backend, vendor migration, or volume-based routing.

Series Navigation

Part 1: Why Your FSx for ONTAP Logs Deserve Better
Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
Part 5: Escape Vendor Lock-in with OTel Collector (this post)

Questions about the OTel Collector pattern or multi-backend delivery? Drop a comment below.

Previous: Part 4 — FPolicy File Activity Pipeline

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

CTF Event Report: Security-JAWS 10th Anniversary Day 2 — All 27 AWS Security Challenges Solved

TOMOAKI ishihara — Mon, 18 May 2026 13:13:37 +0000

Introduction

I participated in the CTF held on Day 2 of "Security-JAWS DAYS ~10th Anniversary Event~", organized by Security-JAWS, a Japanese AWS user community focused on cloud security.

The CTF was themed around a fictional SaaS company called "TechVault", and the scenario had us conducting a penetration investigation — starting from their employee portal and ultimately uncovering evidence of fraudulent transactions. It was an exceptionally well-crafted CTF with a cohesive narrative running through all challenges.

Event Overview

Duration: 13:00–17:00 / 4 hours
Total challenges: 27
- Tutorial: 6 / 290 pt
- Mainline: 12 / 2,300 pt
- Bonus: 5 / 1,400 pt
- Advanced: 2 / 700 pt
- Blue Team: 1 / 300 pt
- Finale: 1 / 600 pt
Setting: AWS environment of a fictional SaaS company "TechVault"

The story begins with an intrusion investigation of TechVault's portal service and culminates in gathering evidence of someone's fraudulent transactions. The level of polish in the scenario design was remarkable.

Results

Challenges solved: 27 out of 27
Score: 5,310 pt (max: ~5,590–5,650 pt)
Time to complete: 2 hours 48 minutes 36 seconds
Final ranking: 12th out of 125 participants
What went well:
- Using knowledge and CLI tools I don't normally touch in daily work, and working through them hands-on gave me a much deeper understanding of each technique.
What I'd improve:
- I should have set up my environment beforehand. I normally use devContainers, so my host machine only had the minimum: AWS CLI and Python. Docker, OpenSSL, Boto3, and similar tools were missing, which cost me more time than necessary.

Challenge Structure

Tutorial (290 pt)

A step-by-step introduction to web reconnaissance and AWS CLI basics.

ID	Title	Description
T1	Web Recon · robots.txt	Discover hidden paths from `robots.txt` Disallow directives
T2	The Unlocked Warehouse · Public S3	Retrieve files directly from a publicly exposed S3 bucket
T3	Behind the Page · HTML Source	Investigate credentials buried in HTML comments
T4	First Steps with curl	Check information embedded in HTTP response headers
T5	Leaked Config · .env File	Find a `.env` file mistakenly placed in the web root
T6	First Steps with AWS CLI	Use the key found in `.env` to run `sts get-caller-identity`

The T5→T6 flow was clever. You grab a key from .env and immediately use it with the AWS CLI — a hands-on demonstration of how a web vulnerability becomes an AWS entry point.

Mainline (2,300 pt)

The core of the CTF: an attack chain that follows the path of intrusion → privilege escalation → evidence collection, starting from Stage 0.

ID	Title	Description
Stage 0	The Forgotten Debug Mode	Extract AWS keys from debug output left in an auth API error response
Stage 1A	Flip the Bucket	Find a file hidden under a `.hidden/` prefix in an S3 bucket
Stage 1B	Who Am I?	Read IAM policy metadata to understand the compromised user's permissions
Stage 1C	The Past Never Disappears	Recover AWS keys left in a Git repository's commit history
Stage 1D	Ask the AI	Prompt injection against an AI assistant embedded in the employee portal
Stage 2A	The Deleted File	Recover a deleted file using S3 object versioning
Stage 2B	The Permission Map	Use `sts:AssumeRole` to pivot laterally into the DataAnalystRole
Stage 2D	The Function's Secret	Retrieve sensitive data stored in Lambda environment variables
Stage 2E	The Parameter Labyrinth	Navigate SSM Parameter Store paths to collect secrets
Stage 2G	The AI's Permissions	Extract S3 data via an over-privileged Bedrock agent
Stage 3A	The Vault Key	Retrieve the ZIP decryption password from Secrets Manager
Final	Consolidate the Evidence	Decrypt the evidence file using all collected information to expose the CEO's fraud

The Bedrock agent challenge (Stage 2G) was fresh. The agent was configured with direct S3 access, so data from a bucket I couldn't read directly could be pulled out simply by asking the agent "show me the project metadata." It drove home how important permission design is when integrating AI into your stack.

Bonus / Advanced / Blue Team / Finale (3,000 pt)

Additional challenges branching off the mainline, each requiring deeper technical knowledge.

ID	Title	Category	Description
Stage 2C	The Server's Shadow	bonus	Flag stored in EC2 instance tags
Stage 2F	Find It Automatically	bonus	Scan all branches for secrets using `gitleaks`
Stage 3B	The Invisible Voice	bonus	SSRF to IMDSv1 to steal EC2 role temporary credentials
Stage 3C	The False Face	advanced	Self-declare `custom:role=admin` during Cognito sign-up
Stage 3D	The Truth Inside the Image	bonus	Recover files deleted by `RUN rm` from Docker image layers
Stage 3E	The Neighbor's Vault	advanced	Read another tenant's data via wildcard permissions on S3 Vectors
Stage 4	Follow the Trail	blueteam	Identify attacker operation timestamps from CloudTrail logs
Stage 5	Suspicious Activity	finale	Decrypt CTO complicity evidence by tracing late-night activity in CloudTrail
Stage 5B	Combined Attack Surface	bonus	Call an internal API by combining intelligence from Stage 3D and Stage 3E

Stage 3D was by far the most time-consuming for me — because Docker wasn't installed in my CTF environment. Instead of using docker history (which would have shown it in seconds), I had to query the ECR API directly to fetch the image manifest, download each layer, and extract the tarballs manually. Painful on the clock, but I ended up with a much deeper understanding of how Docker image layers actually work.

Stage 4 and Stage 5 involved parsing large CloudTrail log files with jq to reconstruct the attacker's footsteps — a great taste of what SOC/incident response work feels like. Stage 5 in particular required chaining multiple steps: find the suspicious late-night (JST) operations in the logs, track down the Secrets Manager path they pointed to, and decrypt the encrypted evidence file with OpenSSL. OpenSSL wasn't available either, so I ended up implementing the decryption in Python.

The Full Attack Chain

Each challenge looks independent, but they're all connected as a single story.

Obtain AWS keys from debug API response (Stage 0)
        ↓
IAM recon reveals an AssumeRole-able role (Stage 1B)
        ↓
Pivot laterally into DataAnalystRole (Stage 2B)
        ↓
┌─────────────────────────────────────────────────────┐
│  Collect intelligence across multiple parallel paths │
│  · EC2 tags (Stage 2C)                              │
│  · S3 versioning — recover deleted files (Stage 2A) │
│  · Lambda environment variables (Stage 2D)          │
│  · SSM Parameter Store (Stage 2E)                   │
│  · SSRF → IMDSv1 (Stage 3B)                        │
│  · ECR Docker layer analysis (Stage 3D)             │
│  · S3 Vectors cross-tenant leak (Stage 3E)          │
└─────────────────────────────────────────────────────┘
        ↓
Retrieve password from Secrets Manager (Stage 3A)
        ↓
Decrypt ZIP to obtain CEO fraud evidence (Final)
        ↓
Trace CTO complicity via CloudTrail (Stage 4 → 5)

Each individual vulnerability might look limited in isolation, but chaining them together produces a critical breach. Stage 5B is the perfect example: an internal API only reachable by combining intelligence gathered from two separate advanced stages.

Key Takeaways and Mitigations

Information Leakage (Debug, Headers, etc.)

Debug output that's convenient during development can leak AWS keys if left enabled in production. HTML comments, response headers, and robots.txt are all reconnaissance vectors attackers regularly check.

Mitigations:

Disable debug mode in production environments.
Remove unnecessary response headers like X-Powered-By.
Never place secrets in front-end source code.

S3 Misconfiguration

Three distinct S3 issues appeared: public access enabled, a .hidden/ prefix used as security-by-obscurity, and deleted files recoverable via versioning. All three stem from treating S3 like a traditional filesystem.

Mitigations:

Enable Block Public Access on all buckets.
Prefixes are not access controls.
If versioning is enabled, also design lifecycle policies to expire delete markers and old versions.

Secrets in Git History

Even after deleting a .env file and committing the removal, git log -p surfaces it instantly. Tools like gitleaks can scan every branch and every commit in seconds.

Mitigations:

Integrate git-secrets or gitleaks as a pre-commit hook.
If a secret was already committed, rewrite history with git filter-repo and rotate the key immediately.

Prompt Injection

A single sentence — "ignore previous instructions" — was enough to extract the contents of the system prompt. Using the system prompt as a "hidden" information store is not a security boundary.

Mitigations:

Never put sensitive information in system prompts.
Validate both inputs and outputs. Make the boundary between user input and system instructions explicit.

Overly Broad IAM Permissions and AssumeRole

Having sts:AssumeRole allows switching to a different role. In this CTF, flags were embedded in IAM policy descriptions and EC2 tags for challenge purposes — but in the real world, metadata fields are an underappreciated place for sensitive data to accumulate.

Mitigations:

Apply the principle of least privilege rigorously.
When granting sts:AssumeRole, restrict the target resources.
Use Condition keys in trust policies to restrict callers.

Poor Secret Management

Three storage locations appeared: Lambda environment variables, SSM Parameter Store, and Secrets Manager. Even Secrets Manager provides no protection if the IAM permissions granting GetSecretValue are too broad.

Mitigations:

Manage secrets in Secrets Manager.
Scope the GetSecretValue resource policy to the specific secret ARN.

SSRF × IMDSv1

A URL preview feature in the dashboard was fetching external URLs server-side — and there was no filtering to block requests to http://169.254.169.254. IMDSv1 requires no token, so SSRF access to the link-local address yields EC2 role temporary credentials directly.

Mitigations:

Enforce IMDSv2 (HttpTokens: required) on all EC2 instances.
URL-fetching features should use an allowlist, and must block private IP ranges and link-local addresses.

Secrets Persisted in Docker Image Layers

A Dockerfile pattern like COPY secret.txt . → RUN python setup.py → RUN rm secret.txt produces a final image where secret.txt is not visible at runtime. However, downloading the image layers directly from the ECR API reveals secret.txt intact in a prior layer's tarball.

Mitigations:

Use multi-stage builds; never copy secrets into build contexts.
Retrieve secrets from Secrets Manager at runtime instead.

Broken Multi-Tenant Permission Design

The S3 Vectors resource policy was set to Resource: "*", allowing a role scoped to one tenant to query another tenant's vector data. Tenant isolation in SaaS demands rigorous permission separation.

Mitigations:

Constrain the Resource and Condition in resource policies to tenant-specific identifiers.
If sharing a vector bucket, scope queries and metadata access by tenant at the API level.

Cognito Authorization Design Flaw

Passing custom:role=admin in --user-attributes during aws cognito-idp sign-up was enough to self-declare administrator status, which the application then trusted for authorization decisions.

Mitigations:

Control role assignment server-side (e.g., in a Pre Sign-up Lambda Trigger).
Never use attributes that external parties can set as the basis for authorization.

AI Agent Over-Privilege

The Bedrock agent's IAM role had access to S3 buckets that the DataAnalystRole itself could not read. By asking the agent a natural-language question, data from otherwise-inaccessible buckets was pulled out indirectly.

Mitigations:

Apply the principle of least privilege to AI agent roles as well.
Explicitly enumerate and restrict the resources an agent is permitted to access.

CloudTrail for Evidence Preservation

With CloudTrail logs in place, "what happened, when, and by whom" can be reconstructed almost completely. A handful of jq filters were enough to trace the attacker's full activity.

Mitigations:

Enable CloudTrail in all regions and ship logs to S3.
Pair with GuardDuty for real-time detection.
Apply Object Lock to the log bucket to prevent tampering.

Closing Thoughts

All 27 challenges together gave me a visceral sense of how AWS misconfigurations cascade. Each individual problem represented a realistic "seen-in-the-wild" vulnerability — but what made this CTF special was that they were all woven into a single coherent story.

AWS certifications don't teach you why something is dangerous. Solving these challenges hands-on — making mistakes, working around missing tools, figuring out the low-level APIs when Docker wasn't available — built an intuition that studying documentation alone never could.

Highly recommend participating if a similar opportunity comes around. And if you're building on AWS, I hope this report serves as a useful checklist of things worth double-checking in your own environment.

I Built a ML Churn Predictor in Minutes- Here's How Kiro Made It Possible

Adeline Makokha [AWS Hero] — Mon, 18 May 2026 12:25:13 +0000

Customer churn is one of the most expensive problems in the telecom industry. Acquiring a new customer costs 5–10× more than retaining an existing one, yet most companies only discover a customer has churned after they've already left. The goal of this project is to flip that, give analysts a tool to identify at-risk customers before they churn, so retention teams can act proactively.

What would normally take days of planning, scaffolding, and wiring together took a fraction of the time, because I built it with Kiro, an AI-powered development environment that thinks in specs, not just code completions.

In this article I'll walk through building a complete churn prediction web application from scratch using Python, Flask, scikit-learn, and Plotly. By the end you'll have a working app that:

Accepts CSV uploads of customer data
Runs a Random Forest churn prediction model
Visualises results with three interactive charts
Lets you browse, filter, and sort at-risk customers
Exports results to CSV for downstream use

How Kiro Accelerated This Build

Before diving into the code, it's worth explaining why this came together so fast.

Most AI coding tools are reactive meaning you write code, they autocomplete. Kiro works differently. It starts with a spec-driven workflow where you describe what you want to build, and Kiro helps you think through requirements, design, and implementation tasks before a single line of code is written.

Here's exactly how this project unfolded:

1. Requirements in minutes, not hours

I described the project in plain English, "a telecom customer churn prediction website using Python" and Kiro generated a full requirements document covering 7 requirement areas with precise, testable acceptance criteria in EARS format. Things like:

"IF a Dataset contains up to 10,000 Customer Records, THEN THE Predictor SHALL complete prediction within 30 seconds."

No ambiguity. No back-and-forth. Edge cases I hadn't even thought about, like what happens when tenure = 0, or when a CSV is valid but contains zero data rows, were already covered.

2. Technical design with 15 correctness properties

From the requirements, Kiro produced a full technical design document i.e, component interfaces with Python signatures, data models, an architecture diagram, Flask route table, and 15 formal correctness properties to be verified with property-based tests. For example:

"For any array of churn scores and any threshold in [0.0, 1.0], compute_churn_rate SHALL return exactly round((count of scores >= threshold / total count) * 100, 2)."

This is the kind of rigour that usually only happens on large teams with dedicated QA. Kiro baked it in from the start.

3. Implementation tasks, automatically sequenced

Kiro then broke the design into a dependency-ordered task list, 13 top-level tasks across 8 parallel waves, from project scaffolding through to integration tests. Each task referenced specific requirements for traceability.

4. Code generation that actually matches the spec

With the spec in place, Kiro generated all the Python modules, Flask routes, Jinja2 templates, JavaScript, and sample data and the code matched the design document precisely. No hallucinated APIs, no mismatched interfaces.

The result is a production-quality app with 146 passing tests (unit, integration, and property-based) generated from a single plain English description.

The takeaway: Kiro doesn't just write code faster. It helps you build the right thing by front-loading the thinking. The spec becomes the source of truth, and the code follows from it.

What We're Building

Here's the full feature set at a glance:

Feature	Details
CSV Upload	Validates format, size (≤50 MB), required columns, and row-level data quality
Churn Prediction	Random Forest model, configurable threshold (default 0.5)
Dashboard	Summary stats + 3 Plotly charts
At-Risk Table	Paginated (25/page), sortable, filterable
Export	Download results as CSV
Model Info	Displays model name, version, and training date

Architecture Overview

The app follows a clean separation of concerns. Each responsibility lives in its own module:

Request flows:

Upload → Validator checks the file → valid rows stored in AppState
Predict → Predictor scores every row → Visualizer builds chart specs → stored in AppState
Export → Exporter serialises results → streamed as file download
Startup → ModelLoader loads model.joblib once; if it fails, prediction is disabled

Project Structure

telecom-churn-app/
├── app.py               # Flask routes and AppState
├── validator.py         # CSV upload validation
├── predictor.py         # Churn scoring
├── visualizer.py        # Plotly chart builders
├── exporter.py          # CSV export
├── model_loader.py      # joblib model loading
├── table_helpers.py     # Pagination, sort, filter
├── generate_model.py    # One-time model training script
├── requirements.txt
├── data/
│   ├── sample_customers.csv   # 200 rows
│   └── sample_small.csv       # 20 rows for quick testing
├── templates/
│   ├── base.html
│   └── dashboard.html
└── static/
    └── app.js

Getting Started

Prerequisites

Python 3.11+
pip

Install dependencies

pip install -r requirements.txt

requirements.txt:

Flask==3.0.3
pandas==2.2.2
numpy==1.26.4
scikit-learn==1.5.0
joblib==1.4.2
plotly==5.22.0
pytest==8.2.2
hypothesis==6.103.1

Generate the model

python generate_model.py
# → Model saved to model.joblib

Run the app

python app.py
# → Running on http://localhost:5000

The Data

The app expects a CSV with these five columns:

Column	Type	Rules
`customer_id`	string	Required
`tenure`	numeric	0 – 999 (months)
`monthly_charges`	numeric	> 0
`total_charges`	numeric	> 0
`contract_type`	string	"Month-to-month", "One year", "Two year"

A sample row from data/sample_small.csv:

customer_id,tenure,monthly_charges,total_charges,contract_type
CUST0001,69,113.04,7560.84,One year
CUST0004,41,59.11,2325.40,Month-to-month
CUST0007,68,113.62,7797.64,Two year

Step 1: Training the Model (`generate_model.py`)

We generate 1,000 rows of synthetic training data where churn probability is driven by three realistic signals i.e, short tenure, high monthly charges, and month-to-month contracts.

def generate_training_data(n: int = 1000, seed: int = 42) -> pd.DataFrame:
    rng = np.random.default_rng(seed)

    tenure = rng.integers(0, 73, size=n)
    monthly_charges = rng.uniform(20.0, 120.0, size=n).round(2)
    total_charges = (tenure * monthly_charges * rng.uniform(0.95, 1.05, size=n)).round(2)

    contract_type = rng.choice(
        ["Month-to-month", "One year", "Two year"],
        size=n,
        p=[0.5, 0.3, 0.2]
    )

    # Churn probability: higher for short tenure, month-to-month, high charges
    churn_prob = (
        0.4 * (1 - tenure / 72)
        + 0.3 * (monthly_charges - 20) / 100
        + 0.3 * (contract_type == "Month-to-month").astype(float)
    )
    churn_prob = np.clip(churn_prob, 0.05, 0.95)
    churn = rng.binomial(1, churn_prob, size=n)
    ...

After training a RandomForestClassifier, we attach metadata directly to the model object before saving with joblib:

model.metadata = {
    "name": "RandomForestChurnModel",
    "version": "1.0.0",
    "training_date": "2024-01-15",
}
joblib.dump(model, "model.joblib")

This keeps the model and its metadata in a single file and no separate config needed.

Step 2: Loading the Model (`model_loader.py`)

The model is loaded once at startup. If the file is missing or corrupt, the app enters a degraded state where prediction is disabled but everything else still works.

@dataclass
class ModelMetadata:
    name: str
    version: str
    training_date: date  # displayed as ISO 8601 YYYY-MM-DD

@dataclass
class LoadedModel:
    model: object
    metadata: ModelMetadata

class ModelLoadError(Exception):
    pass

def load_model(path: str) -> LoadedModel:
    try:
        model = joblib.load(path)
    except FileNotFoundError:
        raise ModelLoadError(f"Model file not found: {path}")
    except Exception as e:
        raise ModelLoadError(f"Failed to load model: {e}")

    raw_meta = model.metadata
    training_date = date.fromisoformat(raw_meta["training_date"])

    return LoadedModel(
        model=model,
        metadata=ModelMetadata(
            name=raw_meta["name"],
            version=raw_meta["version"],
            training_date=training_date,
        )
    )

In app.py, this runs before the first request is served:

app_state = AppState()

def _load_model_on_startup():
    try:
        app_state.loaded_model = load_model(MODEL_PATH)
    except ModelLoadError as e:
        app_state.model_load_error = str(e)

_load_model_on_startup()

Step 3: Validating Uploads (`validator.py`)

The validator runs a multi-stage pipeline. Each stage can fail fast with a clear error message:

The row-level rules are:

monthly_charges and total_charges must be numeric and > 0
tenure must be numeric, ≥ 0, and ≤ 999 (tenure = 0 is valid)

def _validate_rows(df: pd.DataFrame) -> tuple[pd.DataFrame, int]:
    work_df = df.copy()
    work_df["_tenure_num"]  = pd.to_numeric(work_df["tenure"],          errors="coerce")
    work_df["_monthly_num"] = pd.to_numeric(work_df["monthly_charges"], errors="coerce")
    work_df["_total_num"]   = pd.to_numeric(work_df["total_charges"],   errors="coerce")

    valid_mask = (
        work_df["_tenure_num"].notna()
        & (work_df["_tenure_num"] >= 0)
        & (work_df["_tenure_num"] <= 999)
        & work_df["_monthly_num"].notna()
        & (work_df["_monthly_num"] > 0)
        & work_df["_total_num"].notna()
        & (work_df["_total_num"] > 0)
    )

    valid_df = df[valid_mask].copy()
    invalid_count = int((~valid_mask).sum())
    return valid_df, invalid_count

If some rows are invalid but at least one is valid, the app warns the user and proceeds with the clean rows. If all rows are invalid, prediction is blocked.

The ValidationResult dataclass carries everything the route handler needs:

@dataclass
class ValidationResult:
    success: bool
    error_message: str | None = None
    warning_message: str | None = None
    dataframe: pd.DataFrame | None = None
    total_rows: int = 0
    valid_rows: int = 0
    invalid_rows: int = 0

Step 4: Running Predictions (`predictor.py`)

The predictor one-hot encodes contract_type to match the training feature set, then calls predict_proba to get churn probabilities:

def predict(df: pd.DataFrame, model, threshold: float = 0.5) -> PredictionResult:
    try:
        feature_df = df[["tenure", "monthly_charges", "total_charges", "contract_type"]].copy()
        feature_df = pd.get_dummies(feature_df, columns=["contract_type"])

        # Ensure all contract type columns exist even if not in this batch
        for col in ["contract_type_Month-to-month", "contract_type_One year", "contract_type_Two year"]:
            if col not in feature_df.columns:
                feature_df[col] = 0

        feature_cols = ["tenure", "monthly_charges", "total_charges",
                        "contract_type_Month-to-month", "contract_type_One year", "contract_type_Two year"]
        X = feature_df[feature_cols].astype(float)

        probas = model.predict_proba(X)
        scores = probas[:, 1]  # probability of churn

        return PredictionResult(success=True, scores=scores,
                                customer_ids=df["customer_id"].tolist(),
                                threshold=threshold)
    except Exception as e:
        return PredictionResult(success=False, error_message=str(e), threshold=threshold)

The churn rate formula is explicit and deterministic:

def compute_churn_rate(scores: np.ndarray, threshold: float) -> float:
    at_risk_count = int(np.sum(scores >= threshold))
    return round((at_risk_count / len(scores)) * 100, 2)

Step 5: Visualising Results (`visualizer.py`)

Three Plotly charts are built server-side and serialised to JSON, then rendered client-side with Plotly.newPlot. This keeps the server stateless with respect to chart rendering.

Chart 1: At-Risk vs Non-At-Risk bar chart

At-Risk  ████████████████  87
Non-Risk ████████████████████████████████  113

Chart 2: Churn Score Distribution (histogram)

Exactly 10 bins of width 0.1 spanning [0.0, 1.0]:

def build_score_histogram(scores: np.ndarray) -> dict:
    bin_edges = np.linspace(0.0, 1.0, 11)  # 11 edges = 10 bins
    counts, _ = np.histogram(scores, bins=bin_edges)
    bin_centers = [(bin_edges[i] + bin_edges[i+1]) / 2 for i in range(10)]

    fig = go.Figure(data=[go.Bar(x=bin_centers, y=counts.tolist(), width=0.09)])
    fig.update_layout(title="Churn Score Distribution", ...)
    return fig.to_dict()

Chart 3: Churn Rate by Contract Type

Month-to-month  ████████████████████████  62.4%
One year        ████████  21.3%
Two year        ████  10.1%

Step 6: The Flask Application (`app.py`)

All mutable state lives in a single AppState dataclass — a simple singleton for single-user deployments:

@dataclass
class AppState:
    dataset: pd.DataFrame | None = None
    prediction_result: PredictionResult | None = None
    threshold: float = 0.5
    loaded_model: LoadedModel | None = None
    model_load_error: str | None = None
    chart_specs: dict | None = None

The six routes map cleanly to user actions:

Route	Method	Action
`GET /`	GET	Redirect to dashboard
`GET /dashboard`	GET	Render main page
`POST /upload`	POST	Validate and store CSV
`POST /predict`	POST	Run prediction
`POST /threshold`	POST	Update churn threshold
`GET /export`	GET	Stream CSV download

The upload route shows the validation pipeline in action:

@app.route("/upload", methods=["POST"])
def upload():
    uploaded_file = request.files["file"]
    file_bytes = uploaded_file.read()
    result = validate_upload(file_bytes, uploaded_file.filename, len(file_bytes))

    if not result.success:
        flash(result.error_message, "error")
        return redirect(url_for("dashboard"))

    # Clear previous results when new data is uploaded
    app_state.dataset = result.dataframe
    app_state.prediction_result = None
    app_state.chart_specs = None

    if result.warning_message:
        flash(result.warning_message, "warning")

    flash(f"File uploaded successfully. {result.valid_rows} customer record(s) loaded.", "success")
    return redirect(url_for("dashboard"))

The threshold route validates the range before accepting the new value:

@app.route("/threshold", methods=["POST"])
def update_threshold():
    threshold_val = float(request.form.get("threshold", ""))

    if not validate_threshold(threshold_val):
        flash(f"Threshold {threshold_val} is out of range. Valid range is [0.0, 1.0].", "error")
        return redirect(url_for("dashboard"))

    app_state.threshold = threshold_val
    # Rebuild charts immediately if results exist
    if app_state.prediction_result and app_state.prediction_result.success:
        app_state.prediction_result.threshold = threshold_val
        app_state.chart_specs = _build_chart_specs(app_state.dataset, app_state.prediction_result)

    flash(f"Threshold updated to {threshold_val}.", "success")
    return redirect(url_for("dashboard"))

Step 7: The Dashboard UI

The dashboard uses a two-column Bootstrap 5 layout: a narrow left sidebar for controls, and a wide right panel for results.

Chart data is injected into the page as JSON and rendered by Plotly client-side:

<!-- In dashboard.html -->
{% if has_results %}
<script id="chart-data" type="application/json">{{ chart_data_json | safe }}</script>
<script id="table-data" type="application/json">{{ at_risk_table_json | safe }}</script>
{% endif %}

// In app.js
function renderCharts() {
  const chartData = JSON.parse(document.getElementById('chart-data').textContent);
  const config = { responsive: true, displayModeBar: false };

  Plotly.newPlot('chart-at-risk',   chartData.at_risk_bar.data,    chartData.at_risk_bar.layout,    config);
  Plotly.newPlot('chart-histogram', chartData.score_histogram.data, chartData.score_histogram.layout, config);
  Plotly.newPlot('chart-contract',  chartData.contract_type.data,   chartData.contract_type.layout,   config);
}

Step 8: At-Risk Table: Pagination, Sort, Filter

The table helpers are pure Python functions, independently testable and reusable:

# table_helpers.py

def paginate(records: list, page: int, page_size: int = 25) -> list:
    start = (page - 1) * page_size
    return records[start : start + page_size]

def sort_records(records: list, column: str, direction: str) -> list:
    reverse = direction.lower() == "desc"
    return sorted(records, key=lambda r: (r.get(column) is None, r.get(column)), reverse=reverse)

def filter_records(records: list, search_term: str) -> list:
    if not search_term:
        return records
    term = search_term.lower()
    return [r for r in records if term in str(r.get("customer_id", "")).lower()]

The client-side JavaScript mirrors this logic for instant interactivity without round-trips:

// Sort on column header click
document.querySelectorAll('#atRiskTable thead th[data-col]').forEach(th => {
  th.addEventListener('click', function () {
    const col = this.getAttribute('data-col');
    sortDirection = (sortColumn === col && sortDirection === 'asc') ? 'desc' : 'asc';
    sortColumn = col;
    currentPage = 1;
    renderTable();
  });
});

// Filter on search input
document.getElementById('tableSearch').addEventListener('input', function () {
  const term = this.value.toLowerCase();
  filteredRecords = allRecords.filter(r =>
    String(r.customer_id || '').toLowerCase().includes(term)
  );
  currentPage = 1;
  renderTable();
});

Step 9: Exporting Results (`exporter.py`)

The export produces a clean CSV with a fixed column order. One detail worth noting: pandas serialises Python booleans as True/False (capitalised) by default, but the spec requires lowercase true/false. We handle this explicitly:

EXPORT_COLUMNS = ["customer_id", "churn_score", "is_at_risk",
                  "contract_type", "tenure", "monthly_charges"]

def build_export_dataframe(df, scores, at_risk_flags) -> pd.DataFrame:
    return pd.DataFrame({
        "customer_id":     df["customer_id"].values,
        "churn_score":     np.round(scores, 4),       # 4 decimal places
        "is_at_risk":      at_risk_flags.astype(bool),
        "contract_type":   df["contract_type"].values,
        "tenure":          df["tenure"].values,
        "monthly_charges": df["monthly_charges"].values,
    })[EXPORT_COLUMNS]

def to_csv_bytes(export_df: pd.DataFrame) -> bytes:
    out_df = export_df.copy()
    out_df["is_at_risk"] = out_df["is_at_risk"].map({True: "true", False: "false"})
    return out_df.to_csv(index=False).encode("utf-8")

Sample export output:

customer_id,churn_score,is_at_risk,contract_type,tenure,monthly_charges
CUST0004,0.8231,true,Month-to-month,41,59.11
CUST0005,0.7654,true,Month-to-month,12,47.03
CUST0009,0.1203,false,One year,70,97.46

Testing Strategy

The project uses two complementary testing approaches.

Example-based tests (pytest)

These cover specific scenarios and exact error messages:

# tests/unit/test_validator.py
def test_rejects_non_csv_file():
    result = validate_upload(b"some data", "data.xlsx", 100)
    assert result.success is False
    assert "xlsx" in result.error_message

def test_tenure_zero_is_valid():
    csv = b"customer_id,tenure,monthly_charges,total_charges,contract_type\n"
    csv += b"C001,0,50.0,0.01,Month-to-month\n"
    result = validate_upload(csv, "test.csv", len(csv))
    assert result.success is True
    assert result.valid_rows == 1

Property-based tests (Hypothesis)

These verify universal correctness properties across thousands of generated inputs:

# tests/property/test_predictor_properties.py
from hypothesis import given, settings
import hypothesis.strategies as st

@given(
    scores=st.lists(st.floats(0.0, 1.0), min_size=1).map(np.array),
    threshold=st.floats(0.0, 1.0)
)
@settings(max_examples=200)
def test_classify_at_risk_consistency(scores, threshold):
    """
    Property 2: For any scores and threshold, classify_at_risk returns
    True iff score >= threshold — including identical scores.
    """
    flags = classify_at_risk(scores, threshold)
    for score, flag in zip(scores, flags):
        assert flag == (score >= threshold)

@given(st.integers(min_value=0))
def test_file_size_boundary(size):
    """
    Property 9: _check_file_size returns True iff size <= 52,428,800.
    """
    result = _check_file_size(size)
    assert result == (size <= 52_428_800)

Key Design Decisions

Why a singleton AppState instead of Flask sessions?
Sessions are limited to ~4 KB (cookie storage) and can't hold DataFrames. For a single-user analytics tool, a module-level singleton is simpler and more practical than a database or Redis cache.

Why Plotly JSON instead of server-rendered images?
Plotly charts are interactive, users can hover, zoom, and pan. Serialising chart specs as JSON and rendering client-side means the server doesn't need a headless browser or image generation library.

Why separate table_helpers.py?
Keeping pagination, sort, and filter as pure functions makes them trivially testable without spinning up a Flask test client. The JavaScript mirrors the same logic for instant client-side interactivity.

Why one-hot encode at prediction time?
The uploaded CSV may not contain all three contract types. Encoding at prediction time and filling missing columns with 0 ensures the feature vector always matches what the model was trained on.

Running the Full App

# 1. Install dependencies
pip install -r requirements.txt

# 2. Train and save the model
python generate_model.py

# 3. Start the server
python app.py

Open http://localhost:5000, upload data/sample_customers.csv, and click Predict Churn.

You should see:

Summary stats (total customers, at-risk count, churn rate %)
Three interactive Plotly charts
A paginated, sortable, filterable table of at-risk customers
An export button to download the full results as CSV

What's Next

A few natural extensions from here:

User authentication - add Flask-Login for multi-user support with per-user state
Model retraining - add an admin route to upload new training data and retrain in-place
Scheduled batch jobs - use Celery + Redis to run predictions on a schedule and email results
Database persistence - swap the in-memory AppState for SQLAlchemy + PostgreSQL to persist results across restarts
SHAP explanations - add feature importance explanations per customer using the shap library

Source Code

The full source is available on GitHub: Agentic AI Kiro

The project includes:

All Python modules with docstrings
Sample CSV data (200 rows)
generate_model.py to reproduce the model
Unit, integration, and property-based tests

Try Kiro Yourself

If you want to build something like this or anything else - Kiro is worth trying. The spec-driven workflow changes how you approach a project. Instead of diving straight into code and figuring out the design as you go, you start with a clear picture of what you're building and why. The code becomes the easy part.

The entire requirements document, technical design, task list, and implementation for this project came from a single prompt. That's the difference.

Built with Flask, pandas, scikit-learn, and Plotly. Spec-driven development powered by Kiro. Tested with pytest and Hypothesis.

Stop Using Lambda for ML at This Scale (Benchmark + Cost Analysis)

Matia Rašetina — Mon, 18 May 2026 07:00:00 +0000

As a CTO, your job during a Proof of Concept (POC) is deceptively simple: don’t over-engineer, and don’t overspend.

You don’t need the perfect ML infrastructure—you need the cheapest architecture that works well enough.

Here’s the pipeline we built for our ML POC:

Audio file → S3 → Compute → Prediction → DynamoDB

The real question isn’t how to run inference—it’s:

At what point does Lambda stop being the smartest choice?

In this blog post, we are comparing the 3 Serverless ways of processing the data with an already trained Machine Learning model:

AWS Lambda with standard configuration
AWS Lambda with Snapstart enabled
AWS Lambda used as a proxy to use AWS SageMaker

To access the full project code, you can click the link here.

Experiment setup

The architecture across the board is very similar — all compute resources (Lambdas and SageMaker instance) have the same 4GB RAM configuration.

There is a subtle difference in assigning the vCPUs, as for each 1.769GB of RAM in AWS Lambda, you get the equivalent of one vCPU, meaning that our Lambdas would have 2.31 vCPU assigned (based on AWS docs here), and our SageMaker instance (ml.t2.medium instance) would have 2 vCPU assigned.

In addition, SageMaker stack has a proxy Lambda, with 128MB of RAM assigned, which gets the information from the uploaded file in S3, forwards the information to SageMaker and saves the results into DynamoDB.

All stacks do not use any GPU instances, making the playing field as level as possible.

Here is some other experiment choices to make the benchmark fair:

Same CPU architecture everywhere (x86_64): Lambda functions use the x86_64 architecture, dependencies are bundled with the SAM x86_64 Python 3.12 image, and the SageMaker container image is built for linux/amd64 so ONNX and wheels behave the same across paths.
Same language runtime: All Lambda handlers run Python 3.12 with the same packaged lambda_src layout (only the handler and SnapStart wiring differ).
Same model and container vs zip trade-off is intentional: One shared ONNX artifact from S3; standard and SnapStart load it inside the function, SageMaker serves it from a dedicated container behind an endpoint.

To keep the benchmark fair, SageMaker serverless was intentionally excluded. The reason for this is to keep the costs of running the ML model as low as possible and to keep the performance fair across the board.

The architecture diagram for this benchmark can be seen in the following picture:

Here is an overview of all stacks in this experiment:

Option	Setup	Cost Model	Pros	Cons
Lambda (4GB)	Model runs directly inside Lambda, ~2.31 vCPU, 4GB of RAM	Pay-per-request	Scales to zero, no idle cost, fast per request	High memory cost, not ideal at very high traffic
Lambda with SnapStart Enabled	Model runs directly inside Lambda, ~2.31 vCPU, 4GB of RAM	Pay-per-request	Predictable performance, cost-efficient at scale, SnapStart helping in cold starts	High memory cost, not ideal at very high traffic, additional SnapStart cost if traffic is sporadic
SageMaker Endpoint	Model hosted on ml.t2.medium (2 vCPU with 4GB of RAM), invoked via 128MB Lambda	Fixed monthly	Predictable performance, cost-efficient at scale	Always-on, pays even when idle, slightly higher latency

Configuration in CDK code

Here are the code snippets of all compute resources used in this experiment. All Lambdas are created with the following method, to reduce code duplication.

def create_python_function(
    *,
    scope: Construct,
    function_name: str,
    handler: str,
    environment: dict[str, str] | None = None,
    timeout: Duration = Duration.seconds(60),
    memory_size: int = 4096,
    architecture: _lambda.Architecture = _lambda.Architecture.X86_64,
    snapstart: _lambda.SnapStartConf | None = None,
) -> _lambda.Function:
    runtime_environment: dict[str, str] = {**_DEFAULT_RUNTIME_ENV}
    if environment:
        runtime_environment.update(environment)

    return _lambda.Function(
        scope,
        function_name,
        function_name=function_name,
        runtime=_lambda.Runtime.PYTHON_3_12,
        architecture=architecture,
        handler=handler,
        code=bundled_lambda_code(),
        timeout=timeout,
        memory_size=memory_size,
        environment=runtime_environment,
        snap_start=snapstart,
    )

Standard Lambda:

# Initialize a Lambda with the standard configuration
standard_lambda = create_python_function(
    scope=self,
    function_name="standard-predictor",
    handler="standard_handler.handler",
    timeout=Duration.seconds(90),
    environment={
        "PREDICTIONS_TABLE": self.predictions_table.table_name,
        "PREDICTOR": "standard",
        "MODEL_S3_URI": f"s3://{self.model_asset.s3_bucket_name}/{self.model_asset.s3_object_key}",
    },
)

SnapStart Lambda:

# Initialize the SnapStart Lamba
snapstart_lambda = create_python_function(
    scope=self,
    function_name="snapstart-predictor",
    handler="snapstart_handler.handler",
    timeout=Duration.seconds(90),
    snapstart=_lambda.SnapStartConf.ON_PUBLISHED_VERSIONS, # Very important to configure this parameter!
    environment={
        "PREDICTIONS_TABLE": predictions_table.table_name,
        "PREDICTOR": "snapstart",
        "MODEL_S3_URI": f"s3://{model_asset.s3_bucket_name}/{model_asset.s3_object_key}",
    },
)

# Initializing an Alias, as SnapStart doesn't work without it
live_alias = _lambda.Alias(
    self,
    "SnapStartLiveAlias",
    alias_name="live",
    version=snapstart_lambda.current_version,
)

SageMaker endpoint + Lambda proxy:

# Configure the SageMaker endpoint
endpoint_config = SageMaker.CfnEndpointConfig(
    self,
    "AudioPredictorEndpointConfig",
    production_variants=[
        SageMaker.CfnEndpointConfig.ProductionVariantProperty(
            variant_name="AllTraffic",
            model_name=model.attr_model_name,
            initial_instance_count=1,
            instance_type="ml.t2.medium",
            initial_variant_weight=1.0,
        )
    ],
)

# Define the endpoint
endpoint = SageMaker.CfnEndpoint(
    self,
    "AudioPredictorEndpoint",
    endpoint_config_name=endpoint_config.attr_endpoint_config_name,
)

# Initialize the SageMaker Lambda proxy
SageMaker_trigger = create_python_function(
    scope=self,
    function_name="SageMaker-predictor",
    handler="SageMaker_trigger_handler.handler",
    timeout=Duration.seconds(90),
    environment={
        "PREDICTIONS_TABLE": predictions_table.table_name,
        "PREDICTOR": "SageMaker",
        "MODEL_S3_URI": f"s3://{model_asset.s3_bucket_name}/{model_asset.s3_object_key}",
        "ENDPOINT_NAME": endpoint.attr_endpoint_name,
    },
)

Results

In the following image, you can see the execution duration of all the stacks which were used (note - SnapStart Lambda was ran once before to save the environment and then waited for 10 minutes for the Lambda to have a cold start again):

Method	Mean Latency	Median	Stability (Std)
Standard Lambda	280.72 ms	127.65 ms	443.29 ms
SnapStart	178.60 ms	124.69 ms	166.35 ms
SageMaker	339.18 ms	226.91 ms	350.76 ms

From the graph, we can see that the Lambdas execute faster than the SageMaker endpoint, staying under the 200ms mark. The circles represent the cold starts, and you can see that the SnapStart Lambda was at least 2x faster than other resources, thanks to SnapStart. SageMaker stack performed the worst, but not by a lot, having the most Lambda invocations just above the 200ms mark and the cold start taking almost 1.4 seconds.

Cost Breakdown

Lambda Cost (per request)

Formula: Cost = Duration × Memory × $0.0000166667

Duration: ~200 ms
Memory: 4 GB

Cost per request: ~$0.0000133

Cost per 1M requests: ~$13.80

SageMaker Cost (fixed)

ml.t4g.medium ≈ $24–30/month
Runs 24/7, even when idle

Takeaway:

Lambda has variable costs that scale with usage. SageMaker has fixed costs, making the tradeoff clear when requests grow.

The main question is:

When does SageMaker become the better option?

I’ve done the math.

SageMaker becomes a better option at ~72 requests per minute — take a look at the following graph:

It is obvious that, with the serverless nature of Lambda, costs are going to be lower since you have a fixed price for running the SageMaker endpoint, but as you have more traffic, SageMaker will handle it cheaper.

You can notice that the green line, representing the SageMaker endpoint, starts going up as well, — that is expected, as you will have many Lambda invocations as well, however it’s manageable as the already mentioned Lambda proxy is configured to use the lowest configuration.

Here is a broader look at the cost of this benchmark, it shows a broader view of expected cost, based on the latest pricing and traffic you can expect.

Traffic Volume	Standard Lambda (4GB)	SnapStart Lambda (4GB)	SageMaker (ml.t2.medium + 128MB Caller)
Price per 1M Req (Variable)	$13.80	$16.82	$0.81
Fixed Monthly Cost	$0.00	$0.00	$40.88
Total: 10 RPM (~438k req/mo)	$6.05	$7.37	$41.24
Total: 50 RPM (~2.1M req/mo)	$30.24	$36.87	$42.66
Total: 72 RPM (~3.1M req/mo)	$43.51	$53.05	$43.43 (Crossover point)
Total: 200 RPM (~8.7M req/mo)	$120.84	$147.32	$47.97
Total: 1000 RPM (~43.8M req/mo)	$604.22	$736.62	$76.38

References:

SageMaker pricing - link
Lambda pricing - link

CTO Verdict: A Decision Framework

Think in thresholds, not services.

Use Standard Lambda when:

You’re in POC or early stage
Traffic is low or unpredictable
You want zero idle cost

Use Lambda with SnapStart when:

Traffic is low and sporadic
You are willing to pay for the SnapStart snapshot restoration
You also want a zero idle cost

Use SageMaker when:

You exceed the mentioned 72 requests/minute consistently
Traffic is steady
You want predictable cost

Final Rule:

Lambda is the default
SageMaker is the optimization

FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Mon, 18 May 2026 02:31:34 +0000

TL;DR

ONTAP FPolicy pushes file operation notifications over a persistent TCP connection. We run a lightweight Python server on ECS Fargate that receives these events, normalizes them, and forwards them to SQS → Lambda → Datadog. In my validation environment, create events reached Datadog in about 6 seconds. Rename/delete behavior depends on FPolicy mode, protocol, and ONTAP/FSx behavior, so this post documents both the working path and the limitations observed.

Why FPolicy Needs Fargate

In Part 3, we showed how EMS webhooks deliver ARP alerts via API Gateway → Lambda. That works because EMS uses standard HTTPS.

FPolicy is different. ONTAP's FPolicy subsystem uses a proprietary binary protocol over persistent TCP connections. ONTAP initiates the connection to the FPolicy server and maintains it with periodic KeepAlive messages. This means:

❌ Lambda — No persistent TCP connections, max 15-minute timeout
❌ API Gateway — HTTP/HTTPS only, no raw TCP
✅ ECS Fargate — Persistent TCP listener, private IP, auto-restart

Why I Did Not Use an NLB in This Validation

I tested an NLB-based approach, but it did not work reliably in my validation. The issue was not that NLB cannot forward binary TCP traffic; it can. The challenge was FPolicy's stateful session negotiation and ONTAP's expectation of configured FPolicy server IPs. Health checks and connection behavior introduced additional complexity. For this validation, the simplest reliable path was to let ONTAP connect directly to the Fargate task's private IP and automate external-engine IP updates on task restart.

The Fargate task runs a Python server that:

Listens on TCP:9898
Handles FPolicy protocol negotiation (version handshake)
Receives KeepAlive messages (connection health)
Parses file operation notifications
Forwards structured events to SQS

Architecture

SMB/NFS Client
    │ file create/write/rename/delete
    ▼
FSx for ONTAP (FPolicy enabled)
    │ proprietary TCP protocol
    ▼
ECS Fargate (TCP:9898)
    │ parse → normalize → forward
    ▼
SQS Queue
    │ event source mapping
    ▼
Lambda (fpolicy_handler)
    │ format → ship
    ▼
Datadog Logs API v2 (source:fsxn-fpolicy)

Key design decisions:

ONTAP connects TO Fargate — the Fargate task must be reachable on a private IP. Because that IP can change on task restart, the ONTAP external engine must be updated automatically or operationally.
SQS decouples the TCP server from the shipping logic — if Datadog is slow, events buffer in SQS
Lambda handles Datadog shipping — retry logic, batch formatting, API key management
No NLB — ONTAP connects directly to the Fargate task's private IP

Deployment

Prerequisites

FSx for ONTAP file system with a CIFS-enabled SVM
VPC with private subnets (same as FSx for ONTAP)
ECR repository with the FPolicy server image
Private subnet egress for Fargate: either a NAT Gateway or VPC endpoints for ECR image pull, CloudWatch Logs, and SQS access

Step 1: Deploy the Fargate Stack

aws cloudformation deploy \
  --template-file shared/templates/fpolicy-server-fargate.yaml \
  --stack-name fsxn-fpolicy-server \
  --parameter-overrides \
    VpcId=<your-vpc-id> \
    SubnetIds=<your-private-subnet> \
    FsxnSvmSecurityGroupId=<fsx-sg-id> \
    ContainerImage=<account>.dkr.ecr.<region>.amazonaws.com/fsxn-fpolicy-server:latest \
  --capabilities CAPABILITY_NAMED_IAM

This creates:

ECS Cluster + Fargate Service (1 task)
SQS Queue for FPolicy events
Security Group (inbound TCP:9898 from FSx SG)
CloudWatch Log Group

Step 2: Deploy the Datadog Shipping Lambda

The template accepts the SQS queue ARN as a parameter and automatically creates the event source mapping:

# Get the SQS queue ARN from Step 1 outputs
SQS_ARN=$(aws cloudformation describe-stacks \
  --stack-name fsxn-fpolicy-server \
  --query "Stacks[0].Outputs[?OutputKey=='FPolicyQueueArn'].OutputValue" \
  --output text)

aws cloudformation deploy \
  --template-file integrations/datadog/template-ems-fpolicy.yaml \
  --stack-name fsxn-datadog-ems-fpolicy \
  --parameter-overrides \
    DatadogApiKeySecretArn=<secret-arn> \
    DatadogSite=ap1.datadoghq.com \
    FPolicySqsQueueArn=${SQS_ARN} \
  --capabilities CAPABILITY_NAMED_IAM

This creates the Lambda function with an SQS event source mapping — no manual create-event-source-mapping needed.

Step 3: Get the Fargate Task IP

TASK_ARN=$(aws ecs list-tasks \
  --cluster fsxn-fpolicy-server-cluster \
  --service-name fsxn-fpolicy-server-service \
  --query "taskArns[0]" --output text)

aws ecs describe-tasks \
  --cluster fsxn-fpolicy-server-cluster \
  --tasks $TASK_ARN \
  --query "tasks[0].containers[0].networkInterfaces[0].privateIpv4Address" \
  --output text

ONTAP FPolicy Configuration

CLI note: Some ONTAP versions show these commands under vserver fpolicy ..., while newer CLI contexts may allow shortened forms. Use the command form supported by your ONTAP version. The examples below use the form validated in my environment (FSx for ONTAP 9.17.1). See NetApp CLI reference for the full command syntax.

FPolicy requires three components: an External Engine (where to send events), an Event (what to monitor), and a Policy (linking them together).

Create the External Engine

vserver fpolicy policy external-engine create -vserver <svm-name> \
  -engine-name fpolicy_aws_engine \
  -primary-servers <fargate-task-ip> \
  -port 9898 \
  -extern-engine-type asynchronous \
  -ssl-option no-auth

Production note: For production deployments, evaluate server-auth or mutual-auth instead of no-auth, and validate certificate handling between ONTAP and the FPolicy server. See NetApp FPolicy external engine documentation.

Create the FPolicy Event

vserver fpolicy policy event create -vserver <svm-name> \
  -event-name cifs_file_events \
  -protocol cifs \
  -file-operations create,write,rename,delete

Tip: For write-heavy workloads, review the protocol-specific FPolicy filters supported by your ONTAP version and protocol. Where supported, use close/modify-oriented filters to reduce duplicate or noisy write events.

Create and Enable the Policy

vserver fpolicy policy create -vserver <svm-name> \
  -policy-name fpolicy_aws \
  -events cifs_file_events \
  -engine fpolicy_aws_engine \
  -is-mandatory false

vserver fpolicy enable -vserver <svm-name> \
  -policy-name fpolicy_aws \
  -sequence-number 1

This example uses an asynchronous, non-mandatory policy so client file operations are not blocked by FPolicy server processing or Datadog delivery. If the FPolicy server is unavailable, file operations continue unimpeded — but notifications may be buffered or lost depending on your ONTAP version and configuration.

Verify Connection

vserver fpolicy show-engine -vserver <svm-name> -engine-name fpolicy_aws_engine

You should see connected status. In the ECS logs, KeepAlive messages confirm the connection:

[INFO] fpolicy-server: [+] Connection from ('10.0.x.x', 44107)
[INFO] fpolicy-server: [Handshake] Policy=fpolicy_aws | Session=... | VsUUID=...
[INFO] fpolicy-server: [Send] NEGO_RESP | Version=1.2 | Policy=fpolicy_aws
[INFO] fpolicy-server: [KeepAlive] Received — connection healthy

E2E Validation Results

File operations on the SMB share produce events that flow through the entire pipeline:

Operation	ECS Log	SQS	Lambda	Datadog	Latency
create `blog_demo_create.txt`	✅	✅	✅ shipped:1	✅	~6 seconds
create `blog_demo_write.txt`	✅	✅	✅ shipped:1	✅	~6 seconds
create `confidential_report_2026.xlsx`	✅	✅	✅ shipped:1	✅	~6 seconds

ECS Fargate Logs — Connection Lifecycle

The FPolicy server logs show the complete lifecycle: server start → ONTAP connection → protocol handshake → KeepAlive → file events → SQS delivery.

Lambda CloudWatch Logs — Event Processing

Each SQS message triggers a Lambda invocation. Processing time is typically 30-50ms per event.

Datadog Log Explorer

Query: source:fsxn-fpolicy

Each event contains structured attributes:

operation_type: The file operation (create, write, rename, delete)
file_path: The file that was operated on
client_ip: The client that performed the operation
volume_name: The ONTAP volume
svm: The ONTAP SVM name (may show "unknown" if not resolved from handshake context)
timestamp: When the operation occurred

Correlating FPolicy with ARP

The real power emerges when you combine FPolicy file activity with ARP ransomware detection from Part 3:

source:(fsxn-fpolicy OR fsxn-ems) @attributes.svm:svm-prod-01

This correlation query shows:

ARP alert (from EMS): "Ransomware detected on volume X"
File operations (from FPolicy): Which user, from which IP, created/renamed which files

Together they answer the critical incident response questions: What happened, who did it, and from where?

Security Use Case: Detecting Suspicious File Creation Bursts

With FPolicy create events in Datadog, you can create a Monitor that fires when a single client creates more than 50 files in 5 minutes — a potential indicator of ransomware encryption or unauthorized bulk operations:

Datadog Monitor query:

logs("source:fsxn-fpolicy @attributes.operation_type:create").rollup("count").by("@attributes.client_ip").last("5m") > 50

Alert message:

🚨 Suspicious file creation burst detected on FSx for ONTAP

Client IP: {{@attributes.client_ip}}
Volume: {{@attributes.volume_name}}
Count: {{value}} file creations in 5 minutes

Investigate immediately — check if this is authorized batch processing or potential ransomware activity.

Note on delete monitoring: If your FPolicy configuration and ONTAP version reliably deliver delete events (e.g., synchronous mode or a future ONTAP release), you can extend this pattern to bulk deletion detection. In my async-mode validation, delete notifications were not reliably delivered — I recommend using audit logs from Part 2 for delete-event completeness.

This is difficult to achieve with traditional audit log polling, which depends on rotation and scheduler intervals. FPolicy's event-driven delivery makes sub-minute detection possible for the operations it reliably captures.

Operational Considerations

Fargate Task IP Changes

When a Fargate task restarts (deployment, crash, scaling), it gets a new private IP. ONTAP's External Engine must be updated with the new IP. Options:

Manual update: vserver fpolicy policy external-engine modify -primary-servers <new-ip>
Automated: Lambda triggered by ECS task state change → ONTAP REST API update

The repository includes a helper script (shared/scripts/fpolicy-update-engine-ip.sh --auto) that detects the current task IP and updates the ONTAP engine. For full automation, wire an EventBridge rule on ECS task state changes to an update Lambda — this is not included in the base stack but is straightforward to add. Automated updates require network reachability to the ONTAP management endpoint and credentials (stored in Secrets Manager) with permission to modify the FPolicy external engine.

Restart Resilience — Validated

I tested the full restart cycle to confirm the pipeline recovers gracefully:

Step	Result	Time
Stop Fargate (scale to 0)	Task stopped	~30s
Restart Fargate (scale to 1)	New task, new IP	~45s
Update ONTAP Engine IP	Reconnection	~20s
File operation after restart	Event delivered to Datadog	~6s
Total recovery time		~2 minutes

The Lambda's retry logic also proved itself: on the first request after reconnection, a transient RemoteDisconnected error occurred. The exponential backoff retry succeeded on the second attempt — exactly the behavior we designed for.

[WARNING] HTTP error shipping to Datadog (attempt 1/3): RemoteDisconnected
[INFO]    Processing complete: {"statusCode": 200, "body": {"shipped": 1}}

Cost Profile

Component	Monthly Cost (estimate)
Fargate (0.25 vCPU, 0.5 GB)	~$10
SQS (low volume)	< $1
Lambda (event-driven)	< $1
CloudWatch Logs	~$2
Total	~$14/month

Compare this to an always-on EC2-based collector, plus OS patching, agent management, and HA considerations. Exact EC2 costs vary by region and instance type.

This is an AWS-side estimate and excludes Datadog ingest/retention costs, NAT Gateway or VPC endpoint charges, ECR storage, and high-volume CloudWatch Logs.

Scaling

A single Fargate task is sufficient for the low-volume validation scenarios in this post. The architecture can scale by tuning Fargate CPU/memory, SQS buffering, and Lambda concurrency, but you should benchmark your own workload before assuming a specific events/sec capacity.

Monitoring

Key CloudWatch metrics to watch:

ECS/CPUUtilization — Fargate task health
SQS/ApproximateNumberOfMessagesVisible — Queue depth (should stay near 0)
Lambda/Errors — Shipping failures
Lambda/Duration — Processing time per batch

The FPolicy Server

The FPolicy server (shared/fpolicy-server/fpolicy_server.py) implements:

Protocol negotiation: Responds to ONTAP's version handshake
KeepAlive handling: Acknowledges connection health checks
Event parsing: Extracts file path, operation, user, client IP from binary frames
SQS forwarding: Sends normalized JSON events to the queue
Write coalescing: Configurable delay to batch rapid write events (default: 5 seconds)

The server runs in realtime mode — events are forwarded as they arrive, with optional write-complete delay to avoid duplicate notifications for multi-write operations.

Limitations and Future Work

Rename/Delete Events Not Delivered in Async Mode

In my E2E testing, ONTAP did not deliver rename or delete notifications to the FPolicy server in asynchronous mode — even though these operations are configured in the FPolicy event definition. Only create events were reliably delivered. This appears to be a limitation of FSx for ONTAP's FPolicy implementation in async mode for certain operation types.

Workaround options:

Use synchronous mode (adds latency to file operations — not recommended for production)
Combine FPolicy (event-driven create) with audit log polling (catches rename/delete in EVTX)
Accept create-only monitoring for event-driven alerting, use audit logs for forensic completeness

NFS Protocol Support

Protocol	FPolicy Support	Notes
SMB/CIFS	✅ Verified	Primary validation protocol
NFSv3	✅ Supported	Requires explicit `vers=3` mount option
NFSv4.0	✅ Supported	Requires explicit `vers=4.0`
NFSv4.1	✅ Supported	Requires ONTAP 9.15.1+, explicit `vers=4.1`
NFSv4.2	❌ Not supported	ONTAP FPolicy does not monitor NFSv4.2 operations

For protocol support details, verify your ONTAP version. NetApp documents that FPolicy does not currently support NFSv4.2; supported NFS protocols include NFSv3, NFSv4.0, and NFSv4.1 (ONTAP 9.15.1+).

Critical gotcha: mount -o vers=4 on modern Linux negotiates to NFSv4.2, which ONTAP FPolicy does not support. Always use explicit version: mount -o vers=4.1 or vers=3.

NFS + FPolicy latency: NFSv3 lacks close semantics, so the FPolicy server cannot know when a write is complete. The server uses a configurable WRITE_COMPLETE_DELAY_SEC (default: 5s) to wait before forwarding the event. This adds latency but prevents premature processing of incomplete files.

NFS write hang (observed): In some configurations, NFS write operations may hang when FPolicy is enabled — even with is-mandatory=false. This is a known ONTAP behavior related to FPolicy notification processing. If you experience this, verify your ONTAP version and consider limiting FPolicy scope to specific volumes.

User Identity

In the current implementation, the user field may be empty for some operations depending on ONTAP's FPolicy notification content. The FPolicy binary frame includes user identity in extended attributes that require additional parsing logic. Future versions will extract this from the NOTI_REQ body.

Event Durability During Restarts

In my validation, events generated while the Fargate server was disconnected were not observed downstream in Datadog after reconnection. Treat FPolicy delivery during server outages as something you must validate in your own environment.

ONTAP documentation describes buffering behavior for asynchronous notifications — notifications generated during a network outage are stored on the storage node and can be fetched when the server comes back online. Beginning with ONTAP 9.14.1, FPolicy persistent store support is available for asynchronous non-mandatory policies. If you cannot tolerate event loss during FPolicy server restarts, evaluate persistent store and validate the behavior on your FSx for ONTAP version.

Try It Yourself

# Clone the repository
git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git

# Deploy prerequisites (if not already done)
aws cloudformation deploy \
  --template-file shared/templates/fpolicy-server-fargate.yaml \
  --stack-name fsxn-fpolicy-server \
  --parameter-overrides \
    VpcId=<your-vpc> \
    SubnetIds=<your-subnet> \
    FsxnSvmSecurityGroupId=<fsx-sg> \
    ContainerImage=<your-ecr-image> \
  --capabilities CAPABILITY_NAMED_IAM

# Configure ONTAP FPolicy (see ONTAP section above)
# Create a file on the SMB share
# Check Datadog: source:fsxn-fpolicy

Where FPolicy Fits in ONTAP Telemetry

This series covers three ONTAP telemetry sources. Each serves a different purpose:

Use Case	Best Source	Latency	Coverage
Compliance audit trail	Audit logs (Part 2)	Minutes (scheduler interval)	Complete historical record
Ransomware detection	ARP via EMS (Part 3)	~30 seconds (webhook)	ML-based pattern detection
Event-driven file activity signal	FPolicy (this post)	~6 seconds (TCP)	Create events validated; other operations depend on mode/version
Forensic investigation	Audit logs + FPolicy correlation	Combined	Timeline reconstruction

FPolicy is not a replacement for audit logs. It provides an event-driven signal for detection and alerting. Audit logs provide the authoritative, complete historical record for compliance and forensics. Use them together.

Key Takeaways

Use Fargate for FPolicy TCP listener — Lambda cannot maintain persistent TCP connections. Fargate provides the long-running listener without OS management.
Use SQS to decouple ingestion from shipping — If Datadog is slow or Lambda is throttled, events buffer safely in SQS.
Validate operation coverage in your environment — Async mode reliably delivered create events in my testing. Rename/delete behavior varies by ONTAP version and mode.
Use audit logs for forensic completeness — FPolicy provides event-driven signal for detection; audit logs (Part 2) provide the complete historical record.
Treat FPolicy as event-driven alerting, not full audit replacement — The two are complementary, not interchangeable.

Production Considerations Beyond This Validation

This post validates the end-to-end path. For production deployments, the following topics warrant additional design work:

Topic	Key Questions
HA / Multi-AZ	ONTAP external engine supports `primary-servers` and `secondary-servers`. How to run multiple Fargate tasks across AZs?
Scope Design	Which volumes, operations, and protocols to monitor? How to avoid noisy workloads?
Security Hardening	TLS/mTLS for FPolicy, ECR image scanning, VPC Flow Logs, task role least-privilege
Cost Model	FPolicy generates events per file operation — Datadog ingest can become the dominant cost at scale
Operations Runbook	Task restart, engine disconnected, SQS backlog, Datadog missing events, NFS hang
Stable Endpoint	Auto-update Lambda for engine IP, or primary/secondary server design for zero-downtime restarts

These topics are documented in the repository:

Production Architecture Patterns — Single task, primary/secondary, auto-update, multi-AZ patterns with failure mode matrix
Operational Guide — 4-layer health model, runbooks, IP reconciliation, synthetic health check
PoC Checklist — Preconditions, scope, validation steps, success criteria, go/no-go

Contributions and questions are welcome.

Series Navigation

Part 1: Why Your FSx for ONTAP Logs Deserve Better
Part 2: Shipping FSx for ONTAP Logs to Datadog, The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline (this post)

Coming next:

Splunk: Replacing EC2 + Universal Forwarder with Lambda + HEC
OpenTelemetry: The vendor-neutral escape hatch

Questions about FPolicy or the Fargate architecture? Drop a comment below.

Previous: Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Diario de una builder: El camino hacia la orquestación de dos mundos

Diana Castro — Mon, 18 May 2026 00:38:32 +0000

Aprender una segunda nube sin empezar desde cero

En tecnología hay una verdad incómoda, pero también liberadora: nunca terminamos de dominar completamente un tema. Lo que sabías ayer puede quedar obsoleto mañana y, en el mundo de las nubes públicas, donde los servicios evolucionan constantemente, es prácticamente imposible conocer cada detalle de cada herramienta.

Más que aspirar a saberlo todo, el verdadero enfoque está en comprender los fundamentos y especializarse en ciertos dominios. Se trata de reconocer qué servicios existen, para qué fueron diseñados y en qué escenarios aportan valor. Así, cuando enfrentas un problema real, no partes desde cero: sabes qué buscar y dónde apoyarte.

El reto de aprender otra nube

Más que dominar una nube en su totalidad, el enfoque real está en el aprendizaje continuo y en desarrollar criterio técnico para entender cómo funcionan los servicios y cuándo utilizarlos.

Y por esas oportunidades que da la vida —que se agradecen enormemente— terminé frente a un nuevo desafío: aprender una segunda nube.

Un reto que impone respeto.

Que incluso puede generar cierta incertidumbre.

Pero que también expande la forma en que pensamos la arquitectura.

La pregunta entonces fue:

¿Cómo abordar este reto sin empezar completamente desde cero?

La respuesta estuvo en reutilizar el conocimiento base.

En lugar de aprender desde una hoja en blanco, comencé a buscar patrones, equivalencias y analogías:

Este servicio se parece a este otro.
Esta solución resuelve un problema similar en otra nube.
Este concepto cambia de nombre, pero no necesariamente de propósito.

Y sí, ese enfoque funciona… hasta que deja de funcionar.

Cuando las equivalencias dejan de ser suficientes

El primer impulso al aprender una segunda nube es buscar traducciones directas entre servicios. Algo natural. Necesitamos referencias conocidas para orientarnos.

Pero eventualmente llegan las diferencias importantes.

Descubres que:

Los Region Pairs en Azure abordan Disaster Recovery de una forma distinta.
El modelo de identidad no se mapea 1:1 con AWS.
Las suposiciones sobre failover automático pueden estar completamente invertidas.
La organización de recursos responde a filosofías diferentes.
Incluso la forma de operar y navegar la plataforma cambia.

Y ahí ocurre algo interesante: dejas de intentar traducir una nube hacia la otra y comienzas a entender cómo piensa cada proveedor.

Ese suele ser el punto donde realmente empieza el aprendizaje.

El modelo de responsabilidad compartida

(AWS Shared Responsibility Model & Azure Shared Responsibility Model)

El modelo de responsabilidad compartida es conceptualmente el mismo en AWS y Azure: el proveedor asegura la infraestructura de la nube, mientras que el cliente es responsable de la configuración, los datos y el acceso.

Sin embargo, aunque el principio es equivalente, su implementación varía según el nivel de abstracción del servicio y la filosofía de cada proveedor.

A simple vista puede parecer un concepto sencillo… hasta que llegas a los detalles.

Los valores por defecto, las configuraciones iniciales y la forma en que cada nube aplica sus controles no son idénticos. Y, como suele ocurrir en tecnología, el diablo está en los detalles.

Podemos pensar en la clásica analogía de la casa:

El proveedor construye la estructura.
Garantiza que la infraestructura sea segura.
Pero tú decides quién entra, qué permisos tiene y cómo proteges lo que guardas dentro.

El problema es que no todas las casas vienen configuradas igual.

Algunas plataformas habilitan más controles desde el inicio.

Otras requieren que el cliente los defina explícitamente.

Y ahí es donde se vuelve evidente que, aunque el modelo sea el mismo en teoría, la implementación cambia significativamente en la práctica.

Porque en multi-cloud no basta con entender qué eres responsable de proteger.

También necesitas entender:

cómo cada proveedor interpreta esa responsabilidad,
qué controles vienen habilitados por defecto,
qué configuraciones requieren intervención manual,
y qué supuestos de seguridad estás heredando sin darte cuenta.

Ese suele ser uno de los primeros momentos donde descubres que aprender una segunda nube no es memorizar servicios… sino ajustar la manera en que piensas la seguridad.

Estructura de la nube

Sería un error intentar definir equivalencias entre servicios sin comprender primero cómo está organizada cada nube. Antes de hablar de servicios, redes o seguridad, necesitamos entender la base sobre la que todo está construido.

Porque aunque AWS y Azure comparten muchos conceptos, la forma en que estructuran su infraestructura refleja filosofías bastante distintas.

Este recorrido no busca ser exhaustivo.

La idea es construir un mapa mental rápido que ayude a entender dónde empiezan las similitudes… y dónde aparecen las diferencias importantes.

Organización global

A nivel global, Azure y AWS adoptan estrategias diferentes para organizar y aislar su infraestructura.

En Azure, la organización global se basa en Geographies, que agrupan múltiples regiones dentro de un mismo límite orientado principalmente a:

cumplimiento normativo,
residencia de datos,
y latencia.

Estas geografías forman parte de un entorno altamente interconectado donde los servicios, la identidad y la gobernanza se gestionan de forma relativamente unificada.

AWS, en cambio, estructura su organización global mediante Partitions, que representan límites de aislamiento mucho más marcados tanto a nivel técnico como regulatorio.

Cada partición funciona prácticamente como un entorno independiente:

servicios separados,
endpoints distintos,
controles propios,
e incluso aislamiento de IAM.

Ese enfoque hace que AWS priorice mucho más el desacoplamiento entre entornos globales.

Regiones y Zonas de Disponibilidad

En este nivel, la organización entre AWS y Azure se vuelve mucho más comparable, aunque siguen existiendo diferencias importantes.

Ambos proveedores operan con regiones distribuidas globalmente, cada una compuesta por múltiples Availability Zones (AZs) diseñadas para ofrecer alta disponibilidad y resiliencia.

Sin embargo, la implementación cambia bastante.

Una de las diferencias más relevantes es que Azure trabaja con el concepto de Region Pairs, donde cada región tiene una contraparte definida para escenarios de recuperación ante desastres.

Esto permite que Microsoft:

coordine actualizaciones,
priorice recuperación,
y mantenga estrategias de continuidad más estructuradas.

En AWS no existe un equivalente automático.

Las estrategias multi-región deben diseñarse explícitamente por el arquitecto.

Eso entrega más flexibilidad, pero también más responsabilidad.

A nivel de AZs también existen diferencias relevantes.

AWS mantiene una cobertura bastante consistente: la mayoría de regiones cuentan con entre 2 y 6 zonas de disponibilidad.

En Azure, aunque muchas regiones modernas sí disponen de múltiples AZs, no todas las regiones ofrecen soporte completo de Availability Zones, algo que puede afectar decisiones de arquitectura dependiendo de la ubicación elegida.

Datacenters y extensiones de baja latencia

En el nivel más bajo de infraestructura, ambos proveedores operan sobre datacenters físicos.

Tanto en Azure como en AWS, estos datacenters forman parte de una abstracción superior: las Availability Zones, que agrupan múltiples instalaciones físicas para reducir puntos únicos de fallo.

En Azure, aunque el datacenter no se expone directamente como recurso, existen conceptos importantes como: Fault Domains, Update Domains

Estos permiten distribuir máquinas virtuales minimizando el impacto de fallos físicos o mantenimientos programados.

AWS no expone exactamente la misma granularidad.

En su lugar, utiliza mecanismos como:Placement Groups, distribución entre AZs y diseño de resiliencia a nivel regional.

Local Zones y edge computing

Más allá del datacenter tradicional, ambos proveedores han extendido su infraestructura hacia ubicaciones más cercanas al usuario final para reducir latencia.

En AWS, esto se materializa mediante Local Zones, que extienden una región hacia áreas metropolitanas específicas permitiendo ejecutar cargas con latencias extremadamente bajas sin desplegar una región completa.

Azure ofrece iniciativas similares como: Azure Local Zones, Azure Stack Edge. Aunque actualmente su disponibilidad es más limitada y el enfoque suele combinar baja latencia con integración híbrida.

Resumen comparativo

Concepto	Azure	AWS
Nivel 1: Global	Geography (`US`, `Europe`, `Asia Pacific`) • Agrupa múltiples regiones • Define residencia de datos • Boundary de compliance • Entorno unificado	Partition (`aws`, `aws-cn`, `aws-us-gov`) • Agrupa múltiples regiones • Aislamiento completo de IAM, servicios y endpoints • Boundary legal y regulatorio • Entornos independientes
Nivel 2: Regional	Region (`East US`, `West Europe`) • Múltiples regiones globales • Cada región puede tener múltiples AZs • Region Pairs definidos • Updates coordinados • Priorización de recuperación	Region (`us-east-1`, `eu-west-1`) • Múltiples regiones globales • Cada región tiene múltiples AZs • No existe emparejamiento automático • Estrategia multi-región definida por el arquitecto
Nivel 3: Availability Zones	Availability Zone (AZ) • 3 o más AZs en regiones compatibles • Datacenters físicamente separados • Baja latencia entre AZs • No todas las regiones tienen AZs	Availability Zone (AZ) • La mayoría de regiones tienen múltiples AZs • Datacenters físicamente separados • Baja latencia entre AZs • Cobertura más consistente
Nivel 4: Datacenter	Datacenter (no expuesto al usuario) • Múltiples datacenters por AZ • Fault Domains • Update Domains • Abstracción gestionada por plataforma	Datacenter (no expuesto al usuario) • Múltiples datacenters por AZ • Placement Groups • Distribución gestionada por arquitectura • Sin equivalente directo a Update Domains
Extensiones locales	Azure Local Zones / Azure Stack Edge • Baja latencia • Escenarios híbridos • Disponibilidad más limitada	Local Zones / Wavelength Zones • Extensión metropolitana de regiones • Latencia ultra baja • Integración 5G y edge computing

💡 Pro Tip

Las similitudes entre AWS y Azure facilitan el aprendizaje, pero las diferencias en su implementación son las que realmente definen una buena arquitectura.

Diseñar correctamente implica adaptar patrones, no traducirlos literalmente.

Cómo se organizan las nubes

Uno de mis primeros choques mentales en el proceso multi nube fue entender que AWS y Azure no organizan sus recursos de la misma manera. Parece un detalle administrativo sin demasiada importancia… hasta que empiezan las conversaciones sobre ambientes, permisos, facturación, gobernanza o separación de cargas. Ahí uno entiende rápidamente que la estructura organizacional de cada nube impacta muchísimo más de lo que imaginaba al inicio.

De hecho, probablemente este ha sido uno de los temas más difíciles tanto de entender como de explicar cuando converso con colegas que vienen principalmente de trabajar con una sola nube.

En AWS, el modelo mental gira alrededor de la cuenta. Desde mi punto de vista, ahí es donde normalmente se establece la primera gran separación organizacional. Por ejemplo, si alguien plantea:

“quiero separar ambientes”

La respuesta natural suele ser crear cuentas distintas para producción, desarrollo, seguridad o logging, algo muy alineado con las buenas prácticas de AWS.

Sobre esas cuentas se construyen estructuras organizacionales mediante Amazon Web Services Organizations, que permiten agruparlas con fines administrativos y de control. A partir de ahí aparecen conceptos como Organizational Units (OU), Service Control Policies (SCP) e identidades centralizadas que ayudan a establecer reglas comunes entre múltiples cuentas.

En Azure, el enfoque se siente mucho más jerárquico e integrado desde el inicio. El modelo normalmente se entiende así:

Tenant → Subscription → Resource Group → Resource

Cada nivel cumple un propósito distinto relacionado con organización, facturación, permisos y administración. La suscripción no representa el mismo nivel de separación operativa que una cuenta AWS; muchas veces funciona más como un contenedor administrativo dentro de una jerarquía mayor controlada por el tenant.

Desde mi perspectiva, AWS prioriza más explícitamente la separación mediante cuentas, mientras Azure aborda la organización desde una jerarquía profundamente integrada al modelo operativo de la plataforma. Y ojo, eso no significa que AWS no tenga jerarquías o estructuras organizacionales; simplemente la cuenta suele convertirse en el elemento principal alrededor del cual se diseñan muchas decisiones arquitectónicas.

Veamos con más detalle cada elemento desde la perspectiva de cada proveedor.

Enfoque de Azure

Tenant

Es el nivel más alto. Representa la organización completa en Azure y está asociado a una instancia de Microsoft Entra ID (anteriormente Azure Active Directory). Cuando una empresa contrata Azure, se crea un tenant. Todo lo demás vive dentro de él.

Management Group

Es opcional, pero muy útil en organizaciones grandes. Permite agrupar suscripciones para aplicar políticas y permisos de forma centralizada.

Por ejemplo, puedes tener un Management Group para todas las suscripciones de producción y otro para desarrollo, aplicando reglas distintas sin tener que configurar cada suscripción individualmente. También podrías tener un Management Group que agrupe todas las suscripciones de la organización únicamente para gobierno y cumplimiento.

Subscription

Es el contenedor administrativo y financiero principal. Todo recurso que se crea en Azure vive dentro de una suscripción. También es donde se aplican cuotas y donde se consolida la facturación.

Muchas organizaciones usan suscripciones separadas para producción, desarrollo o unidades de negocio, más por administración y control financiero que por separación técnica entre entornos.

Un detalle importante —y fuente frecuente de confusión— es que, aunque la suscripción sea un contenedor administrativo, no puedes mezclar recursos de distintas suscripciones dentro del mismo Resource Group.

Resource Group

Es un contenedor lógico dentro de una suscripción que agrupa recursos relacionados con una carga de trabajo: App Services, bases de datos, Cosmos DB, redes, etc.

Mientras los recursos pertenezcan al mismo scope administrativo, pueden agruparse dentro de un Resource Group. Además de organizar recursos, permite aplicar permisos mediante RBAC y gestionar el ciclo de vida completo de una solución: si eliminas el Resource Group, eliminas todo lo que contiene.

Personalmente, este es uno de los elementos que más me ayudó durante mi proceso de adopción de Azure.

Resource

Es el recurso concreto: una VM, un Storage Account, un NAT Gateway o una base de datos. Representa la unidad mínima de infraestructura o servicio dentro de Azure.

Enfoque AWS

Root Account

Es la cuenta inicial que se crea cuando una organización comienza a utilizar AWS. Tiene acceso total e irrestricto a todos los recursos y servicios.

La recomendación general es no usarla para trabajo diario, protegerla con MFA y reservarla únicamente para tareas administrativas muy específicas.

AWS Organizations

Es la estructura que permite gobernar múltiples cuentas AWS desde un punto centralizado. Se habilita desde la Root Account, que pasa a convertirse en la Management Account de la organización.

Desde ahí pueden crearse cuentas hijas, agruparlas y aplicar políticas comunes.

Organizational Unit (OU)

Es un contenedor dentro de AWS Organizations que agrupa cuentas con un propósito común.

Por ejemplo, puedes tener una OU para producción, otra para desarrollo y otra para seguridad, incluyendo los niveles de anidación que necesites.

Las políticas aplicadas a una OU se heredan a todas las cuentas contenidas dentro de ella, permitiendo gobernar a escala sin configurar cada cuenta individualmente.

Service Control Policy (SCP)

Es un mecanismo de control aplicado sobre OUs o cuentas.

Define el máximo nivel de acciones permitidas dentro de una cuenta. Aunque un usuario tenga permisos amplios mediante IAM, si una SCP restringe una acción, la restricción prevalece.

Las SCP no otorgan permisos por sí mismas; únicamente establecen límites.

Cuenta AWS

Es probablemente la unidad organizacional más importante dentro del modelo AWS.

Cada cuenta posee sus propios recursos, redes, facturación y límites de servicio. El acceso entre cuentas no ocurre automáticamente; normalmente requiere configuraciones explícitas mediante IAM, networking o servicios compartidos.

Es el equivalente conceptual más cercano a una Subscription de Azure, aunque con una separación operativa mucho más marcada desde el diseño de la plataforma.

Equivalencias conceptuales

Nivel Azure	Equivalente conceptual AWS	Nota clave
Tenant	AWS Organizations / Root Context	En Azure todo vive dentro de un tenant asociado a Entra ID; en AWS el contexto organizacional suele construirse alrededor de Organizations y la cuenta raíz
Management Group	Organizational Unit (OU)	Ambos permiten agrupar contenedores hijos para aplicar políticas y gobernanza centralizada
Subscription	Cuenta AWS	Ambos funcionan como contenedores administrativos y financieros, aunque la cuenta AWS suele representar una separación operativa más marcada
Resource Group	No existe equivalente directo	AWS utiliza tags, stacks y convenciones organizacionales para agrupar recursos, pero no existe un contenedor con el mismo peso operativo y ciclo de vida que un Resource Group
Resource	Resource	La unidad mínima consumible de infraestructura o servicio en ambas nubes

Y esto nos lleva al tema de facturación, que también refleja bastante la filosofía de organización de cada nube.

En Azure, la suscripción tiene un peso administrativo y financiero muy importante; muchas estrategias de gobernanza, límites y control de costos se construyen alrededor de ella.

En AWS, aunque la cuenta sigue siendo un elemento financiero clave, la granularidad del análisis de costos suele apoyarse muchísimo más en estrategias de tagging y consolidación mediante AWS Organizations.

Mi impresión personal es que Azure incentiva más una segmentación jerárquica desde la propia estructura organizacional, mientras AWS favorece una separación basada en cuentas complementada con modelos detallados de etiquetado para gobierno financiero y operacional.

Veamos un ejemplo práctico

Imaginemos una organización dedicada a investigación y desarrollo que está iniciando su adopción cloud y necesita construir una estructura ordenada, segura y escalable tanto en AWS como en Azure.

La organización quiere separar claramente sus ambientes de:

Desarrollo
Pruebas
Preproducción
Producción

Además, busca implementar controles bien definidos para:

permisos y accesos
facturación y control de costos
gobernanza
cumplimiento
networking compartido
servicios de seguridad centralizados

A simple vista, el objetivo parece idéntico en ambas nubes: organizar recursos, separar ambientes y aplicar políticas. Sin embargo, cuando empezamos a diseñar la estructura, rápidamente aparecen diferencias importantes en la filosofía organizacional de cada proveedor.

En AWS, el diseño suele inclinarse hacia una separación por cuentas, donde cada ambiente vive en una cuenta independiente administrada mediante AWS Organizations y Organizational Units (OU).

En Azure, el enfoque normalmente se construye alrededor de una jerarquía organizacional basada en:

Tenant → Management Groups → Subscriptions → Resource Groups

donde la gobernanza y la administración se integran profundamente dentro de la estructura jerárquica de la plataforma.

El siguiente diagrama muestra cómo podría modelarse este mismo escenario en ambas nubes y ayuda a visualizar por qué, aunque los objetivos sean similares, la forma de pensar y organizar la infraestructura cambia considerablemente entre AWS y Azure.}

Identidad: donde todo inicia

Puedes replicar infraestructura entre nubes, pero si no entiendes cómo funciona la identidad, no puedes gobernarlas. Y esta es, quizá, una de las particularidades más complejas cuando estás transitando entre dos mundos.

En lo personal, este tema me costó un poco. Ambos entornos resuelven la misma necesidad de formas similares, pero —y aquí está el punto clave— similar no es lo mismo.

Mi mayor confusión venía de esto:

AWS te da control fino desde el inicio, mientras que Azure te ofrece una capa de abstracción inicial y luego te permite profundizar.

Analicémoslo con más detalle.

AWS: identidad y permisos en un mismo sistema

En AWS, la identidad y los permisos se definen dentro de un mismo sistema: AWS Identity and Access Management (IAM).

Aquí tienes control granular a través de políticas, donde defines exactamente qué puede hacer cada identidad sobre cada recurso.

Yo lo veo así:

Usuarios / Grupos / Roles
Policies (JSON)
Permisos a servicios y recursos

Las asignaciones son altamente granulares.

Ese control fino permite aplicar el principio de mínimo privilegio desde el inicio, aunque puede resultar más complejo y, en ocasiones, un poco árido al principio.

Azure: identidad y autorización como capas separadas

En Azure, en cambio, el modelo se separa en dos capas bien definidas.

Por un lado está la identidad, gestionada en Microsoft Entra ID:

Usuarios
Grupos
Aplicaciones / Service Principals

Aquí es donde defines quién eres.

Por otro lado está la autorización, gestionada mediante Azure Role-Based Access Control (RBAC):

Roles: Owner, Contributor, Reader (y muchos más)

Asignaciones a nivel de:
- Subscription
- Resource Group
- Recurso específico

Aquí es donde defines qué puede hacer esa identidad.

La diferencia importante

Esta separación es clave para entender Azure.

Mientras en AWS todo vive en un mismo sistema, en Azure debes pensar en dos dimensiones:

identidad
permisos

Y aunque ambos modelos terminan resolviendo el mismo problema, la forma en que llegas ahí cambia bastante entre plataformas.

Cómo se comunican los recursos - Networking

Y aquí es donde realmente empiezan las diferencias filosóficas fuertes entre ambas nubes. Y siendo muy honesta, el networking no es mi fuerte. AWS y Azure se parecen bastante superficialmente, pero me parece que el diseño mental cambia un poco, por lo que les compartiré mi “Piedra Roseta” para tratar de hacer más fácil el proceso de adaptación a otra nube y algunas reflexiones sobre los elementos de networking más destacables.

VPC vs VNet

Conceptualmente, ambos servicios cumplen el mismo objetivo: crear redes privadas lógicas dentro de la nube para aislar y conectar recursos de forma segura.

Tanto AWS como Azure permiten:

definir CIDR,
segmentar mediante subnets,
controlar tráfico,
conectar entornos on-premises,
e incluso otras nubes.

Hasta aquí, pareciera que hablamos exactamente de lo mismo. Pero nuevamente, el modelo puede parecer similar mientras la filosofía detrás del diseño cambia bastante.

En AWS, la VPC se siente muy explícita en el aislamiento. El arquitecto define de forma muy consciente cómo se segmenta la red, cómo se enruta el tráfico y qué componentes permiten la salida o entrada hacia Internet. Soy de software, eso siempre me ha costado.

Muchos elementos deben declararse explícitamente:

Internet Gateways
Route Tables
NAT Gateways
asociaciones de subnets

Desde el inicio hay mucho control y consciencia de lo que es permitido y no, y por supuesto muchos dolores de cabeza cuando no le puedes llegar a un recurso.

En Azure, la VNet se percibe más integrada al ecosistema general de la suscripción y la región. El modelo suele sentirse más abstraído y conectado al diseño operativo de Azure.

Aunque también existen tablas de ruteo, gateways y segmentación, varios comportamientos vienen más integrados dentro del modelo de la plataforma.

Uno de los detalles más importantes es la relación entre subnets y zonas de disponibilidad.

En AWS, una subnet pertenece a una Availability Zone específica.
En Azure, las subnets viven a nivel regional y los recursos son los que posteriormente se distribuyen entre zonas cuando el servicio lo soporta.

Es un pequeño detalle que cambia bastante la forma de pensar en términos de resiliencia y diseño de red.

Al momento de escribir este artículo una región solo tenía una AZ.

NSG vs Security Groups ¿qué tan parecidos?

Al inicio, los Network Security Groups (NSG) de Azure y los Security Groups de AWS parecen prácticamente lo mismo, pero no hay que dejarse engañar. Al principio es solo ese falso sentimiento de:

“esto lo conozco”.

Ambos permiten controlar tráfico de entrada y salida hacia recursos dentro de la red. Sin embargo, conforme se profundiza, aparecen diferencias importantes en filosofía y funcionamiento.

En AWS, los Security Groups son stateful y se enfocan principalmente en proteger workloads o interfaces de red específicas como:

Funcionan únicamente mediante reglas ALLOW; si el tráfico no está explícitamente permitido, se deniega implícitamente.

No existen reglas DENY.

AWS además separa otro componente llamado Network ACL (NACL), que funciona a nivel subnet.

Los NACL son:

stateless,
permiten reglas ALLOW,
permiten reglas DENY.

Esto crea una separación bastante clara entre controles a nivel subnet y controles a nivel workload.

En Azure, los NSG consolidan parte de ambos conceptos.

También son stateful, pero pueden aplicarse tanto a:

subnets,
como directamente a NICs.

A diferencia de los Security Groups de AWS, los NSG sí soportan reglas DENY explícitas.

Ese pequeño detalle cambia bastante el enfoque mental.

AWS separa más explícitamente las capas de seguridad de red.
Azure tiende a integrar más funcionalidades dentro de un mismo componente.

Pro Tip

Mientras en AWS se trabajan capas de control separadas — NACL para subnet y Security Groups a nivel de servicios — Azure consolida el modelo en NSG.

Esto permite entrever la diferencia filosófica de que AWS tiende a separar componentes mientras que Azure consolida funcionalidades.

Tal y como les prometí: mi “Piedra Rosetta”

Azure VNet	AWS VPC	Diferencias Clave
Red virtual privada regional	Red virtual privada regional	Azure integra la VNet más visiblemente dentro del modelo de suscripción y Resource Groups, mientras AWS trata la VPC como un boundary de aislamiento más explícito y desacoplado
Subnets regionales	Subnets asociadas a una AZ específica	En Azure las subnets pertenecen a la VNet regional; en AWS cada subnet vive dentro de una Availability Zone específica
NSG aplicable a subnet o NIC	Security Groups aplicados a interfaces/instancias	Azure permite aplicar controles tanto a nivel subnet como NIC y permite Allows y Deny; en AWS los Security Groups se enfocan principalmente en interfaces y workloads, solo permiten Allows y el concepto NACL no existe aislado en Azure
User Defined Routes (UDR)	Route Tables	Azure maneja el routing de forma más integrada dentro de la plataforma; en AWS las asociaciones entre subnets y Route Tables suelen ser más explícitas
VPN Gateway	Site to Site VPN	Ambos servicios permiten conectar redes on-premises con la nube mediante túneles IPsec, soportando escenarios híbridos y routing dinámico con BGP. Sin embargo, Azure expone de forma más explícita conceptos tradicionales de networking como tipos de VPN (route-based y policy-based), SKUs, configuraciones active-active y opciones avanzadas desde el proceso inicial de despliegue. En AWS, aunque estas capacidades también existen, el servicio administrado abstrae más parte de la complejidad operativa y el flujo suele sentirse más guiado desde la experiencia de implementación
ExpressRoute	Direct Connect	Tanto Azure ExpressRoute como AWS Direct Connect suelen requerir la participación de carriers o partners especializados para establecer la conectividad física. Ambos servicios buscan reducir la dependencia de Internet pública y ofrecer conexiones más estables y predecibles. Sin embargo, históricamente ExpressRoute ha tenido una orientación más integrada hacia el ecosistema Microsoft mediante distintos modelos de peering que permiten conectividad privada no solo hacia VNets, sino también hacia servicios Microsoft y plataformas SaaS asociadas. Direct Connect, por su parte, suele percibirse más enfocado en conectividad dedicada hacia VPCs, redes y workloads específicos dentro de AWS
Service Endpoints / Private Endpoints	VPC Endpoints	Azure diferencia dos enfoques explícitos: Service Endpoints restringen el acceso al servicio a VNets autorizadas sin crear interfaces de red adicionales, mientras que Private Endpoints asignan una IP privada dentro de la VNet y permiten resolución mediante DNS privado, posibilitando además deshabilitar opcionalmente el acceso público al servicio. AWS agrupa estos patrones bajo el concepto de VPC Endpoints, diferenciando internamente entre Gateway Endpoints — integrados mediante route tables y limitados principalmente a S3 y DynamoDB — e Interface Endpoints, que crean una ENI con IP privada y permiten conectividad privada hacia una amplia variedad de servicios AWS y servicios compatibles con PrivateLink, incluso en escenarios híbridos mediante VPN o Direct Connect
NAT Gateway	NAT Gateway	Ambas nubes usan NAT Gateway para que recursos en subnets privadas accedan a internet sin exponer su IP directamente. En Azure basta con asociarlo a la subnet sin tocar route tables. En AWS el proceso es más explícito: requiere un Internet Gateway, una subnet pública donde reside el NAT Gateway, y una entrada manual en la route table de cada subnet privada — lo que da más control pero también más superficie de error, especialmente en arquitecturas multi-zona
Public IP	Elastic IP	Azure trata la IP pública como un recurso independiente que puede asociarse a componentes como NICs, Load Balancers o NAT Gateways. Aunque la IP existe como recurso separado, operativamente suele crearse y administrarse en conjunto con el servicio asociado. Para conservarla basta con utilizar asignación estática y desasociarla sin eliminar el recurso, permitiendo reutilizarla posteriormente. AWS el modelo mental es algo distinto: utiliza Elastic IPs como mecanismo principal para direcciones públicas persistentes. Estas se reservan explícitamente dentro de la cuenta y pueden asociarse o moverse entre instancias y servicios de manera independiente. Ambas nubes cobran por IPs públicas estáticas no asociadas; la diferencia es que AWS hace de la reasignación explícita parte natural del modelo operativo, mientras que Azure suele integrar más la administración de la IP al ciclo de vida del recurso que la consume

Interactuando con la nube

No podía cerrar esta primera parte sin hablar de algo que también cambia muchísimo entre proveedores: la forma en que interactuamos con la nube día a día.

Ambas plataformas cuentan con:

consola web,
APIs,
SDKs,
Infrastructure as Code,
y CLI.

Sin embargo, nuevamente la filosofía detrás del diseño se siente bastante distinta.

A nivel de consola, en Azure Resource Manager (ARM) funciona como una capa unificada de administración para despliegues, permisos, políticas y organización de recursos. Esa integración hace que muchas operaciones se perciban más centralizadas y coherentes con la estructura jerárquica previamente resaltada.

En AWS, la experiencia suele sentirse más orientada a servicios individuales.

Aunque existen mecanismos unificadores como:

Organizations,
CloudFormation,
o Control Tower,

la interacción diaria muchas veces implica navegar entre servicios relativamente desacoplados entre sí.

Eso ofrece muchísimo control y flexibilidad, pero también puede requerir entender mejor cómo interactúan múltiples componentes para operar con fluidez.

No considero que un enfoque sea “mejor” que el otro; más bien destacan la diferencia de filosofía entre ambas nubes.

Reflexiones finales

Este es apenas un primer acercamiento al reto de convertirse en un arquitecto multi nube.

En un momento donde cada vez más organizaciones dejan atrás la idea de depender de un único proveedor, necesitamos desarrollar la capacidad de comprender las fortalezas, limitaciones y filosofía operativa de cada plataforma.

Ser multi nube no significa solamente aprender servicios equivalentes entre AWS y Azure. También implica entender cómo piensa cada ecosistema, cómo organiza sus recursos, cómo gobierna su infraestructura y cómo toma decisiones operativas.

Al final, el verdadero reto es saber qué pieza ajustar en cada ambiente para construir soluciones que sean:

sostenibles,
eficientes,
y financieramente responsables.

Yo sigo aprendiendo en ese proceso y más adelante quiero compartirles también mis experiencias y estrategias alrededor de IA en ambos mundos cloud.

Operational Hardening — Guardrails, Secrets Rotation & SLO — FSx ONTAP S3AP Phase 12

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 18:21:39 +0000

TL;DR

Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.

Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.

This is Phase 12 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10 and Phase 11, Phase 12 delivers:

Capacity Guardrails: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics
Secrets Rotation: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval
Synthetic Monitoring: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)
Capacity Forecasting: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule
Data Lineage Tracking: DynamoDB table with GSI for processing history and opt-in integration
Protobuf TCP Framing: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader
SLO Definition: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection
FPolicy Pipeline E2E: NFS file creation → FPolicy → SQS delivery confirmed
Persistent Store Replay: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios
Property-Based Testing: 16 Hypothesis properties, 53 tests, 3 bugs discovered
S3 Access Point Deep Dive: Multi-layer authorization, IAM ARN format, VPC network constraints

Key metrics: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS

The problem

FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.

The solution

A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:

graph LR
    A[Auto-Expand Request] --> B{GuardrailMode?}
    B -->|DRY_RUN| C[Log + Allow<br/>fail-open on DDB error]
    B -->|ENFORCE| D[Check + Block<br/>fail-closed on DDB error]
    B -->|BREAK_GLASS| E[Bypass All Checks<br/>SNS Alert + Audit Log]
    C --> F[DynamoDB Tracking]
    D --> F
    E --> F
    F --> G[CloudWatch EMF Metrics]

Mode	Behavior on Check Failure	Behavior on DynamoDB Error
`DRY_RUN`	Log warning, allow action	Fail-open (allow)
`ENFORCE`	Block action, emit metric	Fail-closed (deny)
`BREAK_GLASS`	Skip all checks	SNS alert + audit log

Core implementation

from shared.guardrails import CapacityGuardrail, GuardrailMode

guardrail = CapacityGuardrail()  # Mode from GUARDRAIL_MODE env var

result = guardrail.check_and_execute(
    action_type="volume_grow",
    requested_gb=50.0,
    execute_fn=my_grow_function,
    volume_id="vol-abc123",
)

if result.allowed:
    print(f"Action executed: {result.action_id}")
else:
    print(f"Action denied: {result.reason}")
    # Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active

Three safety checks (ENFORCE mode)

Rate limit: Max 10 actions per day per action type
Daily cap: Max 500 GB cumulative expansion per day
Cooldown: 300-second minimum interval between actions

All thresholds are configurable via environment variables (GUARDRAIL_RATE_LIMIT, GUARDRAIL_DAILY_CAP_GB, GUARDRAIL_COOLDOWN_SECONDS).

DynamoDB tracking schema

Attribute	Type	Description
`pk`	String	Action type (e.g., `volume_grow`)
`sk`	String	Date (`YYYY-MM-DD`)
`daily_total_gb`	Number	Cumulative GB expanded today
`action_count`	Number	Number of actions today
`last_action_ts`	String	ISO timestamp of last action
`actions`	List	Audit trail of all actions
`ttl`	Number	30-day auto-expiry

BREAK_GLASS production considerations

In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.

2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation

The problem

ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.

The solution

A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:

sequenceDiagram
    participant SM as Secrets Manager
    participant Lambda as Rotation Lambda (VPC)
    participant ONTAP as FSx ONTAP REST API

    SM->>Lambda: Step 1: createSecret
    Lambda->>SM: Generate new password, store as AWSPENDING

    SM->>Lambda: Step 2: setSecret
    Lambda->>ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
    ONTAP-->>Lambda: 200 OK

    SM->>Lambda: Step 3: testSecret
    Lambda->>ONTAP: GET /api/cluster (using new password)
    ONTAP-->>Lambda: 200 OK (cluster UUID returned)

    SM->>Lambda: Step 4: finishSecret
    Lambda->>SM: Promote AWSPENDING → AWSCURRENT

Key design decisions

VPC deployment: Lambda must be in the same VPC as the ONTAP management LIF
90-day interval: Configurable via CloudFormation parameter
Validation: Step 3 (testSecret) verifies the new password works by calling the ONTAP cluster API
Rollback safety: If testSecret fails, the old password remains as AWSCURRENT

Bugs discovered during live testing

Three bugs were found and fixed during the actual rotation execution:

AWSPENDING empty check: createSecret must handle the case where get_secret_value(VersionStage='AWSPENDING') raises ResourceNotFoundException
management_ip fallback: The Lambda must support both management_ip (new) and ontap_mgmt_ip (legacy) keys in the secret JSON
Cluster UUID validation: testSecret now validates the response contains a valid uuid field, not just HTTP 200

Verification result

Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret):    ✅ ONTAP password changed via REST API
Step 3 (testSecret):   ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT

Operational note

Rotating fsxadmin affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's urllib3 or requests configuration handles certificate verification appropriately (see shared/ontap_client.py for the pattern used in this project).

For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing fsxadmin across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.

3. Synthetic Monitoring — CloudWatch Synthetics Canary

The problem

The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.

The solution

A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:

ONTAP Health Check: REST API call to the management endpoint (VPC-internal)
S3 Access Point Check: ListObjectsV2 against the S3AP alias

Critical finding: network-origin and endpoint configuration matter

During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.

This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS documents support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.

In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:

Check	Observed requirement in this environment	Result
ONTAP REST API	VPC-internal access to management LIF	✅ Works
S3AP health check	Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy	⚠️ Timed out from the initial VPC Canary configuration

Solution: Split into two monitoring paths:

ONTAP health: VPC-internal Canary (confirmed working, 88ms response)
S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)

This is documented as a critical constraint in docs/guides/s3ap-fsxn-specification.md.

Canary runtime version lesson

The template initially specified syn-python-selenium-3.0, which was deprecated on 2026-02-03. Updated to syn-python-selenium-11.0. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.

AWS builder lesson: VPC placement is a design choice

A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is connected to a VPC, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.

4. Capacity Forecasting — Linear Regression with stdlib Only

The problem

Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.

The solution

A Lambda function running on a daily EventBridge schedule:

Fetches 30 days of FSx StorageUsed metrics from CloudWatch
Performs linear regression using only Python's math module (zero external dependencies)
Publishes DaysUntilFull as a CloudWatch custom metric
Sends SNS alert when forecast drops below threshold (default: 30 days)

Linear regression implementation (stdlib only)

def linear_regression(data_points: list[tuple[float, float]]) -> tuple[float, float]:
    """Least-squares linear regression using only math module."""
    n = len(data_points)
    if n < 2:
        raise ValueError("Need at least 2 data points for regression")

    sum_x = sum_y = sum_xy = sum_x2 = 0.0
    for x, y in data_points:
        sum_x += x
        sum_y += y
        sum_xy += x * y
        sum_x2 += x * x

    denominator = n * sum_x2 - sum_x * sum_x
    if abs(denominator) < 1e-10:
        return (0.0, sum_y / n)

    slope = (n * sum_xy - sum_x * sum_y) / denominator
    intercept = (sum_y - slope * sum_x) / n
    return (slope, intercept)

Edge cases handled

Scenario	DaysUntilFull	Behavior
< 2 data points	-1	Insufficient data, no prediction
slope ≤ 0 (shrinking/flat)	-1	Never fills up
Already over capacity	0	Immediate alert
Very low usage (0.03%)	169,374	Normal — far future prediction

Live verification

{
  "days_until_full": 169374,
  "current_usage_pct": 0.03,
  "total_capacity_gb": 1024.0,
  "growth_rate_gb_per_day": 0.006,
  "forecast_date": "2490-02-06T06:26:42Z"
}

The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.

This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat DaysUntilFull as an early-warning signal, not an exact prediction.

5. Data Lineage Tracking — DynamoDB with GSI

The problem

When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.

The solution

A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:

graph TD
    subgraph "DynamoDB: fsxn-s3ap-data-lineage"
        PK[PK: source_file_key<br/>SK: processing_timestamp]
        GSI[GSI: uc_id-timestamp-index<br/>PK: uc_id, SK: processing_timestamp]
    end

    Q1[Query by file] -->|PK lookup| PK
    Q2[Query by UC + time range] -->|GSI query| GSI
    Q3[Query by execution ARN] -->|Scan + filter| PK

For high-volume environments, consider adding a dedicated GSI on step_functions_execution_arn. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.

Integration helper (opt-in)

from shared.lineage import LineageTracker, LineageRecord

tracker = LineageTracker()
record = LineageRecord(
    source_file_key="/vol1/legal/contracts/deal-001.pdf",
    processing_timestamp="2026-05-16T14:30:45.123Z",
    step_functions_execution_arn="arn:aws:states:...:execution:...",
    uc_id="legal-compliance",
    output_keys=["s3://output-bucket/legal/reports/deal-001-analysis.json"],
    status="success",
    duration_ms=4523,
)
lineage_id = tracker.record(record)

Design principles

Non-blocking: Write failures emit a warning log but never interrupt the main processing pipeline
TTL: 365-day auto-expiry via DynamoDB TTL (configurable via LINEAGE_TTL_DAYS environment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention)
Opt-in: UCs integrate by importing the helper — no mandatory coupling
PAY_PER_REQUEST: No capacity planning needed for variable workloads

Future: compliance-grade lineage (v2)

For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future LineageRecord v2:

Field	Purpose
`input_checksum`	SHA-256 of source file for integrity verification
`output_checksum`	SHA-256 of generated output
`fpolicy_sequence_number`	ONTAP-assigned sequence for ordering
`policy_version`	FPolicy policy configuration version
`uc_template_version`	UC CloudFormation template version
`guardrail_mode`	Active guardrail mode at processing time
`retention_profile`	Retention class for compliance tiering

For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.

6. Protobuf TCP Framing — Adaptive Reader

The problem

Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing read_fpolicy_message() assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.

The solution

An adaptive ProtobufFrameReader that supports three framing modes:

graph TD
    A[Incoming TCP Stream] --> B{FramingMode}
    B -->|AUTO_DETECT| C[Probe first 4 bytes]
    C -->|Valid uint32 length| D[LENGTH_PREFIXED]
    C -->|Otherwise| E[FRAMELESS]
    B -->|LENGTH_PREFIXED| D
    B -->|FRAMELESS| E
    D --> F[4-byte big-endian header → payload]
    E --> G[varint-delimited → payload]
    F --> H[Decoded Message]
    G --> H

Three modes

Mode	Wire Format	Use Case
`LENGTH_PREFIXED`	4-byte big-endian length + payload	XML mode (legacy)
`FRAMELESS`	varint-delimited protobuf	Protobuf mode (ONTAP 9.15.1+)
`AUTO_DETECT`	Probe first bytes, then lock mode	Unknown/mixed environments

Auto-detection heuristic

async def _auto_detect_and_read(self) -> bytes | None:
    """Probe first 4 bytes to determine framing mode."""
    peek = await self._reader.readexactly(4)
    candidate_length = struct.unpack("!I", peek)[0]

    if 0 < candidate_length <= self._max_message_size:
        # Valid length header → LENGTH_PREFIXED
        self._detected_mode = FramingMode.LENGTH_PREFIXED
        payload = await self._reader.readexactly(candidate_length)
        return payload
    else:
        # Not a valid length → FRAMELESS (varint-delimited)
        self._detected_mode = FramingMode.FRAMELESS
        self._buffer = peek
        return await self._read_varint_delimited()

Safety features

Max message size enforcement (default 1 MB): Prevents DoS via malformed length headers
FramingError exception: Structured error with offset and raw data for debugging
Graceful EOF handling: Returns None on connection close without raising

Integration with existing FPolicy server

from shared.integrations.protobuf_integration import create_fpolicy_reader, read_fpolicy_message_v2

# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
reader = create_fpolicy_reader(stream)
message = await read_fpolicy_message_v2(reader or stream)

Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.

Phase 13 protobuf validation scope

The following questions will be confirmed with NetApp support during live wire validation:

Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)
Message boundary behavior under high throughput
Keep-alive behavior in protobuf mode vs XML mode
Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?
Mixed-mode migration path (XML → protobuf transition without event loss)
Maximum message size guidance from ONTAP side

7. SLO Definition — 4 Targets with CloudWatch Dashboard

The problem

Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.

The solution

Four SLO targets covering the critical path of the event-driven pipeline:

SLO	Metric	Target	SLO met when
Event Ingestion Latency	`EventIngestionLatency_ms`	P99 < 5,000 ms	LessThanThreshold
Processing Success Rate	`ProcessingSuccessRate_pct`	> 99.5%	GreaterThanThreshold
Reconnect Time	`FPolicyReconnectTime_sec`	< 30 sec	LessThanThreshold
Replay Completion Time	`ReplayCompletionTime_sec`	< 300 sec (5 min)	LessThanThreshold

For success rate, the CloudWatch Alarm fires when the metric drops below 99.5% (ComparisonOperator: LessThanThreshold), even though the SLO target is expressed as "> 99.5%".

CloudWatch Dashboard

The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):

from shared.slo import SLO_TARGETS, evaluate_slos, generate_dashboard_widgets

# Evaluate all SLOs programmatically
results = evaluate_slos(cloudwatch_client)
for r in results:
    status = "MET" if r.met else "VIOLATED"
    print(f"{r.slo_name}: {status} (value={r.value}, threshold={r.threshold})")

# Generate dashboard widget JSON for CloudFormation
widgets = generate_dashboard_widgets(region="ap-northeast-1")

Alarm-based violation detection

Each SLO has a corresponding CloudWatch Alarm:

Alarm Name	State	Evaluation
`fsxn-s3ap-slo-ingestion-latency`	OK	3 consecutive periods
`fsxn-s3ap-slo-success-rate`	OK	3 consecutive periods
`fsxn-s3ap-slo-reconnect-time`	OK	3 consecutive periods
`fsxn-s3ap-slo-replay-completion`	OK	3 consecutive periods

All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.

8. FPolicy Pipeline E2E Verification

The problem

Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.

The verification

sequenceDiagram
    participant NFS as NFS Client (Bastion)
    participant ONTAP as FSx for ONTAP
    participant FP as FPolicy Server (Fargate)
    participant SQS as SQS Queue

    NFS->>ONTAP: echo "test" > /mnt/fpolicy_vol/test.txt
    ONTAP->>FP: NOTI_REQ (FILE_CREATE event)
    FP->>FP: Parse event, extract metadata
    FP->>SQS: SendMessage (JSON payload)
    SQS-->>SQS: Message available for consumers

Timeline (actual observed)

Time	Event	Detail
T+0s	TCP connection test	ONTAP → Fargate IP (10.0.128.98:9898)
T+10s	Session established	NEGO_REQ → NEGO_RESP handshake
T+12s	KEEP_ALIVE starts	2-minute interval
T+30s	NFS file created	`echo "test" > /mnt/fpolicy_vol/test_fpolicy_event.txt`
T+31s	NOTI_REQ received	FPolicy server receives file creation event
T+32s	SQS delivery	Event sent to SQS queue (FPolicy_Q)

SQS message format

{
  "event_type": "FILE_CREATE",
  "svm_name": "FSxN_OnPre",
  "volume_name": "vol1",
  "file_path": "/vol1/test_fpolicy_event.txt",
  "client_ip": "10.0.128.98",
  "timestamp": "2026-05-16T08:45:32Z",
  "session_id": 1,
  "sequence_number": 1
}

IAM issue discovered and fixed

The ECS task role's SQS policy used a Resource ARN pattern arn:aws:sqs:...:fsxn-fpolicy-* that didn't match the actual queue name FPolicy_Q. Fix: use explicit ARN or * wildcard in the template.

Lesson: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.

Event contract assumptions

The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:

Duplicate events can occur (especially during Persistent Store replay)
Delivery order is not guaranteed (confirmed in Section 9)
Consumers must be idempotent
file_path + timestamp + sequence_number serves as an idempotency key candidate
Replay events may arrive after newer events
Schema versioning should be introduced before multi-UC production rollout

9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios

The problem

Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.

Important prerequisite: FPolicy Persistent Store is available for asynchronous non-mandatory policies only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have only one Persistent Store, and the same store can be used by multiple policies within that SVM.

The test procedure

Stop Fargate task (ECS stop-task)
Create 5 files via NFS during downtime (replay-test-1.txt through replay-test-5.txt)
Wait for ECS service auto-recovery (new task launch)
Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)
Verify all 5 events arrive in SQS

Results

Metric	Value
Events generated during downtime	5
Events replayed to SQS	5
Lost events	0
Replay delivery order	3, 1, 2, 5, 4 (non-sequential)
Replay completion time	~30 seconds

Key observation: Out-of-order replay

Persistent Store replays events in a non-sequential order — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:

Idempotency: Deduplicate by file path + timestamp
Timestamp-based ordering: Sort by event timestamp, not arrival order

20-file burst validation

Additionally, a 20-file burst test confirmed zero event loss under higher load:

Test	Files Created	Events Delivered	Loss
Replay (5 files)	5	5	0
Burst (20 files)	20	20	0

Phase 13 replay storm metrics

The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:

Metric	Purpose
Persistent Store volume usage before/after replay	Capacity planning for the store volume
Events queued vs events replayed	Completeness verification
Replay throughput (events/sec)	Performance baseline
Replay duration	SLO calibration
Out-of-order distance	Downstream buffer sizing
Duplicate events	Idempotency requirement validation
ONTAP EMS logs around disconnect/reconnect	Root cause correlation

Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.

Operational framing: event durability as RPO/RTO

Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while ReplayCompletionTime_sec provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.

Phase 12 validation scope

Scope	Phase 12 Assumption	Production Consideration
SVM	Single SVM validation	Multi-SVM needs per-SVM policy and Persistent Store planning
Volume	Test volume	Production volumes should be grouped by UC/event profile
Protocol	NFS-based E2E test	NFSv3/NFSv4.1/SMB replay validation remains Phase 13
Event types	File create	Modify/delete/rename validation remains Phase 13
FPolicy mode	Async non-mandatory	Required for Persistent Store (NetApp docs)

10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests

The problem

Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.

The approach

Using Python's Hypothesis library, we defined 16 properties across the Phase 12 modules:

Property Group	Properties	Tests	Bugs Found
Protobuf Frame Reader	5 (round-trip, max size, EOF, multi-message, auto-detect)	18	1
Capacity Guardrails	4 (mode behavior, rate limit, daily cap, cooldown)	14	1
Data Lineage	3 (record/query round-trip, GSI consistency, TTL)	9	0
SLO Evaluation	2 (threshold comparison, no-data handling)	6	1
Capacity Forecast	2 (regression accuracy, edge cases)	6	0
Total	16	53	3

Bugs discovered

Protobuf reader: AUTO_DETECT mode failed when the first 4 bytes happened to form a valid-looking length that exceeded max_message_size. Fix: treat oversized candidate lengths as FRAMELESS indicator.
Guardrails: BREAK_GLASS mode didn't emit the GuardrailBypass metric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.
SLO evaluation: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation), max(datapoints, key=lambda dp: dp["Timestamp"]) was non-deterministic. Fix: add secondary sort by value.

Example property test

@given(messages=st.lists(
    st.binary(min_size=1, max_size=1000),
    min_size=1, max_size=10,
))
@settings(max_examples=200)
def test_length_prefixed_round_trip(self, messages: list[bytes]):
    """Property: LENGTH_PREFIXED encode → decode preserves all messages."""
    stream_data = _make_length_prefixed_stream(messages)
    reader = _make_stream_reader(stream_data)
    frame_reader = ProtobufFrameReader(
        reader=reader,
        mode=FramingMode.LENGTH_PREFIXED,
        max_message_size=max(len(m) for m in messages) + 1,
    )

    decoded = []
    for _ in range(len(messages)):
        msg = asyncio.run(frame_reader.read_message())
        assert msg is not None
        decoded.append(msg)

    assert decoded == messages  # Round-trip property

11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints

The critical finding

FSx for ONTAP S3 Access Points are not standard S3 endpoints. They use the FSx data plane, which has different network routing characteristics than standard S3.

In this pattern library, FSx for ONTAP S3 Access Points serve as an AWS service integration boundary: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.

Multi-layer authorization model

graph TD
    Client[S3 API Client] --> IAM{Layer 1: IAM Policy}
    IAM -->|identity-based policy| AP{Layer 2: AP Resource Policy}
    AP -->|resource policy| FS{Layer 3: File System Identity}
    FS -->|UNIX UID or AD user| Volume[ONTAP Volume]

    IAM -.->|❌ Denied| Block1[Access Denied]
    AP -.->|❌ Denied| Block2[Access Denied]
    FS -.->|❌ No permission| Block3[Access Denied]

AWS documents this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.

Correct IAM ARN format

{
  "Effect": "Allow",
  "Action": ["s3:ListBucket"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap"
}
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap/object/*"
}

Common mistake: Using the S3AP alias (xxx-ext-s3alias) as a bucket ARN. The alias is only valid as the Bucket parameter in boto3 calls — IAM policies require the full access point ARN.

VPC network constraint (environment-specific observation)

Access Pattern	Observed Result	Notes
VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint)	⚠️ Timeout in this config	Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment
Internet → S3 AP (NetworkOrigin=Internet)	✅	Routes correctly with valid IAM credentials
VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC)	Supported per AWS docs; not verified in Phase 12	Requires VPC-origin AP and matching endpoint policy
VPC Lambda → ONTAP REST API	✅	Direct management LIF access

Important: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS documents that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.

Architectural implication for this pattern: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:

Run outside VPC (with Internet access)
Use NAT Gateway for outbound routing
Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions

Write support and practical constraints

FSx ONTAP S3 Access Points support PutObject, DeleteObject, multipart uploads (CreateMultipartUpload, UploadPart, CompleteMultipartUpload), and other write operations — they are not read-only. The access point compatibility table documents the full list of supported S3 API operations.

However, S3 Access Points are not full S3 buckets. Key constraints include:

Maximum upload size: 5 GB
Only FSX_ONTAP storage class
Only SSE-FSX encryption
No ACLs (except bucket-owner-full-control), no Object Versioning, no Object Lock, no presigned URLs

All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.

12. Cross-Project Feedback — Template Hardening

During Phase 12, the companion project fsxn-observability-integrations reviewed our CloudFormation templates and provided actionable feedback. All items were applied:

Security Group: SourceSecurityGroupId over CIDR

Before (broad):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: 9898
    ToPort: 9898
    CidrIp: "10.0.0.0/8"

After (precise):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: !Ref FPolicyPort
    ToPort: !Ref FPolicyPort
    SourceSecurityGroupId: !Ref FsxnSvmSecurityGroupId
    Description: FPolicy TCP from FSxN SVM Security Group

This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.

ONTAP CLI: Deprecated `vserver` prefix

ONTAP 9.11+ deprecates the vserver prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:

# Deprecated (still works for backward compatibility)
vserver fpolicy policy external-engine create -vserver FSxN_OnPre ...

# Recommended (ONTAP 9.11+)
fpolicy policy external-engine create -vserver FSxN_OnPre ...

KMS Decrypt: When it's needed (and when it's not)

Added documentation clarifying SQS encryption behavior:

SqsManagedSseEnabled: true → kms:Decrypt is NOT needed (transparent)
KmsMasterKeyId: alias/aws/sqs → kms:Decrypt IS needed

Our templates use SqsManagedSseEnabled: true, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.

EC2 AMI: Removed redundant Docker install

ECS-optimized AMIs ({{resolve:ssm:/aws/service/ecs/optimized-ami/...}}) already include Docker. Removed the unnecessary yum install -y docker from UserData scripts.

Cpu/Memory: String type is intentional

Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with AllowedValues provides better validation than Number type for this constrained parameter space.

13. What's Next — Phase 13 Outlook

Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:

✅ Capacity guardrails preventing runaway auto-scaling
✅ Automated secrets rotation on 90-day cycle
✅ Proactive capacity forecasting with daily predictions
✅ SLO-based observability with alarm-driven alerting
✅ Data lineage tracking for audit and debugging
✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios
✅ Property-based testing catching real bugs

Ownership boundary

Layer	Primary Owner	Examples
Shared event platform	Platform / storage team	FPolicy server, SQS queue, EventBridge bus, Persistent Store
ONTAP operations	Storage team	SVM, volume, FPolicy policy, Persistent Store capacity
Security operations	Security / platform team	Secrets rotation, BREAK_GLASS approval, IAM policies
Workload UC	Application / data team	Step Functions, UC routing rules, output destinations
Observability	Platform + workload teams	SLO dashboard, UC-specific alarms, runbooks

Production Readiness Matrix

Capability	Phase 12 Status	Remaining Work
Capacity Guardrails	Verified (DRY_RUN/ENFORCE/BREAK_GLASS)	Approval workflow optional
Secrets Rotation	4-step rotation verified	Ensure all clients read from Secrets Manager
SLO Dashboard	Deployed, 4 alarms active	Runbooks and alarm response automation in Phase 13
Persistent Store Replay	5-event + 20-event scenarios verified	1000+ replay storm testing
S3AP Monitoring	ONTAP health path verified	Split S3AP health check (VPC-external)
Protobuf Framing	Property/integration tested	Live ONTAP protobuf wire validation
Multi-account OAM	Stack deployed conditionally	Second-account validation
Production UC E2E	Pipeline verified to SQS delivery	Full TriggerMode=EVENT_DRIVEN UC flow
Cost Dashboard	Not yet deployed	Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation

Phase 13 candidates

Operational readiness:

Canary S3AP check separation: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)
SLO violation runbooks: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)
Replay storm testing: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior

Enterprise deployment:

Multi-account OAM validation: Deploy workload-account-oam-link.yaml in a second AWS account
Shared platform vs workload boundary: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)
Production UC end-to-end: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify the complete flow from NFS file creation through Step Functions execution to output generation

Protocol and cost:

Protobuf live wire validation: Confirm protobuf TCP framing with NetApp support and validate AUTO_DETECT mode against real ONTAP protobuf traffic
Cost optimization dashboard: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics

Decision trees and operational guides:

Decision trees: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)
NetApp Partner Delivery Checklist: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover

Cost model awareness

While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:

Category	Cost Type	Driver
FPolicy server (Fargate/EC2)	Fixed baseline	Always-on listener
NAT Gateway	Fixed + per-GB	Required if VPC Lambda needs Internet-origin S3AP access
CloudWatch Synthetics	Per-canary-run	5-minute interval = 8,640 runs/month
CloudWatch custom metrics + Logs	Per-metric + per-GB ingested	SLO metrics, FPolicy server logs
DynamoDB (lineage + guardrails)	Per-request (PAY_PER_REQUEST)	Event volume dependent
SQS / EventBridge	Per-message / per-event	Event volume dependent
Persistent Store volume	Per-GB provisioned	Sized for max queued events during downtime

Design decision for new deployments: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).

NetworkOrigin decision table

Based on AWS documentation, the following decision criteria apply:

Choose VPC-origin when:

All consumers are Lambda/ECS/EC2 inside the same VPC
Private connectivity is mandatory (no internet-routed path allowed)
VPC endpoint policy is part of the security boundary
Network restriction is built-in (cannot be accidentally misconfigured)

Choose Internet-origin when:

External accounts or on-premises clients need access
Consumers are outside the bound VPC
Internet-routed access with IAM controls is acceptable
Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC

Factor	VPC-origin	Internet-origin
Network enforcement	Built-in explicit Deny for non-VPC traffic	Policy-based only
VPC endpoint required	Yes (Gateway or Interface in bound VPC)	Only if using `aws:SourceVpc` conditions
Multi-VPC access	Via Interface endpoint + peering/TGW to bound VPC	Via policy conditions
Change access scope	Must recreate access point	Update policy
On-premises access	Via Interface endpoint in bound VPC	Direct with IAM credentials
Cost implication	VPC endpoint (Gateway=free, Interface=hourly)	NAT Gateway if VPC Lambda needs access

Critical: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.

Phase 12 readiness by workload type

Workload	Phase 12 Ready?	Notes
Controlled PoC / single-account	✅ Ready	All core components verified
Low/moderate event volume (< 100 events/day)	✅ Ready	20-event burst validated
DRY_RUN guardrail validation	✅ Ready	Safe to deploy immediately
Secrets rotation validation	✅ Ready	4-step rotation verified
High-volume replay storm (1000+ events)	⏳ Phase 13	Throughput curve and store capacity not yet measured
Multi-account production	⏳ Phase 13	OAM link deployed but second-account validation pending
Strict SLO operations requiring runbooks	⏳ Phase 13	Dashboard deployed, runbooks not yet written
Live protobuf production mode	⏳ Phase 13	Wire validation with NetApp support pending
Full EVENT_DRIVEN UC end-to-end	⏳ Phase 13	Pipeline verified to SQS, Step Functions flow pending

Phase 13 runbook scope: first-response diagnostic bundle

For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:

# FPolicy status
fpolicy show -vserver <SVM> -fields policy-name,status
fpolicy policy external-engine show -vserver <SVM>
fpolicy persistent-store show -vserver <SVM>

# Connection and event state
fpolicy show-engine -vserver <SVM>
fpolicy show-passthrough-read-connection -vserver <SVM>

# EMS logs for FPolicy events
event log show -messagename *fpolicy*

Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.

Deployed Infrastructure

7 CloudFormation stacks deployed and verified:

Stack	Status	Purpose
`fsxn-phase12-guardrails-table`	CREATE_COMPLETE	DynamoDB tracking table
`fsxn-phase12-lineage-table`	CREATE_COMPLETE	Data lineage DynamoDB + GSI
`fsxn-phase12-slo-dashboard`	CREATE_COMPLETE	CloudWatch dashboard + 4 alarms
`fsxn-phase12-oam-link`	CREATE_COMPLETE	Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13)
`fsxn-phase12-capacity-forecast`	CREATE_COMPLETE	Lambda + EventBridge schedule
`fsxn-phase12-secrets-rotation`	CREATE_COMPLETE	VPC Lambda + rotation config
`fsxn-phase12-synthetic-monitoring`	CREATE_COMPLETE	Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13

Test Results Summary

Category	Count	Type	Result
Unit Tests	116	Local (CI-reproducible)	✅ All pass
Property Tests (Hypothesis)	53	Local (CI-reproducible)	✅ All pass
CloudFormation Deployments	7 stacks	AWS integration	✅ All CREATE_COMPLETE
Lambda Invocations	2 (forecast + rotation)	AWS integration	✅ Successful
FPolicy E2E	1 pipeline test	AWS manual verification	✅ Event delivered
Replay E2E	5 events	AWS manual verification	✅ Zero loss
20-file burst	20 events	AWS manual verification	✅ Zero loss
Bugs found (property testing)	3	Local (CI-reproducible)	✅ All fixed

NetApp-Specific Takeaways

For NetApp users and partners evaluating this pattern:

FPolicy Persistent Store works as the durability layer for asynchronous non-mandatory FPolicy policies (NetApp docs), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).
S3 Access Points for FSx for ONTAP are not standard S3 buckets: they support selected S3 API operations including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).
NetworkOrigin is a design-time decision. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.
ONTAP-common vs AWS-specific: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.
Operational readiness requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.

The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.

Conclusion

Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.

The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.

With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10 · Phase 11

Everything is Under Control

mgbec — Sun, 17 May 2026 16:43:44 +0000

I’m a control enthusiast, not a control freak. And control is part of my job description, so no apologies. As an enterprise, with all the new AI tools entering the atmosphere every day, we want to enable innovation and efficiency. We also need to have governance over these tools and their usage. Organizations want to make sure they minimize any potential risks, and of course, have observability into everything that is happening.

I wanted to test an AgentCore Gateway workflow with multiple control mechanisms- https://github.com/mgbec/CEDAR-plus-interceptor. There are three pieces I put into play:

OAuth 2.1 (via Cognito) — “Who are you?”

The problem it solves: Identity and authentication. Before the gateway can make any access decisions, it needs to know who’s making the request and verify they’re legitimate.

What it does in this scenario:

-The agent (or user) authenticates against Cognito with their email/password.

-Cognito issues a JWT containing the user’s identity (sub) and group memberships (cognito:groups: [“engineering”])

-The gateway’s CUSTOM_JWT authorizer validates the token signature, expiry, audience, and issuer against Cognito’s OIDC discovery endpoint.

-If the token is invalid or missing → 401 immediately, nothing else runs

What it can’t do: It has no opinion on what the authenticated user is allowed to do. A valid token from a marketing user looks the same as one from an admin at this layer — both pass authentication.

I had to think about one detail here that was a little confusing to me. Cognito returns both an ID Token and an Access Token. The ID Token tells the client application who the user is and the Access Token tells the gateway about the application client and the scope they are granted. The Access Token does not authorize the user to do anything beyond get to the gateway, however. The access token’s scope claim only gets the request past the gateway’s front door — it’s a binary check: “does this token have a valid scope?”

Real-world analogy: The badge reader at the building entrance. It confirms you’re an employee, but doesn’t know which floors you’re allowed on.

Cedar Policy — “Are you allowed to do this?”

The problem it solves: Authorization. Given a verified identity with known group memberships, should this specific tool invocation be permitted?

What it does in this scenario:

-Reads the cognito:groups claim from the validated JWT to determine the principal

-Evaluates Cedar rules: “Is Group::”engineering” permitted Action::”InvokeTool” on Tool::”DatabaseTools___delete_records”?”

-Returns allow or deny based purely on the static policy set

The forbid on delete_records for engineers is absolute — no other rule can override it

What it can’t do:

It can’t count how many times you’ve called a tool today

It can’t call an external service to check something

It can’t modify the request or response

It can’t make decisions based on the request body content (e.g., “only allow SELECT queries, not DELETE queries”)

Real-world analogy: The access control list on each floor. Engineering badges open the lab doors but not the server room. Marketing badges only open the conference rooms.

Request Interceptor (Rate Limiter Lambda)- “Should we let this through right now?”

https://github.com/mgbec/CEDAR-plus-interceptor/tree/main/lambdas/rate-limiter

The problem it solves: Runtime enforcement that requires state, external lookups, or data transformation — things that can’t be expressed as static allow/deny rules.

What it does in this scenario:

-Runs only after OAuth and Cedar have both passed (no point rate-limiting a request that would be denied anyway)

-Reads the user ID and group from the request context

-Queries DynamoDB: “How many requests has this user made in the current hour?”

-Compares against the role-based quota (admins: 100, engineering: 50, marketing: 20)

-Either passes the request through or returns 429

Real-world analogy: The security guard who checks if the parking lot is full before letting your car in, even though your badge is valid and you’re allowed on that floor.

Response Interceptor (PII Redactor Lambda)- “Is this role allowed to view PII?”

https://github.com/mgbec/CEDAR-plus-interceptor/tree/main/lambdas/pii-redactor

This lambda reads the users’ Cognito group and determines if they are allowed to see PII based on that group membership. Mine is a pretty simple PII detector with detection for just SSN’s, Credit Card Numbers, email addresses, and phone numbers. In production you would want something more robust.

The PII is redacted from responses before they reach the agent, depending on the group they are in.

Static access control is not as ideal here in responders. You could implement role-based permissions in a Lambda, but it’d be harder to audit, version, and reason about than Cedar policies.

Real-world analogy: On the way out of the building, the guard would check you for contraband items being removed from company premises.

Building (and Troubleshooting)

There was quite a bit of troubleshooting involved for me to build this out. I tried both CDK and Terraform. Terraform seemed to work better, but there were some resources that were problematic. Kiro was incredibly helpful with debugging and part of this may have been user error. Issues that seemed to be true are:

Rate limit counters persist across tests — DynamoDB counters use a 1-hour window. If you test marketing (limit 20) and then test again in the same hour, the counter is already at 20+ and everything gets blocked immediately. Clear the table between test runs or wait for the next hour.

UpdateGateway replaces everything- The UpdateGateway API is a full replacement, not a patch. If you call it to attach the policy engine but don’t include interceptorConfigurations, the interceptor gets wiped. Every update must pass through ALL existing fields. https://docs.aws.amazon.com/bedrock-agentcore-control/latest/APIReference/API_UpdateGateway.html. This caused interceptors to disappear multiple times.

Cedar Policy Entity Types- AgentCore::Group doesn’t exist. The valid principal type is AgentCore::OAuthUser. Group membership is checked via tags: principal.hasTag(“cognito:groups”) && principal.getTag(“cognito:groups”) like “*engineering*”. https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-understanding-cedar.html

Tool-specific policies require the exact gateway ARN. You can’t use “resource is AgentCore::Gateway” for tool-scoped policies — the API rejects it. And when the gateway gets recreated (new ID), all policies become stale and need to be recreated with the new ARN.

Gateway recreation breaks policy references- when Terraform recreates the gateway (e.g., terraform apply -replace), it gets a new ID and ARN. All Cedar policies that reference the old gateway ARN stop matching (default-deny kicks in). You have to delete and recreate the policies with the new ARN.

From my understanding, the gateway ARN coupling is by design (security isolation between gateways). The best practice is to treat the gateway as a long-lived resource and avoid recreating it.

Using a combination of scripts and Terraform seemed to work best for me, as long as I remembered the correct order of operations. The danger zone is when either tool updates the gateway — it can wipe what the other entity set. The safest workflow is:

terraform apply (creates/updates gateway shell)
create-policies.sh (attaches policy engine + interceptor, preserving existing config)

Observability (and Troubleshooting)

My first test was a bit of a failure. There is a small amount of observability built into the output of the tests, so we can at least see that things did not go as planned.

However, one of the best things about AgentCore is all of the detailed observability baked into the components.

We can even dig down into the trace level to watch our policies in action.

We can look at the bigger picture of our gateway performance with metrics like denied and allowed policy decisions.

One important thing to note for observability of the PII redactor response interceptor:

The traces and logs capture the response from the Lambda target, which contains the full unredacted PII. The PII redactor runs after that, as the last step before the client receives the response. The observability system records what the Lambda returned, not what the client ultimately saw.

The flow is:

Lambda returns full PII

│

├──→ CloudWatch logs/traces capture THIS (unredacted)

│

▼

PII Redactor intercepts

│

▼

Client receives redacted response

This is actually correct from a security audit perspective — you want the logs to show the full data so that security teams can audit what data was accessed. You can verify the redactor is working by comparing logs versus client response. To quickly see what is returned to the client, you can manually set the token and Gateway URL and then test with curl.

TOKEN=$(./scripts/get-token.sh engineer@example.com 2>/dev/null)

GATEWAY_URL=$(terraform -chdir=terraform output -raw gateway_url)

curl -s -X POST “$GATEWAY_URL” \

-H “Authorization: Bearer $TOKEN” \

-H “Content-Type: application/json” \

-d ‘{“jsonrpc”: “2.0”, “id”: 1, “method”: “tools/call”, “params”: {“name”: “DatabaseTools___run_query”, “arguments”: {“sql”: “SELECT * FROM users”, “database”: “analytics”}}}’ \

| jq -r ‘.result.content[0].text’ | python3 -m json.tool

Next try with an admin user, which should receive unredacted data.

TOKEN=$(./scripts/get-token.sh admin@example.com 2>/dev/null)

GATEWAY_URL=$(terraform -chdir=terraform output -raw gateway_url)

curl -s -X POST “$GATEWAY_URL” \

-H “Authorization: Bearer $TOKEN” \

-H “Content-Type: application/json” \

-d ‘{“jsonrpc”: “2.0”, “id”: 1, “method”: “tools/call”, “params”: {“name”: “DatabaseTools___run_query”, “arguments”: {“sql”: “SELECT * FROM users”, “database”: “analytics”}}}’ \

| jq -r ‘.result.content[0].text’ | python3 -m json.tool

Final Thoughts

So, do I feel like I have things completely under control? Not really, on many levels, but that may be a personal issue. These AgentCore Gateway, in addition to OAuth 2.1, Cedar Policies, and Lambda interceptors, are helping us with constraints and oversight, as well as giving us some assistance with governance. Again, as we have heard over and over, this is such a dynamic field. I’m looking forward to the evolution of our GenAI and cybersecurity fields and the technological transformations we will see. Thanks for reading!

Event-Driven Ransomware Detection with ONTAP ARP + Datadog

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 09:16:54 +0000

TL;DR

ONTAP's Autonomous Ransomware Protection (ARP) detects encryption patterns at the storage layer. When ARP fires, an EMS event is pushed via webhook to API Gateway → Lambda → Datadog. In my validation environment, end-to-end latency was around 30 seconds. This post shows how to wire it up, what the alert looks like, and how to respond.

The Threat Model

Ransomware encrypts files at hundreds or thousands of files per minute. Traditional detection — antivirus signatures, host-based EDR — often catches it after significant damage is done.

What if your storage could detect the encryption pattern before the host-based tools react?

That's exactly what ONTAP Autonomous Ransomware Protection (ARP) does. It runs ML-based entropy analysis at the storage layer, detecting:

Sudden spikes in file entropy (encryption)
Mass file extension changes (.docx → .encrypted)
Abnormal write patterns inconsistent with normal workload behavior

When ARP detects an attack, it changes the volume state to attack-detected and fires an EMS event. Our job is to get that event to the security team in seconds, not hours.

The Detection Pipeline

In Part 2, we built the audit log pipeline and showed Datadog search queries for file access events. Now we turn those patterns into event-driven security alerting — starting with ONTAP's most powerful detection signal: Autonomous Ransomware Protection.

ONTAP ARP detects encryption behavior
    │
    ▼ EMS event: arw.volume.state (severity: alert)
ONTAP EMS Webhook (HTTPS POST)
    │
    ▼
API Gateway (REST endpoint)
    │
    ▼
Lambda (EMS handler)
    │
    ▼ normalize → format → ship
Datadog Logs API v2 (source:fsxn-ems)
    │
    ▼
Datadog Monitor → PagerDuty / Slack / Email

End-to-end latency: around 30 seconds in my validation environment (ap-northeast-1). Your latency will vary depending on ONTAP event delivery, API Gateway/Lambda behavior, Datadog ingest latency, and notification routing.

Compare this to the audit log path (Part 2), which depends on rotation interval + scheduler frequency. EMS webhooks are event-driven rather than scheduled, delivering alerts within seconds rather than minutes.

Deploying the EMS Integration

The EMS Lambda is deployed alongside the FPolicy shipping Lambda in a single stack. Note that the FPolicy TCP listener itself remains a separate ECS Fargate-based path (as described in Part 1) because ONTAP FPolicy requires a persistent TCP connection.

aws cloudformation deploy \
  --template-file integrations/datadog/template-ems-fpolicy.yaml \
  --stack-name fsxn-datadog-ems-fpolicy \
  --parameter-overrides \
    DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-northeast-1

What Gets Created

Resource	Purpose
EMS Lambda	Receives EMS webhooks, normalizes, ships to Datadog
FPolicy Lambda	Receives FPolicy events from SQS, ships to Datadog
API Gateway (from shared EMS webhook stack)	HTTPS endpoint for ONTAP EMS webhooks
IAM Roles	Least-privilege for each Lambda
CloudWatch Log Groups	Execution logs

Webhook Security

For production, do not expose an unauthenticated webhook endpoint. ONTAP EMS webhook destinations support HTTPS and mutual authentication options. Use HTTPS for the API Gateway endpoint, restrict access where possible, and consider validating a shared secret or header in the Lambda handler.

ONTAP EMS Configuration

After deployment, configure ONTAP EMS to forward ARP-related events to the API Gateway endpoint. At minimum, include arw.volume.state and other arw.* events you want to monitor. Refer to the NetApp EMS webhook documentation for destination and filter configuration.

The EMS Lambda Handler

The handler receives an API Gateway proxy event containing the EMS webhook payload:

def lambda_handler(event: dict, context: Any) -> dict:
    """Process EMS webhook from ONTAP via API Gateway."""
    api_key = get_api_key()
    request_id = _get_request_id(event)

    logger.info("EMS handler invoked: requestId=%s", request_id)

    # Extract EMS events from webhook body
    ems_events = _extract_ems_events(event)
    logger.info("Parsed %d EMS event(s)", len(ems_events))

    # Normalize to common schema
    normalized = _normalize_ems_events(ems_events)

    # Format for Datadog
    dd_logs = _format_for_datadog(normalized)

    # Ship to Datadog
    shipped = _ship_to_datadog(dd_logs, api_key)

    return _api_response(200, {
        "message": "EMS events processed",
        "total_events": len(ems_events),
        "shipped": shipped,
    })

EMS Event Normalization

ONTAP EMS events arrive with fields like messageName, severity, node, svmName, parameters. The handler normalizes them:

def _normalize_ems_events(events: list[dict]) -> list[dict]:
    """Normalize raw EMS events to internal schema."""
    normalized = []
    for event in events:
        normalized.append({
            "event_name": event.get("messageName", "unknown"),
            "severity": event.get("severity", "info"),
            "source_node": event.get("node", ""),
            "svm": event.get("svmName", ""),
            "message": event.get("message", json.dumps(event)),
            "parameters": event.get("parameters", {}),
            "timestamp": event.get("time", datetime.now(timezone.utc).isoformat()),
        })
    return normalized

Datadog Formatting (source:fsxn-ems)

def _format_for_datadog(events: list[dict]) -> list[dict]:
    """Format normalized EMS events for Datadog Logs API v2."""
    dd_logs = []
    for event in events:
        dd_logs.append({
            "ddsource": "fsxn-ems",
            "ddtags": f"source:fsxn-ems,service:{DD_SERVICE},env:{DD_ENV}",
            "hostname": event["source_node"],
            "service": DD_SERVICE,
            "message": event["message"],
            "date": event["timestamp"],
            "attributes": {
                "event_name": event["event_name"],
                "severity": event["severity"],
                "source_node": event["source_node"],
                "svm": event["svm"],
                "parameters": event["parameters"],
            },
        })
    return dd_logs

ARP Event Payload (Normalized by Lambda)

ONTAP EMS webhooks deliver event notifications to the API Gateway endpoint. The Lambda's _extract_ems_events() function parses the incoming API Gateway proxy event body, then _normalize_ems_events() produces the following internal schema:

{
  "event_name": "arw.volume.state",
  "severity": "alert",
  "source_node": "fsxn-node-01",
  "svm": "svm-prod-01",
  "timestamp": "2026-05-17T01:04:22Z",
  "message": "Anti-ransomware: Volume vol_data state changed to attack-detected",
  "parameters": {
    "volume_name": "vol_data",
    "state": "attack-detected"
  }
}

In Datadog, this arrives as:

source:fsxn-ems
host:fsxn-node-01
service:fsxn-ontap
@attributes.event_name:arw.volume.state
@attributes.severity:alert
@attributes.svm:svm-prod-01
@attributes.parameters.volume_name:vol_data
@attributes.parameters.state:attack-detected

Setting Up the Datadog Monitor

Create a Monitor that triggers on any ARP alert:

Monitor Configuration

Log Explorer search query:

source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected

Datadog Monitor API JSON:

{
  "name": "🚨 FSx for ONTAP: Ransomware Detected (ARP)",
  "type": "log alert",
  "query": "logs(\"source:fsxn-ems @attributes.event_name:arw.volume.state @attributes.parameters.state:attack-detected\").index(\"*\").rollup(\"count\").last(\"5m\") > 0",
  "message": "🚨 ONTAP Autonomous Ransomware Protection detected suspicious activity.\n\n**Volume**: {{attributes.parameters.volume_name}}\n**SVM**: {{attributes.svm}}\n**Node**: {{host}}\n**Time**: {{date}}\n\n## Recommended Actions\n1. Verify the ARP event in ONTAP and Datadog.\n2. Check FPolicy/audit logs for user/client IP correlation.\n3. Follow the approved storage incident response runbook for snapshot, access restriction, or recovery actions.\n\n@pagerduty @slack-security-alerts",
  "options": {
    "thresholds": { "critical": 0 },
    "notify_no_data": false,
    "evaluation_delay": 0
  }
}

What This Monitor Does

Triggers on: Any arw.volume.state event with state:attack-detected
Threshold: Critical when count > 0 in a 5-minute window
Notification: PagerDuty + Slack with volume name, SVM, and response steps
No-data handling: Disabled (absence of ARP events is normal)

Adjust template variables ({{attributes.*}}, {{host}}, {{date}}) based on how your Datadog site renders log attributes in monitor notifications. Test with a simulated event before relying on production alerts.

FPolicy: The Complementary Signal

While ARP detects the encryption pattern, FPolicy provides the file-level detail. Together they answer:

Question	Source
Is ransomware active?	ARP (EMS)
Which files are affected?	FPolicy
Who is doing it?	FPolicy (`user` field)
From where?	FPolicy (`client_ip` field)
What operations?	FPolicy (`operation`: create, write, rename, delete)

FPolicy Event in Datadog

source:fsxn-fpolicy
@attributes.operation:create
@attributes.file_path:/vol/data/finance/confidential_report.xlsx
@attributes.user:suspicious_user@corp.local
@attributes.client_ip:10.0.1.55
@attributes.protocol:cifs

Correlation Query

After an ARP alert, investigate with FPolicy data:

source:fsxn-fpolicy @attributes.svm:svm-prod-01 @attributes.operation:(create OR write OR rename)

This shows all file modifications on the affected SVM, helping identify the responsible user and client.

Incident Response Workflow

1. ARP fires → EMS webhook → Datadog alert (around 30 seconds)
     │
2. Responder receives PagerDuty/Slack notification
     │
3. Verify in Datadog and ONTAP:
   - source:fsxn-ems → confirm ARP event details
   - source:fsxn-fpolicy → identify user, IP, affected files
   - ONTAP: security anti-ransomware volume show
     │
4. Correlate and assess:
   - Is this a true positive or legitimate bulk operation?
   - What is the blast radius (volumes, files, users)?
     │
5. Containment (only after verification, per approved runbook):
   - Create snapshot (preserve recovery point)
   - Restrict volume access if confirmed malicious
   - Review ARP suspect list
     │
6. Recovery:
   - Restore from snapshot (pre-attack state)
   - Re-enable access after containment
   - Update audit policies if gaps found

Important: ARP alerts are high-confidence signals, but false positives can occur (e.g., legitimate backup encryption, bulk file operations). Always verify before applying disruptive containment actions such as restricting volume access. Follow your organization's incident response process.

For a more detailed role-based runbook, see the repository's ARP Incident Response Guide.

Beyond ARP: Other EMS Use Cases

The same EMS webhook pipeline handles other critical ONTAP events:

EMS Event	Severity	Use Case
`arw.volume.state`	alert	Ransomware detection
`wafl.quota.softlimit.exceeded`	warning	Capacity planning
`wafl.quota.hardlimit.exceeded`	error	Immediate capacity action
`cf.fsm.takeover`	alert	HA failover notification
`sms.vol.full`	error	Volume full — data at risk
`net.linkDown`	warning	Network connectivity issue

All arrive in Datadog as source:fsxn-ems with the event name in @attributes.event_name, enabling targeted Monitors for each scenario. For the full cross-vendor field mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.

Validation Results

This integration was validated end-to-end:

Test	Result	Latency
ARP event → Datadog	✅ Arrived	~30 seconds
Quota exceeded → Datadog	✅ Arrived	~30 seconds
FPolicy file create → Datadog	✅ Arrived (via SQS → Lambda path)	~30 seconds
Lambda error handling	✅ DLQ capture	—
API key from Secrets Manager	✅ Cached	—

Validation performed in ap-northeast-1 with the deployed fsxn-datadog-ems-fpolicy stack.

Design Considerations for Security Teams

Webhook security: Use HTTPS for EMS webhook delivery. Do not expose an unauthenticated API Gateway endpoint in production. Validate a shared secret, header, or mTLS identity where possible.

Detection latency: EMS webhooks are event-driven. ARP detection itself depends on ONTAP's ML model — it typically fires within seconds of detecting the pattern, not after a fixed interval. End-to-end latency from ARP detection to Datadog visibility depends on webhook delivery, Lambda processing, and Datadog ingest.

False positives: ARP can trigger on legitimate bulk encryption operations (e.g., backup software encrypting files). Design your response workflow to include a verification step before disruptive actions like restricting volume access.

Coverage: ARP behavior depends on your ONTAP version, volume type, and whether ARP/AI is available. Older NAS FlexVol configurations may start in learning mode before active detection, while newer ONTAP versions (9.16.1+ with ARP/AI) can become active immediately for supported volumes. Always verify security anti-ransomware volume show before relying on alerts.

Audit trail: The EMS event in Datadog serves as the detection timestamp for incident timelines. FPolicy events provide the forensic detail. Together they form a complete audit trail from detection to response.

Cost profile: EMS events are usually low-volume and alert-oriented, while FPolicy can be high-volume depending on policy scope. Treat their Datadog ingest and alerting cost profiles separately.

Try It Yourself

If you want the shortest path to a first successful ARP alert test, see the repository's minimum quick start.

The following simulated event exercises the Lambda normalization and Datadog shipping path. Your actual ONTAP EMS webhook payload may differ depending on EMS webhook configuration, so validate with a real EMS event before production use.

# Deploy EMS + FPolicy integration
aws cloudformation deploy \
  --template-file integrations/datadog/template-ems-fpolicy.yaml \
  --stack-name fsxn-datadog-ems-fpolicy \
  --parameter-overrides \
    DatadogApiKeySecretArn=<your-secret-arn> \
    DatadogSite=ap1.datadoghq.com \
  --capabilities CAPABILITY_NAMED_IAM

# Create a test event file
cat > arp-test-event.json <<EOF
{
  "body": "{\"messageName\":\"arw.volume.state\",\"severity\":\"alert\",\"node\":\"fsxn-node-01\",\"svmName\":\"svm-prod-01\",\"time\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"message\":\"Anti-ransomware: Volume vol_data state changed to attack-detected\",\"parameters\":{\"volume_name\":\"vol_data\",\"state\":\"attack-detected\"}}",
  "requestContext": {"requestId": "test"}
}
EOF

# Invoke Lambda with the test event
aws lambda invoke \
  --function-name fsxn-datadog-ems-fpolicy-ems \
  --payload file://arp-test-event.json \
  --cli-binary-format raw-in-base64-out \
  --region ap-northeast-1 \
  arp-test-output.json

# Check Datadog: source:fsxn-ems @attributes.event_name:arw.volume.state

What's Next

This completes the Datadog series:

Part 1: Architecture and project introduction
Part 2: Audit log pipeline implementation
Part 3: Event-driven ransomware detection (this post)

Coming up next in the series:

Splunk: Replacing EC2 + Universal Forwarder with Lambda + HEC
OpenTelemetry: The vendor-neutral escape hatch
Grafana Cloud: Loki Push API with label cardinality guidance

Each will follow the same pattern: deploy, validate, document the gotchas.

Have questions about ARP detection or the EMS pipeline? Drop a comment below.

Previous: Part 2 — Shipping FSx for ONTAP Logs to Datadog, The Serverless Way

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Shipping FSx for ONTAP Logs to Datadog — The Serverless Way

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 09:16:31 +0000

TL;DR

Deploy a CloudFormation stack, configure ONTAP audit logging, and see structured file access events in Datadog Log Explorer within minutes — no EC2, no NFS mounts, no agents. This post walks through the full implementation: CloudFormation template, Lambda handler code, Datadog field mapping, and operational validation.

What We're Building

In Part 1, I introduced the architecture: FSx for ONTAP audit volume → S3 Access Point → EventBridge Scheduler → Lambda → Datadog. Now let's build it.

By the end of this post, you'll have:

A deployed CloudFormation stack with Lambda, Scheduler, DLQ, and alarms
ONTAP audit events flowing into Datadog Log Explorer
Structured attributes (@attributes.svm, @attributes.user, @attributes.operation, @attributes.path, @attributes.client_ip, @attributes.result) ready for search, filtering, and Datadog facet creation
An operational CloudWatch dashboard monitoring pipeline health

Prerequisites

Before deploying, you need:

FSx for ONTAP file system with an SVM configured for audit logging
FSx for ONTAP S3 Access Point attached to the audit volume
Datadog account (free trial works) with an API Key
API Key in Secrets Manager:

aws secretsmanager create-secret \
  --name fsxn-datadog-api-key \
  --secret-string '{"api_key":"<your-dd-api-key>"}' \
  --region ap-northeast-1

ONTAP audit logging enabled:

# Time-based rotation for quick validation
vserver audit create -vserver <svm-name> -destination /audit_log \
  -events file-ops \
  -format evtx \
  -rotate-schedule-minute 0,5,10,15,20,25,30,35,40,45,50,55
vserver audit enable -vserver <svm-name>

For quick validation, use time-based rotation. If you only use -rotate-size, low-volume environments may not produce rotated audit files within the expected validation window. Adjust the -events list based on what you want to audit.

Important: Enabling vserver audit is only one part of file access auditing. Make sure the target SMB folders have SACLs configured, or NFSv4 ACL audit flags are set for NFS workloads. Otherwise, the audit pipeline may be healthy but no file access events will be generated.

For detailed ONTAP-side setup, including audit volume sizing, SACL/NFSv4 ACL examples, and source health checks, see the repository's ONTAP Audit Setup Guide and Operational Guide.

Verify how audit files appear via S3 API (to set AuditLogPrefix correctly):

aws s3api list-objects-v2 \
  --bucket <fsx-s3-access-point-arn-or-alias> \
  --max-keys 10 \
  --region ap-northeast-1

Set AuditLogPrefix to match the key prefix you see. If the access point is attached directly to the audit volume root, this may be empty.

Note: /audit_log is the ONTAP namespace path. The S3 object key prefix can differ depending on the access point attachment, so always verify with list-objects-v2.

The CloudFormation Stack

The Datadog integration deploys as a single self-contained stack:

aws cloudformation deploy \
  --template-file integrations/datadog/template.yaml \
  --stack-name fsxn-datadog-integration \
  --parameter-overrides \
    FsxS3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
    DatadogApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-datadog-api-key \
    DatadogSite=ap1.datadoghq.com \
    AuditLogPrefix=<prefix-from-list-objects-v2> \
    ScheduleRate="rate(5 minutes)" \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-northeast-1

What Gets Created

Resource	Purpose
Lambda Function	Reads audit logs from S3 AP, parses EVTX/XML, ships to Datadog
EventBridge Scheduler	Invokes Lambda every 5 minutes
Scheduler IAM Role	Allows Scheduler to invoke Lambda
Lambda Execution Role	S3 AP read, Secrets Manager read, CloudWatch Logs, DLQ send permissions
Dead Letter Queue (SQS)	Captures failed events for replay
CloudWatch Alarms (3)	Errors, throttles, DLQ depth
CloudWatch Dashboard	Operational health: errors, duration, invocations, DLQ
CloudWatch Log Group	Lambda execution logs (30-day retention)

Key Parameters

Parameter	Required	Description
`FsxS3AccessPointArn`	✅	FSx for ONTAP S3 Access Point ARN
`DatadogApiKeySecretArn`	✅	Secrets Manager ARN for the API key
`DatadogSite`	❌	Datadog site (default: `ap1.datadoghq.com`)
`ScheduleRate`	❌	Processing frequency (default: `rate(5 minutes)`)
`AuditLogPrefix`	❌	Object key prefix as seen via S3 API. Leave empty if audit files appear at the access point root.
`VpcEnabled`	❌	Enable VPC config — requires NAT Gateway

The Lambda Handler

The handler follows a straightforward flow:

Scheduled invocation
  → List objects from FSx for ONTAP S3 AP (via S3 ListObjectsV2)
  → Filter by checkpoint (skip already-processed files)
  → For each new file:
      → Read via S3 GetObject
      → Detect format (EVTX magic bytes or XML declaration)
      → Parse into normalized events
      → Format for Datadog Logs API v2
      → Batch (≤5MB, ≤1000 items per request)
      → Ship with exponential backoff (max 3 attempts)
  → Update checkpoint

Datadog API Limits

The Datadog Logs API v2 enforces the following per-request limits (docs):

Maximum payload size (uncompressed): 5MB
Maximum size for a single log: 1MB (larger logs are truncated, not rejected)
Maximum array size: 1000 entries

The shipper batches conservatively below these limits.

Core Shipping Logic

def _ship_to_datadog(logs: list[dict], api_key: str) -> int:
    """Ship normalized logs to Datadog Logs Intake API v2.

    If any batch fails after retries, raise an exception so the Lambda
    invocation is treated as failed and the checkpoint is not advanced.
    """
    shipped = 0
    failed_batches = 0

    for batch in _create_batches(logs):
        if _send_batch(batch, api_key):
            shipped += len(batch)
        else:
            failed_batches += 1

    if failed_batches:
        raise RuntimeError(f"{failed_batches} batch(es) failed after retries")

    return shipped

Checkpoint Semantics

The checkpoint is advanced only after all batches for an audit log file are successfully delivered to Datadog. If any batch fails after retries, the Lambda invocation fails (raises an exception) and the checkpoint is not updated.

This makes the pipeline at-least-once: the same audit file may be retried on the next scheduled invocation, so downstream queries should tolerate duplicate events. For production, consider adding a deterministic event ID derived from the audit file key and event record offset to support deduplication where your observability platform supports it.

Because EventBridge Scheduler invokes Lambda asynchronously, a failed invocation (unhandled exception) triggers Lambda's built-in retry behavior (up to 2 retries by default). After all retries are exhausted, the event payload is sent to the configured DLQ.

Retry with Exponential Backoff

def _send_batch(batch: list[dict], api_key: str) -> bool:
    """Send a single batch with retry on 429/5xx, up to MAX_RETRIES attempts."""
    for attempt in range(MAX_RETRIES):
        response = http.request(
            "POST",
            DATADOG_LOGS_URL,
            body=json.dumps(batch).encode("utf-8"),
            headers={
                "Content-Type": "application/json",
                "DD-API-KEY": api_key,
            },
        )
        if response.status < 300:
            return True
        if response.status == 429 or response.status >= 500:
            time.sleep(2 ** attempt + random.uniform(0, 1))  # jitter
            continue
        # Client error (4xx) — don't retry
        return False
    return False

The implementation uses exponential backoff with jitter (2^attempt + random) to avoid synchronized retries when multiple Lambda invocations hit vendor-side throttling simultaneously. Note that MAX_RETRIES in the code represents the total number of attempts, not retries after an initial attempt.

API Key Caching

The API key is fetched from Secrets Manager once per Lambda execution context (cold start) and cached in a module-level variable. This avoids per-invocation Secrets Manager calls:

_api_key_cache: str | None = None

def get_api_key() -> str:
    global _api_key_cache
    if _api_key_cache:
        return _api_key_cache
    response = secrets_client.get_secret_value(SecretId=API_KEY_SECRET_ARN)
    secret = json.loads(response["SecretString"])
    _api_key_cache = secret.get("api_key", secret.get("dd_api_key", response["SecretString"]))
    return _api_key_cache

Datadog Field Mapping

Every audit event arrives in Datadog with structured attributes. The Lambda sends these via the Datadog Logs API v2 payload fields (ddsource, hostname, service, message) and custom attributes nested under attributes:

Datadog Log Explorer	Payload Field	ONTAP Source	Example
`source`	`ddsource`	Configured	`fsxn`
`service`	`service`	Configured	`fsxn-ontap`
`host`	`hostname`	SVM name	`svm-prod-01`
`@attributes.svm`	`attributes.svm`	SVMName / Computer	`svm-prod-01`
`@attributes.user`	`attributes.user`	UserName / SubjectUserName	`admin@corp.local`
`@attributes.client_ip`	`attributes.client_ip`	ClientIP / IpAddress	`10.0.1.50`
`@attributes.operation`	`attributes.operation`	Operation / ObjectType	`ReadData`
`@attributes.path`	`attributes.path`	ObjectName	`/vol/data/reports/q4.xlsx`
`@attributes.result`	`attributes.result`	Result / Keywords	`Success`
`@attributes.event_type`	`attributes.event_type`	EventID	`4663`
`@attributes._pipeline.processed_at`	`attributes._pipeline.processed_at`	Lambda timestamp	`2026-05-17T01:30:00Z`
`@attributes._pipeline.source_file`	`attributes._pipeline.source_file`	S3 object key	`audit_log/audit_svm_20260517.evtx`

Set DatadogSite to your Datadog site, such as datadoghq.com (US1), datadoghq.eu (EU1), or ap1.datadoghq.com (AP1/Tokyo). The site determines the API endpoint.

For the full cross-vendor mapping (Datadog, Splunk, Elastic, OpenTelemetry), see the Normalized Event Schema.

Datadog Search Queries

# All FSx for ONTAP audit events
source:fsxn

# Failed access attempts
source:fsxn @attributes.result:Failure

# Specific user activity
source:fsxn @attributes.user:"admin@corp.local"

# Delete operations on sensitive paths
source:fsxn @attributes.operation:delete @attributes.path:"/vol/data/confidential/*"

# Pipeline processing metadata
source:fsxn @attributes._pipeline.source_file:*

In Part 3, we'll turn these queries into Datadog Monitors for ARP ransomware detection and suspicious file activity alerting.

Investigation Query Starters

When investigating an incident, start with these patterns:

Question	Search query	Then group by
What did this user do?	`source:fsxn @attributes.user:"suspect@corp.local"`	`@attributes.operation` or `@attributes.path`
Who accessed this file?	`source:fsxn @attributes.path:"/vol/data/secret.pdf"`	`@attributes.user`
Which clients generated failures?	`source:fsxn @attributes.result:Failure`	`@attributes.client_ip`
Where are deletes concentrated?	`source:fsxn @attributes.operation:delete`	`@attributes.path` or a path prefix
What happened on this SVM in the last hour?	`source:fsxn @attributes.svm:svm-prod-01`	`@attributes.operation`

For high-volume environments, avoid grouping by full file path unless needed. Consider deriving a lower-cardinality field such as a path prefix or data area classification.

Operational Validation

Quick Validation (5–10 minutes)

With a 5-minute audit rotation and 5-minute Scheduler interval, the first events typically appear within a few minutes, but allow up to 10 minutes depending on timing.

Before waiting for logs, generate a test file operation on the audited SMB/NFS share — such as creating and deleting a small test file — to ensure ONTAP produces an audit event.

# 0. Get stack outputs (log group name, DLQ URL, etc.)
aws cloudformation describe-stacks \
  --stack-name fsxn-datadog-integration \
  --query 'Stacks[0].Outputs' \
  --region ap-northeast-1

# 1. Confirm Scheduler is invoking Lambda
aws logs filter-log-events \
  --log-group-name <LambdaLogGroupName from outputs> \
  --start-time $(python3 -c "import time; print(int((time.time()-300)*1000))") \
  --region ap-northeast-1

# 2. Confirm DLQ is empty
aws sqs get-queue-attributes \
  --queue-url <dlq-url> \
  --attribute-names All \
  --query 'Attributes.ApproximateNumberOfMessages'

# 3. Search in Datadog
#    source:fsxn

CloudWatch Dashboard

The stack includes a pre-built dashboard (fsxn-datadog-integration-health) with:

Lambda Errors & Throttles
Lambda Duration (avg/max)
Lambda Invocations
DLQ Depth

For production, consider publishing custom metrics such as files processed, events shipped, batch failures, and checkpoint lag to gain deeper pipeline observability beyond Lambda-level metrics.

What to Watch For

Symptom	Likely Cause	Fix
No logs in Datadog	Scheduler not running, or no new audit files	Check CloudWatch Logs for Lambda invocations
Logs arrive but fields are empty	EVTX/XML parsing issue	Check `@attributes.event_type` — if "unknown", parser needs tuning
DLQ messages appearing	Datadog API rejection	Check API key validity, site configuration, timestamp age
Lambda timeout	S3 AP access issue (VPC Gateway EP?)	Verify NAT Gateway or deploy Lambda outside VPC

Troubleshooting

Old Timestamps May Not Appear in Log Explorer

The Datadog Logs API accepts log events with timestamps up to 18 hours in the past. If your audit files are rotated or processed too late, older events may not appear as expected in Log Explorer.

Fix: Use a time-based ONTAP audit rotation schedule and a Scheduler frequency that keeps processing well within the 18-hour window.

Gzip Compression Issue (AP1 Site)

During E2E validation, gzip-compressed payloads were accepted (HTTP 202) but not indexed on the AP1 site. The ENABLE_GZIP parameter defaults to false for this reason.

S3 Access Point Timeout in VPC

If Lambda is in a VPC with only an S3 Gateway Endpoint, reads from FSx for ONTAP S3 Access Points will timeout. Add NAT Gateway or deploy Lambda outside VPC.

Day-2 Operations

DLQ Replay

This stack uses an SQS queue as the Lambda asynchronous invocation DLQ. Because the DLQ is attached to Lambda (not an SQS source queue), sqs start-message-move-task cannot redrive messages automatically.

For replay, inspect the DLQ message, identify the failed invocation payload, and re-invoke Lambda manually:

# Inspect failed messages
aws sqs receive-message \
  --queue-url <dlq-url> \
  --max-number-of-messages 1 \
  --attribute-names All \
  --message-attribute-names All

After fixing the root cause (e.g., expired API key, Datadog site misconfiguration), re-run the scheduled processor:

aws lambda invoke \
  --function-name <lambda-function-name> \
  --cli-binary-format raw-in-base64-out \
  --payload '{}' \
  --region ap-northeast-1 \
  replay-output.json

In this pattern, replay usually means re-running the scheduled processor after fixing the root cause. Because the checkpoint is not advanced on failed delivery, the same audit file remains eligible for processing on the next invocation. This does not re-submit the DLQ message itself — it re-runs the processor so files whose checkpoints were not advanced can be picked up again.

For production, consider adding a dedicated replay Lambda that reads DLQ messages, validates the payload, and re-submits failed processing requests in a controlled way.

Checkpoint Reset (Reprocess All Files)

⚠️ Warning: Resetting the checkpoint causes previously processed audit files to be eligible for reprocessing. This can generate duplicate logs in Datadog. Use only for controlled replay or testing.

aws dynamodb delete-item \
  --table-name fsxn-observability-audit-checkpoint \
  --key '{"svm_name": {"S": "svm-prod-01"}, "file_key": {"S": "LATEST"}}'

Teardown

aws cloudformation delete-stack \
  --stack-name fsxn-datadog-integration \
  --region ap-northeast-1

Deleting the stack does not affect ONTAP audit logging or data on the FSx for ONTAP volume.

Cost Estimate

For a typical deployment (1 SVM, 100MB audit logs/day, 5-minute schedule):

Component	Monthly Cost
Lambda (288 invocations/day × 5s avg)	~$0.50
EventBridge Scheduler	~$0.01
DynamoDB (checkpoint)	~$0.01
Secrets Manager	~$0.40
CloudWatch Logs (30-day)	~$1.00
NAT Gateway (if VPC)	Region-dependent hourly + per-GB
Total (no VPC)	~$2/month
Total (with VPC/NAT)	~$30–50+/month depending on Region

Cost numbers are illustrative. Assume a 5-minute schedule, 5-second average runtime, and 100MB/day of audit logs. NAT Gateway pricing is regional and includes hourly charges plus per-GB data processing charges. Check the AWS Pricing Calculator for your target Region.

Important: Datadog ingest and retention costs are not included in this AWS-side estimate and can become the dominant cost driver for high-volume audit policies, especially when read auditing is enabled.

Evidence retention: This pipeline optimizes search and alerting via normalized events in Datadog. If you need audit evidence retention for compliance, design raw EVTX/XML retention separately on the audit volume or in an archive path.

Cost control: For high-volume environments, consider a tiered strategy: send security-relevant operations such as deletes, permission changes, and failed access to indexed logs; reduce, archive, or exclude noisy read events only if your audit and compliance requirements allow it.

Compare this to an always-on EC2 collector instance, plus EBS, patching labor, and agent licensing.

What's Next

In Part 3, we'll add event-driven security alerting:

ONTAP Autonomous Ransomware Protection (ARP) detection
EMS webhook → API Gateway → Lambda → Datadog
Datadog Monitor configuration for instant alerts
Incident response workflow

Datadog is the first E2E-verified integration in this pattern library; the same structure will be used for the remaining vendor integrations as they are validated.

Questions about the Datadog integration? Drop a comment below.

Previous: Part 1 — Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2
Next: Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog

DEV Community: AWS Community Builders

AI Terms, Simply Explained: Notes from My Learning Journey

Why Fundamentals Matter

Fundamentals

Further Reading

VPC Peering: El puente de red para que recursos aislados se comuniquen

¿Qué es un VPC Peering?

Alcance del workshop

Plantilla cloudformation para desplegar recursos

Que despliega exactamente esta plantilla?

Paso 1: Despliegue de la infraestructura base

Paso 2: Comprobar el aislamiento (El fallo esperado)

Paso 3: Creacion de un VPC Peering

Paso 4: Configuración de Tablas de Rutas (El Mapa de Red)

El "muro" del laboratorio

El segundo ping

Configurando el camino de regreso

Probando la bidireccionalidad

Cuándo usar (y cuándo evitar) un VPC Peering?

Paso Final: Eliminación de recursos (¡No olvides este paso!)

Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.

TL;DR

What We're Building

The Problem: Vendor-Specific APIs = Lock-in

The Solution: OTLP as the Producer-to-Collector Contract

Prerequisites

The OTel Collector Configuration

Section Breakdown

Adding Datadog as a Third Backend

The Lambda Handler (OTLP Shipper)

Key Design Decisions

Field Mapping: FSx ONTAP → OTLP Attributes

Severity Determination Logic

OTLP Payload Construction

Retry with Exponential Backoff

AUTH_MODE Support

Deployment

Local Development: Docker Run

First Success Path

AWS Deployment: CloudFormation

Environment Variables

Verified Results

The Proof: Zero Code Changes

Demonstration: Adding a Backend

Demonstration: Removing a Backend

Troubleshooting

Timestamp Rejection / Static Payload Gotcha

Grafana Cloud Auth Format

Honeycomb Key Types

Colima Docker Compose Compatibility

Common Mistake: loki Exporter vs otlp_http

Cost Model: How to Think About It

Lambda Cost (OTLP Path vs Direct Send)

OTel Collector Cost

When to Use Each Pattern

When to Use This Pattern

Multi-Vendor Evaluation

Compliance: Logs in Multiple Systems

Migration Between Vendors

Cost Optimization: Route by Volume

What's Next

Key Takeaways

Series Navigation

CTF Event Report: Security-JAWS 10th Anniversary Day 2 — All 27 AWS Security Challenges Solved

Introduction

Event Overview

Results

Challenge Structure

Tutorial (290 pt)

Mainline (2,300 pt)

Bonus / Advanced / Blue Team / Finale (3,000 pt)

The Full Attack Chain

Key Takeaways and Mitigations

Information Leakage (Debug, Headers, etc.)

S3 Misconfiguration

Secrets in Git History

Prompt Injection

Overly Broad IAM Permissions and AssumeRole

Poor Secret Management

SSRF × IMDSv1

Step 1: Training the Model (`generate_model.py`)

Step 2: Loading the Model (`model_loader.py`)

Step 3: Validating Uploads (`validator.py`)

Step 4: Running Predictions (`predictor.py`)

Step 5: Visualising Results (`visualizer.py`)

Step 6: The Flask Application (`app.py`)

Step 9: Exporting Results (`exporter.py`)