<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Guatu</title>
    <description>The latest articles on DEV Community by Guatu (@futhgar).</description>
    <link>https://dev.to/futhgar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847021%2F5aa46faa-d8e6-4023-ad78-5a335f875d69.png</url>
      <title>DEV Community: Guatu</title>
      <link>https://dev.to/futhgar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/futhgar"/>
    <language>en</language>
    <item>
      <title>Proxmox Cluster Quorum: How Many Nodes Do You Actually Need</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Mon, 18 May 2026 16:15:57 +0000</pubDate>
      <link>https://dev.to/futhgar/proxmox-cluster-quorum-how-many-nodes-do-you-actually-need-3jpc</link>
      <guid>https://dev.to/futhgar/proxmox-cluster-quorum-how-many-nodes-do-you-actually-need-3jpc</guid>
      <description>&lt;p&gt;I woke up to a cluster that had effectively turned itself into a read-only museum. My VMs were running, but I couldn't start a new one, I couldn't migrate a workload, and the Proxmox GUI was throwing "Cluster not ready - no quorum" errors across the board. I had a two-node setup, one node had rebooted for a kernel update, and the remaining node decided that since it didn't have a majority, it no longer had the right to make decisions.&lt;/p&gt;

&lt;p&gt;If you're building a Proxmox cluster, quorum is the one concept that will either be completely invisible or the primary reason your entire infrastructure freezes. Most people treat it as a checkbox during the cluster creation wizard, but in a home lab, the math of quorum often clashes with the reality of how many physical servers you can actually fit in your rack.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I tried first
&lt;/h3&gt;

&lt;p&gt;My initial instinct was that "Cluster" simply meant "nodes that can talk to each other." I assumed that as long as one node was alive, the cluster was alive. I set up two beefy nodes, linked them together, and felt confident. &lt;/p&gt;

&lt;p&gt;Then I hit the "split-brain" wall. In a two-node cluster, the quorum requirement is &lt;code&gt;(n/2) + 1&lt;/code&gt;. For two nodes, that means you need two votes to have a majority. If one node goes down, the remaining node has one vote. One is not greater than one. The remaining node loses quorum and enters a protective state. It stops allowing configuration changes to prevent a scenario where both nodes think they are the master and start writing conflicting data to shared storage, which is a great way to corrupt your VM disks.&lt;/p&gt;

&lt;p&gt;I tried to "fix" this by manually forcing quorum on the surviving node using &lt;code&gt;pvecm expected 1&lt;/code&gt;. It worked for a few minutes, but it's a manual band-aid. Every time a node rebooted or a network cable acted up, I was back in the CLI fighting with the cluster manager. I realized I was fighting the fundamental design of Corosync, and the only way out was to change the voting math.&lt;/p&gt;

&lt;h3&gt;
  
  
  The actual solution
&lt;/h3&gt;

&lt;p&gt;You have three real options depending on your hardware budget and your tolerance for manual intervention.&lt;/p&gt;

&lt;h4&gt;
  
  
  Option 1: The Three-Node Standard
&lt;/h4&gt;

&lt;p&gt;The cleanest way to solve quorum is to just add a third node. With three nodes, quorum is two votes. If one node dies, two remain. You still have a majority, and HA (High Availability) actually works as intended.&lt;/p&gt;

&lt;h4&gt;
  
  
  Option 2: The QDevice (The "Cheap" Vote)
&lt;/h4&gt;

&lt;p&gt;If you can't justify a third full-sized server, you use a Quorum Device (QDevice). A QDevice is a lightweight external voter. It doesn't run VMs; it just tells the cluster "Yes, I see Node A." You can run this on a Raspberry Pi, a tiny VM on a separate host, or even a cheap VPS.&lt;/p&gt;

&lt;p&gt;To set up a QDevice on a separate Debian/Ubuntu machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On the QDevice server (the voter)&lt;/span&gt;
apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt &lt;span class="nb"&gt;install &lt;/span&gt;corosync-qnetd

&lt;span class="c"&gt;# On all Proxmox nodes&lt;/span&gt;
apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt &lt;span class="nb"&gt;install &lt;/span&gt;corosync-qdevice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the software is installed, you initialize the device from one of the Proxmox nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run this on one PVE node&lt;/span&gt;
pvecm qdevice setup &amp;lt;IP-OF-QDEVICE-SERVER&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This adds a third vote to the cluster without requiring a third Proxmox node. Now, if one PVE node fails, the other PVE node and the QDevice provide the two votes needed to maintain quorum.&lt;/p&gt;

&lt;h4&gt;
  
  
  Option 3: Monitoring and API Integration
&lt;/h4&gt;

&lt;p&gt;If you're running a larger setup, you shouldn't be checking quorum by clicking through the GUI. I integrated &lt;code&gt;pve_exporter&lt;/code&gt; with Prometheus to get alerts the second a node loses its vote.&lt;/p&gt;

&lt;p&gt;Since I'm using token-based authentication to avoid the security risks of root passwords in plain text (see my post on &lt;a href="https://guatulabs.dev/posts/proxmox-api-tokens-bash-history-expansion-and-the-character/" rel="noopener noreferrer"&gt;Proxmox API Tokens&lt;/a&gt;), the setup looks like this.&lt;/p&gt;

&lt;p&gt;First, create a restricted user for the exporter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create user with PVEAuditor role&lt;/span&gt;
pveum user add prometheus@pve &lt;span class="nt"&gt;--realm&lt;/span&gt; &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="nt"&gt;--password&lt;/span&gt; sEcr3T! &lt;span class="nt"&gt;--groups&lt;/span&gt; PVEAuditors

&lt;span class="c"&gt;# Create API token for prometheus@pve&lt;/span&gt;
pveum token add prometheus@pve prometheus &lt;span class="nt"&gt;--privsep&lt;/span&gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, configure the &lt;code&gt;pve_exporter&lt;/code&gt; YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;token_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
  &lt;span class="na"&gt;token_value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus@pve!prometheus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the Prometheus scrape config to target the nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;proxmox'&lt;/span&gt;
  &lt;span class="na"&gt;metrics_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/pve&lt;/span&gt;
  &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
  &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__address__&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;^(10\.0\.0\.\d+)$'&lt;/span&gt;
      &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;__param_target&lt;/span&gt;
      &lt;span class="na"&gt;replacement&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$1&lt;/span&gt;
  &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;10.0.0.x:9221'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works
&lt;/h3&gt;

&lt;p&gt;Proxmox uses Corosync for cluster membership and quorum. Corosync is designed for absolute consistency over availability (the "C" in the CAP theorem). It assumes that if you can't reach a majority of your peers, you are the one who is isolated, not them.&lt;/p&gt;

&lt;p&gt;In a two-node cluster, there is no way to distinguish between "Node B is dead" and "The network cable between Node A and Node B is unplugged." If Node A decided to stay "active" while Node B also stayed "active," and both tried to modify the same shared storage (like a Ceph pool or an NFS share), you'd end up with a corrupted filesystem.&lt;/p&gt;

&lt;p&gt;By adding a third vote (either a node or a QDevice), you break the tie. The node that can still talk to the QDevice knows it is part of the majority. The node that is isolated knows it's alone and gracefully steps back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons learned
&lt;/h3&gt;

&lt;p&gt;The biggest lesson here is that High Availability (HA) is a lie if you don't have a proper quorum strategy. I spent a week thinking I had "HA" because I had two nodes and shared storage. In reality, I had a system that would freeze the moment I tried to update a BIOS or swap a NIC.&lt;/p&gt;

&lt;p&gt;If you're running a two-node cluster, do not rely on &lt;code&gt;pvecm expected 1&lt;/code&gt;. It's a temporary fix for recovery, not a configuration. Get a QDevice. Even a $35 Raspberry Pi is better than a cluster that goes read-only during a midnight update.&lt;/p&gt;

&lt;p&gt;I also found that hardware stability plays a huge role in quorum health. If you're seeing random "Node lost" messages in your logs but the server is still pingable, check your kernel settings. I've dealt with &lt;a href="https://guatulabs.dev/posts/amd-ryzen-c-state-freezes-the-processor-max-cstate-1-fix/" rel="noopener noreferrer"&gt;AMD Ryzen C-State freezes&lt;/a&gt; that looked like network failures but were actually the CPU dropping into a sleep state so deep the NIC stopped responding for a few milliseconds, triggering a Corosync timeout.&lt;/p&gt;

&lt;p&gt;A few final caveats:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;QDevice Placement&lt;/strong&gt;: Don't run your QDevice as a VM on the same cluster it's voting for. That's circular logic. If the cluster loses quorum and the VM stops, the QDevice disappears, and you're stuck. Put it on a separate physical box or a different hypervisor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Latency&lt;/strong&gt;: Corosync is extremely sensitive to latency. If you're putting your QDevice in the cloud or on a slow Wi-Fi link, you'll see "flapping" where the cluster constantly gains and loses quorum. Use a wired connection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Expected" Trap&lt;/strong&gt;: When you manually change &lt;code&gt;pvecm expected&lt;/code&gt;, you are telling the cluster to ignore the safety rules. Only do this when you are performing maintenance on a known-down node and need to regain control of the surviving one.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're scaling this into a production-grade environment, this is where the gap between a "homelab" and "infrastructure" becomes clear. For those needing professional help architecting these systems for zero-downtime, I provide &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;infrastructure consulting&lt;/a&gt; to handle the messy parts of bare-metal orchestration.&lt;/p&gt;

</description>
      <category>proxmox</category>
      <category>quorum</category>
      <category>highavailability</category>
      <category>homelab</category>
    </item>
    <item>
      <title>Kyverno Admission Controllers: Policy-as-Code That Actually Works</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Mon, 18 May 2026 02:15:57 +0000</pubDate>
      <link>https://dev.to/futhgar/kyverno-admission-controllers-policy-as-code-that-actually-works-1l9d</link>
      <guid>https://dev.to/futhgar/kyverno-admission-controllers-policy-as-code-that-actually-works-1l9d</guid>
      <description>&lt;p&gt;I spent an entire Saturday afternoon debugging why my CloudNativePG (CNPG) database cluster refused to initialize, only to find out my own security policies were killing the initdb jobs. I had a "require-resource-limits" policy active across the cluster. It sounded like a great idea: no pod enters the cluster without explicit CPU and memory limits. The documentation makes this look like a five-minute win for cluster stability.&lt;/p&gt;

&lt;p&gt;What the docs don't tell you is that many Kubernetes Operators, including CNPG, spawn temporary Jobs or Pods that don't always inherit the limits you've defined in the primary custom resource. The admission controller saw a pod without limits, deemed it "illegal," and blocked it. The operator just kept retrying, and I kept wondering why my database was stuck in a pending state with no obvious error in the operator logs.&lt;/p&gt;

&lt;p&gt;This is the gap between "Policy-as-Code" as a concept and Policy-as-Code in a real production environment. If you've ever tried to enforce standards across a multi-node cluster, you've probably looked at OPA Gatekeeper or Kyverno. I've used both. One requires you to learn a specialized language (Rego) that feels like a full-time job, and the other uses YAML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why you'd choose a Policy Engine
&lt;/h2&gt;

&lt;p&gt;You reach this decision point when your cluster grows beyond a few hand-rolled manifests. Once you're using &lt;a href="https://guatulabs.dev/posts/gitops-for-homelabs-argocd-app-of-apps/" rel="noopener noreferrer"&gt;ArgoCD to scale your apps&lt;/a&gt;, you stop caring about individual pods and start caring about invariants.&lt;/p&gt;

&lt;p&gt;These invariants usually fall into a few buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No one runs a container as root.&lt;/li&gt;
&lt;li&gt;Every deployment has a specific set of labels for monitoring.&lt;/li&gt;
&lt;li&gt;Resource limits are enforced so one runaway AI agent doesn't starve the rest of the node.&lt;/li&gt;
&lt;li&gt;Sidecars are automatically injected without manually editing every deployment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can do some of this with Pod Security Admissions (PSA), but PSA is a blunt instrument. It's a "yes or no" switch. A real admission controller allows you to mutate the request on the fly. If a developer forgets a security context, the controller doesn't just reject the pod; it injects the correct one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option A: OPA Gatekeeper
&lt;/h2&gt;

&lt;p&gt;Gatekeeper is the industry standard for large-scale enterprises. It's built on Open Policy Agent (OPA), and its primary strength is its absolute precision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;br&gt;
The logic is decoupled from the Kubernetes API. Because it uses Rego, you can write incredibly complex queries. If you need a policy that says "Allow this pod only if the user is in the 'dev' group AND the time is between 9 AM and 5 PM AND the image comes from a specific signed registry," Gatekeeper can do it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;&lt;br&gt;
The learning curve is a cliff. Rego is a declarative query language, and if you've never used it, you'll spend more time fighting the syntax than actually securing your cluster. Debugging a failing Rego policy is a nightmare because the error messages are often opaque.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it shines&lt;/strong&gt;&lt;br&gt;
Gatekeeper is for environments where compliance is a legal requirement. If you're in a highly regulated industry where you need a mathematical proof of your security posture, the overhead of Rego is worth it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Option B: Kyverno
&lt;/h2&gt;

&lt;p&gt;Kyverno is the choice for those of us who just want things to work without learning a new language. It uses YAML for everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;br&gt;
It's native to Kubernetes. If you can write a Pod manifest, you can write a Kyverno policy. It handles mutation, validation, and generation. The "generation" part is a killer feature: you can tell Kyverno that whenever a new namespace is created, it should automatically generate a NetworkPolicy and a LimitRange for that namespace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;&lt;br&gt;
YAML has limits. While Kyverno is powerful, it can't match the raw computational logic of Rego for extremely complex edge cases. It's also easier to accidentally create "mutation loops" where a policy changes a resource, which triggers the policy again, ad infinitum.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it shines&lt;/strong&gt;&lt;br&gt;
It's perfect for the GitOps-driven homelab or mid-sized production environment. It integrates cleanly with &lt;a href="https://guatulabs.dev/posts/kubernetes-manifest-validation-catching-errors-before-merge/" rel="noopener noreferrer"&gt;manifest validation pipelines&lt;/a&gt; and doesn't require a dedicated "Policy Engineer" to maintain.&lt;/p&gt;
&lt;h2&gt;
  
  
  Decision Framework
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;OPA Gatekeeper&lt;/th&gt;
&lt;th&gt;Kyverno&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rego (Specialized)&lt;/td&gt;
&lt;td&gt;YAML (K8s Native)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning Curve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Steep&lt;/td&gt;
&lt;td&gt;Shallow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mutation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Possible, but complex&lt;/td&gt;
&lt;td&gt;First-class citizen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extremely high&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ConstraintTemplates&lt;/td&gt;
&lt;td&gt;ClusterPolicies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ideal User&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compliance/Security Teams&lt;/td&gt;
&lt;td&gt;DevOps/Platform Engineers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  My Pick and Why
&lt;/h2&gt;

&lt;p&gt;I use Kyverno. I've tried the "right way" with OPA, but in a lean environment, the cognitive load of Rego is a liability. I'd rather spend my time optimizing my &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;AI agent orchestration&lt;/a&gt; than debugging a query language.&lt;/p&gt;

&lt;p&gt;However, using Kyverno without a strategy is a fast track to a broken cluster. To make it actually work, you have to move away from the "happy path" and account for infrastructure overhead.&lt;/p&gt;
&lt;h3&gt;
  
  
  The "Infrastructure Exclusion" Pattern
&lt;/h3&gt;

&lt;p&gt;The biggest mistake I made early on was applying policies globally. I had a policy that required all pods to have a specific security context. Suddenly, my Traefik ingress and ArgoCD controllers started crashing because they needed specific capabilities (like &lt;code&gt;NET_ADMIN&lt;/code&gt;) that my policy explicitly forbade.&lt;/p&gt;

&lt;p&gt;The fix is to implement a strict exclusion list. You cannot treat your infrastructure components the same way you treat your application workloads. I now use a combination of namespace exclusions and label-based filters to ensure that the "plumbing" of the cluster stays functional.&lt;/p&gt;

&lt;p&gt;Here is how I handled the CNPG issue. Instead of a blanket "require limits" policy that blocks everything, I added an exclusion for any resource tagged by the CNPG operator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-resource-limits&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-resource-limits&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
      &lt;span class="na"&gt;generate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LimitRange&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-limit-range&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$(metadata.namespace)&lt;/span&gt;
        &lt;span class="na"&gt;applyTo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Container&lt;/span&gt;
              &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cnpg.io/cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This policy ensures that most pods get a default limit range, but it stays out of the way of the database operator's internal jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling Security Contexts without Breaking the Cluster
&lt;/h3&gt;

&lt;p&gt;Another common pitfall is forcing security contexts on pods that actually need to run as root to perform system-level tasks. I've seen this happen with storage drivers and network plugins.&lt;/p&gt;

&lt;p&gt;I prefer a "mutate-then-validate" approach. I use Kyverno to inject a sane default security context for everything, and then I create a small set of exceptions for the system namespaces.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-security-context&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;set-default-security-context&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
      &lt;span class="c1"&gt;# I use mutate here instead of generate to ensure the pod &lt;/span&gt;
      &lt;span class="c1"&gt;# spec is modified before it hits the scheduler&lt;/span&gt;
      &lt;span class="na"&gt;mutate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;patchStrategicMerge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;runAsUser&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
              &lt;span class="na"&gt;runAsGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
              &lt;span class="na"&gt;fsGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
              &lt;span class="na"&gt;supplementalGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;2001&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you apply this globally, you'll likely break your CNI or your CSI driver. You must exclude &lt;code&gt;kube-system&lt;/code&gt; and any namespace where you've deployed low-level infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Danger of &lt;code&gt;synchronize: true&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Kyverno has a setting called &lt;code&gt;synchronize&lt;/code&gt;. When set to true, Kyverno will automatically update the generated resource if the policy changes. This sounds great in theory, but in practice, it can create a synchronization nightmare.&lt;/p&gt;

&lt;p&gt;I once had a policy generating NetworkPolicies for every new namespace. I changed the policy to add a new rule, and Kyverno attempted to update every single NetworkPolicy in the cluster simultaneously. This caused a spike in API server latency and, for a few minutes, left some of my internal services unreachable because the policies were in a state of flux.&lt;/p&gt;

&lt;p&gt;My rule of thumb now is to avoid &lt;code&gt;synchronize: true&lt;/code&gt; for high-churn resources. If you need to update a generated resource across the cluster, it's safer to trigger a rolling update via your GitOps pipeline than to let the admission controller try to rewrite the cluster state on the fly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Orphaned Resources and the Cleanup Gap
&lt;/h3&gt;

&lt;p&gt;Policy engines are great at creating things, but they're often bad at cleaning them up. I ran into this with a dashboard app called Homarr. I had a policy that generated certain config maps for the dashboard. When I deleted the application via the API, the generated resources stayed behind.&lt;/p&gt;

&lt;p&gt;This led to "phantom" items appearing in my dashboard UI. The application was gone, but the configuration lived on in the etcd store. Kyverno doesn't always track the lifecycle of generated resources perfectly.&lt;/p&gt;

&lt;p&gt;If you find yourself with orphaned records in your database or config stores, you might have to go in manually. For Homarr, I had to run a few SQL queries to purge the dead references:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Clean up orphaned item_layout and item records&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;item_layout&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;itemId&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;app_id&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a reminder that while "Policy-as-Code" automates the deployment, it doesn't always automate the decommissioning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration with the Wider Stack
&lt;/h3&gt;

&lt;p&gt;A policy engine shouldn't exist in a vacuum. I've found that the most stable setups link Kyverno with other infrastructure tools. For example, I use it to ensure that any ingress resource created in the cluster has the correct annotations for &lt;a href="https://guatulabs.dev/posts/cert-manager-cloudflare-dns-01-automated-tls-for-everything/" rel="noopener noreferrer"&gt;cert-manager and Cloudflare DNS-01&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Instead of reminding every developer to add the &lt;code&gt;cert-manager.io/cluster-issuer&lt;/code&gt; annotation, I wrote a mutation policy that adds it automatically if the ingress is in a production namespace. This removes the human element from the TLS chain.&lt;/p&gt;

&lt;p&gt;Similarly, I use Kyverno to enforce that all &lt;a href="https://guatulabs.dev/posts/sealedsecrets-key-backup-don-t-lose-your-encryption-keys/" rel="noopener noreferrer"&gt;SealedSecrets&lt;/a&gt; are tagged with an owner label. This makes it significantly easier to track who owns which secret when I'm auditing the cluster for old, unused credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;The biggest takeaway from my time with admission controllers is that the "happy path" is a lie. The documentation shows you how to block a pod, but it doesn't show you the three hours of debugging you'll do when a system-critical operator gets blocked by that same policy.&lt;/p&gt;

&lt;p&gt;I've learned to follow three strict rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Test in a sandbox.&lt;/strong&gt; Never apply a new &lt;code&gt;ClusterPolicy&lt;/code&gt; to a production cluster without running it in &lt;code&gt;audit&lt;/code&gt; mode first. Kyverno's audit mode lets you see what would have been blocked without actually blocking it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exclude the plumbing.&lt;/strong&gt; Your infrastructure (Traefik, ArgoCD, CNPG, etc.) should almost always be exempt from general application policies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep it simple.&lt;/strong&gt; If a policy requires more than a few lines of complex YAML logic, it's probably time to ask if that constraint should be handled at the CI/CD level rather than the admission level.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I've moved toward using &lt;a href="https://guatulabs.dev/posts/kubernetes-manifest-validation-catching-errors-before-merge/" rel="noopener noreferrer"&gt;manifest validation in CI&lt;/a&gt; to catch the obvious errors before they ever hit the API server. This reduces the load on the admission controller and provides faster feedback to the person writing the YAML.&lt;/p&gt;

&lt;p&gt;If you're building out your own infrastructure and need help designing a secure, automated pipeline for AI agents or industrial systems, you can check out my &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;services&lt;/a&gt;. I focus on the gap between the documentation and the actual production reality, which is usually where the most expensive bugs live.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>kyverno</category>
      <category>policyascode</category>
      <category>security</category>
    </item>
    <item>
      <title>Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Fri, 15 May 2026 16:15:32 +0000</pubDate>
      <link>https://dev.to/futhgar/privacy-routed-llm-inference-keeping-sensitive-data-out-of-the-cloud-166</link>
      <guid>https://dev.to/futhgar/privacy-routed-llm-inference-keeping-sensitive-data-out-of-the-cloud-166</guid>
      <description>&lt;p&gt;I spent three hours debugging a "hallucination" in my agent's daily briefing only to realize the agent wasn't hallucinating at all. It had simply failed to access my local financial spreadsheets because of a tool denylist I'd configured for security, and instead of admitting it couldn't see the data, it had tried to "guess" based on a few fragments it had previously cached in a cloud-based session. Even worse, I discovered that a fallback trigger in my orchestration layer had sent a summarized snippet of my private data to a cloud API because the local inference node had a momentary timeout.&lt;/p&gt;

&lt;p&gt;If you're building AI agents that touch real-world data, the "happy path" is usually just a prompt and an API key. The reality is a minefield of data leaks, prompt injections, and silent failures that send your private keys or bank statements to a third-party server because a local GPU pod decided to restart.&lt;/p&gt;

&lt;p&gt;This is a problem for anyone running autonomous agents that have &lt;code&gt;read&lt;/code&gt; or &lt;code&gt;write&lt;/code&gt; access to a local filesystem. If your routing logic is flawed, your privacy isn't a policy; it's a coin flip.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Wrong Way: Trusting the Orchestrator
&lt;/h3&gt;

&lt;p&gt;My first attempt at "privacy" was naive. I used a simple conditional in my agent's logic: if the query contained words like "bank," "password," or "private," route it to a local Ollama instance. Otherwise, send it to GPT-4o.&lt;/p&gt;

&lt;p&gt;This failed immediately for three reasons. First, keyword filtering is a joke. A user (or a prompt injection) can easily bypass "bank" by asking about "financial liquidity instruments." Second, I assumed the orchestrator was a neutral party. In reality, the orchestrator often handles the context window, meaning the sensitive data is already in the prompt before the routing decision is even made. Third, I had no fail-safe. When the local model timed out, the system defaulted to the cloud provider to ensure "high availability." In a privacy-first system, unavailability is better than exposure.&lt;/p&gt;

&lt;p&gt;I also hit a wall with tool access. I had disabled &lt;code&gt;sandbox.mode&lt;/code&gt; to let my agents actually do work, but I quickly found that built-in tools like &lt;code&gt;read&lt;/code&gt; and &lt;code&gt;edit&lt;/code&gt; can be manipulated to bypass &lt;code&gt;exec&lt;/code&gt; allowlists. I saw a specific instance where a prompt injection convinced the agent to use a &lt;code&gt;read-chunk&lt;/code&gt; command (a hidden utility in some knowledge base scripts) to dump raw data from a file that should have been summarized first.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Actual Solution: Two-Tier Privacy Routing
&lt;/h3&gt;

&lt;p&gt;The only way to actually guarantee privacy is to move the routing logic as close to the data as possible and treat the cloud LLM as an untrusted guest. I implemented a two-tier architecture: a local "Privacy Gate" and a reference-only knowledge base.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. The Reference-Only Knowledge Base
&lt;/h4&gt;

&lt;p&gt;Instead of feeding raw files to the LLM, I use a system where the LLM never sees the original document. I use &lt;code&gt;poppler-utils&lt;/code&gt; for PDF extraction and a local embedding model to populate a Qdrant vector store. The agent queries the vector store, but the results are filtered through a local script before being sent to any inference engine.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. The Privacy Gate (Routing Layer)
&lt;/h4&gt;

&lt;p&gt;I wrote a wrapper, &lt;code&gt;knowledge.sh&lt;/code&gt;, that handles the routing. It doesn't rely on keywords. It relies on the data source. If the data comes from a "Sensitive" tagged volume in my cluster, the request is hard-pinned to the local GPU node.&lt;/p&gt;

&lt;p&gt;Here is a simplified version of how I handle a private query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# knowledge.sh query - Local-first routing&lt;/span&gt;

&lt;span class="nv"&gt;QUERY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
&lt;span class="nv"&gt;MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"qwen2.5:14b"&lt;/span&gt;
&lt;span class="c"&gt;# The local endpoint is a dedicated GPU node in my K8s cluster&lt;/span&gt;
&lt;span class="nv"&gt;LOCAL_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://ollama-gpu-node.internal/v1/chat/completions"&lt;/span&gt;

&lt;span class="c"&gt;# Check if the query requires sensitive data access&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$QUERY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="s2"&gt;"--private"&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Routing to local inference..."&lt;/span&gt;
    &lt;span class="c"&gt;# We use a local model and a local endpoint. No cloud fallback.&lt;/span&gt;
    curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCAL_ENDPOINT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
         &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
         &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"{
           &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;model&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="nv"&gt;$MODEL&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,
           &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;messages&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: [{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;role&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;content&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="nv"&gt;$QUERY&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}],
           &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;stream&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: false
         }"&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="c"&gt;# Non-sensitive queries can go to the cloud orchestrator&lt;/span&gt;
    ./route-to-cloud.sh &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$QUERY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Hardening the Execution
&lt;/h4&gt;

&lt;p&gt;To prevent the "hallucination via missing data" problem, I stopped letting the LLM handle the final delivery of sensitive reports. I use a pattern where the LLM generates a &lt;em&gt;template&lt;/em&gt; or a &lt;em&gt;summary&lt;/em&gt;, but a local Python script handles the actual data insertion and delivery.&lt;/p&gt;

&lt;p&gt;For my daily briefings, I use a wrapper script that ensures the data collection is isolated from the cloud inference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# life-briefing-run.sh&lt;/span&gt;

&lt;span class="c"&gt;# 1. Collect raw data locally (Private)&lt;/span&gt;
./daily-briefing.sh &lt;span class="nt"&gt;--collect-only&lt;/span&gt;

&lt;span class="c"&gt;# 2. Format the data using a local script (No LLM involved here)&lt;/span&gt;
&lt;span class="c"&gt;# This prevents the LLM from accidentally leaking raw data in its output&lt;/span&gt;
python3 /opt/scripts/format-and-send-briefing.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the Python script handles the delivery via a secure API (like Telegram) without ever sending the raw content to a third-party LLM for "polishing":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_telegram_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Tokens are managed via SealedSecrets in K8s
&lt;/span&gt;    &lt;span class="n"&gt;bot_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ANONYMIZED_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ANONYMIZED_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://api.telegram.org/bot&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bot_token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/sendMessage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chat_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_mode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Markdown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load the locally generated briefing
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp/briefing.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;send_telegram_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;p&gt;This approach works because it removes the "decision" from the LLM. If you ask an LLM "Should I send this to the cloud?", it will eventually say yes. By moving the routing to a bash wrapper and a Python script, the logic is deterministic.&lt;/p&gt;

&lt;p&gt;The use of a local model like &lt;code&gt;qwen2.5:14b&lt;/code&gt; via Ollama provides enough reasoning capability to summarize private data without needing the massive parameter counts of GPT-4. I've found that for most RAG (Retrieval-Augmented Generation) tasks, a 14B model is the sweet spot between performance and the VRAM limits of my GPU nodes.&lt;/p&gt;

&lt;p&gt;By separating the &lt;em&gt;synthesis&lt;/em&gt; (LLM) from the &lt;em&gt;delivery&lt;/em&gt; (Python script), I've created a circuit breaker. Even if the LLM is compromised via prompt injection, it cannot "leak" the data to the cloud because it doesn't have the API keys for the cloud provider; those are held by the orchestrator, which is gated by the &lt;code&gt;knowledge.sh&lt;/code&gt; script.&lt;/p&gt;

&lt;p&gt;For those managing the underlying hardware, ensuring these local models stay performant requires a stable infrastructure. I've written about &lt;a href="https://dev.to/posts/gpu-passthrough-on-proxmox-gotcha-guide/"&gt;how I handle GPU passthrough on Proxmox&lt;/a&gt; and why the &lt;a href="https://dev.to/posts/nvidia-container-toolkit-why-the-default-runtime-matters/"&gt;NVIDIA Container Toolkit is non-negotiable&lt;/a&gt; for this to work in a Kubernetes environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons Learned
&lt;/h3&gt;

&lt;p&gt;The biggest surprise was how often "convenience" features in agent frameworks are actually security holes. For example, I found that &lt;code&gt;sessionKey&lt;/code&gt; in some cron-job implementations is often misunderstood. I assumed it provided hard isolation, but it turns out it's often just a routing hint. To get actual isolation, you have to explicitly set the session to &lt;code&gt;isolated&lt;/code&gt;, or you risk your private data bleeding into the "main" session context, which might be shared with a cloud-connected agent.&lt;/p&gt;

&lt;p&gt;Another gotcha was the Qdrant MCP. I hit several "Not existing vector name" errors during the rollout. This wasn't a bug in my code but a version mismatch between the MCP server and the Qdrant instance. In a bare-metal K8s setup, pinning your versions is the only way to avoid waking up to a broken pipeline.&lt;/p&gt;

&lt;p&gt;If I were to do this again, I'd implement a more formal "Taint and Toleration" system in Kubernetes. I'd taint my GPU nodes with &lt;code&gt;privacy=high&lt;/code&gt; and only allow pods with the corresponding toleration to run there. This would prevent a non-private, cloud-connected pod from ever being scheduled on the same physical hardware where my sensitive local models are processing data in memory.&lt;/p&gt;

&lt;p&gt;For those looking to scale this into a professional environment, this kind of architecture is a core part of what I do in &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;AI agent and infrastructure consulting&lt;/a&gt;. Moving from a "it works on my machine" script to a production-grade, privacy-routed pipeline is where most of the complexity lives.&lt;/p&gt;

&lt;p&gt;The takeaway is simple: if the data is sensitive, the cloud is a liability. Build your gate, pin your models, and never let your LLM decide where your data goes.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>localllm</category>
      <category>privacy</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Tailscale Subnet Routers: Accessing Your LAN Without the VPN Headache</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Thu, 14 May 2026 02:15:07 +0000</pubDate>
      <link>https://dev.to/futhgar/tailscale-subnet-routers-accessing-your-lan-without-the-vpn-headache-3m70</link>
      <guid>https://dev.to/futhgar/tailscale-subnet-routers-accessing-your-lan-without-the-vpn-headache-3m70</guid>
      <description>&lt;p&gt;I spent three hours trying to SSH into a legacy industrial gateway from a coffee shop, only to realize I'd forgotten to install the Tailscale agent on that specific piece of hardware. The device was a locked-down firmware image where "installing a binary" isn't an option. That's the moment I stopped trying to put Tailscale on every single node and instead shifted to a dedicated subnet router.&lt;/p&gt;

&lt;p&gt;If you have a multi-node Proxmox cluster, a rack of IoT sensors, or a bunch of "dumb" switches and PDUs, you can't possibly install a client on everything. You need a way to tell your Tailnet: "If you're looking for anything in the 10.0.0.x range, just send the traffic to this specific Linux box, and it'll handle the rest."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Concept: Routing vs. Agent-based Access
&lt;/h2&gt;

&lt;p&gt;Standard Tailscale is a mesh. Every device is a peer. This is great for your laptop and your primary workstation, but it's a nightmare for infrastructure. A subnet router turns a single node into a gateway. It acts as a bridge between the encrypted WireGuard mesh and your local physical network.&lt;/p&gt;

&lt;p&gt;The magic here is that the devices on your LAN don't even know Tailscale exists. They just see traffic coming from the subnet router's local IP. You get the security of a private mesh without having to touch the network configuration of your legacy gear or your Kubernetes pods.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: The "Happy Path" and the Reality
&lt;/h2&gt;

&lt;p&gt;The official docs make this look like a one-line command. While that's technically true, if you're running this on a production-grade homelab or a bare-metal node, there are a few kernel-level requirements that usually get glossed over.&lt;/p&gt;

&lt;p&gt;First, you have to enable IP forwarding. If the Linux kernel isn't allowed to pass packets between interfaces, your subnet router is just a fancy wall.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable IPv4 forwarding immediately&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;sysctl &lt;span class="nt"&gt;-w&lt;/span&gt; net.ipv4.ip_forward&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;span class="nb"&gt;sudo &lt;/span&gt;sysctl &lt;span class="nt"&gt;-w&lt;/span&gt; net.ipv6.conf.all.forwarding&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# Make it persist across reboots&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"net.ipv4.ip_forward = 1"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/sysctl.d/99-tailscale.conf
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"net.ipv6.conf.all.forwarding = 1"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/sysctl.d/99-tailscale.conf
&lt;span class="nb"&gt;sudo &lt;/span&gt;sysctl &lt;span class="nt"&gt;-p&lt;/span&gt; /etc/sysctl.d/99-tailscale.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the kernel is ready, you bring Tailscale up. I've found that explicitly forcing the kernel TUN interface is safer than relying on the default, especially if you've experimented with userspace networking in the past. Userspace networking (&lt;code&gt;--tun=userspace-networking&lt;/code&gt;) is a death sentence for subnet routing; it simply won't work because the OS doesn't see the interface as a routable device.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start Tailscale as a subnet router for a specific range&lt;/span&gt;
&lt;span class="c"&gt;# Replace 10.0.0.0/24 with your actual local subnet&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;tailscale up &lt;span class="nt"&gt;--tun&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;kernel &lt;span class="nt"&gt;--advertise-routes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10.0.0.0/24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this, you still aren't connected. You have to go into the Tailscale Admin Console and manually approve the routes. This is a security feature to prevent a compromised node from suddenly hijacking all traffic for your entire network.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Kubernetes and Gateway Trap
&lt;/h2&gt;

&lt;p&gt;If you're running your subnet router inside a container or as part of a larger orchestration layer, you'll likely run into the &lt;code&gt;gateway.bind&lt;/code&gt; issue. I hit this while integrating a gateway with some Kubernetes services.&lt;/p&gt;

&lt;p&gt;When using tools like OpenClaw or custom wrappers, the default binding often fails because the application tries to bind to an interface that isn't actually the LAN. If your config looks like a generic default, you'll see the node is "online" in the dashboard, but you can't ping anything on the local subnet.&lt;/p&gt;

&lt;p&gt;You need to explicitly tell the gateway to bind to the LAN interface. In the JSON config, it looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gateway"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lan"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this, the traffic often loops back or hits a dead end in the container network. This is similar to the networking headaches I've dealt with regarding DNS resolution, like the &lt;a href="https://dev.to/posts/wildcard-dns-ndots-5-the-tls-nightmare-and-how-to-fix-it"&gt;Wildcard DNS and ndots:5 nightmare&lt;/a&gt;, where the system thinks it knows where to go but the underlying routing logic is fundamentally flawed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning it into an Exit Node
&lt;/h2&gt;

&lt;p&gt;A subnet router lets you reach your home from the outside. An exit node lets you send all your internet traffic through your home from the outside. It's the difference between "I want to see my Proxmox UI" and "I'm on public WiFi and I want to pretend I'm at home for security."&lt;/p&gt;

&lt;p&gt;To do this, add the &lt;code&gt;--advertise-exit-node&lt;/code&gt; flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;tailscale up &lt;span class="nt"&gt;--advertise-routes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10.0.0.0/24 &lt;span class="nt"&gt;--advertise-exit-node&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is where most people get stuck. You'll enable the exit node, select it in your client, and then realize you have zero internet access. The packets are reaching your router, but they aren't being NAT'd back out to the web. You need a MASQUERADE rule in your iptables to handle the translation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Replace eth0 with your actual primary network interface&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-A&lt;/span&gt; POSTROUTING &lt;span class="nt"&gt;-o&lt;/span&gt; eth0 &lt;span class="nt"&gt;-j&lt;/span&gt; MASQUERADE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're using a modern distro with &lt;code&gt;nftables&lt;/code&gt;, you'll need the equivalent rule there. If you don't do this, the return packets from the internet don't know how to get back to the Tailscale client because they're coming from a virtual IP range the rest of your network doesn't recognize.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Advanced Headache: NAT-PMP and Network Namespaces
&lt;/h2&gt;

&lt;p&gt;In some complex environments, specifically when dealing with certain industrial gateways or strict NAT setups, you'll encounter NAT-PMP misrouting. This happens when Tailscale tries to be too smart about the local network and accidentally routes requests to a remote subnet router instead of the local one.&lt;/p&gt;

&lt;p&gt;The fix is ugly, but it works: isolate the Tailscale traffic into its own network namespace (&lt;code&gt;netns&lt;/code&gt;). This prevents the daemon from interfering with the host's primary routing table in ways that cause loops.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a dedicated namespace for Tailscale&lt;/span&gt;
ip netns add tailscale
ip &lt;span class="nb"&gt;link &lt;/span&gt;add veth0 &lt;span class="nb"&gt;type &lt;/span&gt;veth peer name veth1
ip &lt;span class="nb"&gt;link set &lt;/span&gt;veth0 netns tailscale

&lt;span class="c"&gt;# Assign an IP to the virtual interface&lt;/span&gt;
ip addr add 10.0.0.2/24 dev veth1
ip &lt;span class="nb"&gt;link set &lt;/span&gt;veth1 up

&lt;span class="c"&gt;# Run the daemon inside the namespace&lt;/span&gt;
ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;tailscale tailscaled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is overkill for 90% of homelabbers, but if you're building out &lt;a href="https://dev.to/posts/automating-infrastructure-with-opentofu-and-github-actions"&gt;automated infrastructure with OpenTofu&lt;/a&gt; and deploying these routers across multiple sites, you'll want this kind of isolation to ensure stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison: Subnet Routers vs. Traditional VPNs
&lt;/h2&gt;

&lt;p&gt;I've run OpenVPN and WireGuard (manual) for years. Here is the honest breakdown of why I switched to Tailscale subnet routers for my remote access.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Traditional VPN (OpenVPN/WireGuard)&lt;/th&gt;
&lt;th&gt;Tailscale Subnet Router&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual certs, port forwarding, firewall rules&lt;/td&gt;
&lt;td&gt;Zero-config NAT traversal, OAuth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Client Mgmt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distributing &lt;code&gt;.ovpn&lt;/code&gt; or &lt;code&gt;.conf&lt;/code&gt; files&lt;/td&gt;
&lt;td&gt;Log in with SSO/Identity provider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Routing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual static routes on clients&lt;/td&gt;
&lt;td&gt;Centralized route management in console&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintenance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Updating keys, managing IP pools&lt;/td&gt;
&lt;td&gt;Automatic key rotation, managed IPs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (if tuned correctly)&lt;/td&gt;
&lt;td&gt;High (WireGuard based)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tradeoff is the "phone home" aspect. Tailscale's coordination server knows which nodes are online. For most of us, that's a fair price to pay to avoid spending a Saturday morning debugging why a UDP port isn't opening on a residential ISP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas and Lessons Learned
&lt;/h2&gt;

&lt;p&gt;If you're setting this up, watch out for these three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The "Double-Hop" Latency:&lt;/strong&gt; If you use a subnet router and then an exit node on a different machine, your traffic is bouncing across your network multiple times. It's fine for SSH, but terrible for VoIP or gaming. Keep your subnet router and exit node on the same high-performance machine if possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DNS Leaks:&lt;/strong&gt; Just because you can route to &lt;code&gt;10.0.0.x&lt;/code&gt; doesn't mean your DNS is working. You'll still be typing IPs unless you configure "MagicDNS" or set up a global nameserver in the Tailscale admin panel that points to your internal DNS (like AdGuard Home).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail-Closed Policies:&lt;/strong&gt; By default, if your subnet router goes down, you lose access to the entire LAN. If this is for a production environment, I highly recommend setting up two subnet routers in different failure domains. Tailscale doesn't do "automatic failover" in the traditional sense, but you can have multiple nodes advertising the same route.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The subnet router is the only sane way to manage remote access to a complex lab. It separates the "connectivity" layer from the "device" layer. You don't need to care if your old NAS doesn't support WireGuard or if your industrial PLC has a proprietary OS. You just need one stable Linux box with &lt;code&gt;ip_forwarding&lt;/code&gt; enabled and a couple of iptables rules.&lt;/p&gt;

&lt;p&gt;Reach for this technique the moment you find yourself saying, "I wish I could just SSH into this thing without having to install a client on it." If you're looking to scale this into a larger professional setup, feel free to check out my &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;infrastructure consulting services&lt;/a&gt; for help with AI agent orchestration or bare-metal networking.&lt;/p&gt;

</description>
      <category>tailscale</category>
      <category>networking</category>
      <category>infrastructure</category>
      <category>homelab</category>
    </item>
    <item>
      <title>PCIe Device Passthrough: NIC Name Instability and MAC Pinning</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Fri, 08 May 2026 04:15:19 +0000</pubDate>
      <link>https://dev.to/futhgar/pcie-device-passthrough-nic-name-instability-and-mac-pinning-4di7</link>
      <guid>https://dev.to/futhgar/pcie-device-passthrough-nic-name-instability-and-mac-pinning-4di7</guid>
      <description>&lt;p&gt;My Proxmox node rebooted, and suddenly the host was unreachable via SSH. I had to plug in a physical monitor and keyboard only to find that my primary network interface, which had been &lt;code&gt;enp4s0&lt;/code&gt; for months, had decided to rename itself to &lt;code&gt;enp5s0&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Because my &lt;code&gt;/etc/network/interfaces&lt;/code&gt; file was explicitly tied to &lt;code&gt;enp4s0&lt;/code&gt;, the bridge didn't come up, the IP wasn't assigned, and I was locked out of my own hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I expected
&lt;/h2&gt;

&lt;p&gt;I expected the Linux kernel to consistently enumerate my PCIe devices. In a static hardware environment where nothing has moved, the PCI bus address should be deterministic. If the NIC is plugged into the same slot and the BIOS hasn't changed, &lt;code&gt;enp4s0&lt;/code&gt; should stay &lt;code&gt;enp4s0&lt;/code&gt; forever. This is the "happy path" most documentation assumes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually happened
&lt;/h2&gt;

&lt;p&gt;The reality is that PCIe enumeration is not always a constant. I'm using a mix of onboard NICs and a PCIe expansion card. I also have a GPU passed through to a VM. &lt;/p&gt;

&lt;p&gt;The surprise here is how the kernel's predictable network interface naming (systemd-udevd) interacts with the PCIe topology. When I added a new PCIe device and tweaked some BIOS settings for IOMMU, the way the kernel mapped the physical slots to the virtual naming changed. A slight shift in how the PCIe switch reported the devices caused the index to jump.&lt;/p&gt;

&lt;p&gt;This isn't just a "one-time fluke." If you're running a multi-node cluster or using GPUs that might move addresses (something I've documented before in &lt;a href="https://guatulabs.dev/posts/gpu-pci-address-instability-when-your-card-moves-between-reboots/" rel="noopener noreferrer"&gt;GPU PCI Address Instability&lt;/a&gt;), you'll find that the kernel is surprisingly flexible with where it puts things. &lt;/p&gt;

&lt;p&gt;The root cause is that &lt;code&gt;enp4s0&lt;/code&gt; is a name derived from the PCI location. If the location changes—even by one digit—the name changes. If your network config depends on that name, your system is one reboot away from a blackout.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: MAC Pinning
&lt;/h2&gt;

&lt;p&gt;The only way to stop this is to stop relying on the PCI slot location and start relying on the hardware's unique identifier: the MAC address. &lt;/p&gt;

&lt;p&gt;I decided to use systemd &lt;code&gt;.link&lt;/code&gt; files. This allows me to tell the kernel: "I don't care where this device is on the PCIe bus; if it has this MAC address, call it &lt;code&gt;eth0&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Identify the MAC address
&lt;/h3&gt;

&lt;p&gt;First, I had to find the actual MAC of the problematic NIC while I had console access.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ip &lt;span class="nb"&gt;link &lt;/span&gt;show
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I looked for the interface that was currently named &lt;code&gt;enp5s0&lt;/code&gt; (the "wrong" name) and copied the &lt;code&gt;link/ether&lt;/code&gt; value.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Create the .link file
&lt;/h3&gt;

&lt;p&gt;I created a custom link file in &lt;code&gt;/etc/systemd/network/&lt;/code&gt;. I chose the name &lt;code&gt;10-lan.link&lt;/code&gt; to ensure it loads early in the boot process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/network/10-lan.link
&lt;/span&gt;&lt;span class="nn"&gt;[Match]&lt;/span&gt;
&lt;span class="py"&gt;MACAddress&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;00:11:22:33:44:55&lt;/span&gt;

&lt;span class="nn"&gt;[Link]&lt;/span&gt;
&lt;span class="py"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;eth0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(Note: I've anonymized the MAC address above. Use your actual hardware MAC here.)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Update the network configuration
&lt;/h3&gt;

&lt;p&gt;Once the interface is pinned to &lt;code&gt;eth0&lt;/code&gt;, I had to update the Proxmox network configuration to match. I edited &lt;code&gt;/etc/network/interfaces&lt;/code&gt; to replace the volatile &lt;code&gt;enp4s0&lt;/code&gt; with the stable &lt;code&gt;eth0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example snippet from /etc/network/interfaces&lt;/span&gt;
auto eth0
iface eth0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.x/24
    gateway 10.0.0.1
    bridge-ports eth0
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Apply and verify
&lt;/h3&gt;

&lt;p&gt;I ran &lt;code&gt;systemd-networkd-restart&lt;/code&gt; (or just rebooted, since I was already at the console) and verified the name with &lt;code&gt;ip a&lt;/code&gt;. The NIC was now consistently &lt;code&gt;eth0&lt;/code&gt;, regardless of whether the PCIe bus shifted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;If you're just running a single VM on a desktop, this is a minor annoyance. But if you're building a &lt;a href="https://guatulabs.dev/posts/building-production-homelab/" rel="noopener noreferrer"&gt;production-grade homelab&lt;/a&gt;, this is a critical failure point.&lt;/p&gt;

&lt;p&gt;You'll hit this specifically in these scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adding/Removing PCIe Hardware:&lt;/strong&gt; Adding a new NVMe drive or a GPU can shift the enumeration of other devices on the same root complex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BIOS Updates:&lt;/strong&gt; A BIOS update often resets PCIe lane bifurcation or IOMMU settings, which can completely reorder how the kernel sees your NICs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using PCIe Switches:&lt;/strong&gt; Some high-end motherboards or riser cables use PCIe switches that can report different topologies depending on the power state of the devices.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Tradeoff
&lt;/h3&gt;

&lt;p&gt;The tradeoff here is that you're moving away from the "modern" predictable naming convention back to the "old" &lt;code&gt;ethX&lt;/code&gt; style. Some people find &lt;code&gt;eth0&lt;/code&gt; ugly or outdated, but in a headless server environment, "ugly" is better than "unreachable."&lt;/p&gt;

&lt;p&gt;I've also seen people try to fix this using udev rules in &lt;code&gt;/etc/udev/rules.d/&lt;/code&gt;. While that works, &lt;code&gt;.link&lt;/code&gt; files are the native systemd way to handle this and are generally cleaner to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;The biggest lesson here is that documentation for Proxmox and Debian assumes your hardware topology is a constant. It isn't. &lt;/p&gt;

&lt;p&gt;When you're doing complex things like PCIe passthrough—which I've detailed in my &lt;a href="https://guatulabs.dev/posts/gpu-passthrough-on-proxmox-gotcha-guide/" rel="noopener noreferrer"&gt;GPU Passthrough Gotcha Guide&lt;/a&gt;—you are intentionally messing with the PCI bus. You're telling the host kernel to ignore certain devices so the VM can claim them. This volatility is a side effect of that power.&lt;/p&gt;

&lt;p&gt;If you are passing through NICs or GPUs, do not trust the default interface names. Pin your critical management interfaces to their MAC addresses immediately. It takes five minutes to set up and saves you from a midnight trip to the server rack because a reboot decided your network card now lives at &lt;code&gt;enp6s0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For those of you managing larger fleets or complex AI agent infrastructure, this kind of hardware-level stability is the foundation. You can't build a reliable &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;multi-agent AI pipeline&lt;/a&gt; if the underlying Kubernetes worker nodes are randomly losing their network identity.&lt;/p&gt;

&lt;p&gt;Next time you're configuring a new node, don't just copy the &lt;code&gt;enpXsX&lt;/code&gt; name from the GUI. Take the extra step to pin it. Your future self will thank you when the next BIOS update doesn't break your entire cluster.&lt;/p&gt;

</description>
      <category>proxmox</category>
      <category>pciepassthrough</category>
      <category>networking</category>
      <category>homelab</category>
    </item>
    <item>
      <title>GPU PCI Address Instability: When Your Card Moves Between Reboots</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Thu, 07 May 2026 00:15:04 +0000</pubDate>
      <link>https://dev.to/futhgar/gpu-pci-address-instability-when-your-card-moves-between-reboots-56mj</link>
      <guid>https://dev.to/futhgar/gpu-pci-address-instability-when-your-card-moves-between-reboots-56mj</guid>
      <description>&lt;p&gt;I spent an entire afternoon debugging a VM that refused to boot, only to find out my GPU had decided to change its PCI address. One reboot and the device that lived at &lt;code&gt;01:00.0&lt;/code&gt; suddenly migrated to &lt;code&gt;02:00.0&lt;/code&gt;. Because my Proxmox VM configuration was pinned to the old address, the VM crashed with a QEMU assertion error, and the GPU simply vanished from the guest.&lt;/p&gt;

&lt;p&gt;This usually happens because of how the BIOS handles PCIe enumeration during POST. If you have multiple PCIe devices or a complex motherboard topology, the bus numbering isn't always deterministic. This is compounded by AMD Ryzen C-states or weird UMA frame buffer settings that can delay device initialization, causing the kernel to assign addresses in a different order than the previous boot. If you've already dealt with &lt;a href="https://guatulabs.dev/posts/amd-igpu-stealing-your-ram-uma-frame-buffer-on-headless-servers/" rel="noopener noreferrer"&gt;AMD iGPU RAM theft&lt;/a&gt;, you know how sensitive these BIOS settings are.&lt;/p&gt;

&lt;p&gt;If you're on Proxmox 8.4+, the "happy path" is to use the &lt;code&gt;q35&lt;/code&gt; machine type. The older &lt;code&gt;i440fx&lt;/code&gt; is more prone to these PCI mapping failures and IRQ conflicts. I also found that preventing the card from entering deep power states helps avoid the "zombie GPU" scenario where the card is physically there but logically dead.&lt;/p&gt;

&lt;p&gt;To stabilize this, I switched the VM to &lt;code&gt;q35&lt;/code&gt; and explicitly enabled PCIe mode for the passthrough device. I also added a kernel parameter to stop the CPU from entering deep sleep states, which I've found reduces the randomness of the PCIe bus scan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Change VM to q35 machine type for better PCIe support&lt;/span&gt;
qm &lt;span class="nb"&gt;set&lt;/span&gt; &amp;lt;VMID&amp;gt; &lt;span class="nt"&gt;--machine&lt;/span&gt; q35

&lt;span class="c"&gt;# 2. Pass through the GPU with pcie=1 to ensure it's treated as a PCIe device&lt;/span&gt;
&lt;span class="c"&gt;# Replace &amp;lt;PCI_ADDRESS&amp;gt; with your current address (e.g., 0000:01:00.0)&lt;/span&gt;
qm &lt;span class="nb"&gt;set&lt;/span&gt; &amp;lt;VMID&amp;gt; &lt;span class="nt"&gt;-hostpci0&lt;/span&gt; &amp;lt;PCI_ADDRESS&amp;gt;,pcie&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# 3. To stop the GPU from entering D3cold (which can cause boot-time instability)&lt;/span&gt;
&lt;span class="c"&gt;# Run this on the Proxmox host&lt;/span&gt;
&lt;span class="nb"&gt;echo &lt;/span&gt;0 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/bus/pci/devices/0000:&amp;lt;PCI_BUS&amp;gt;:&amp;lt;PCI_SLOT&amp;gt;.0/d3cold_allowed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the addresses keep shifting despite these changes, you're fighting your motherboard's firmware. At that point, I stopped fighting the VM abstraction and moved the NVIDIA drivers directly onto the Proxmox host. I then used the &lt;a href="https://guatulabs.dev/posts/nvidia-container-toolkit-why-the-default-runtime-matters" rel="noopener noreferrer"&gt;NVIDIA Container Toolkit&lt;/a&gt; to expose the GPU to my Kubernetes worker. It removes the PCI address fragility entirely because the host driver handles the hardware mapping, and the containers just see the device.&lt;/p&gt;

&lt;p&gt;The lesson here is that PCI addresses are not constants; they are suggestions. If your workload requires 100% uptime and you can't guarantee a static PCI map, stop using VM passthrough and move the driver to the host.&lt;/p&gt;

</description>
      <category>proxmox</category>
      <category>gpupassthrough</category>
      <category>pcie</category>
      <category>homelab</category>
    </item>
    <item>
      <title>Cognitive Memory for Agents: Vector Search vs Activation-Based Recall</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Wed, 06 May 2026 22:15:04 +0000</pubDate>
      <link>https://dev.to/futhgar/cognitive-memory-for-agents-vector-search-vs-activation-based-recall-52lh</link>
      <guid>https://dev.to/futhgar/cognitive-memory-for-agents-vector-search-vs-activation-based-recall-52lh</guid>
      <description>&lt;p&gt;I spent a few weeks trying to build an agent that could remember specific user preferences across sessions without bloating the context window to a point where latency became unbearable. The standard advice is always "just use a vector database." But as the memory store grew, I noticed a weird gap: the agent could find a document about "user prefers dark mode" via cosine similarity, but it couldn't "recall" the immediate emotional state or the nuance of the last three turns of conversation unless they were explicitly mirrored in the embedding.&lt;/p&gt;

&lt;p&gt;The problem is that vector search is a retrieval mechanism, not a cognitive memory system. When you move from simple RAG to actual agentic memory, you have to choose between external vector search and internal activation-based recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Point
&lt;/h3&gt;

&lt;p&gt;You face this choice when your agent's "short-term" memory (the context window) is full, and your "long-term" memory (the database) is returning results that are mathematically similar but contextually irrelevant. &lt;/p&gt;

&lt;p&gt;If you need your agent to remember a 500-page technical manual, you need a vector store. If you need your agent to exhibit a consistent "personality" or recall a specific pattern of behavior that isn't easily summarized into a string of text for an embedding model, you need something closer to activation-based recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A: Vector Search (The External Archive)
&lt;/h3&gt;

&lt;p&gt;Vector search is the industry standard for a reason: it's easy to scale and the tooling is mature. You turn a piece of text into a vector using an embedding model (like &lt;code&gt;text-embedding-3-small&lt;/code&gt;), shove it into a store like FAISS or Milvus, and query it with another vector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale:&lt;/strong&gt; You can store billions of vectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold Storage:&lt;/strong&gt; It doesn't eat VRAM. It lives on disk or in a dedicated database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interpretability:&lt;/strong&gt; I can literally query the database and see exactly which chunk of text was retrieved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The "Semantic Gap":&lt;/strong&gt; Cosine similarity is a blunt instrument. If a user says "That's not what I meant," a vector search might retrieve a passage about "meaning" or "intent" rather than understanding the correction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; You have to embed the query, hit the DB, and then stuff the results into the prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a basic implementation using FAISS. I use this for the "knowledge base" layer of my agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Dimension depends on your embedding model (e.g., 1536 for OpenAI)
&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt; 
&lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;  &lt;span class="c1"&gt;# number of memory chunks
&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;# Mocking embeddings of agent experiences
&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;# Querying for the top 4 most similar memories
&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieved memory indices: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option B: Activation-Based Recall (The Internal Intuition)
&lt;/h3&gt;

&lt;p&gt;Activation-based recall is more akin to how biological memory works. Instead of searching a database, the "memory" is stored in the weights or the hidden states of the model. In modern agent architectures, this often involves using activation hooks or specialized memory layers (like Memory Transformers) that allow the model to trigger a recall based on the current internal state of the network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed:&lt;/strong&gt; There is no external API call or DB lookup. The recall happens during the forward pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nuance:&lt;/strong&gt; It captures "how" something was said, not just "what" was said. It's an associative trigger rather than a keyword search.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Black Box:&lt;/strong&gt; Debugging this is a nightmare. You can't just "look" at the database to see why the agent recalled a specific memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM Pressure:&lt;/strong&gt; Storing these activations or maintaining a dynamic memory network consumes precious GPU memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've experimented with simple activation hooks in PyTorch to track which "states" trigger certain behaviors. It's not a full-blown Memory Transformer, but it's a start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# In a real system, this would be a specific layer's activation
&lt;/span&gt;        &lt;span class="c1"&gt;# that represents a 'concept' or 'state'
&lt;/span&gt;        &lt;span class="n"&gt;activation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tanh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

        &lt;span class="c1"&gt;# Store the activation state for later recall/analysis
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detach&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentModel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;input_tensor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_tensor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stored state vector: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Decision Framework
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Vector Search&lt;/th&gt;
&lt;th&gt;Activation-Based Recall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive (TB+)&lt;/td&gt;
&lt;td&gt;Small (MB to GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds (Network/Disk)&lt;/td&gt;
&lt;td&gt;Microseconds (GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantic/Keyword&lt;/td&gt;
&lt;td&gt;Associative/Pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debugging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Easy (Query the DB)&lt;/td&gt;
&lt;td&gt;Hard (Analyze Tensors)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CPU/Disk/API&lt;/td&gt;
&lt;td&gt;VRAM/Compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  My Pick and Why
&lt;/h3&gt;

&lt;p&gt;I don't pick one. I use a hybrid. &lt;/p&gt;

&lt;p&gt;If you're building a production agent, relying solely on vector search leads to that "robotic" feeling where the agent repeats the same retrieved snippet regardless of the conversation flow. Relying solely on activations is a recipe for a system you can't debug when it starts hallucinating.&lt;/p&gt;

&lt;p&gt;I implement a tiered system. I use a vector store for the "Library" (hard facts, documentation) and a sliding window of activations for the "Working Memory" (current mood, immediate goals, recent corrections). This mirrors the &lt;a href="https://dev.to/posts/six-layer-memory-architecture-for-claude-code"&gt;6-layer memory architecture&lt;/a&gt; I've used for my own tools.&lt;/p&gt;

&lt;p&gt;For those building multi-agent systems, I recommend offloading the vector search to a shared service and keeping the activation-based recall local to the agent's specific instance. This prevents the "shared memory" from becoming a noisy mess of conflicting embeddings. You can see how this fits into larger patterns in my post on &lt;a href="https://dev.to/posts/multi-agent-ai-systems-architecture-patterns"&gt;multi-agent architecture patterns&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're still struggling with agents that forget things every five minutes, you might be hitting a safety loop. I've written about &lt;a href="https://dev.to/posts/three-layer-safety-autonomous-agents"&gt;three-layer safety for autonomous agents&lt;/a&gt; which often solves the "infinite loop" problem that people mistake for a memory issue.&lt;/p&gt;

&lt;p&gt;If you need help designing a memory architecture that doesn't melt your GPU or your budget, check out my &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;AI agent consulting services&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lessons learned:&lt;/strong&gt; &lt;br&gt;
The docs for vector DBs make it sound like they replace the need for cognitive memory. They don't. They replace the need for a filing cabinet. If you want an agent that actually "feels" like it's learning from a conversation in real-time, you have to move closer to the activations.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>vectordatabases</category>
      <category>llmmemory</category>
      <category>cognitivearchitecture</category>
    </item>
    <item>
      <title>Vibration Monitoring Architecture: From Sensor to Dashboard</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Wed, 06 May 2026 16:15:04 +0000</pubDate>
      <link>https://dev.to/futhgar/vibration-monitoring-architecture-from-sensor-to-dashboard-26ib</link>
      <guid>https://dev.to/futhgar/vibration-monitoring-architecture-from-sensor-to-dashboard-26ib</guid>
      <description>&lt;p&gt;The first time I tried to stream raw vibration data to a dashboard, I managed to crash my MQTT broker in under ten minutes. I had a high-frequency accelerometer spitting out samples at 5kHz, and I thought I'd just wrap those values in JSON and send them over the wire. The result wasn't a pretty graph; it was a series of &lt;code&gt;Connection refused&lt;/code&gt; errors and a broker that had completely locked up under the weight of thousands of tiny packets per second.&lt;/p&gt;

&lt;p&gt;If you're building a vibration monitoring system, you're not just dealing with "IoT data." You're dealing with signal processing. There is a massive difference between reporting a temperature every 30 seconds and capturing the harmonic frequencies of a motor bearing. If you treat vibration data like any other telemetry, your network will choke, your database will bloat, and your dashboards will be useless.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I tried first (The wrong way)
&lt;/h3&gt;

&lt;p&gt;My initial assumption was that the "modern stack" (Sensor $\rightarrow$ MQTT $\rightarrow$ Time Series DB $\rightarrow$ Grafana) would handle everything. I used a cheap industrial sensor that output raw voltage via a 4-20mA loop, fed into a PLC, which then pushed data to a Python script on a Raspberry Pi.&lt;/p&gt;

&lt;p&gt;I wrote a simple loop that read the sensor and published to a topic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# DO NOT DO THIS
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;factory/machine1/vibration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I quickly hit three walls:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Network Saturation:&lt;/strong&gt; Sending one MQTT packet per sample is an architectural sin. The overhead of the TCP/IP stack and MQTT headers is larger than the actual payload. I was spending 90% of my bandwidth on headers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Explosion:&lt;/strong&gt; InfluxDB is great, but inserting 5,000 points per second per sensor is a recipe for a disk space crisis. My cardinality exploded, and queries that should have taken milliseconds started taking 30 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Noise" Problem:&lt;/strong&gt; The raw data was a jagged mess. I couldn't see the actual vibration patterns because the high-frequency electrical noise from the nearby VFDs (Variable Frequency Drives) was masking the mechanical signal.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I realized that the gap between the sensor and the dashboard isn't a straight line. It's a funnel. You have to aggressively reduce the data volume at the edge before it ever touches the network.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Actual Solution: The Edge-Heavy Pipeline
&lt;/h3&gt;

&lt;p&gt;To make this work, I shifted the intelligence to the edge. The goal is to move from "streaming raw samples" to "streaming features." Instead of sending every single point, I calculate the RMS (Root Mean Square), Peak-to-Peak, and FFT (Fast Fourier Transform) bins locally.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Signal Conditioning and Edge Processing
&lt;/h4&gt;

&lt;p&gt;I moved the processing to a dedicated edge gateway. I used a Python-based service that buffers samples in memory, applies a digital filter to remove electrical noise, and calculates the metrics.&lt;/p&gt;

&lt;p&gt;Here is the implementation of the signal conditioning and feature extraction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy.signal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;butter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filtfilt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;paho.mqtt.client&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mqtt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration for a 10kHz sampling rate
&lt;/span&gt;&lt;span class="n"&gt;FS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt; 
&lt;span class="n"&gt;CUTOFF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="c1"&gt;# Remove noise above 2kHz
&lt;/span&gt;&lt;span class="n"&gt;ORDER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;butter_lowpass_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;nyq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;
    &lt;span class="n"&gt;normal_cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;nyq&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;butter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normal_cutoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;btype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analog&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;filtfilt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Filter the raw signal to remove high-frequency noise
&lt;/span&gt;    &lt;span class="n"&gt;filtered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;butter_lowpass_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CUTOFF&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ORDER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate RMS - the primary indicator of overall vibration level
&lt;/span&gt;    &lt;span class="n"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate Peak-to-Peak
&lt;/span&gt;    &lt;span class="n"&gt;ptp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ptp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Perform FFT to find the dominant frequency
&lt;/span&gt;    &lt;span class="n"&gt;fft_vals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;freqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfftfreq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;FS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;dominant_freq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freqs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fft_vals&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rms&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ptp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dom_freq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dominant_freq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Main loop: Buffer 1000 samples, then send 1 summary packet
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mqtt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mqtt-broker.example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1883&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_sensor_raw&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Mock function for ADC read
&lt;/span&gt;    &lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Send summary instead of 1000 raw points
&lt;/span&gt;        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iiot/machine1/vibration/features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="c1"&gt;# Clear buffer
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. The Transport Layer (MQTT 5.0)
&lt;/h4&gt;

&lt;p&gt;For the broker, I shifted from a basic Mosquitto setup to a more controlled configuration. Since vibration data is critical for predictive maintenance, I needed to ensure that the "heartbeat" of the machine was always known.&lt;/p&gt;

&lt;p&gt;I used MQTT 5.0 "Will Messages" to detect if a gateway went offline. If the gateway crashes, the broker immediately publishes a "disconnected" status to the health topic, so the dashboard doesn't just show a flat line (which could be mistaken for a stopped machine).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# mosquitto.conf snippet&lt;/span&gt;
&lt;span class="s"&gt;listener &lt;/span&gt;&lt;span class="m"&gt;1883&lt;/span&gt;
&lt;span class="s"&gt;allow_anonymous &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="s"&gt;password_file /etc/mosquitto/passwd&lt;/span&gt;
&lt;span class="c1"&gt;# Prevent the broker from being overwhelmed by slow consumers&lt;/span&gt;
&lt;span class="s"&gt;max_queued_messages &lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've written more about choosing the right broker in my &lt;a href="https://guatulabs.dev/posts/mqtt-broker-selection-hivemq-vs-mosquitto-for-industrial-use/" rel="noopener noreferrer"&gt;MQTT Broker Selection&lt;/a&gt; post, but for vibration, the priority is low latency and high reliability over massive scale.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Storage and Visualization
&lt;/h4&gt;

&lt;p&gt;I used InfluxDB 2.x for storage because of its native handling of time-series data. Instead of storing the raw waveform, I store the calculated features. This reduces the storage requirement by 1000x.&lt;/p&gt;

&lt;p&gt;In Grafana, I set up a dashboard that monitors the RMS value. However, looking at a raw line graph of vibration is usually useless for operators. They don't know if 0.5g is "bad" or "normal." &lt;/p&gt;

&lt;p&gt;I integrated this with a health scoring system. I used a Flux query in InfluxDB to compare the current RMS against a baseline (the average of the last 7 days).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;InfluxDB&lt;/span&gt; &lt;span class="n"&gt;Flux&lt;/span&gt; &lt;span class="n"&gt;Query&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;Relative&lt;/span&gt; &lt;span class="n"&gt;Vibration&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"iiot_data"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"_measurement"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nv"&gt;"vibration_sensor"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"_field"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nv"&gt;"rms"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;aggregateWindow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;every&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_value&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;Normalize&lt;/span&gt; &lt;span class="n"&gt;against&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feeds directly into the concept of &lt;a href="https://guatulabs.dev/posts/equipment-health-scoring-one-number-your-operators-actually-check/" rel="noopener noreferrer"&gt;Equipment Health Scoring&lt;/a&gt;, where the goal is to give the operator a single "Health %" rather than a complex spectrum analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this architecture works
&lt;/h3&gt;

&lt;p&gt;The reason this works is that it respects the laws of physics and networking. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Nyquist-Shannon Theorem&lt;/strong&gt; tells us we need to sample at twice the frequency of the signal we want to capture. If you want to detect a bearing fault at 2kHz, you must sample at 4kHz+. Trying to do this over WiFi or Ethernet using standard JSON-over-MQTT is impossible because the packet overhead kills the throughput.&lt;/p&gt;

&lt;p&gt;By calculating the RMS and FFT at the edge, we are performing &lt;strong&gt;Data Reduction&lt;/strong&gt;. We transform a high-bandwidth signal (time domain) into a low-bandwidth set of descriptors (frequency domain). &lt;/p&gt;

&lt;p&gt;The edge processing also acts as a mechanical filter. By using a Butterworth low-pass filter, I can strip out the 60Hz hum from the power lines and the high-frequency spikes from the VFDs. If you do this in the cloud, you've already wasted the bandwidth sending noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons learned and caveats
&lt;/h3&gt;

&lt;p&gt;If I had to build this again, I'd change a few things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Hardware-level filtering:&lt;/strong&gt; I spent too much time in Python trying to fix signal noise. In a real industrial environment, you should use an analog anti-aliasing filter (a physical capacitor/resistor circuit) before the signal ever hits the ADC. Software filters are great, but they can't fix aliasing if the signal was already corrupted during sampling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The "Buffer" Trap:&lt;/strong&gt; My Python script used a simple list for the buffer. At very high sampling rates, Python's list appending becomes slow. I had to switch to &lt;code&gt;numpy&lt;/code&gt; arrays with pre-allocated memory to avoid garbage collection pauses that caused gaps in the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Provisioning the Edge:&lt;/strong&gt; Managing these Python scripts across five different gateways was a nightmare. I eventually moved the deployment to a GitOps flow, using &lt;a href="https://guatulabs.dev/posts/automating-infrastructure-with-opentofu-and-github-actions/" rel="noopener noreferrer"&gt;OpenTofu and GitHub Actions&lt;/a&gt; to manage the underlying VM configurations on my Proxmox cluster, ensuring every gateway had the exact same version of &lt;code&gt;scipy&lt;/code&gt; and &lt;code&gt;numpy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Dashboard Paradox:&lt;/strong&gt; The more data I put on the dashboard, the less the operators used it. The final version of the system only shows three things: a Green/Yellow/Red light for health, the current RMS value, and a "Time to Maintenance" estimate. Everything else (the FFT bins, the raw waveforms) is hidden in a "Deep Dive" tab that only the reliability engineer ever opens.&lt;/p&gt;

&lt;p&gt;Vibration monitoring is a classic example of where "more data" is actually "less information." The value isn't in the sensor; it's in the reduction process that happens between the sensor and the screen.&lt;/p&gt;

</description>
      <category>iiot</category>
      <category>vibrationanalysis</category>
      <category>mqtt</category>
      <category>influxdb</category>
    </item>
    <item>
      <title>Unprivileged LXC + Docker: The runc Sysctl Permission Trap</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Tue, 05 May 2026 00:15:20 +0000</pubDate>
      <link>https://dev.to/futhgar/unprivileged-lxc-docker-the-runc-sysctl-permission-trap-fb5</link>
      <guid>https://dev.to/futhgar/unprivileged-lxc-docker-the-runc-sysctl-permission-trap-fb5</guid>
      <description>&lt;p&gt;&lt;code&gt;sysctl: setting key "net.ipv4.ip_local_port_range": Permission denied&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;I saw this error while trying to tune the network stack for a high-concurrency service running in Docker, which itself was hosted inside an unprivileged LXC container on Proxmox. The weird part? I was root inside the container.&lt;/p&gt;

&lt;p&gt;I expected that since I had already enabled &lt;code&gt;nesting=1&lt;/code&gt; and &lt;code&gt;keyctl=1&lt;/code&gt; in the LXC configuration, Docker would have the necessary permissions to modify kernel parameters via &lt;code&gt;runc&lt;/code&gt;. In a standard VM, this is trivial. In a privileged container, it just works. But in an unprivileged container, the user namespace mapping creates a wall that &lt;code&gt;runc&lt;/code&gt; cannot climb.&lt;/p&gt;

&lt;p&gt;What actually happened is a collision between &lt;code&gt;systemd&lt;/code&gt; (v243+), &lt;code&gt;runc&lt;/code&gt;, and the Linux kernel's security model for unprivileged user namespaces. When you run an unprivileged LXC, the root user inside the container is actually a non-privileged user on the Proxmox host (usually UID 100000). &lt;/p&gt;

&lt;p&gt;The kernel prevents these mapped users from modifying &lt;code&gt;sysctl&lt;/code&gt; settings because those settings are often global or namespace-specific in ways that could allow a container to crash the host or leak information. &lt;code&gt;runc&lt;/code&gt;, the runtime Docker uses, tries to apply these settings during container creation, but the kernel returns a permission denied error. Because of how some Docker versions handle this, the error is sometimes swallowed, and your app just runs with the wrong defaults.&lt;/p&gt;

&lt;p&gt;If you're building a production-grade homelab, you probably don't want to just switch to a privileged container. That's a security nightmare. Instead, you have to move the configuration "up" the chain.&lt;/p&gt;

&lt;p&gt;The fix is to apply the &lt;code&gt;sysctl&lt;/code&gt; settings at the LXC level before the container fully initializes, or directly on the host if the parameter isn't namespaced. Since we want to keep the host clean, using an LXC pre-start hook is the cleanest way to inject these settings.&lt;/p&gt;

&lt;p&gt;On the Proxmox host, you can add a hook to the container's configuration file (usually in &lt;code&gt;/etc/pve/lxc/ID.conf&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add this to your LXC .conf file on the Proxmox host&lt;/span&gt;
lxc.hook.pre-start &lt;span class="o"&gt;=&lt;/span&gt; /usr/bin/echo &lt;span class="s2"&gt;"net.ipv4.ip_local_port_range = 1024 65535"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/sysctl.d/99-lxc.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, for most users, the most reliable method is to define the parameter in the host's &lt;code&gt;sysctl.conf&lt;/code&gt; if it's a global setting, or use the &lt;code&gt;lxc.sysctl&lt;/code&gt; directive in the config file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example Proxmox LXC config snippet&lt;/span&gt;
&lt;span class="na"&gt;arch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;amd64&lt;/span&gt;
&lt;span class="na"&gt;cores&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
&lt;span class="na"&gt;net0&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;name=eth0,bridge=vmbr0,ip=10.0.0.x/24,gw=10.0.0.1&lt;/span&gt;
&lt;span class="na"&gt;ostype&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu&lt;/span&gt;
&lt;span class="na"&gt;unprivileged&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;features&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nesting=1,keyctl=1&lt;/span&gt;
&lt;span class="c1"&gt;# Inject the sysctl here&lt;/span&gt;
&lt;span class="s"&gt;lxc.sysctl.net.ipv4.ip_local_port_range = 1024 &lt;/span&gt;&lt;span class="m"&gt;65535&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After adding this, you have to restart the container. If you just restart the Docker daemon inside the LXC, the kernel parameter won't update because the LXC boundary is where the restriction lives.&lt;/p&gt;

&lt;p&gt;This trap is common when you're trying to optimize networking or memory management (like &lt;code&gt;vm.max_map_count&lt;/code&gt; for Elasticsearch) inside a nested environment. If you've dealt with the headache of &lt;a href="https://guatulabs.dev/posts/gpu-passthrough-on-proxmox-gotcha-guide/" rel="noopener noreferrer"&gt;GPU passthrough on Proxmox&lt;/a&gt;, you know that the gap between "it's a container" and "it's an unprivileged container" is where most of the pain lives.&lt;/p&gt;

&lt;p&gt;One last thing to watch out for: UID shifts. If you're mounting NFS shares into these containers to provide storage for your Docker volumes, you'll hit the UID mismatch. The container thinks it's root (UID 0), but the host sees UID 100000. I've spent hours debugging "Permission Denied" on volumes only to realize I needed to &lt;code&gt;chmod 0777&lt;/code&gt; the host directory or properly map the IDs in the &lt;code&gt;.conf&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;If you're scaling this into a larger cluster, I highly recommend moving these workloads to bare-metal Kubernetes. I wrote about my experience with &lt;a href="https://guatulabs.dev/posts/kubernetes-storage-on-bare-metal-longhorn-in-practice/" rel="noopener noreferrer"&gt;Longhorn for bare-metal storage&lt;/a&gt;, and while the initial setup is heavier than an LXC, you stop fighting the Proxmox container permission war and start dealing with standard K8s primitives.&lt;/p&gt;

</description>
      <category>proxmox</category>
      <category>lxc</category>
      <category>docker</category>
      <category>sysctl</category>
    </item>
    <item>
      <title>AdGuard Home: Network-Wide DNS Filtering with Failover</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Mon, 04 May 2026 22:15:20 +0000</pubDate>
      <link>https://dev.to/futhgar/adguard-home-network-wide-dns-filtering-with-failover-1i0</link>
      <guid>https://dev.to/futhgar/adguard-home-network-wide-dns-filtering-with-failover-1i0</guid>
      <description>&lt;p&gt;DNS is the single point of failure that makes everyone in the house complain that "the internet is down" when, in reality, your DNS container just crashed. I've spent too much time as the sole admin of my network having to manually flip DNS settings on my router because a single AdGuard Home instance decided to stop responding. If you're running this in a homelab, you can't just set it and forget it. You need a failover strategy that doesn't require you to touch a CLI while your family is staring at you.&lt;/p&gt;

&lt;p&gt;The mistake most people make is trusting the default upstream behavior. They add three upstream servers and assume AdGuard Home will magically route around a dead one instantly. In practice, depending on your version and config, you can still hit timeouts that feel like a total outage. I've moved my setup to a Kubernetes deployment using MetalLB to give it a static IP, but the real win is the explicit failover logic in the &lt;code&gt;adguard-home.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I prefer using a combination of Cloudflare and Quad9 for the primary upstreams, with a dedicated fallback. This ensures that if my primary DNS providers have a routing issue, the system pivots to a tertiary option without dropping the request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# adguard-home.yaml snippets&lt;/span&gt;
&lt;span class="na"&gt;upstream_dns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.1.1.1"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0.1"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9.9.9.9"&lt;/span&gt;

&lt;span class="na"&gt;dns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Use parallel requests to find the fastest response&lt;/span&gt;
  &lt;span class="na"&gt;upstream_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;parallel&lt;/span&gt; 

&lt;span class="na"&gt;failover&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;health_check_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
  &lt;span class="na"&gt;health_check_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;fallback_upstream&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8.8.8.8"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For those running this on K8s, don't skimp on memory limits. I initially set my memory request too low and saw the OOM killer terminate the pod every time I updated a large blocklist. I now pin my resources to ensure stability, especially when integrated with &lt;a href="https://guatulabs.dev/posts/cert-manager-cloudflare-dns-01-automated-tls-for-everything/" rel="noopener noreferrer"&gt;cert-manager for automated TLS&lt;/a&gt; to secure the dashboard.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;adguard-home k8s-at-home/adguard-home &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; network &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; image.tag&lt;span class="o"&gt;=&lt;/span&gt;latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; resources.limits.memory&lt;span class="o"&gt;=&lt;/span&gt;1Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; resources.requests.memory&lt;span class="o"&gt;=&lt;/span&gt;256Mi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The biggest lesson here is that "high availability" for DNS isn't just about having two pods. It's about how the system handles the gap between a server being "up" and a server actually returning a valid record. If you're building out larger infrastructure, I've found that combining this with a strict &lt;a href="https://guatulabs.dev/posts/kubernetes-manifest-validation-catching-errors-before-merge/" rel="noopener noreferrer"&gt;manifest validation pipeline&lt;/a&gt; prevents the kind of YAML typos that can take your entire network offline.&lt;/p&gt;

&lt;p&gt;Keep your upstreams diverse and your memory limits realistic.&lt;/p&gt;

</description>
      <category>dns</category>
      <category>adguardhome</category>
      <category>infrastructure</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Three-Layer Safety for Autonomous Agents: Stopping the Infinite Loop</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Thu, 30 Apr 2026 22:15:29 +0000</pubDate>
      <link>https://dev.to/futhgar/three-layer-safety-for-autonomous-agents-stopping-the-infinite-loop-3go5</link>
      <guid>https://dev.to/futhgar/three-layer-safety-for-autonomous-agents-stopping-the-infinite-loop-3go5</guid>
      <description>&lt;p&gt;I watched an autonomous agent spend three hours and 40,000 tokens trying to close a GitHub issue that had an open dependency, only to fail because it kept hallucinating a &lt;code&gt;force_close&lt;/code&gt; flag that didn't exist in the API. It didn't just fail; it entered a perfect infinite loop: it would call the tool, get a 400 error, interpret the error as a "temporary network glitch," and try again with the exact same payload.&lt;/p&gt;

&lt;p&gt;If you've built agents that actually touch production systems, you know this feeling. Prompting the agent to "be careful" or "follow the schema" is a placebo. When you move from a chat window to an autonomous loop, the gap between the LLM's intent and the system's reality becomes a canyon where agents go to die (and burn through your API credits).&lt;/p&gt;

&lt;p&gt;For anyone running agent orchestration in a homelab or production environment, you need a safety architecture that doesn't rely on the model's "good behavior." I've moved to a three-layer safety model: Token-Level Enforcement, Pre-Execution Gates, and Execution Isolation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tried first
&lt;/h2&gt;

&lt;p&gt;My first instinct was to lean heavily on PydanticAI. The idea of using Pydantic for type-safe tool calling seemed like the silver bullet. I spent a week building out complex schemas, thinking that if the code validated the output, the agent would simply "learn" to provide the correct format.&lt;/p&gt;

&lt;p&gt;I was wrong. I hit a wall where the agent would produce a JSON object that was &lt;em&gt;almost&lt;/em&gt; correct, but it would miss a closing brace or add a trailing comma. Pydantic would throw a &lt;code&gt;ValidationError&lt;/code&gt;, the agent would see that error in its history, and then it would attempt to "fix" the JSON by adding even more commentary around the code block. This created a feedback loop of &lt;code&gt;ValidationError&lt;/code&gt; $\rightarrow$ &lt;code&gt;Apology&lt;/code&gt; $\rightarrow$ &lt;code&gt;Broken JSON&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then I tried adding a "supervisor" agent to review the actions of the "worker" agent. This just doubled my latency and doubled my token cost without actually solving the root cause. The supervisor often hallucinated the same API capabilities as the worker because they were using the same base model.&lt;/p&gt;

&lt;p&gt;The real problem wasn't the logic; it was the lack of deterministic boundaries. I was treating the LLM as a reliable software component when it's actually a probabilistic engine. To make it safe, I had to stop trying to "convince" the model to be safe and start forcing it to be safe at the infrastructure level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Token-Level Schema Enforcement
&lt;/h2&gt;

&lt;p&gt;The first layer of safety happens before the agent even finishes its sentence. If you're using Ollama v0.5.0 or newer, you can stop relying on the model to "try its best" with JSON.&lt;/p&gt;

&lt;p&gt;Most people use the OpenAI-compatible API layer provided by frameworks, but that often just wraps the prompt in "Please return JSON." Ollama now supports a native &lt;code&gt;format&lt;/code&gt; parameter that enforces the schema at the token-sampling level. This means the model physically cannot sample a token that violates the JSON schema.&lt;/p&gt;

&lt;p&gt;Here is how I implemented this for my homelab health reports using &lt;code&gt;qwen2.5:14b-instruct&lt;/code&gt;. I switched from the 32B model to the 14B variant because the 32B was causing 502 timeouts on my Tesla P40s due to VRAM pressure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="c1"&gt;# Define the strict structure we want
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HomelabHealthReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;node_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;critical_alerts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;storage_utilization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Percentage 0-100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Extract the JSON schema for Ollama
&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HomelabHealthReport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_json_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_safe_report&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# We bypass the high-level wrappers and hit the API directly
&lt;/span&gt;    &lt;span class="c1"&gt;# to ensure the 'format' parameter is actually passed.
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://ollama:11434/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5:14b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# This is the magic: token-level enforcement
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a health report for the homelab based on current metrics.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Result is guaranteed to be valid JSON matching HomelabHealthReport
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By moving the constraint to the sampler, I eliminated the &lt;code&gt;ValidationError&lt;/code&gt; loops entirely. The model no longer "guesses" the JSON; it is constrained by the grammar of the schema.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: The Pre-Execution Gate (ActionGate)
&lt;/h2&gt;

&lt;p&gt;Even with perfect JSON, an agent can still decide to do something stupid. Token-level safety ensures the &lt;em&gt;format&lt;/em&gt; is right, but it doesn't ensure the &lt;em&gt;intent&lt;/em&gt; is safe.&lt;/p&gt;

&lt;p&gt;I implemented an &lt;code&gt;ActionGate&lt;/code&gt;. This is a deterministic middleware layer that sits between the agent's tool-call and the actual execution. It doesn't use an LLM. It uses hard-coded business logic and state checks.&lt;/p&gt;

&lt;p&gt;If an agent tries to close a ticket, the &lt;code&gt;ActionGate&lt;/code&gt; checks if there are open dependencies. If it tries to reboot a node, it checks if that node is currently the only one running a critical service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SafetyException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_action_safety&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Deterministic safety check. 
    No LLMs allowed here.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Prevent closing issues that have blocking dependencies
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;close_issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;issue_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;issue_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_has_dependency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SafetyException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Safety Violation: Cannot close issue &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;issue_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; while dependencies are open.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Prevent destructive actions on production nodes during peak hours
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reboot_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;node_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;node_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;peak_hours&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SafetyException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Safety Violation: Reboot of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; forbidden during peak hours.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="c1"&gt;# Usage in the agent loop
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;check_action_safety&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;SafetyException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# We feed the specific error back to the agent so it can pivot
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action rejected by Safety Gate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents the "infinite loop of failure" I mentioned earlier. Instead of the agent getting a generic 400 error from an API and thinking it's a network glitch, it gets a clear, human-readable explanation: "You cannot do this because X." This forces the agent to change its strategy rather than just retrying the same failed request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Execution Isolation and Shell Safety
&lt;/h2&gt;

&lt;p&gt;The final layer is where the rubber meets the road. I've spent too many hours debugging "quoting hell." &lt;/p&gt;

&lt;p&gt;When you have an agent generating a command that needs to run over SSH, inside a Proxmox container (&lt;code&gt;pct exec&lt;/code&gt;), as a specific user (&lt;code&gt;su&lt;/code&gt;), and then executing a Python script, you have four layers of shell interpretation. If you use f-strings to build these commands, a single single-quote in the agent's output will break the entire pipeline.&lt;/p&gt;

&lt;p&gt;I saw this happen when an agent tried to pass a complex JSON string as an argument to a script. The shell interpreted the quotes, the &lt;code&gt;su&lt;/code&gt; command stripped another layer, and by the time it hit Python, the syntax was mangled.&lt;/p&gt;

&lt;p&gt;The fix is to stop passing code as shell arguments. Instead, pipe the code directly into the stdin of the remote process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The wrong way (prone to quoting errors):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This will break the moment the agent adds a ' or " to the payload&lt;/span&gt;
ssh node-a &lt;span class="s2"&gt;"pct exec 101 -- su - user -c 'python3 -c &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;print(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Hello World&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;'"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The right way (Shell-safe piping):&lt;/strong&gt;&lt;br&gt;
I wrote a helper that writes the agent's intended Python logic to a temporary file or pipes it directly. This avoids the shell's interpretation of the string entirely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# We pipe the actual script content into the remote shell&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; ~/bin/helpers/scout-ideas-helper.py | &lt;span class="se"&gt;\&lt;/span&gt;
  ssh node-a &lt;span class="s2"&gt;"pct exec 101 -- su - user -c 'python3 -'"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this setup, &lt;code&gt;python3 -&lt;/code&gt; tells Python to execute the code coming from stdin. The shell only sees the command to start Python, not the code itself. This completely eliminates the quoting nightmare.&lt;/p&gt;

&lt;p&gt;To manage the tools themselves, I've moved away from custom boilerplate and started using FastMCP. It allows me to wrap my MSAM (Multi-Agent System Architecture) tools into a standardized server that the agents can discover and use without me having to manually update the tool definitions every time I add a new function. I've detailed the setup for this in my post on &lt;a href="https://guatulabs.dev/posts/building-mcp-servers-with-fastmcp/" rel="noopener noreferrer"&gt;Building MCP Servers with FastMCP&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works
&lt;/h2&gt;

&lt;p&gt;This architecture works because it acknowledges that the LLM is the most unreliable part of the system. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token-level enforcement&lt;/strong&gt; removes the "formatting" problem. The agent can no longer fail because it forgot a comma.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The ActionGate&lt;/strong&gt; removes the "logic" problem. The agent can no longer perform an action that is fundamentally unsafe, regardless of how confident it is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Isolation&lt;/strong&gt; removes the "infrastructure" problem. The agent's output is treated as data (stdin) rather than as a command (shell argument).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When you combine these, you move from a system that is "mostly working" to one that is "predictably bounded."&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;The biggest surprise was how much the &lt;code&gt;format&lt;/code&gt; parameter in Ollama reduced the need for complex prompt engineering. I spent weeks refining a "System Prompt" to ensure JSON compliance, only to find that a single API parameter did the job better than 500 words of instructions.&lt;/p&gt;

&lt;p&gt;If I were to do this over again, I would have implemented the &lt;code&gt;ActionGate&lt;/code&gt; much sooner. I spent too much time trying to make the agent "smarter" when I should have just made the environment "stricter."&lt;/p&gt;

&lt;p&gt;A few caveats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Each layer adds a small amount of overhead. The &lt;code&gt;ActionGate&lt;/code&gt; is negligible (milliseconds), but the token-level enforcement can slightly increase the time to first token because the sampler has to do more work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM&lt;/strong&gt;: As I noted, model size matters. Qwen 2.5 14B is the sweet spot for my hardware. If you're running on limited VRAM, don't chase the 32B or 70B models just for the sake of "intelligence" if it leads to 502 timeouts and unstable inference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Drift&lt;/strong&gt;: Ensure your agent's memory is cleaned up. I use a &lt;a href="https://guatulabs.dev/posts/six-layer-memory-architecture-for-claude-code/" rel="noopener noreferrer"&gt;six-layer memory architecture&lt;/a&gt; to prevent the agent from getting confused by outdated context, which is often the root cause of why it tries to perform unsafe actions in the first place.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building autonomous agents isn't about finding the perfect model; it's about building the perfect cage for that model to operate in.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>llmops</category>
      <category>mcpservers</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Stop Merging Broken YAML: Kubernetes Manifest Validation in CI</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Sat, 25 Apr 2026 22:15:35 +0000</pubDate>
      <link>https://dev.to/futhgar/stop-merging-broken-yaml-kubernetes-manifest-validation-in-ci-52g9</link>
      <guid>https://dev.to/futhgar/stop-merging-broken-yaml-kubernetes-manifest-validation-in-ci-52g9</guid>
      <description>&lt;p&gt;Pushing a broken manifest to your main branch is a rite of passage, but it's one that becomes significantly more painful when you're running a GitOps workflow with ArgoCD. I've spent far too many late nights staring at a "Sync Failed" status in ArgoCD, only to realize I had a typo in a Traefik IngressRoute or a missing resource limit that Kyverno was blocking. The problem isn't just the error itself; it's the feedback loop. If the error only surfaces during deployment, your CI pipeline has failed its primary job.&lt;/p&gt;

&lt;p&gt;The goal is to move validation as far left as possible. I started integrating &lt;code&gt;kubeconform&lt;/code&gt; into my GitHub Actions workflow to catch structural errors—like invalid API versions or malike fields—before the code even reaches a pull request review. However, structural validation is only half the battle. You also have to deal with policy enforcement. I recently ran into a situation where a Kyverno policy enforcing resource limits on all Jobs was breaking my CloudNativePG (CNPG) deployments. The CNPG operator creates Jobs that don't always follow the standard resource pattern, and because the policy was too broad, the cluster refused to provision the primary.&lt;/p&gt;

&lt;p&gt;The fix involves two parts: using &lt;code&gt;kubeconform&lt;/code&gt; for schema validation in CI and using targeted exclusions in your Kyverno policies. For the CI side, you don't need a complex setup. A simple action step can scan your entire manifests directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Action snippet for manifest validation&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validate-manifests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Validate Kubernetes manifests&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yannh/kubernetes-manifest-validate@v1.11&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;manifests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;kubernetes/workloads/**/*.yaml&lt;/span&gt;
            &lt;span class="s"&gt;kubernetes/infrastructure/**/*.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the cluster side, when you have a legitimate reason to bypass a policy—like the CNPG example—don't just disable the policy globally. Use labels to create an exclusion scope. This keeps your &lt;a href="https://guatulabs.dev/posts/gitops-for-homelabs-argocd-app-of-apps/" rel="noopener noreferrer"&gt;GitOps for Homelabs&lt;/a&gt; workflow clean without sacrificing security for the rest of your workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Policy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-resource-limits&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;enforce-limits-on-jobs&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
      &lt;span class="c1"&gt;# Exclude CNPG clusters so the operator can manage its own jobs&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cnpg.io/cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;containers&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;have&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limits&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;defined."&lt;/span&gt;
        &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
                        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Validating at the PR stage catches the "dumb" mistakes, while smart policy exclusions prevent the "smart" tools from breaking your legitimate infrastructure.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>gitops</category>
      <category>cicd</category>
      <category>infrastructure</category>
    </item>
  </channel>
</rss>
