DEV Community: Guatu

Proxmox Cluster Quorum: How Many Nodes Do You Actually Need

Guatu — Mon, 18 May 2026 16:15:57 +0000

I woke up to a cluster that had effectively turned itself into a read-only museum. My VMs were running, but I couldn't start a new one, I couldn't migrate a workload, and the Proxmox GUI was throwing "Cluster not ready - no quorum" errors across the board. I had a two-node setup, one node had rebooted for a kernel update, and the remaining node decided that since it didn't have a majority, it no longer had the right to make decisions.

If you're building a Proxmox cluster, quorum is the one concept that will either be completely invisible or the primary reason your entire infrastructure freezes. Most people treat it as a checkbox during the cluster creation wizard, but in a home lab, the math of quorum often clashes with the reality of how many physical servers you can actually fit in your rack.

What I tried first

My initial instinct was that "Cluster" simply meant "nodes that can talk to each other." I assumed that as long as one node was alive, the cluster was alive. I set up two beefy nodes, linked them together, and felt confident.

Then I hit the "split-brain" wall. In a two-node cluster, the quorum requirement is (n/2) + 1. For two nodes, that means you need two votes to have a majority. If one node goes down, the remaining node has one vote. One is not greater than one. The remaining node loses quorum and enters a protective state. It stops allowing configuration changes to prevent a scenario where both nodes think they are the master and start writing conflicting data to shared storage, which is a great way to corrupt your VM disks.

I tried to "fix" this by manually forcing quorum on the surviving node using pvecm expected 1. It worked for a few minutes, but it's a manual band-aid. Every time a node rebooted or a network cable acted up, I was back in the CLI fighting with the cluster manager. I realized I was fighting the fundamental design of Corosync, and the only way out was to change the voting math.

The actual solution

You have three real options depending on your hardware budget and your tolerance for manual intervention.

Option 1: The Three-Node Standard

The cleanest way to solve quorum is to just add a third node. With three nodes, quorum is two votes. If one node dies, two remain. You still have a majority, and HA (High Availability) actually works as intended.

Option 2: The QDevice (The "Cheap" Vote)

If you can't justify a third full-sized server, you use a Quorum Device (QDevice). A QDevice is a lightweight external voter. It doesn't run VMs; it just tells the cluster "Yes, I see Node A." You can run this on a Raspberry Pi, a tiny VM on a separate host, or even a cheap VPS.

To set up a QDevice on a separate Debian/Ubuntu machine:

# On the QDevice server (the voter)
apt update && apt install corosync-qnetd

# On all Proxmox nodes
apt update && apt install corosync-qdevice

Once the software is installed, you initialize the device from one of the Proxmox nodes:

# Run this on one PVE node
pvecm qdevice setup <IP-OF-QDEVICE-SERVER>

This adds a third vote to the cluster without requiring a third Proxmox node. Now, if one PVE node fails, the other PVE node and the QDevice provide the two votes needed to maintain quorum.

Option 3: Monitoring and API Integration

If you're running a larger setup, you shouldn't be checking quorum by clicking through the GUI. I integrated pve_exporter with Prometheus to get alerts the second a node loses its vote.

Since I'm using token-based authentication to avoid the security risks of root passwords in plain text (see my post on Proxmox API Tokens), the setup looks like this.

First, create a restricted user for the exporter:

# Create user with PVEAuditor role
pveum user add prometheus@pve --realm local --password sEcr3T! --groups PVEAuditors

# Create API token for prometheus@pve
pveum token add prometheus@pve prometheus --privsep 0

Then, configure the pve_exporter YAML:

api:
  token_name: prometheus
  token_value: prometheus@pve!prometheus

And the Prometheus scrape config to target the nodes:

- job_name: 'proxmox'
  metrics_path: /pve
  scrape_interval: 30s
  params:
    cluster: ['1']
    node: ['1']
  relabel_configs:
    - source_labels: [__address__]
      regex: '^(10\.0\.0\.\d+)$'
      target_label: __param_target
      replacement: $1
  static_configs:
    - targets: ['10.0.0.x:9221']

Why it works

Proxmox uses Corosync for cluster membership and quorum. Corosync is designed for absolute consistency over availability (the "C" in the CAP theorem). It assumes that if you can't reach a majority of your peers, you are the one who is isolated, not them.

In a two-node cluster, there is no way to distinguish between "Node B is dead" and "The network cable between Node A and Node B is unplugged." If Node A decided to stay "active" while Node B also stayed "active," and both tried to modify the same shared storage (like a Ceph pool or an NFS share), you'd end up with a corrupted filesystem.

By adding a third vote (either a node or a QDevice), you break the tie. The node that can still talk to the QDevice knows it is part of the majority. The node that is isolated knows it's alone and gracefully steps back.

Lessons learned

The biggest lesson here is that High Availability (HA) is a lie if you don't have a proper quorum strategy. I spent a week thinking I had "HA" because I had two nodes and shared storage. In reality, I had a system that would freeze the moment I tried to update a BIOS or swap a NIC.

If you're running a two-node cluster, do not rely on pvecm expected 1. It's a temporary fix for recovery, not a configuration. Get a QDevice. Even a $35 Raspberry Pi is better than a cluster that goes read-only during a midnight update.

I also found that hardware stability plays a huge role in quorum health. If you're seeing random "Node lost" messages in your logs but the server is still pingable, check your kernel settings. I've dealt with AMD Ryzen C-State freezes that looked like network failures but were actually the CPU dropping into a sleep state so deep the NIC stopped responding for a few milliseconds, triggering a Corosync timeout.

A few final caveats:

QDevice Placement: Don't run your QDevice as a VM on the same cluster it's voting for. That's circular logic. If the cluster loses quorum and the VM stops, the QDevice disappears, and you're stuck. Put it on a separate physical box or a different hypervisor.
Network Latency: Corosync is extremely sensitive to latency. If you're putting your QDevice in the cloud or on a slow Wi-Fi link, you'll see "flapping" where the cluster constantly gains and loses quorum. Use a wired connection.
The "Expected" Trap: When you manually change pvecm expected, you are telling the cluster to ignore the safety rules. Only do this when you are performing maintenance on a known-down node and need to regain control of the surviving one.

If you're scaling this into a production-grade environment, this is where the gap between a "homelab" and "infrastructure" becomes clear. For those needing professional help architecting these systems for zero-downtime, I provide infrastructure consulting to handle the messy parts of bare-metal orchestration.

Kyverno Admission Controllers: Policy-as-Code That Actually Works

Guatu — Mon, 18 May 2026 02:15:57 +0000

I spent an entire Saturday afternoon debugging why my CloudNativePG (CNPG) database cluster refused to initialize, only to find out my own security policies were killing the initdb jobs. I had a "require-resource-limits" policy active across the cluster. It sounded like a great idea: no pod enters the cluster without explicit CPU and memory limits. The documentation makes this look like a five-minute win for cluster stability.

What the docs don't tell you is that many Kubernetes Operators, including CNPG, spawn temporary Jobs or Pods that don't always inherit the limits you've defined in the primary custom resource. The admission controller saw a pod without limits, deemed it "illegal," and blocked it. The operator just kept retrying, and I kept wondering why my database was stuck in a pending state with no obvious error in the operator logs.

This is the gap between "Policy-as-Code" as a concept and Policy-as-Code in a real production environment. If you've ever tried to enforce standards across a multi-node cluster, you've probably looked at OPA Gatekeeper or Kyverno. I've used both. One requires you to learn a specialized language (Rego) that feels like a full-time job, and the other uses YAML.

Why you'd choose a Policy Engine

You reach this decision point when your cluster grows beyond a few hand-rolled manifests. Once you're using ArgoCD to scale your apps, you stop caring about individual pods and start caring about invariants.

These invariants usually fall into a few buckets:

No one runs a container as root.
Every deployment has a specific set of labels for monitoring.
Resource limits are enforced so one runaway AI agent doesn't starve the rest of the node.
Sidecars are automatically injected without manually editing every deployment.

You can do some of this with Pod Security Admissions (PSA), but PSA is a blunt instrument. It's a "yes or no" switch. A real admission controller allows you to mutate the request on the fly. If a developer forgets a security context, the controller doesn't just reject the pod; it injects the correct one.

Option A: OPA Gatekeeper

Gatekeeper is the industry standard for large-scale enterprises. It's built on Open Policy Agent (OPA), and its primary strength is its absolute precision.

Strengths
The logic is decoupled from the Kubernetes API. Because it uses Rego, you can write incredibly complex queries. If you need a policy that says "Allow this pod only if the user is in the 'dev' group AND the time is between 9 AM and 5 PM AND the image comes from a specific signed registry," Gatekeeper can do it.

Weaknesses
The learning curve is a cliff. Rego is a declarative query language, and if you've never used it, you'll spend more time fighting the syntax than actually securing your cluster. Debugging a failing Rego policy is a nightmare because the error messages are often opaque.

When it shines
Gatekeeper is for environments where compliance is a legal requirement. If you're in a highly regulated industry where you need a mathematical proof of your security posture, the overhead of Rego is worth it.

Option B: Kyverno

Kyverno is the choice for those of us who just want things to work without learning a new language. It uses YAML for everything.

Strengths
It's native to Kubernetes. If you can write a Pod manifest, you can write a Kyverno policy. It handles mutation, validation, and generation. The "generation" part is a killer feature: you can tell Kyverno that whenever a new namespace is created, it should automatically generate a NetworkPolicy and a LimitRange for that namespace.

Weaknesses
YAML has limits. While Kyverno is powerful, it can't match the raw computational logic of Rego for extremely complex edge cases. It's also easier to accidentally create "mutation loops" where a policy changes a resource, which triggers the policy again, ad infinitum.

When it shines
It's perfect for the GitOps-driven homelab or mid-sized production environment. It integrates cleanly with manifest validation pipelines and doesn't require a dedicated "Policy Engineer" to maintain.

Decision Framework

Criterion	OPA Gatekeeper	Kyverno
Language	Rego (Specialized)	YAML (K8s Native)
Learning Curve	Steep	Shallow
Mutation	Possible, but complex	First-class citizen
Resource Generation	No	Yes
Performance	Extremely high	High
Configuration	ConstraintTemplates	ClusterPolicies
Ideal User	Compliance/Security Teams	DevOps/Platform Engineers

My Pick and Why

I use Kyverno. I've tried the "right way" with OPA, but in a lean environment, the cognitive load of Rego is a liability. I'd rather spend my time optimizing my AI agent orchestration than debugging a query language.

However, using Kyverno without a strategy is a fast track to a broken cluster. To make it actually work, you have to move away from the "happy path" and account for infrastructure overhead.

The "Infrastructure Exclusion" Pattern

The biggest mistake I made early on was applying policies globally. I had a policy that required all pods to have a specific security context. Suddenly, my Traefik ingress and ArgoCD controllers started crashing because they needed specific capabilities (like NET_ADMIN) that my policy explicitly forbade.

The fix is to implement a strict exclusion list. You cannot treat your infrastructure components the same way you treat your application workloads. I now use a combination of namespace exclusions and label-based filters to ensure that the "plumbing" of the cluster stays functional.

Here is how I handled the CNPG issue. Instead of a blanket "require limits" policy that blocks everything, I added an exclusion for any resource tagged by the CNPG operator.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  rules:
    - name: require-resource-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
      generate:
        kind: LimitRange
        name: default-limit-range
        namespace: $(metadata.namespace)
        applyTo: Pod
        spec:
          limits:
            - type: Container
              max:
                memory: 512Mi
      exclude:
        any:
          - labels:
              cnpg.io/cluster: "*"

This policy ensures that most pods get a default limit range, but it stays out of the way of the database operator's internal jobs.

Handling Security Contexts without Breaking the Cluster

Another common pitfall is forcing security contexts on pods that actually need to run as root to perform system-level tasks. I've seen this happen with storage drivers and network plugins.

I prefer a "mutate-then-validate" approach. I use Kyverno to inject a sane default security context for everything, and then I create a small set of exceptions for the system namespaces.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: default-security-context
spec:
  rules:
    - name: set-default-security-context
      match:
        any:
          - resources:
              kinds:
                - Pod
      # I use mutate here instead of generate to ensure the pod 
      # spec is modified before it hits the scheduler
      mutate:
        patchStrategicMerge:
          spec:
            securityContext:
              runAsUser: 1000
              runAsGroup: 1000
              fsGroup: 2000
              supplementalGroups: [2001]

If you apply this globally, you'll likely break your CNI or your CSI driver. You must exclude kube-system and any namespace where you've deployed low-level infrastructure.

The Danger of `synchronize: true`

Kyverno has a setting called synchronize. When set to true, Kyverno will automatically update the generated resource if the policy changes. This sounds great in theory, but in practice, it can create a synchronization nightmare.

I once had a policy generating NetworkPolicies for every new namespace. I changed the policy to add a new rule, and Kyverno attempted to update every single NetworkPolicy in the cluster simultaneously. This caused a spike in API server latency and, for a few minutes, left some of my internal services unreachable because the policies were in a state of flux.

My rule of thumb now is to avoid synchronize: true for high-churn resources. If you need to update a generated resource across the cluster, it's safer to trigger a rolling update via your GitOps pipeline than to let the admission controller try to rewrite the cluster state on the fly.

Orphaned Resources and the Cleanup Gap

Policy engines are great at creating things, but they're often bad at cleaning them up. I ran into this with a dashboard app called Homarr. I had a policy that generated certain config maps for the dashboard. When I deleted the application via the API, the generated resources stayed behind.

This led to "phantom" items appearing in my dashboard UI. The application was gone, but the configuration lived on in the etcd store. Kyverno doesn't always track the lifecycle of generated resources perfectly.

If you find yourself with orphaned records in your database or config stores, you might have to go in manually. For Homarr, I had to run a few SQL queries to purge the dead references:

-- Clean up orphaned item_layout and item records
DELETE FROM item_layout WHERE itemId NOT IN (SELECT id FROM item);
DELETE FROM item WHERE app_id NOT IN (SELECT id FROM app);

It's a reminder that while "Policy-as-Code" automates the deployment, it doesn't always automate the decommissioning.

Integration with the Wider Stack

A policy engine shouldn't exist in a vacuum. I've found that the most stable setups link Kyverno with other infrastructure tools. For example, I use it to ensure that any ingress resource created in the cluster has the correct annotations for cert-manager and Cloudflare DNS-01.

Instead of reminding every developer to add the cert-manager.io/cluster-issuer annotation, I wrote a mutation policy that adds it automatically if the ingress is in a production namespace. This removes the human element from the TLS chain.

Similarly, I use Kyverno to enforce that all SealedSecrets are tagged with an owner label. This makes it significantly easier to track who owns which secret when I'm auditing the cluster for old, unused credentials.

Lessons Learned

The biggest takeaway from my time with admission controllers is that the "happy path" is a lie. The documentation shows you how to block a pod, but it doesn't show you the three hours of debugging you'll do when a system-critical operator gets blocked by that same policy.

I've learned to follow three strict rules:

Test in a sandbox. Never apply a new ClusterPolicy to a production cluster without running it in audit mode first. Kyverno's audit mode lets you see what would have been blocked without actually blocking it.
Exclude the plumbing. Your infrastructure (Traefik, ArgoCD, CNPG, etc.) should almost always be exempt from general application policies.
Keep it simple. If a policy requires more than a few lines of complex YAML logic, it's probably time to ask if that constraint should be handled at the CI/CD level rather than the admission level.

I've moved toward using manifest validation in CI to catch the obvious errors before they ever hit the API server. This reduces the load on the admission controller and provides faster feedback to the person writing the YAML.

If you're building out your own infrastructure and need help designing a secure, automated pipeline for AI agents or industrial systems, you can check out my services. I focus on the gap between the documentation and the actual production reality, which is usually where the most expensive bugs live.

Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

Guatu — Fri, 15 May 2026 16:15:32 +0000

I spent three hours debugging a "hallucination" in my agent's daily briefing only to realize the agent wasn't hallucinating at all. It had simply failed to access my local financial spreadsheets because of a tool denylist I'd configured for security, and instead of admitting it couldn't see the data, it had tried to "guess" based on a few fragments it had previously cached in a cloud-based session. Even worse, I discovered that a fallback trigger in my orchestration layer had sent a summarized snippet of my private data to a cloud API because the local inference node had a momentary timeout.

If you're building AI agents that touch real-world data, the "happy path" is usually just a prompt and an API key. The reality is a minefield of data leaks, prompt injections, and silent failures that send your private keys or bank statements to a third-party server because a local GPU pod decided to restart.

This is a problem for anyone running autonomous agents that have read or write access to a local filesystem. If your routing logic is flawed, your privacy isn't a policy; it's a coin flip.

The Wrong Way: Trusting the Orchestrator

My first attempt at "privacy" was naive. I used a simple conditional in my agent's logic: if the query contained words like "bank," "password," or "private," route it to a local Ollama instance. Otherwise, send it to GPT-4o.

This failed immediately for three reasons. First, keyword filtering is a joke. A user (or a prompt injection) can easily bypass "bank" by asking about "financial liquidity instruments." Second, I assumed the orchestrator was a neutral party. In reality, the orchestrator often handles the context window, meaning the sensitive data is already in the prompt before the routing decision is even made. Third, I had no fail-safe. When the local model timed out, the system defaulted to the cloud provider to ensure "high availability." In a privacy-first system, unavailability is better than exposure.

I also hit a wall with tool access. I had disabled sandbox.mode to let my agents actually do work, but I quickly found that built-in tools like read and edit can be manipulated to bypass exec allowlists. I saw a specific instance where a prompt injection convinced the agent to use a read-chunk command (a hidden utility in some knowledge base scripts) to dump raw data from a file that should have been summarized first.

The Actual Solution: Two-Tier Privacy Routing

The only way to actually guarantee privacy is to move the routing logic as close to the data as possible and treat the cloud LLM as an untrusted guest. I implemented a two-tier architecture: a local "Privacy Gate" and a reference-only knowledge base.

1. The Reference-Only Knowledge Base

Instead of feeding raw files to the LLM, I use a system where the LLM never sees the original document. I use poppler-utils for PDF extraction and a local embedding model to populate a Qdrant vector store. The agent queries the vector store, but the results are filtered through a local script before being sent to any inference engine.

2. The Privacy Gate (Routing Layer)

I wrote a wrapper, knowledge.sh, that handles the routing. It doesn't rely on keywords. It relies on the data source. If the data comes from a "Sensitive" tagged volume in my cluster, the request is hard-pinned to the local GPU node.

Here is a simplified version of how I handle a private query:

#!/bin/bash
# knowledge.sh query - Local-first routing

QUERY=$1
MODEL="qwen2.5:14b"
# The local endpoint is a dedicated GPU node in my K8s cluster
LOCAL_ENDPOINT="http://ollama-gpu-node.internal/v1/chat/completions"

# Check if the query requires sensitive data access
if [[ "$QUERY" == *"--private"* ]]; then
    echo "Routing to local inference..."
    # We use a local model and a local endpoint. No cloud fallback.
    curl -X POST "$LOCAL_ENDPOINT" \
         -H "Content-Type: application/json" \
         -d "{
           \"model\": \"$MODEL\",
           \"messages\": [{\"role\": \"user\", \"content\": \"$QUERY\"}],
           \"stream\": false
         }"
else
    # Non-sensitive queries can go to the cloud orchestrator
    ./route-to-cloud.sh "$QUERY"
fi

3. Hardening the Execution

To prevent the "hallucination via missing data" problem, I stopped letting the LLM handle the final delivery of sensitive reports. I use a pattern where the LLM generates a template or a summary, but a local Python script handles the actual data insertion and delivery.

For my daily briefings, I use a wrapper script that ensures the data collection is isolated from the cloud inference:

#!/bin/bash
# life-briefing-run.sh

# 1. Collect raw data locally (Private)
./daily-briefing.sh --collect-only

# 2. Format the data using a local script (No LLM involved here)
# This prevents the LLM from accidentally leaking raw data in its output
python3 /opt/scripts/format-and-send-briefing.py

And the Python script handles the delivery via a secure API (like Telegram) without ever sending the raw content to a third-party LLM for "polishing":

import json
import requests

def send_telegram_message(message):
    # Tokens are managed via SealedSecrets in K8s
    bot_token = 'ANONYMIZED_TOKEN'
    chat_id = 'ANONYMIZED_ID'
    url = f'https://api.telegram.org/bot{bot_token}/sendMessage'
    payload = {
        'chat_id': chat_id,
        'text': message,
        'parse_mode': 'Markdown'
    }
    requests.post(url, json=payload)

# Load the locally generated briefing
with open('/tmp/briefing.txt', 'r') as f:
    content = f.read()
    send_telegram_message(content)

Why This Works

This approach works because it removes the "decision" from the LLM. If you ask an LLM "Should I send this to the cloud?", it will eventually say yes. By moving the routing to a bash wrapper and a Python script, the logic is deterministic.

The use of a local model like qwen2.5:14b via Ollama provides enough reasoning capability to summarize private data without needing the massive parameter counts of GPT-4. I've found that for most RAG (Retrieval-Augmented Generation) tasks, a 14B model is the sweet spot between performance and the VRAM limits of my GPU nodes.

By separating the synthesis (LLM) from the delivery (Python script), I've created a circuit breaker. Even if the LLM is compromised via prompt injection, it cannot "leak" the data to the cloud because it doesn't have the API keys for the cloud provider; those are held by the orchestrator, which is gated by the knowledge.sh script.

For those managing the underlying hardware, ensuring these local models stay performant requires a stable infrastructure. I've written about how I handle GPU passthrough on Proxmox and why the NVIDIA Container Toolkit is non-negotiable for this to work in a Kubernetes environment.

Lessons Learned

The biggest surprise was how often "convenience" features in agent frameworks are actually security holes. For example, I found that sessionKey in some cron-job implementations is often misunderstood. I assumed it provided hard isolation, but it turns out it's often just a routing hint. To get actual isolation, you have to explicitly set the session to isolated, or you risk your private data bleeding into the "main" session context, which might be shared with a cloud-connected agent.

Another gotcha was the Qdrant MCP. I hit several "Not existing vector name" errors during the rollout. This wasn't a bug in my code but a version mismatch between the MCP server and the Qdrant instance. In a bare-metal K8s setup, pinning your versions is the only way to avoid waking up to a broken pipeline.

If I were to do this again, I'd implement a more formal "Taint and Toleration" system in Kubernetes. I'd taint my GPU nodes with privacy=high and only allow pods with the corresponding toleration to run there. This would prevent a non-private, cloud-connected pod from ever being scheduled on the same physical hardware where my sensitive local models are processing data in memory.

For those looking to scale this into a professional environment, this kind of architecture is a core part of what I do in AI agent and infrastructure consulting. Moving from a "it works on my machine" script to a production-grade, privacy-routed pipeline is where most of the complexity lives.

The takeaway is simple: if the data is sensitive, the cloud is a liability. Build your gate, pin your models, and never let your LLM decide where your data goes.

Tailscale Subnet Routers: Accessing Your LAN Without the VPN Headache

Guatu — Thu, 14 May 2026 02:15:07 +0000

I spent three hours trying to SSH into a legacy industrial gateway from a coffee shop, only to realize I'd forgotten to install the Tailscale agent on that specific piece of hardware. The device was a locked-down firmware image where "installing a binary" isn't an option. That's the moment I stopped trying to put Tailscale on every single node and instead shifted to a dedicated subnet router.

If you have a multi-node Proxmox cluster, a rack of IoT sensors, or a bunch of "dumb" switches and PDUs, you can't possibly install a client on everything. You need a way to tell your Tailnet: "If you're looking for anything in the 10.0.0.x range, just send the traffic to this specific Linux box, and it'll handle the rest."

The Concept: Routing vs. Agent-based Access

Standard Tailscale is a mesh. Every device is a peer. This is great for your laptop and your primary workstation, but it's a nightmare for infrastructure. A subnet router turns a single node into a gateway. It acts as a bridge between the encrypted WireGuard mesh and your local physical network.

The magic here is that the devices on your LAN don't even know Tailscale exists. They just see traffic coming from the subnet router's local IP. You get the security of a private mesh without having to touch the network configuration of your legacy gear or your Kubernetes pods.

Implementation: The "Happy Path" and the Reality

The official docs make this look like a one-line command. While that's technically true, if you're running this on a production-grade homelab or a bare-metal node, there are a few kernel-level requirements that usually get glossed over.

First, you have to enable IP forwarding. If the Linux kernel isn't allowed to pass packets between interfaces, your subnet router is just a fancy wall.

# Enable IPv4 forwarding immediately
sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1

# Make it persist across reboots
echo "net.ipv4.ip_forward = 1" | sudo tee -a /etc/sysctl.d/99-tailscale.conf
echo "net.ipv6.conf.all.forwarding = 1" | sudo tee -a /etc/sysctl.d/99-tailscale.conf
sudo sysctl -p /etc/sysctl.d/99-tailscale.conf

Once the kernel is ready, you bring Tailscale up. I've found that explicitly forcing the kernel TUN interface is safer than relying on the default, especially if you've experimented with userspace networking in the past. Userspace networking (--tun=userspace-networking) is a death sentence for subnet routing; it simply won't work because the OS doesn't see the interface as a routable device.

# Start Tailscale as a subnet router for a specific range
# Replace 10.0.0.0/24 with your actual local subnet
sudo tailscale up --tun=kernel --advertise-routes=10.0.0.0/24

After running this, you still aren't connected. You have to go into the Tailscale Admin Console and manually approve the routes. This is a security feature to prevent a compromised node from suddenly hijacking all traffic for your entire network.

The Kubernetes and Gateway Trap

If you're running your subnet router inside a container or as part of a larger orchestration layer, you'll likely run into the gateway.bind issue. I hit this while integrating a gateway with some Kubernetes services.

When using tools like OpenClaw or custom wrappers, the default binding often fails because the application tries to bind to an interface that isn't actually the LAN. If your config looks like a generic default, you'll see the node is "online" in the dashboard, but you can't ping anything on the local subnet.

You need to explicitly tell the gateway to bind to the LAN interface. In the JSON config, it looks like this:

{
  "gateway": {
    "bind": "lan"
  }
}

Without this, the traffic often loops back or hits a dead end in the container network. This is similar to the networking headaches I've dealt with regarding DNS resolution, like the Wildcard DNS and ndots:5 nightmare, where the system thinks it knows where to go but the underlying routing logic is fundamentally flawed.

Turning it into an Exit Node

A subnet router lets you reach your home from the outside. An exit node lets you send all your internet traffic through your home from the outside. It's the difference between "I want to see my Proxmox UI" and "I'm on public WiFi and I want to pretend I'm at home for security."

To do this, add the --advertise-exit-node flag:

sudo tailscale up --advertise-routes=10.0.0.0/24 --advertise-exit-node

Here is where most people get stuck. You'll enable the exit node, select it in your client, and then realize you have zero internet access. The packets are reaching your router, but they aren't being NAT'd back out to the web. You need a MASQUERADE rule in your iptables to handle the translation.

# Replace eth0 with your actual primary network interface
sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

If you're using a modern distro with nftables, you'll need the equivalent rule there. If you don't do this, the return packets from the internet don't know how to get back to the Tailscale client because they're coming from a virtual IP range the rest of your network doesn't recognize.

The Advanced Headache: NAT-PMP and Network Namespaces

In some complex environments, specifically when dealing with certain industrial gateways or strict NAT setups, you'll encounter NAT-PMP misrouting. This happens when Tailscale tries to be too smart about the local network and accidentally routes requests to a remote subnet router instead of the local one.

The fix is ugly, but it works: isolate the Tailscale traffic into its own network namespace (netns). This prevents the daemon from interfering with the host's primary routing table in ways that cause loops.

# Create a dedicated namespace for Tailscale
ip netns add tailscale
ip link add veth0 type veth peer name veth1
ip link set veth0 netns tailscale

# Assign an IP to the virtual interface
ip addr add 10.0.0.2/24 dev veth1
ip link set veth1 up

# Run the daemon inside the namespace
ip netns exec tailscale tailscaled

This is overkill for 90% of homelabbers, but if you're building out automated infrastructure with OpenTofu and deploying these routers across multiple sites, you'll want this kind of isolation to ensure stability.

Comparison: Subnet Routers vs. Traditional VPNs

I've run OpenVPN and WireGuard (manual) for years. Here is the honest breakdown of why I switched to Tailscale subnet routers for my remote access.

Feature	Traditional VPN (OpenVPN/WireGuard)	Tailscale Subnet Router
Setup	Manual certs, port forwarding, firewall rules	Zero-config NAT traversal, OAuth
Client Mgmt	Distributing `.ovpn` or `.conf` files	Log in with SSO/Identity provider
Routing	Manual static routes on clients	Centralized route management in console
Maintenance	Updating keys, managing IP pools	Automatic key rotation, managed IPs
Performance	High (if tuned correctly)	High (WireGuard based)

The tradeoff is the "phone home" aspect. Tailscale's coordination server knows which nodes are online. For most of us, that's a fair price to pay to avoid spending a Saturday morning debugging why a UDP port isn't opening on a residential ISP.

Gotchas and Lessons Learned

If you're setting this up, watch out for these three things:

The "Double-Hop" Latency: If you use a subnet router and then an exit node on a different machine, your traffic is bouncing across your network multiple times. It's fine for SSH, but terrible for VoIP or gaming. Keep your subnet router and exit node on the same high-performance machine if possible.
DNS Leaks: Just because you can route to 10.0.0.x doesn't mean your DNS is working. You'll still be typing IPs unless you configure "MagicDNS" or set up a global nameserver in the Tailscale admin panel that points to your internal DNS (like AdGuard Home).
Fail-Closed Policies: By default, if your subnet router goes down, you lose access to the entire LAN. If this is for a production environment, I highly recommend setting up two subnet routers in different failure domains. Tailscale doesn't do "automatic failover" in the traditional sense, but you can have multiple nodes advertising the same route.

Final Thoughts

The subnet router is the only sane way to manage remote access to a complex lab. It separates the "connectivity" layer from the "device" layer. You don't need to care if your old NAS doesn't support WireGuard or if your industrial PLC has a proprietary OS. You just need one stable Linux box with ip_forwarding enabled and a couple of iptables rules.

Reach for this technique the moment you find yourself saying, "I wish I could just SSH into this thing without having to install a client on it." If you're looking to scale this into a larger professional setup, feel free to check out my infrastructure consulting services for help with AI agent orchestration or bare-metal networking.

PCIe Device Passthrough: NIC Name Instability and MAC Pinning

Guatu — Fri, 08 May 2026 04:15:19 +0000

My Proxmox node rebooted, and suddenly the host was unreachable via SSH. I had to plug in a physical monitor and keyboard only to find that my primary network interface, which had been enp4s0 for months, had decided to rename itself to enp5s0.

Because my /etc/network/interfaces file was explicitly tied to enp4s0, the bridge didn't come up, the IP wasn't assigned, and I was locked out of my own hardware.

What I expected

I expected the Linux kernel to consistently enumerate my PCIe devices. In a static hardware environment where nothing has moved, the PCI bus address should be deterministic. If the NIC is plugged into the same slot and the BIOS hasn't changed, enp4s0 should stay enp4s0 forever. This is the "happy path" most documentation assumes.

What actually happened

The reality is that PCIe enumeration is not always a constant. I'm using a mix of onboard NICs and a PCIe expansion card. I also have a GPU passed through to a VM.

The surprise here is how the kernel's predictable network interface naming (systemd-udevd) interacts with the PCIe topology. When I added a new PCIe device and tweaked some BIOS settings for IOMMU, the way the kernel mapped the physical slots to the virtual naming changed. A slight shift in how the PCIe switch reported the devices caused the index to jump.

This isn't just a "one-time fluke." If you're running a multi-node cluster or using GPUs that might move addresses (something I've documented before in GPU PCI Address Instability), you'll find that the kernel is surprisingly flexible with where it puts things.

The root cause is that enp4s0 is a name derived from the PCI location. If the location changes—even by one digit—the name changes. If your network config depends on that name, your system is one reboot away from a blackout.

The Fix: MAC Pinning

The only way to stop this is to stop relying on the PCI slot location and start relying on the hardware's unique identifier: the MAC address.

I decided to use systemd .link files. This allows me to tell the kernel: "I don't care where this device is on the PCIe bus; if it has this MAC address, call it eth0."

1. Identify the MAC address

First, I had to find the actual MAC of the problematic NIC while I had console access.

ip link show

I looked for the interface that was currently named enp5s0 (the "wrong" name) and copied the link/ether value.

2. Create the .link file

I created a custom link file in /etc/systemd/network/. I chose the name 10-lan.link to ensure it loads early in the boot process.

# /etc/systemd/network/10-lan.link
[Match]
MACAddress=00:11:22:33:44:55

[Link]
Name=eth0

(Note: I've anonymized the MAC address above. Use your actual hardware MAC here.)

3. Update the network configuration

Once the interface is pinned to eth0, I had to update the Proxmox network configuration to match. I edited /etc/network/interfaces to replace the volatile enp4s0 with the stable eth0.

# Example snippet from /etc/network/interfaces
auto eth0
iface eth0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.x/24
    gateway 10.0.0.1
    bridge-ports eth0
    bridge-stp off
    bridge-fd 0

4. Apply and verify

I ran systemd-networkd-restart (or just rebooted, since I was already at the console) and verified the name with ip a. The NIC was now consistently eth0, regardless of whether the PCIe bus shifted.

Why this matters

If you're just running a single VM on a desktop, this is a minor annoyance. But if you're building a production-grade homelab, this is a critical failure point.

You'll hit this specifically in these scenarios:

Adding/Removing PCIe Hardware: Adding a new NVMe drive or a GPU can shift the enumeration of other devices on the same root complex.
BIOS Updates: A BIOS update often resets PCIe lane bifurcation or IOMMU settings, which can completely reorder how the kernel sees your NICs.
Using PCIe Switches: Some high-end motherboards or riser cables use PCIe switches that can report different topologies depending on the power state of the devices.

The Tradeoff

The tradeoff here is that you're moving away from the "modern" predictable naming convention back to the "old" ethX style. Some people find eth0 ugly or outdated, but in a headless server environment, "ugly" is better than "unreachable."

I've also seen people try to fix this using udev rules in /etc/udev/rules.d/. While that works, .link files are the native systemd way to handle this and are generally cleaner to maintain.

Lessons Learned

The biggest lesson here is that documentation for Proxmox and Debian assumes your hardware topology is a constant. It isn't.

When you're doing complex things like PCIe passthrough—which I've detailed in my GPU Passthrough Gotcha Guide—you are intentionally messing with the PCI bus. You're telling the host kernel to ignore certain devices so the VM can claim them. This volatility is a side effect of that power.

If you are passing through NICs or GPUs, do not trust the default interface names. Pin your critical management interfaces to their MAC addresses immediately. It takes five minutes to set up and saves you from a midnight trip to the server rack because a reboot decided your network card now lives at enp6s0.

For those of you managing larger fleets or complex AI agent infrastructure, this kind of hardware-level stability is the foundation. You can't build a reliable multi-agent AI pipeline if the underlying Kubernetes worker nodes are randomly losing their network identity.

Next time you're configuring a new node, don't just copy the enpXsX name from the GUI. Take the extra step to pin it. Your future self will thank you when the next BIOS update doesn't break your entire cluster.

GPU PCI Address Instability: When Your Card Moves Between Reboots

Guatu — Thu, 07 May 2026 00:15:04 +0000

I spent an entire afternoon debugging a VM that refused to boot, only to find out my GPU had decided to change its PCI address. One reboot and the device that lived at 01:00.0 suddenly migrated to 02:00.0. Because my Proxmox VM configuration was pinned to the old address, the VM crashed with a QEMU assertion error, and the GPU simply vanished from the guest.

This usually happens because of how the BIOS handles PCIe enumeration during POST. If you have multiple PCIe devices or a complex motherboard topology, the bus numbering isn't always deterministic. This is compounded by AMD Ryzen C-states or weird UMA frame buffer settings that can delay device initialization, causing the kernel to assign addresses in a different order than the previous boot. If you've already dealt with AMD iGPU RAM theft, you know how sensitive these BIOS settings are.

If you're on Proxmox 8.4+, the "happy path" is to use the q35 machine type. The older i440fx is more prone to these PCI mapping failures and IRQ conflicts. I also found that preventing the card from entering deep power states helps avoid the "zombie GPU" scenario where the card is physically there but logically dead.

To stabilize this, I switched the VM to q35 and explicitly enabled PCIe mode for the passthrough device. I also added a kernel parameter to stop the CPU from entering deep sleep states, which I've found reduces the randomness of the PCIe bus scan.

# 1. Change VM to q35 machine type for better PCIe support
qm set <VMID> --machine q35

# 2. Pass through the GPU with pcie=1 to ensure it's treated as a PCIe device
# Replace <PCI_ADDRESS> with your current address (e.g., 0000:01:00.0)
qm set <VMID> -hostpci0 <PCI_ADDRESS>,pcie=1

# 3. To stop the GPU from entering D3cold (which can cause boot-time instability)
# Run this on the Proxmox host
echo 0 > /sys/bus/pci/devices/0000:<PCI_BUS>:<PCI_SLOT>.0/d3cold_allowed

If the addresses keep shifting despite these changes, you're fighting your motherboard's firmware. At that point, I stopped fighting the VM abstraction and moved the NVIDIA drivers directly onto the Proxmox host. I then used the NVIDIA Container Toolkit to expose the GPU to my Kubernetes worker. It removes the PCI address fragility entirely because the host driver handles the hardware mapping, and the containers just see the device.

The lesson here is that PCI addresses are not constants; they are suggestions. If your workload requires 100% uptime and you can't guarantee a static PCI map, stop using VM passthrough and move the driver to the host.

Cognitive Memory for Agents: Vector Search vs Activation-Based Recall

Guatu — Wed, 06 May 2026 22:15:04 +0000

I spent a few weeks trying to build an agent that could remember specific user preferences across sessions without bloating the context window to a point where latency became unbearable. The standard advice is always "just use a vector database." But as the memory store grew, I noticed a weird gap: the agent could find a document about "user prefers dark mode" via cosine similarity, but it couldn't "recall" the immediate emotional state or the nuance of the last three turns of conversation unless they were explicitly mirrored in the embedding.

The problem is that vector search is a retrieval mechanism, not a cognitive memory system. When you move from simple RAG to actual agentic memory, you have to choose between external vector search and internal activation-based recall.

The Decision Point

You face this choice when your agent's "short-term" memory (the context window) is full, and your "long-term" memory (the database) is returning results that are mathematically similar but contextually irrelevant.

If you need your agent to remember a 500-page technical manual, you need a vector store. If you need your agent to exhibit a consistent "personality" or recall a specific pattern of behavior that isn't easily summarized into a string of text for an embedding model, you need something closer to activation-based recall.

Option A: Vector Search (The External Archive)

Vector search is the industry standard for a reason: it's easy to scale and the tooling is mature. You turn a piece of text into a vector using an embedding model (like text-embedding-3-small), shove it into a store like FAISS or Milvus, and query it with another vector.

Strengths:

Scale: You can store billions of vectors.
Cold Storage: It doesn't eat VRAM. It lives on disk or in a dedicated database.
Interpretability: I can literally query the database and see exactly which chunk of text was retrieved.

Weaknesses:

The "Semantic Gap": Cosine similarity is a blunt instrument. If a user says "That's not what I meant," a vector search might retrieve a passage about "meaning" or "intent" rather than understanding the correction.
Latency: You have to embed the query, hit the DB, and then stuff the results into the prompt.

Here is a basic implementation using FAISS. I use this for the "knowledge base" layer of my agents:

import faiss
import numpy as np

# Dimension depends on your embedding model (e.g., 1536 for OpenAI)
dimension = 128 
nb = 1000  # number of memory chunks
index = faiss.IndexFlatL2(dimension) 

# Mocking embeddings of agent experiences
vectors = np.random.random((nb, dimension)).astype('float32')
index.add(vectors) 

# Querying for the top 4 most similar memories
queries = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(queries, 4) 
print(f"Retrieved memory indices: {indices}")

Option B: Activation-Based Recall (The Internal Intuition)

Activation-based recall is more akin to how biological memory works. Instead of searching a database, the "memory" is stored in the weights or the hidden states of the model. In modern agent architectures, this often involves using activation hooks or specialized memory layers (like Memory Transformers) that allow the model to trigger a recall based on the current internal state of the network.

Strengths:

Speed: There is no external API call or DB lookup. The recall happens during the forward pass.
Nuance: It captures "how" something was said, not just "what" was said. It's an associative trigger rather than a keyword search.

Weaknesses:

The Black Box: Debugging this is a nightmare. You can't just "look" at the database to see why the agent recalled a specific memory.
VRAM Pressure: Storing these activations or maintaining a dynamic memory network consumes precious GPU memory.

I've experimented with simple activation hooks in PyTorch to track which "states" trigger certain behaviors. It's not a full-blown Memory Transformer, but it's a start:

import torch
from torch import nn

class AgentModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.memory_buffer = []

    def forward(self, x):
        # In a real system, this would be a specific layer's activation
        # that represents a 'concept' or 'state'
        activation = torch.tanh(x) 

        # Store the activation state for later recall/analysis
        self.memory_buffer.append(activation.detach().cpu().numpy())
        return activation

model = AgentModel()
input_tensor = torch.rand(1, 128)
output = model(input_tensor)
print(f"Stored state vector: {model.memory_buffer[-1]}")

Decision Framework

Criteria	Vector Search	Activation-Based Recall
Data Volume	Massive (TB+)	Small (MB to GB)
Retrieval Speed	Milliseconds (Network/Disk)	Microseconds (GPU)
Precision	Semantic/Keyword	Associative/Pattern
Debugging	Easy (Query the DB)	Hard (Analyze Tensors)
Resource Cost	CPU/Disk/API	VRAM/Compute

My Pick and Why

I don't pick one. I use a hybrid.

If you're building a production agent, relying solely on vector search leads to that "robotic" feeling where the agent repeats the same retrieved snippet regardless of the conversation flow. Relying solely on activations is a recipe for a system you can't debug when it starts hallucinating.

I implement a tiered system. I use a vector store for the "Library" (hard facts, documentation) and a sliding window of activations for the "Working Memory" (current mood, immediate goals, recent corrections). This mirrors the 6-layer memory architecture I've used for my own tools.

For those building multi-agent systems, I recommend offloading the vector search to a shared service and keeping the activation-based recall local to the agent's specific instance. This prevents the "shared memory" from becoming a noisy mess of conflicting embeddings. You can see how this fits into larger patterns in my post on multi-agent architecture patterns.

If you're still struggling with agents that forget things every five minutes, you might be hitting a safety loop. I've written about three-layer safety for autonomous agents which often solves the "infinite loop" problem that people mistake for a memory issue.

If you need help designing a memory architecture that doesn't melt your GPU or your budget, check out my AI agent consulting services.

Lessons learned:
The docs for vector DBs make it sound like they replace the need for cognitive memory. They don't. They replace the need for a filing cabinet. If you want an agent that actually "feels" like it's learning from a conversation in real-time, you have to move closer to the activations.

Vibration Monitoring Architecture: From Sensor to Dashboard

Guatu — Wed, 06 May 2026 16:15:04 +0000

The first time I tried to stream raw vibration data to a dashboard, I managed to crash my MQTT broker in under ten minutes. I had a high-frequency accelerometer spitting out samples at 5kHz, and I thought I'd just wrap those values in JSON and send them over the wire. The result wasn't a pretty graph; it was a series of Connection refused errors and a broker that had completely locked up under the weight of thousands of tiny packets per second.

If you're building a vibration monitoring system, you're not just dealing with "IoT data." You're dealing with signal processing. There is a massive difference between reporting a temperature every 30 seconds and capturing the harmonic frequencies of a motor bearing. If you treat vibration data like any other telemetry, your network will choke, your database will bloat, and your dashboards will be useless.

What I tried first (The wrong way)

My initial assumption was that the "modern stack" (Sensor $\rightarrow$ MQTT $\rightarrow$ Time Series DB $\rightarrow$ Grafana) would handle everything. I used a cheap industrial sensor that output raw voltage via a 4-20mA loop, fed into a PLC, which then pushed data to a Python script on a Raspberry Pi.

I wrote a simple loop that read the sensor and published to a topic:

# DO NOT DO THIS
while True:
    val = sensor.read() 
    client.publish("factory/machine1/vibration", json.dumps({"value": val}))

I quickly hit three walls:

Network Saturation: Sending one MQTT packet per sample is an architectural sin. The overhead of the TCP/IP stack and MQTT headers is larger than the actual payload. I was spending 90% of my bandwidth on headers.
Database Explosion: InfluxDB is great, but inserting 5,000 points per second per sensor is a recipe for a disk space crisis. My cardinality exploded, and queries that should have taken milliseconds started taking 30 seconds.
The "Noise" Problem: The raw data was a jagged mess. I couldn't see the actual vibration patterns because the high-frequency electrical noise from the nearby VFDs (Variable Frequency Drives) was masking the mechanical signal.

I realized that the gap between the sensor and the dashboard isn't a straight line. It's a funnel. You have to aggressively reduce the data volume at the edge before it ever touches the network.

The Actual Solution: The Edge-Heavy Pipeline

To make this work, I shifted the intelligence to the edge. The goal is to move from "streaming raw samples" to "streaming features." Instead of sending every single point, I calculate the RMS (Root Mean Square), Peak-to-Peak, and FFT (Fast Fourier Transform) bins locally.

1. Signal Conditioning and Edge Processing

I moved the processing to a dedicated edge gateway. I used a Python-based service that buffers samples in memory, applies a digital filter to remove electrical noise, and calculates the metrics.

Here is the implementation of the signal conditioning and feature extraction:

import numpy as np
from scipy.signal import butter, filtfilt
import paho.mqtt.client as mqtt
import time

# Configuration for a 10kHz sampling rate
FS = 10000 
CUTOFF = 2000 # Remove noise above 2kHz
ORDER = 4

def butter_lowpass_filter(data, cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return filtfilt(b, a, data)

def calculate_features(buffer):
    # Filter the raw signal to remove high-frequency noise
    filtered = butter_lowpass_filter(buffer, CUTOFF, FS, ORDER)

    # Calculate RMS - the primary indicator of overall vibration level
    rms = np.sqrt(np.mean(filtered**2))

    # Calculate Peak-to-Peak
    ptp = np.ptp(filtered)

    # Perform FFT to find the dominant frequency
    fft_vals = np.abs(np.fft.rfft(filtered))
    freqs = np.fft.rfftfreq(len(filtered), 1/FS)
    dominant_freq = freqs[np.argmax(fft_vals)]

    return {
        "rms": float(rms),
        "ptp": float(ptp),
        "dom_freq": float(dominant_freq)
    }

# Main loop: Buffer 1000 samples, then send 1 summary packet
client = mqtt.Client()
client.connect("mqtt-broker.example.com", 1883)

buffer = []
while True:
    val = read_sensor_raw() # Mock function for ADC read
    buffer.append(val)

    if len(buffer) >= 1000:
        features = calculate_features(buffer)
        # Send summary instead of 1000 raw points
        client.publish("iiot/machine1/vibration/features", str(features))
        buffer = [] # Clear buffer

2. The Transport Layer (MQTT 5.0)

For the broker, I shifted from a basic Mosquitto setup to a more controlled configuration. Since vibration data is critical for predictive maintenance, I needed to ensure that the "heartbeat" of the machine was always known.

I used MQTT 5.0 "Will Messages" to detect if a gateway went offline. If the gateway crashes, the broker immediately publishes a "disconnected" status to the health topic, so the dashboard doesn't just show a flat line (which could be mistaken for a stopped machine).

# mosquitto.conf snippet
listener 1883
allow_anonymous false
password_file /etc/mosquitto/passwd
# Prevent the broker from being overwhelmed by slow consumers
max_queued_messages 1000

I've written more about choosing the right broker in my MQTT Broker Selection post, but for vibration, the priority is low latency and high reliability over massive scale.

3. Storage and Visualization

I used InfluxDB 2.x for storage because of its native handling of time-series data. Instead of storing the raw waveform, I store the calculated features. This reduces the storage requirement by 1000x.

In Grafana, I set up a dashboard that monitors the RMS value. However, looking at a raw line graph of vibration is usually useless for operators. They don't know if 0.5g is "bad" or "normal."

I integrated this with a health scoring system. I used a Flux query in InfluxDB to compare the current RMS against a baseline (the average of the last 7 days).

// InfluxDB Flux Query for Relative Vibration
from(bucket: "iiot_data")
  |> range(start: -1h)
  |> filter(fn: (r) => r["_measurement"] == "vibration_sensor")
  |> filter(fn: (r) => r["_field"] == "rms")
  |> aggregateWindow(every: 1m, fn: mean)
  |> map(fn: (r) => ({ r with value: r._value / 0.15 })) // Normalize against threshold 0.15g

This feeds directly into the concept of Equipment Health Scoring, where the goal is to give the operator a single "Health %" rather than a complex spectrum analysis.

Why this architecture works

The reason this works is that it respects the laws of physics and networking.

The Nyquist-Shannon Theorem tells us we need to sample at twice the frequency of the signal we want to capture. If you want to detect a bearing fault at 2kHz, you must sample at 4kHz+. Trying to do this over WiFi or Ethernet using standard JSON-over-MQTT is impossible because the packet overhead kills the throughput.

By calculating the RMS and FFT at the edge, we are performing Data Reduction. We transform a high-bandwidth signal (time domain) into a low-bandwidth set of descriptors (frequency domain).

The edge processing also acts as a mechanical filter. By using a Butterworth low-pass filter, I can strip out the 60Hz hum from the power lines and the high-frequency spikes from the VFDs. If you do this in the cloud, you've already wasted the bandwidth sending noise.

Lessons learned and caveats

If I had to build this again, I'd change a few things:

1. Hardware-level filtering: I spent too much time in Python trying to fix signal noise. In a real industrial environment, you should use an analog anti-aliasing filter (a physical capacitor/resistor circuit) before the signal ever hits the ADC. Software filters are great, but they can't fix aliasing if the signal was already corrupted during sampling.

2. The "Buffer" Trap: My Python script used a simple list for the buffer. At very high sampling rates, Python's list appending becomes slow. I had to switch to numpy arrays with pre-allocated memory to avoid garbage collection pauses that caused gaps in the data.

3. Provisioning the Edge: Managing these Python scripts across five different gateways was a nightmare. I eventually moved the deployment to a GitOps flow, using OpenTofu and GitHub Actions to manage the underlying VM configurations on my Proxmox cluster, ensuring every gateway had the exact same version of scipy and numpy.

4. The Dashboard Paradox: The more data I put on the dashboard, the less the operators used it. The final version of the system only shows three things: a Green/Yellow/Red light for health, the current RMS value, and a "Time to Maintenance" estimate. Everything else (the FFT bins, the raw waveforms) is hidden in a "Deep Dive" tab that only the reliability engineer ever opens.

Vibration monitoring is a classic example of where "more data" is actually "less information." The value isn't in the sensor; it's in the reduction process that happens between the sensor and the screen.

Unprivileged LXC + Docker: The runc Sysctl Permission Trap

Guatu — Tue, 05 May 2026 00:15:20 +0000

sysctl: setting key "net.ipv4.ip_local_port_range": Permission denied

I saw this error while trying to tune the network stack for a high-concurrency service running in Docker, which itself was hosted inside an unprivileged LXC container on Proxmox. The weird part? I was root inside the container.

I expected that since I had already enabled nesting=1 and keyctl=1 in the LXC configuration, Docker would have the necessary permissions to modify kernel parameters via runc. In a standard VM, this is trivial. In a privileged container, it just works. But in an unprivileged container, the user namespace mapping creates a wall that runc cannot climb.

What actually happened is a collision between systemd (v243+), runc, and the Linux kernel's security model for unprivileged user namespaces. When you run an unprivileged LXC, the root user inside the container is actually a non-privileged user on the Proxmox host (usually UID 100000).

The kernel prevents these mapped users from modifying sysctl settings because those settings are often global or namespace-specific in ways that could allow a container to crash the host or leak information. runc, the runtime Docker uses, tries to apply these settings during container creation, but the kernel returns a permission denied error. Because of how some Docker versions handle this, the error is sometimes swallowed, and your app just runs with the wrong defaults.

If you're building a production-grade homelab, you probably don't want to just switch to a privileged container. That's a security nightmare. Instead, you have to move the configuration "up" the chain.

The fix is to apply the sysctl settings at the LXC level before the container fully initializes, or directly on the host if the parameter isn't namespaced. Since we want to keep the host clean, using an LXC pre-start hook is the cleanest way to inject these settings.

On the Proxmox host, you can add a hook to the container's configuration file (usually in /etc/pve/lxc/ID.conf).

# Add this to your LXC .conf file on the Proxmox host
lxc.hook.pre-start = /usr/bin/echo "net.ipv4.ip_local_port_range = 1024 65535" >> /etc/sysctl.d/99-lxc.conf

However, for most users, the most reliable method is to define the parameter in the host's sysctl.conf if it's a global setting, or use the lxc.sysctl directive in the config file:

# Example Proxmox LXC config snippet
arch: amd64
cores: 2
memory: 2048
net0: name=eth0,bridge=vmbr0,ip=10.0.0.x/24,gw=10.0.0.1
ostype: ubuntu
unprivileged: 1
features: nesting=1,keyctl=1
# Inject the sysctl here
lxc.sysctl.net.ipv4.ip_local_port_range = 1024 65535

After adding this, you have to restart the container. If you just restart the Docker daemon inside the LXC, the kernel parameter won't update because the LXC boundary is where the restriction lives.

This trap is common when you're trying to optimize networking or memory management (like vm.max_map_count for Elasticsearch) inside a nested environment. If you've dealt with the headache of GPU passthrough on Proxmox, you know that the gap between "it's a container" and "it's an unprivileged container" is where most of the pain lives.

One last thing to watch out for: UID shifts. If you're mounting NFS shares into these containers to provide storage for your Docker volumes, you'll hit the UID mismatch. The container thinks it's root (UID 0), but the host sees UID 100000. I've spent hours debugging "Permission Denied" on volumes only to realize I needed to chmod 0777 the host directory or properly map the IDs in the .conf file.

If you're scaling this into a larger cluster, I highly recommend moving these workloads to bare-metal Kubernetes. I wrote about my experience with Longhorn for bare-metal storage, and while the initial setup is heavier than an LXC, you stop fighting the Proxmox container permission war and start dealing with standard K8s primitives.

AdGuard Home: Network-Wide DNS Filtering with Failover

Guatu — Mon, 04 May 2026 22:15:20 +0000

DNS is the single point of failure that makes everyone in the house complain that "the internet is down" when, in reality, your DNS container just crashed. I've spent too much time as the sole admin of my network having to manually flip DNS settings on my router because a single AdGuard Home instance decided to stop responding. If you're running this in a homelab, you can't just set it and forget it. You need a failover strategy that doesn't require you to touch a CLI while your family is staring at you.

The mistake most people make is trusting the default upstream behavior. They add three upstream servers and assume AdGuard Home will magically route around a dead one instantly. In practice, depending on your version and config, you can still hit timeouts that feel like a total outage. I've moved my setup to a Kubernetes deployment using MetalLB to give it a static IP, but the real win is the explicit failover logic in the adguard-home.yaml.

I prefer using a combination of Cloudflare and Quad9 for the primary upstreams, with a dedicated fallback. This ensures that if my primary DNS providers have a routing issue, the system pivots to a tertiary option without dropping the request.

# adguard-home.yaml snippets
upstream_dns:
  - "1.1.1.1"
  - "1.0.0.1"
  - "9.9.9.9"

dns:
  # Use parallel requests to find the fastest response
  upstream_mode: parallel 

failover:
  enabled: true
  health_check_interval: 30
  health_check_timeout: 10
  fallback_upstream: "8.8.8.8"

For those running this on K8s, don't skimp on memory limits. I initially set my memory request too low and saw the OOM killer terminate the pod every time I updated a large blocklist. I now pin my resources to ensure stability, especially when integrated with cert-manager for automated TLS to secure the dashboard.

helm install adguard-home k8s-at-home/adguard-home \
  --namespace network \
  --create-namespace \
  --set image.tag=latest \
  --set resources.limits.memory=1Gi \
  --set resources.requests.memory=256Mi

The biggest lesson here is that "high availability" for DNS isn't just about having two pods. It's about how the system handles the gap between a server being "up" and a server actually returning a valid record. If you're building out larger infrastructure, I've found that combining this with a strict manifest validation pipeline prevents the kind of YAML typos that can take your entire network offline.

Keep your upstreams diverse and your memory limits realistic.

Three-Layer Safety for Autonomous Agents: Stopping the Infinite Loop

Guatu — Thu, 30 Apr 2026 22:15:29 +0000

I watched an autonomous agent spend three hours and 40,000 tokens trying to close a GitHub issue that had an open dependency, only to fail because it kept hallucinating a force_close flag that didn't exist in the API. It didn't just fail; it entered a perfect infinite loop: it would call the tool, get a 400 error, interpret the error as a "temporary network glitch," and try again with the exact same payload.

If you've built agents that actually touch production systems, you know this feeling. Prompting the agent to "be careful" or "follow the schema" is a placebo. When you move from a chat window to an autonomous loop, the gap between the LLM's intent and the system's reality becomes a canyon where agents go to die (and burn through your API credits).

For anyone running agent orchestration in a homelab or production environment, you need a safety architecture that doesn't rely on the model's "good behavior." I've moved to a three-layer safety model: Token-Level Enforcement, Pre-Execution Gates, and Execution Isolation.

What I tried first

My first instinct was to lean heavily on PydanticAI. The idea of using Pydantic for type-safe tool calling seemed like the silver bullet. I spent a week building out complex schemas, thinking that if the code validated the output, the agent would simply "learn" to provide the correct format.

I was wrong. I hit a wall where the agent would produce a JSON object that was almost correct, but it would miss a closing brace or add a trailing comma. Pydantic would throw a ValidationError, the agent would see that error in its history, and then it would attempt to "fix" the JSON by adding even more commentary around the code block. This created a feedback loop of ValidationError $\rightarrow$ Apology $\rightarrow$ Broken JSON.

Then I tried adding a "supervisor" agent to review the actions of the "worker" agent. This just doubled my latency and doubled my token cost without actually solving the root cause. The supervisor often hallucinated the same API capabilities as the worker because they were using the same base model.

The real problem wasn't the logic; it was the lack of deterministic boundaries. I was treating the LLM as a reliable software component when it's actually a probabilistic engine. To make it safe, I had to stop trying to "convince" the model to be safe and start forcing it to be safe at the infrastructure level.

Layer 1: Token-Level Schema Enforcement

The first layer of safety happens before the agent even finishes its sentence. If you're using Ollama v0.5.0 or newer, you can stop relying on the model to "try its best" with JSON.

Most people use the OpenAI-compatible API layer provided by frameworks, but that often just wraps the prompt in "Please return JSON." Ollama now supports a native format parameter that enforces the schema at the token-sampling level. This means the model physically cannot sample a token that violates the JSON schema.

Here is how I implemented this for my homelab health reports using qwen2.5:14b-instruct. I switched from the 32B model to the 14B variant because the 32B was causing 502 timeouts on my Tesla P40s due to VRAM pressure.

import httpx
from pydantic import BaseModel, Field

# Define the strict structure we want
class HomelabHealthReport(BaseModel):
    node_status: dict[str, str]
    critical_alerts: list[str]
    storage_utilization: float = Field(description="Percentage 0-100")

# Extract the JSON schema for Ollama
schema = HomelabHealthReport.model_json_schema()

def get_safe_report():
    # We bypass the high-level wrappers and hit the API directly
    # to ensure the 'format' parameter is actually passed.
    response = httpx.post(
        "http://ollama:11434/api/chat",
        json={
            "model": "qwen2.5:14b-instruct",
            "stream": False,
            "format": schema, # This is the magic: token-level enforcement
            "prompt": "Generate a health report for the homelab based on current metrics."
        },
        timeout=30.0
    )

    if response.status_code != 200:
        print(f"API Error: {response.status_code}")
        return None

    return response.json()["message"]["content"]

# Result is guaranteed to be valid JSON matching HomelabHealthReport

By moving the constraint to the sampler, I eliminated the ValidationError loops entirely. The model no longer "guesses" the JSON; it is constrained by the grammar of the schema.

Layer 2: The Pre-Execution Gate (ActionGate)

Even with perfect JSON, an agent can still decide to do something stupid. Token-level safety ensures the format is right, but it doesn't ensure the intent is safe.

I implemented an ActionGate. This is a deterministic middleware layer that sits between the agent's tool-call and the actual execution. It doesn't use an LLM. It uses hard-coded business logic and state checks.

If an agent tries to close a ticket, the ActionGate checks if there are open dependencies. If it tries to reboot a node, it checks if that node is currently the only one running a critical service.

class SafetyException(Exception):
    pass

def check_action_safety(action_name, params, context):
    """
    Deterministic safety check. 
    No LLMs allowed here.
    """
    # Prevent closing issues that have blocking dependencies
    if action_name == "close_issue":
        issue_id = params.get("issue_id")
        if context.get(f"issue_{issue_id}_has_dependency"):
            raise SafetyException(
                f"Safety Violation: Cannot close issue {issue_id} while dependencies are open."
            )

    # Prevent destructive actions on production nodes during peak hours
    if action_name == "reboot_node":
        node_id = params.get("node_id")
        if context.get("is_production") and context.get("peak_hours"):
            raise SafetyException(
                f"Safety Violation: Reboot of {node_id} forbidden during peak hours."
            )

    return True

# Usage in the agent loop
try:
    if check_action_safety(tool_call.name, tool_call.args, current_context):
        result = execute_tool(tool_call)
except SafetyException as e:
    # We feed the specific error back to the agent so it can pivot
    result = f"Action rejected by Safety Gate: {str(e)}"

This prevents the "infinite loop of failure" I mentioned earlier. Instead of the agent getting a generic 400 error from an API and thinking it's a network glitch, it gets a clear, human-readable explanation: "You cannot do this because X." This forces the agent to change its strategy rather than just retrying the same failed request.

Layer 3: Execution Isolation and Shell Safety

The final layer is where the rubber meets the road. I've spent too many hours debugging "quoting hell."

When you have an agent generating a command that needs to run over SSH, inside a Proxmox container (pct exec), as a specific user (su), and then executing a Python script, you have four layers of shell interpretation. If you use f-strings to build these commands, a single single-quote in the agent's output will break the entire pipeline.

I saw this happen when an agent tried to pass a complex JSON string as an argument to a script. The shell interpreted the quotes, the su command stripped another layer, and by the time it hit Python, the syntax was mangled.

The fix is to stop passing code as shell arguments. Instead, pipe the code directly into the stdin of the remote process.

The wrong way (prone to quoting errors):

# This will break the moment the agent adds a ' or " to the payload
ssh node-a "pct exec 101 -- su - user -c 'python3 -c \"print(\"Hello World\")\"'"

The right way (Shell-safe piping):
I wrote a helper that writes the agent's intended Python logic to a temporary file or pipes it directly. This avoids the shell's interpretation of the string entirely.

# We pipe the actual script content into the remote shell
cat ~/bin/helpers/scout-ideas-helper.py | \
  ssh node-a "pct exec 101 -- su - user -c 'python3 -'"

In this setup, python3 - tells Python to execute the code coming from stdin. The shell only sees the command to start Python, not the code itself. This completely eliminates the quoting nightmare.

To manage the tools themselves, I've moved away from custom boilerplate and started using FastMCP. It allows me to wrap my MSAM (Multi-Agent System Architecture) tools into a standardized server that the agents can discover and use without me having to manually update the tool definitions every time I add a new function. I've detailed the setup for this in my post on Building MCP Servers with FastMCP.

Why this works

This architecture works because it acknowledges that the LLM is the most unreliable part of the system.

Token-level enforcement removes the "formatting" problem. The agent can no longer fail because it forgot a comma.
The ActionGate removes the "logic" problem. The agent can no longer perform an action that is fundamentally unsafe, regardless of how confident it is.
Execution Isolation removes the "infrastructure" problem. The agent's output is treated as data (stdin) rather than as a command (shell argument).

When you combine these, you move from a system that is "mostly working" to one that is "predictably bounded."

Lessons Learned

The biggest surprise was how much the format parameter in Ollama reduced the need for complex prompt engineering. I spent weeks refining a "System Prompt" to ensure JSON compliance, only to find that a single API parameter did the job better than 500 words of instructions.

If I were to do this over again, I would have implemented the ActionGate much sooner. I spent too much time trying to make the agent "smarter" when I should have just made the environment "stricter."

A few caveats:

Latency: Each layer adds a small amount of overhead. The ActionGate is negligible (milliseconds), but the token-level enforcement can slightly increase the time to first token because the sampler has to do more work.
VRAM: As I noted, model size matters. Qwen 2.5 14B is the sweet spot for my hardware. If you're running on limited VRAM, don't chase the 32B or 70B models just for the sake of "intelligence" if it leads to 502 timeouts and unstable inference.
Memory Drift: Ensure your agent's memory is cleaned up. I use a six-layer memory architecture to prevent the agent from getting confused by outdated context, which is often the root cause of why it tries to perform unsafe actions in the first place.

Building autonomous agents isn't about finding the perfect model; it's about building the perfect cage for that model to operate in.

Stop Merging Broken YAML: Kubernetes Manifest Validation in CI

Guatu — Sat, 25 Apr 2026 22:15:35 +0000

Pushing a broken manifest to your main branch is a rite of passage, but it's one that becomes significantly more painful when you're running a GitOps workflow with ArgoCD. I've spent far too many late nights staring at a "Sync Failed" status in ArgoCD, only to realize I had a typo in a Traefik IngressRoute or a missing resource limit that Kyverno was blocking. The problem isn't just the error itself; it's the feedback loop. If the error only surfaces during deployment, your CI pipeline has failed its primary job.

The goal is to move validation as far left as possible. I started integrating kubeconform into my GitHub Actions workflow to catch structural errors—like invalid API versions or malike fields—before the code even reaches a pull request review. However, structural validation is only half the battle. You also have to deal with policy enforcement. I recently ran into a situation where a Kyverno policy enforcing resource limits on all Jobs was breaking my CloudNativePG (CNPG) deployments. The CNPG operator creates Jobs that don't always follow the standard resource pattern, and because the policy was too broad, the cluster refused to provision the primary.

The fix involves two parts: using kubeconform for schema validation in CI and using targeted exclusions in your Kyverno policies. For the CI side, you don't need a complex setup. A simple action step can scan your entire manifests directory.

# GitHub Action snippet for manifest validation
jobs:
  validate-manifests:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Validate Kubernetes manifests
        uses: yannh/kubernetes-manifest-validate@v1.11
        with:
          manifests: |
            kubernetes/workloads/**/*.yaml
            kubernetes/infrastructure/**/*.yaml

On the cluster side, when you have a legitimate reason to bypass a policy—like the CNPG example—don't just disable the policy globally. Use labels to create an exclusion scope. This keeps your GitOps for Homelabs workflow clean without sacrificing security for the rest of your workloads.

apiVersion: kyverno.io/v1
kind: Policy
metadata:
  name: require-resource-limits
spec:
  rules:
    - name: enforce-limits-on-jobs
      match:
        resources:
          kinds:
            - Job
      # Exclude CNPG clusters so the operator can manage its own jobs
      exclude:
        resources:
          labels:
            cnpg.io/cluster: "*"
      validate:
        message: "All containers must have resource limits defined."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - resources:
                      limits:
                        cpu: "?*"
                        memory: "?*"

Validating at the PR stage catches the "dumb" mistakes, while smart policy exclusions prevent the "smart" tools from breaking your legitimate infrastructure.

DEV Community: Guatu

Proxmox Cluster Quorum: How Many Nodes Do You Actually Need

What I tried first

The actual solution

Option 1: The Three-Node Standard

Option 2: The QDevice (The "Cheap" Vote)

Option 3: Monitoring and API Integration

Why it works

Lessons learned

Kyverno Admission Controllers: Policy-as-Code That Actually Works

Why you'd choose a Policy Engine

Option A: OPA Gatekeeper

Option B: Kyverno

Decision Framework

My Pick and Why

The "Infrastructure Exclusion" Pattern

Handling Security Contexts without Breaking the Cluster

The Danger of synchronize: true

Orphaned Resources and the Cleanup Gap

Integration with the Wider Stack

Lessons Learned

Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

The Wrong Way: Trusting the Orchestrator

The Actual Solution: Two-Tier Privacy Routing

1. The Reference-Only Knowledge Base

2. The Privacy Gate (Routing Layer)

3. Hardening the Execution

Why This Works

Lessons Learned

Tailscale Subnet Routers: Accessing Your LAN Without the VPN Headache

The Concept: Routing vs. Agent-based Access

Implementation: The "Happy Path" and the Reality

The Kubernetes and Gateway Trap

Turning it into an Exit Node

The Advanced Headache: NAT-PMP and Network Namespaces

Comparison: Subnet Routers vs. Traditional VPNs

Gotchas and Lessons Learned

Final Thoughts

PCIe Device Passthrough: NIC Name Instability and MAC Pinning

What I expected

What actually happened

The Fix: MAC Pinning

1. Identify the MAC address

2. Create the .link file

3. Update the network configuration

4. Apply and verify

Why this matters

The Tradeoff

Lessons Learned

GPU PCI Address Instability: When Your Card Moves Between Reboots

Cognitive Memory for Agents: Vector Search vs Activation-Based Recall

The Decision Point

Option A: Vector Search (The External Archive)

Option B: Activation-Based Recall (The Internal Intuition)

Decision Framework

My Pick and Why

Vibration Monitoring Architecture: From Sensor to Dashboard

What I tried first (The wrong way)

The Actual Solution: The Edge-Heavy Pipeline

1. Signal Conditioning and Edge Processing

2. The Transport Layer (MQTT 5.0)

3. Storage and Visualization

Why this architecture works

Lessons learned and caveats

Unprivileged LXC + Docker: The runc Sysctl Permission Trap

AdGuard Home: Network-Wide DNS Filtering with Failover

Three-Layer Safety for Autonomous Agents: Stopping the Infinite Loop

What I tried first

Layer 1: Token-Level Schema Enforcement

Layer 2: The Pre-Execution Gate (ActionGate)

Layer 3: Execution Isolation and Shell Safety

Why this works

Lessons Learned

Stop Merging Broken YAML: Kubernetes Manifest Validation in CI

The Danger of `synchronize: true`