Zero-UI
← Back to docs/Failure Modes – Zero-UI Home Automation (2025)

Failure Modes – Zero-UI Home Automation (2025)

Purpose

This document defines expected failure modes and required system behavior when components fail, degrade, or become unavailable.

It exists to ensure:

  • Failures are predictable
  • The house remains calm and usable
  • Trust is preserved with non-technical users
  • AI-assisted changes do not introduce fragile dependencies

This file is authoritative for error handling and resilience design.


Core Failure Philosophy

Failure is inevitable. Confusion is optional.

When something breaks:

  • The system must not surprise users
  • The system must not spam notifications
  • The system must not “thrash” (rapid on/off, loops, oscillations)

The preferred failure response is: do nothing, stay quiet, and recover automatically.


Failure Classification

Failures are grouped by impact, not by technology.

Severity Levels

Level Name Description
L0 Invisible User never notices
L1 Minor Slight degradation, no action required
L2 Actionable User intervention required
L3 Safety Immediate awareness required

Only L2 and L3 failures may generate notifications.


Global Failure Rules (Apply Everywhere)

  1. No Cascading Failures
    One subsystem failing must not break unrelated subsystems.

  2. No Repeated Alerts
    Every alert must have rate limiting and suppression.

  3. No “Unknown” Automation Actions
    If required state is missing or ambiguous → do nothing.

  4. Automatic Recovery Is Mandatory
    Services must restart cleanly without manual intervention.

  5. Physical Controls Remain Functional
    A software failure must never disable lights, doors, or safety systems.


Platform-Level Failures

Home Assistant Restart / Crash

Expected Causes

  • Host reboot
  • Container restart
  • Upgrade

Required Behavior

  • Lutron switches continue to function normally
  • Hardwired sensors continue reporting once HA is back
  • Automations resume without manual reset
  • No notifications unless downtime exceeds a defined threshold

Severity

  • Short outage: L0
  • Extended outage (> X minutes): L2 (operator-only)

MQTT Broker Failure

Expected Causes

  • Container restart
  • Disk issue
  • Misconfiguration

Required Behavior

  • Sensor state may freeze temporarily
  • No automations fire on stale data
  • System resumes cleanly when broker returns

Explicit Rule Automations must not assume MQTT messages are always fresh.

Severity

  • Short outage: L1
  • Extended outage: L2 (operator-only)

Database Failure (PostgreSQL / InfluxDB)

Expected Causes

  • Disk full
  • Container issue
  • Corruption

Required Behavior

  • Real-time automations continue operating
  • Loss of history is acceptable
  • UI degradation is acceptable
  • No family-visible impact

Severity

  • L1 (operator concern only)

Sensor Failures

Hardwired Sensor Offline (Konnected / ESPHome)

Expected Causes

  • Board power loss
  • Wiring fault
  • Firmware issue

Required Behavior

  • Automations relying on that sensor must no-op
  • No “phantom occupancy”
  • No aggressive off actions

Notification Policy

  • Only if sensor remains offline beyond a defined duration
  • Operator-focused notification only

Severity

  • Single sensor: L1
  • Multiple critical sensors: L2

Wireless / Supplemental Sensor Failure

Expected Causes

  • Battery depletion
  • RF interference

Required Behavior

  • System ignores missing signal
  • No fallback to unsafe assumptions
  • No notifications unless safety-related

Severity

  • L0–L1

Lighting Failures

Automation Failure

Scenario

  • Lighting automation does not trigger

Required Behavior

  • Physical switches always work
  • No retries that cause flicker
  • No notifications

Severity

  • L0

Light Fails to Turn Off

Scenario

  • Missed occupancy clear
  • Sensor ambiguity

Required Behavior

  • Prefer leaving light on
  • User can manually turn it off
  • Automation backs off after manual action

Severity

  • L0–L1

Camera & Vision Failures

Required Behavior

  • No alerts unless explicitly safety-related
  • No repeated “camera unavailable” spam
  • Camera absence must not affect lighting or occupancy

Severity

  • Single camera: L1
  • Multiple critical cameras: L2 (operator-only)

AI Detection Failure (Frigate / TPU)

Scenarios

  • False negatives
  • False positives
  • AI service offline

Required Behavior

  • Default to silence
  • No motion-only alerts
  • Do not escalate uncertainty

Severity

  • L0–L1

Energy & Power Failures

Utility Power Outage

Required Behavior

  • If generator starts normally → no notification
  • If generator fails → actionable alert
  • Avoid repeated alerts during extended outage

Severity

  • Generator success: L0
  • Generator failure: L3

Solar / Inverter Data Loss

Required Behavior

  • Energy-based automations pause
  • Comfort is preserved
  • No notifications unless outage is abnormal

Severity

  • L1

Vehicle Integration Failures

Cloud API Failure (Tesla / Rivian)

Required Behavior

  • Vehicle state freezes gracefully
  • No retries that drain vehicle battery
  • No notifications

Explicit Rule Vehicles are never mission-critical.

Severity

  • L0–L1

Network Failures

Internet Outage

Required Behavior

  • All local automations continue
  • Lighting, sensors, cameras remain functional
  • No notifications unless safety-related

Severity

  • L0

LAN / Wi-Fi Degradation

Required Behavior

  • Wired systems continue unaffected
  • Wireless sensors may degrade silently
  • No cascading logic failures

Severity

  • L1

Ingress / Reverse Proxy Failures

Nginx Proxy Manager Down or Misconfigured

Expected Causes

  • Container crash, upgrade, or bad host config
  • Router forwarding to wrong internal port (e.g., 3000 instead of 80/443)

Required Behavior

  • Private services remain isolated (no host port exposure)
  • Operator can access HA via LAN IP + 8123 if needed
  • Restore by ensuring WAN 80/443 → NPM and services on nginx-proxy_default

Notification Policy

  • Operator-only alert (once), with clear remediation steps

Severity

  • L1

Open WebUI Accidentally Exposed on Host Port

Scenario

  • Container mapped 8080→3000 and router forwards external 80→3000

Required Behavior

  • Remove host port mapping and attach container to NPM network only
  • Validate reachability from NPM container by name and internal port

Severity

  • L1

Notification Failure Modes

Notification Delivery Failure

Required Behavior

  • Do not retry aggressively
  • Log locally
  • Avoid escalation unless safety-critical

Severity

  • L1

AI Coding Guardrails (Critical)

Any AI-generated change must:

  • Read and align with the Framework, Rules, Intent, Integrations, and this Failure Modes file before emitting code
  • Prefer deterministic, reversible automations; default to silence on uncertainty and notify only when actionable
  • Avoid adding new cloud dependencies, UI requirements, or exposed ports/credentials
  • Document one-sentence intent, failure mode, and cooldown/override for every automation; avoid hidden coupling
  • Ship incrementally (observe → assist → full), with safe rollbacks and operator-visible diffs

If a proposed change violates these guardrails, it must be rejected or redesigned.