Failure Modes – Zero-UI Home Automation (2025)
Purpose
This document defines expected failure modes and required system behavior when components fail, degrade, or become unavailable.
It exists to ensure:
- Failures are predictable
- The house remains calm and usable
- Trust is preserved with non-technical users
- AI-assisted changes do not introduce fragile dependencies
This file is authoritative for error handling and resilience design.
Core Failure Philosophy
Failure is inevitable. Confusion is optional.
When something breaks:
- The system must not surprise users
- The system must not spam notifications
- The system must not “thrash” (rapid on/off, loops, oscillations)
The preferred failure response is: do nothing, stay quiet, and recover automatically.
Failure Classification
Failures are grouped by impact, not by technology.
Severity Levels
| Level | Name | Description |
|---|---|---|
| L0 | Invisible | User never notices |
| L1 | Minor | Slight degradation, no action required |
| L2 | Actionable | User intervention required |
| L3 | Safety | Immediate awareness required |
Only L2 and L3 failures may generate notifications.
Global Failure Rules (Apply Everywhere)
-
No Cascading Failures
One subsystem failing must not break unrelated subsystems. -
No Repeated Alerts
Every alert must have rate limiting and suppression. -
No “Unknown” Automation Actions
If required state is missing or ambiguous → do nothing. -
Automatic Recovery Is Mandatory
Services must restart cleanly without manual intervention. -
Physical Controls Remain Functional
A software failure must never disable lights, doors, or safety systems.
Platform-Level Failures
Home Assistant Restart / Crash
Expected Causes
- Host reboot
- Container restart
- Upgrade
Required Behavior
- Lutron switches continue to function normally
- Hardwired sensors continue reporting once HA is back
- Automations resume without manual reset
- No notifications unless downtime exceeds a defined threshold
Severity
- Short outage: L0
- Extended outage (> X minutes): L2 (operator-only)
MQTT Broker Failure
Expected Causes
- Container restart
- Disk issue
- Misconfiguration
Required Behavior
- Sensor state may freeze temporarily
- No automations fire on stale data
- System resumes cleanly when broker returns
Explicit Rule Automations must not assume MQTT messages are always fresh.
Severity
- Short outage: L1
- Extended outage: L2 (operator-only)
Database Failure (PostgreSQL / InfluxDB)
Expected Causes
- Disk full
- Container issue
- Corruption
Required Behavior
- Real-time automations continue operating
- Loss of history is acceptable
- UI degradation is acceptable
- No family-visible impact
Severity
- L1 (operator concern only)
Sensor Failures
Hardwired Sensor Offline (Konnected / ESPHome)
Expected Causes
- Board power loss
- Wiring fault
- Firmware issue
Required Behavior
- Automations relying on that sensor must no-op
- No “phantom occupancy”
- No aggressive off actions
Notification Policy
- Only if sensor remains offline beyond a defined duration
- Operator-focused notification only
Severity
- Single sensor: L1
- Multiple critical sensors: L2
Wireless / Supplemental Sensor Failure
Expected Causes
- Battery depletion
- RF interference
Required Behavior
- System ignores missing signal
- No fallback to unsafe assumptions
- No notifications unless safety-related
Severity
- L0–L1
Lighting Failures
Automation Failure
Scenario
- Lighting automation does not trigger
Required Behavior
- Physical switches always work
- No retries that cause flicker
- No notifications
Severity
- L0
Light Fails to Turn Off
Scenario
- Missed occupancy clear
- Sensor ambiguity
Required Behavior
- Prefer leaving light on
- User can manually turn it off
- Automation backs off after manual action
Severity
- L0–L1
Camera & Vision Failures
Camera Offline (PoE / Reolink)
Required Behavior
- No alerts unless explicitly safety-related
- No repeated “camera unavailable” spam
- Camera absence must not affect lighting or occupancy
Severity
- Single camera: L1
- Multiple critical cameras: L2 (operator-only)
AI Detection Failure (Frigate / TPU)
Scenarios
- False negatives
- False positives
- AI service offline
Required Behavior
- Default to silence
- No motion-only alerts
- Do not escalate uncertainty
Severity
- L0–L1
Energy & Power Failures
Utility Power Outage
Required Behavior
- If generator starts normally → no notification
- If generator fails → actionable alert
- Avoid repeated alerts during extended outage
Severity
- Generator success: L0
- Generator failure: L3
Solar / Inverter Data Loss
Required Behavior
- Energy-based automations pause
- Comfort is preserved
- No notifications unless outage is abnormal
Severity
- L1
Vehicle Integration Failures
Cloud API Failure (Tesla / Rivian)
Required Behavior
- Vehicle state freezes gracefully
- No retries that drain vehicle battery
- No notifications
Explicit Rule Vehicles are never mission-critical.
Severity
- L0–L1
Network Failures
Internet Outage
Required Behavior
- All local automations continue
- Lighting, sensors, cameras remain functional
- No notifications unless safety-related
Severity
- L0
LAN / Wi-Fi Degradation
Required Behavior
- Wired systems continue unaffected
- Wireless sensors may degrade silently
- No cascading logic failures
Severity
- L1
Ingress / Reverse Proxy Failures
Nginx Proxy Manager Down or Misconfigured
Expected Causes
- Container crash, upgrade, or bad host config
- Router forwarding to wrong internal port (e.g., 3000 instead of 80/443)
Required Behavior
- Private services remain isolated (no host port exposure)
- Operator can access HA via LAN IP + 8123 if needed
- Restore by ensuring WAN 80/443 → NPM and services on
nginx-proxy_default
Notification Policy
- Operator-only alert (once), with clear remediation steps
Severity
- L1
Open WebUI Accidentally Exposed on Host Port
Scenario
- Container mapped 8080→3000 and router forwards external 80→3000
Required Behavior
- Remove host port mapping and attach container to NPM network only
- Validate reachability from NPM container by name and internal port
Severity
- L1
Notification Failure Modes
Notification Delivery Failure
Required Behavior
- Do not retry aggressively
- Log locally
- Avoid escalation unless safety-critical
Severity
- L1
AI Coding Guardrails (Critical)
Any AI-generated change must:
- Read and align with the Framework, Rules, Intent, Integrations, and this Failure Modes file before emitting code
- Prefer deterministic, reversible automations; default to silence on uncertainty and notify only when actionable
- Avoid adding new cloud dependencies, UI requirements, or exposed ports/credentials
- Document one-sentence intent, failure mode, and cooldown/override for every automation; avoid hidden coupling
- Ship incrementally (observe → assist → full), with safe rollbacks and operator-visible diffs
If a proposed change violates these guardrails, it must be rejected or redesigned.