Failure Modes – Zero-UI Home Automation (2025)

Purpose

This document defines expected failure modes and required system behavior when components fail, degrade, or become unavailable.

It exists to ensure:

Failures are predictable
The house remains calm and usable
Trust is preserved with non-technical users
AI-assisted changes do not introduce fragile dependencies

This file is authoritative for error handling and resilience design.

Core Failure Philosophy

Failure is inevitable. Confusion is optional.

When something breaks:

The system must not surprise users
The system must not spam notifications
The system must not “thrash” (rapid on/off, loops, oscillations)

The preferred failure response is: do nothing, stay quiet, and recover automatically.

Failure Classification

Failures are grouped by impact, not by technology.

Severity Levels

Level	Name	Description
L0	Invisible	User never notices
L1	Minor	Slight degradation, no action required
L2	Actionable	User intervention required
L3	Safety	Immediate awareness required

Only L2 and L3 failures may generate notifications.

Global Failure Rules (Apply Everywhere)

No Cascading Failures
One subsystem failing must not break unrelated subsystems.
No Repeated Alerts
Every alert must have rate limiting and suppression.
No “Unknown” Automation Actions
If required state is missing or ambiguous → do nothing.
Automatic Recovery Is Mandatory
Services must restart cleanly without manual intervention.
Physical Controls Remain Functional
A software failure must never disable lights, doors, or safety systems.

Platform-Level Failures

Home Assistant Restart / Crash

Expected Causes

Host reboot
Container restart
Upgrade

Required Behavior

Lutron switches continue to function normally
Hardwired sensors continue reporting once HA is back
Automations resume without manual reset
No notifications unless downtime exceeds a defined threshold

Severity

Short outage: L0
Extended outage (> X minutes): L2 (operator-only)

MQTT Broker Failure

Expected Causes

Container restart
Disk issue
Misconfiguration

Required Behavior

Sensor state may freeze temporarily
No automations fire on stale data
System resumes cleanly when broker returns

Explicit Rule Automations must not assume MQTT messages are always fresh.

Severity

Short outage: L1
Extended outage: L2 (operator-only)

Database Failure (PostgreSQL / InfluxDB)

Expected Causes

Disk full
Container issue
Corruption

Required Behavior

Real-time automations continue operating
Loss of history is acceptable
UI degradation is acceptable
No family-visible impact

Severity

L1 (operator concern only)

Sensor Failures

Hardwired Sensor Offline (Konnected / ESPHome)

Expected Causes

Board power loss
Wiring fault
Firmware issue

Required Behavior

Automations relying on that sensor must no-op
No “phantom occupancy”
No aggressive off actions

Notification Policy

Only if sensor remains offline beyond a defined duration
Operator-focused notification only

Severity

Single sensor: L1
Multiple critical sensors: L2

Wireless / Supplemental Sensor Failure

Expected Causes

Battery depletion
RF interference

Required Behavior

System ignores missing signal
No fallback to unsafe assumptions
No notifications unless safety-related

Severity

L0–L1

Lighting Failures

Automation Failure

Scenario

Lighting automation does not trigger

Required Behavior

Physical switches always work
No retries that cause flicker
No notifications

Severity

Light Fails to Turn Off

Scenario

Missed occupancy clear
Sensor ambiguity

Required Behavior

Prefer leaving light on
User can manually turn it off
Automation backs off after manual action

Severity

L0–L1

Camera & Vision Failures

Camera Offline (PoE / Reolink)

Required Behavior

No alerts unless explicitly safety-related
No repeated “camera unavailable” spam
Camera absence must not affect lighting or occupancy

Severity

Single camera: L1
Multiple critical cameras: L2 (operator-only)

AI Detection Failure (Frigate / TPU)

Scenarios

False negatives
False positives
AI service offline

Required Behavior

Default to silence
No motion-only alerts
Do not escalate uncertainty

Severity

L0–L1

Energy & Power Failures

Utility Power Outage

Required Behavior

If generator starts normally → no notification
If generator fails → actionable alert
Avoid repeated alerts during extended outage

Severity

Generator success: L0
Generator failure: L3

Solar / Inverter Data Loss

Required Behavior

Energy-based automations pause
Comfort is preserved
No notifications unless outage is abnormal

Severity

Vehicle Integration Failures

Cloud API Failure (Tesla / Rivian)

Required Behavior

Vehicle state freezes gracefully
No retries that drain vehicle battery
No notifications

Explicit Rule Vehicles are never mission-critical.

Severity

L0–L1

Network Failures

Internet Outage

Required Behavior

All local automations continue
Lighting, sensors, cameras remain functional
No notifications unless safety-related

Severity

LAN / Wi-Fi Degradation

Required Behavior

Wired systems continue unaffected
Wireless sensors may degrade silently
No cascading logic failures

Severity

Ingress / Reverse Proxy Failures

Nginx Proxy Manager Down or Misconfigured

Expected Causes

Container crash, upgrade, or bad host config
Router forwarding to wrong internal port (e.g., 3000 instead of 80/443)

Required Behavior

Private services remain isolated (no host port exposure)
Operator can access HA via LAN IP + 8123 if needed
Restore by ensuring WAN 80/443 → NPM and services on nginx-proxy_default

Notification Policy

Operator-only alert (once), with clear remediation steps

Severity

Open WebUI Accidentally Exposed on Host Port

Scenario

Container mapped 8080→3000 and router forwards external 80→3000

Required Behavior

Remove host port mapping and attach container to NPM network only
Validate reachability from NPM container by name and internal port

Severity

Finance Stack Failures

SimpleFIN / Institution Sync

Behavior

Re-auth required; institution outage; throttling; partial updates

Required

Staggered polling; backoff; per-connection freshness timestamps; operator-only notifications

Data Correctness

Behavior

Duplicates; payee normalization conflicts; posting date inconsistencies

Required

Dedupe rules; safe revert; store posted/cleared semantics

Infra

Behavior

DB/volume corruption; disk full; container restart; secrets exposure

Required

Nightly backups + retention; storage monitoring; healthchecks; secrets outside git; rotation

Insights (Non-critical)

Behavior

Analytics delays; LLM uncertainty

Required

Read-only; cite numbers; degrade silently

Notification Failure Modes

Notification Delivery Failure

Required Behavior

Do not retry aggressively
Log locally
Avoid escalation unless safety-critical

Severity

AI Coding Guardrails (Critical)

Any AI-generated change must:

Read and align with the Framework, Rules, Intent, Integrations, and this Failure Modes file before emitting code
Prefer deterministic, reversible automations; default to silence on uncertainty and notify only when actionable
Avoid adding new cloud dependencies, UI requirements, or exposed ports/credentials
Document one-sentence intent, failure mode, and cooldown/override for every automation; avoid hidden coupling
Ship incrementally (observe → assist → full), with safe rollbacks and operator-visible diffs

If a proposed change violates these guardrails, it must be rejected or redesigned.