ObjAlert is the core alert engine for the Axion platform. It runs a
continuous check loop against configurable TriggerSql queries, tracks
results in the database, auto-creates incidents for high-severity events,
dispatches notifications through ObjNotify, and maintains a full
lifecycle from detection through escalation to recovery.
Two supporting classes live in the same module:
AlertMetrics — in-memory performance counters shared across allObjAlert instances in a process (total checks, trigger rate, errorObjAlert — the main class; inherits from ObjData for databaseServeWorkflow / cron
│
▼
ObjAlert.Service() ← infinite loop, 20s sleep
│
├─ check() ← evaluate all active alerts
│ ├─ is_maintenance_window()
│ ├─ should_suppress_alert()
│ ├─ should_throttle_alert()
│ ├─ query_get_value(TriggerSql)
│ └─ alert_track()
│ ├─ insert track_alert row
│ ├─ create_incident() [HIGH / CRITICAL only]
│ └─ ObjNotify.Run()
│
├─ check_escalations()
├─ check_recoveries()
└─ check_predictive_alerts()
def_alert — alert definitions| Column | Type | Description |
|---|---|---|
Alert |
char(255) | Alert name (PK with Package) |
Package |
char(255) | Package scope |
TriggerSql |
text | SQL that returns 1/Y to fire |
AlertValueSql |
text | SQL to retrieve the alert value |
AlertNote |
text | Message template for notifications |
TriggerCooldown |
int | Minimum seconds between triggers |
ActiveCron |
char(2) | Cron expression for scheduled checks |
Severity |
enum | CRITICAL, HIGH, MEDIUM, LOW, INFO |
MaxTriggersPerHour |
int | Rate limit (default 10) |
ExponentialBackoff |
char(1) | Enable backoff between retries |
BackoffMultiplier |
decimal | Backoff factor (default 2.0) |
EscalationEnabled |
char(1) | Auto-escalate unacknowledged alerts |
EscalationMinutes |
int | Minutes before first escalation |
EscalationLevel1/2/3 |
varchar | Escalation targets per level |
RunbookUrl |
varchar(500) | Link to runbook file |
ImpactArea |
varchar(100) | System area impacted |
Workflow |
char(255) | Workflow to trigger on fire |
RemoteConnection |
char(255) | Remote DB connection for TriggerSql |
Active |
char(2) | Y/N |
track_alert — alert trigger history| Column | Type | Description |
|---|---|---|
Guid |
char(50) | Row identifier (PK) |
Alert |
char(255) | Alert name |
Package |
char(255) | Package |
AlertNote |
text | Trigger message or "PASS" |
AlertTriggerTime |
datetime | When the alert fired |
Severity |
enum | Severity at trigger time |
AcknowledgedBy |
varchar | User who acknowledged |
AcknowledgedAt |
datetime | Acknowledgment time |
TimeToAcknowledge |
int | Seconds to acknowledge |
ResolvedBy |
varchar | User or "SYSTEM" |
ResolvedAt |
datetime | Resolution time |
TimeToResolve |
int | Seconds to resolve |
ResolutionNotes |
text | Resolution details |
EscalationLevel |
int | Current escalation level (0–3) |
def_alert_metric — check run performance| Column | Type | Description |
|---|---|---|
Package |
varchar | Package |
CheckTime |
datetime | When the check ran |
CheckGuid |
varchar(36) | Unique check run ID |
AlertsChecked |
int | Alerts evaluated |
AlertsTriggered |
int | Alerts that fired |
AlertsPassed |
int | Alerts that passed |
ElapsedSeconds |
decimal | Total check duration |
AvgTriggerSqlTime |
decimal | Average TriggerSql time |
MaxTriggerSqlTime |
decimal | Slowest TriggerSql time |
MemoryUsageMB |
decimal | Process memory at check time |
CheckSuccess |
char(1) | Y/N |
def_alert_maintenance — maintenance windowsSuppresses alerts during scheduled downtime. Pattern-matched on alert
name; NULL pattern suppresses all alerts for the package.
def_alert_dependency — parent/child suppressionWhen a parent alert is active, child alerts listed in this table are
automatically suppressed to reduce noise.
This describes a complete real-world incident from detection to closure.
check() runs every 20 secondsService() calls check() in a loop. For each active alert in
def_alert, the TriggerSql is executed against the configured database
connection. A result of 1 or Y means the alert fires.
Before executing, three suppression checks run:
is_maintenance_window() queriesdef_alert_maintenance for an active window matching this alert.should_suppress_alert() checksdef_alert_dependency for a currently firing parent alert.should_throttle_alert() counts triggers in theMaxTriggersPerHour. Exponential backoff is appliedExponentialBackoff = 'Y'.If any check passes, the alert is skipped for this cycle.
alert_track()On a positive trigger result, alert_track() is called with the alert
name, package, message, value, severity, and notify mode. It:
track_alert with AlertNote = the trigger messageLastTrigger on def_alertHIGH or CRITICAL severity, auto-creates an incident viaObjAlertIncident.create_incident() and links the track_alert rowObjNotify.Run() to dispatch notifications — Slack, SMS, email,[Ref: INC-A3F2])Notification is skipped if alert_note == "PASS" or
notify_mode == AlertNotifyMode.NONE (used internally to prevent
re-notification loops from MQTT unpack).
check_escalations()At the end of every check() run, check_escalations() queries
track_alert for unacknowledged trigger rows that have passed
EscalationMinutes. For each:
EscalationLevel is incremented (max 3)escalate_alert() triggers the workflow ALERT_ESCALATION_L{n} withEscalationLevel1, EscalationLevel2, EscalationLevel3)acknowledge_alert()A team member acknowledges via the dashboard or workflow:
alert_obj.acknowledge_alert(alert_guid="...", user="engineer1")
This stamps AcknowledgedBy, AcknowledgedAt, and
TimeToAcknowledge on the track_alert row. Escalation stops for
this trigger instance.
Notes are added to the linked incident via ObjAlertIncident:
incident_obj.add_incident_notes(
incident_guid="...",
note_type="ROOT_CAUSE",
notes="Long-running report query consuming all DB connections",
user="engineer1",
)
incident_obj.update_incident_status(incident_guid, "INVESTIGATING")
Each status change and note appends a timestamped entry to the incident
timeline.
check_recoveries()Also runs at the end of every check(). It queries track_alert for
alerts where a PASS result has appeared within the last 30 minutes
after a prior trigger. For each recovered alert:
send_recovery_notification() triggers the ALERT_RECOVERY workflowresolve_alert() is called with user="SYSTEM", stampingResolvedAt, TimeToResolve, and ResolutionNotesincident_obj.close_incident(
incident_guid="...",
resolution="Killed long-running query, added query timeout",
root_cause="Reporting query missing WHERE index",
lessons_learned="Add query timeout to all reporting connections",
user="engineer1",
)
Status moves through CREATED → ACKNOWLEDGED → INVESTIGATING → RESOLVED → CLOSED. Each transition is written to the incident timeline.
Runbook stubs can be generated from a template and stored alongside
package-specific documentation.
# Generate runbook for an alert
python factory.core/ObjAlert.py generate-runbook API_RESPONSE_TIME
# Generate for all alerts without a runbook
python factory.core/ObjAlert.py generate-all-runbooks
# Display an existing runbook
python factory.core/ObjAlert.py show-runbook API_RESPONSE_TIME
# List all available runbooks
python factory.core/ObjAlert.py list-runbooks
Runbooks are stored at:
local.documents/package.{package}/runbooks/{Alert}.mdlocal.documents/base/runbooks/{Alert}.mdThe RunbookUrl column on def_alert is updated automatically when a
stub is generated. The runbook path is included in alert context via
enrich_alert_context().
Every check() run writes a row to def_alert_metric recording timing,
counts, and memory. These feed into:
detect_anomalies() — Z-score analysis against a rolling baseline.SLOW_CHECK_EXECUTION), slow trigger SQLSLOW_TRIGGER_SQL), alert storms (ALERT_STORM), and memory growthMEMORY_INCREASE).check_predictive_alerts() — fires synthetic alertsPREDICTIVE_ALERT_STORM, PREDICTIVE_DB_DEGRADATION, etc.) whenAlertMetrics — in-process counters exposed viaObjAlert._metrics.format_stats() for live visibility.Metric cleanup runs every 50 check cycles, deleting rows older than 90
days in batches to avoid table locks.
Three methods analyse historical track_alert data to surface patterns:
| Method | Purpose |
|---|---|
analyze_alert_correlations() |
Scores how often alert pairs fire within a time window |
detect_alert_sequences() |
Finds A-then-B temporal patterns with average delay |
suggest_dependencies() |
Recommends parent/child relationships to add to def_alert_dependency |
group_correlated_alerts() |
Clusters alerts into groups using graph DFS |
get_correlation_matrix() |
Full N×N correlation score matrix for dashboard display |
get_alert_dashboard_json() returns a single dict suitable for a
monitoring dashboard:
alert_obj = ObjAlert()
dashboard = alert_obj.get_alert_dashboard_json(package="homechoice")
# {
# "overview": { active_alerts, health_status, ... },
# "top_alerts_24h": [...],
# "mttr": { mttr_minutes, sample_size, ... },
# "mttd": { mttd_minutes, ... },
# "sla_compliance": { overall: { compliance_pct }, by_severity: {...} },
# "trend_7d": [{ date, total_triggers, critical, high, ... }]
# }
SLA targets hardcoded in calculate_sla_compliance:
| Severity | Acknowledge | Resolve |
|---|---|---|
| CRITICAL | 15 min | 4 hours |
| HIGH | 30 min | 8 hours |
| MEDIUM | 2 hours | 24 hours |
| LOW | 8 hours | 72 hours |
python factory.core/ObjAlert.py check [--package PKG] [--archetype ARC] [--remote-check]
python factory.core/ObjAlert.py remote
python factory.core/ObjAlert.py service
python factory.core/ObjAlert.py generate-runbook ALERT_NAME [--package PKG] [--force]
python factory.core/ObjAlert.py generate-all-runbooks [--package PKG] [--force]
python factory.core/ObjAlert.py show-runbook ALERT_NAME [--package PKG]
python factory.core/ObjAlert.py list-runbooks [--package PKG]
ObjAlertIncident — incident lifecycle,ObjNotify — multi-channel notification deliveryObjMonitor — hardware/system metrics collectioncheck() when remote_check=True)