The Axion alerts and incidents system provides comprehensive monitoring,
incident management, and communication tracking capabilities. This guide
covers how to work with alerts, incidents, and their integration with
notification systems.
Alerts are automated notifications triggered by monitoring systems when
specific conditions are met. They represent individual events that require
attention, such as:
Incidents are managed containers that group related alerts and track the
investigation and resolution process. An incident represents a problem
that requires human intervention and follows a defined lifecycle from
creation to resolution.
Key Difference: Alerts are automated event notifications, while
incidents are managed work items with tracking, notes, assignments, and
resolution workflows.
The main incident tracking table:
Key Columns:
IncidentGuid: Unique identifier (UUID format)Title: Short description of the incidentSeverity: CRITICAL, HIGH, MEDIUM, LOWStatus: CREATED, INVESTIGATING, MONITORING, CLOSEDPackage: System package/module affectedDescription: Detailed incident descriptionAssignedTo: User responsible for resolutionCreatedAt: Incident creation timestampClosedAt: Resolution timestampLinks alerts to incidents (many-to-many relationship):
Key Columns:
IncidentGuid: Reference to data_incidentAlertGuid: Alert identifierLinkedAt: When alert was linked to incidentStores investigation notes and documentation:
Key Columns:
IncidentGuid: Reference to data_incidentNoteType: ROOT_CAUSE, RESOLUTION, INVESTIGATION, MITIGATION, GENERALContent: Note textCreatedBy: User who created the noteCreatedAt: Note timestampTracks all incident events and state changes:
Key Columns:
IncidentGuid: Reference to data_incidentEventType: STATUS_CHANGE, ALERT_LINKED, NOTE_ADDED, etc.Description: Event descriptionCreatedBy: User who triggered the eventTimestamp: When event occurredTracks all notifications sent about incidents:
Key Columns:
CommunicationGuid: Unique identifierIncidentGuid: Reference to data_incidentShortRef: Human-readable reference (REF-YYYYMMDD-XXX)Channel: PAGERDUTY, SLACK, DISCORD, SMS, WHATSAPP, DBUS, MQTTDirection: OUTBOUND, INBOUNDStatus: SENT, FAILED, PENDINGMessage: Message contentRecipient: Notification recipientExternalId: External system reference (e.g., PagerDuty dedup_key)Timestamp: Send timestampMetadata: Additional data (JSON format)from factory.core import ObjAlertIncident
# Initialize the incident manager
incident_mgr = ObjAlertIncident.ObjAlertIncident()
# Create a new incident
incident_guid = incident_mgr.create_incident(
title="Database Connection Pool Exhausted",
severity="CRITICAL",
package="DATABASE",
description="All connection pool slots occupied, new requests failing",
assigned_to="dba-team",
alert_links=["alert-guid-1", "alert-guid-2"] # Optional
)
print(f"Created incident: {incident_guid}")
# Update incident status
incident_mgr.update_incident_status(
incident_guid=incident_guid,
new_status="INVESTIGATING",
updated_by="admin"
)
Valid Status Transitions:
# Add root cause analysis
incident_mgr.add_incident_note(
incident_guid=incident_guid,
note_type="ROOT_CAUSE",
content="Connection leak in UserService.authenticate() method",
created_by="admin"
)
# Add resolution steps
incident_mgr.add_incident_note(
incident_guid=incident_guid,
note_type="RESOLUTION",
content="Deployed fix to properly close connections in finally block",
created_by="admin"
)
Note Types:
ROOT_CAUSE: What caused the incidentRESOLUTION: How it was fixedINVESTIGATION: Findings during investigationMITIGATION: Temporary workarounds appliedGENERAL: Other notes# Close with full details
incident_mgr.close_incident(
incident_guid=incident_guid,
resolution_summary="Fixed connection leak and restarted service",
root_cause="Missing try-finally block in authentication code",
closed_by="admin"
)
# Link additional alerts to existing incident
incident_mgr.link_alert_to_incident(
incident_guid=incident_guid,
alert_guid="alert-guid-3"
)
# Get all alerts for an incident
alerts = incident_mgr.get_incident_alerts(incident_guid)
for alert in alerts:
print(f"Alert: {alert['AlertGuid']} linked at {alert['LinkedAt']}")
When multiple incidents are created for the same issue:
# Merge incident B into incident A
incident_mgr.merge_incidents(
source_incident_guid="incident-B-guid",
target_incident_guid="incident-A-guid",
merged_by="admin"
)
# All alerts, notes, and timeline from B are moved to A
# Incident B is closed with status indicating merge
The incident system integrates with ObjNotify to send notifications across
multiple channels:
# Send critical incident notification
stats = incident_mgr.notify_incident(
incident_guid=incident_guid,
notify_code="INCIDENT_CRITICAL",
sent_by="SYSTEM"
)
print(f"Notifications sent: {stats['sent']}, failed: {stats['failed']}")
Notification Features:
Every notification receives a unique, human-readable reference:
Format: REF-YYYYMMDD-XXX
REF: PrefixYYYYMMDD: Date (e.g., 20251229)XXX: 3-character random code (A-Z, 0-9)Example: REF-20251229-A7K
# Get all communications for an incident
comms = incident_mgr.get_incident_communications(
incident_guid=incident_guid,
channel="SLACK", # Optional filter
limit=50 # Optional pagination
)
for comm in comms:
print(f"{comm['ShortRef']}: {comm['Channel']} - {comm['Status']}")
# Look up specific communication by reference
comm = incident_mgr.get_communication_by_ref("REF-20251229-A7K")
if comm:
print(f"Message sent via {comm['Channel']} to {comm['Recipient']}")
print(f"Incident: {comm['IncidentTitle']} ({comm['IncidentSeverity']})")
# Manually log communication (e.g., phone call, manual action)
comm_guid, short_ref = incident_mgr.log_communication(
incident_guid=incident_guid,
channel="PHONE",
direction="OUTBOUND",
message="Called on-call engineer to escalate",
recipient="+1-555-0123",
status="SENT",
metadata={"duration_seconds": 180, "answered": True}
)
print(f"Logged communication: {short_ref}")
Use placeholders in notification messages:
template = """
INCIDENT ALERT
==============
Title: {incident_title}
Severity: {incident_severity}
Status: {incident_status}
Package: {incident_package}
Description:
{incident_description}
Created: {incident_created_at}
Reference: {incident_guid}
"""
stats = incident_mgr.notify_incident(
incident_guid=incident_guid,
notify_code="INCIDENT_CUSTOM",
message_template=template
)
Available Placeholders:
{incident_guid}{incident_title}{incident_severity}{incident_status}{incident_package}{incident_description}{incident_created_at}The incident system provides CLI commands for common operations:
python factory.core/ObjAlertIncident.py create-incident \
--title "Database Down" \
--severity CRITICAL \
--package DATABASE \
--description "Primary database not responding" \
--assigned-to dba-team
# List all incidents
python factory.core/ObjAlertIncident.py list-incidents
# Filter by status
python factory.core/ObjAlertIncident.py list-incidents --status INVESTIGATING
# Filter by severity
python factory.core/ObjAlertIncident.py list-incidents --severity CRITICAL
python factory.core/ObjAlertIncident.py get-incident <incident-guid>
python factory.core/ObjAlertIncident.py notify-incident \
<incident-guid> \
INCIDENT_CRITICAL
# All communications for incident
python factory.core/ObjAlertIncident.py list-communications <incident-guid>
# Filter by channel
python factory.core/ObjAlertIncident.py list-communications \
<incident-guid> \
--channel PAGERDUTY
python factory.core/ObjAlertIncident.py lookup-ref REF-20251229-A7K
# Calculate incident metrics
metrics = incident_mgr.get_incident_metrics(
package="DATABASE",
start_date="2025-01-01",
end_date="2025-01-31"
)
print(f"Total incidents: {metrics['total_incidents']}")
print(f"Average resolution time: {metrics['avg_resolution_time_hours']} hours")
print(f"By severity: {metrics['by_severity']}")
print(f"By status: {metrics['by_status']}")
# Generate text report
report = incident_mgr.generate_incident_report(
incident_guid=incident_guid,
format="text" # or "markdown", "html"
)
print(report)
# Email incident report
incident_mgr.email_incident_report(
incident_guid=incident_guid,
format="html",
additional_recipients=["manager@example.com"]
)
Report Includes:
The system supports multiple notification channels configured via ObjNotify:
Channels are configured in def_notify table with notify codes that
define which channels to use for different incident types:
-- Example: Critical incident notification to multiple channels
INSERT INTO def_notify (NotifyCode, NotifyName, NotifyChannels)
VALUES (
'INCIDENT_CRITICAL',
'Critical Incident Alert',
'PAGERDUTY,SLACK,SMS'
);
The timeline automatically tracks:
Use get_incident_timeline() to review the complete incident history.
Typical Flow:
Merge When:
from factory.core import ObjAlertIncident
# Initialize
incident_mgr = ObjAlertIncident.ObjAlertIncident()
# 1. Create incident
incident_guid = incident_mgr.create_incident(
title="API Response Time Degraded",
severity="HIGH",
package="WEBAPI",
description="95th percentile latency exceeded 5 seconds",
assigned_to="backend-team"
)
# 2. Send initial notification
incident_mgr.notify_incident(
incident_guid=incident_guid,
notify_code="INCIDENT_HIGH_PRIORITY",
sent_by="MONITORING_SYSTEM"
)
# 3. Update status as investigation begins
incident_mgr.update_incident_status(
incident_guid=incident_guid,
new_status="INVESTIGATING",
updated_by="engineer1"
)
# 4. Add investigation notes
incident_mgr.add_incident_note(
incident_guid=incident_guid,
note_type="INVESTIGATION",
content="Database query logs show slow query on orders table",
created_by="engineer1"
)
# 5. Document root cause
incident_mgr.add_incident_note(
incident_guid=incident_guid,
note_type="ROOT_CAUSE",
content="Missing index on orders.created_at column causing full table scans",
created_by="engineer1"
)
# 6. Apply mitigation
incident_mgr.add_incident_note(
incident_guid=incident_guid,
note_type="MITIGATION",
content="Added index: CREATE INDEX idx_orders_created_at ON orders(created_at)",
created_by="dba1"
)
# 7. Monitor for stability
incident_mgr.update_incident_status(
incident_guid=incident_guid,
new_status="MONITORING",
updated_by="engineer1"
)
# 8. Close incident
incident_mgr.close_incident(
incident_guid=incident_guid,
resolution_summary="Added database index, latency returned to normal",
root_cause="Missing index on frequently queried column",
closed_by="engineer1"
)
# 9. Send resolution notification
incident_mgr.notify_incident(
incident_guid=incident_guid,
notify_code="INCIDENT_RESOLVED",
sent_by="engineer1"
)
# 10. Generate post-incident report
report = incident_mgr.generate_incident_report(
incident_guid=incident_guid,
format="markdown"
)
# Email to stakeholders
incident_mgr.email_incident_report(
incident_guid=incident_guid,
format="html",
additional_recipients=["team-lead@example.com", "product-owner@example.com"]
)
Check that:
data_incident_communication tablenotify_incident() is being called (not direct ObjNotify)Check:
def_notifyCREATE TABLE data_incident_communication (
CommunicationGuid VARCHAR(255) PRIMARY KEY,
IncidentGuid VARCHAR(255) NOT NULL,
ShortRef VARCHAR(20) UNIQUE NOT NULL,
Channel VARCHAR(50),
Direction ENUM('OUTBOUND', 'INBOUND') DEFAULT 'OUTBOUND',
Status VARCHAR(50) DEFAULT 'SENT',
Message TEXT,
Recipient TEXT,
ExternalId VARCHAR(255),
ErrorMessage TEXT,
Timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
Metadata JSON,
FOREIGN KEY (IncidentGuid) REFERENCES data_incident(IncidentGuid)
ON DELETE CASCADE,
INDEX idx_incident_guid (IncidentGuid),
INDEX idx_short_ref (ShortRef),
INDEX idx_channel (Channel),
INDEX idx_timestamp (Timestamp)
);
The alerts and incidents system provides:
Use incidents to organize and track response to alerts, document
investigations, and maintain communication history across all channels.