¶ Alert and Incident Management
Incident Management System - NEW
Comprehensive documentation for the incident management module including:
- Incident lifecycle management (CREATED → INVESTIGATING → MONITORING → CLOSED)
- Investigation notes and documentation (ROOT_CAUSE, RESOLUTION, INVESTIGATION, MITIGATION)
- Communication tracking with unique short references (REF-YYYYMMDD-XXX)
- Multi-channel notification integration
- Alert linking and incident merging
- Metrics and reporting capabilities
- Complete API reference with examples
- CLI command reference
- Integration patterns
Key Features:
- 66 total tests (48 unit + 18 integration)
- Automatic communication logging
- Short reference lookup system
- Timeline and audit trail
- HTML/Markdown/Text report generation
Alert Management System - UPDATED
Documentation for alert generation and tracking:
- Alert lifecycle (ACTIVE → ACKNOWLEDGED → RESOLVED → SUPPRESSED)
- Alert types and severity levels
- Integration with incident management
- Alert aggregation and suppression
- Metrics and trend analysis
- Best practices for alert design and response
- Example workflows
Alert Types:
- RESOURCE_THRESHOLD (CPU, memory, disk)
- SERVICE_DOWN
- ERROR_RATE
- LATENCY
- SECURITY
- DATABASE
- NETWORK
- CUSTOM
Notification System - UPDATED
Multi-channel notification system with incident integration:
Comprehensive Alerts & Incidents Guide - NEW
Complete guide covering:
- Conceptual overview (alerts vs incidents)
- Database schema reference
- Python API usage with examples
- Communication system and short references
- CLI commands
- Metrics and reporting
- Best practices
- Complete workflow examples
- Troubleshooting guide
Sections:
- Overview - What are alerts and incidents
- Database Tables - Complete schema
- Working with Incidents - Python API
- Incident Communication System - Notifications and tracking
- CLI Commands - Command-line interface
- Metrics and Reporting - Analytics
- Best Practices - Guidelines
- Example Workflows - Real-world scenarios
- Troubleshooting - Common issues
-
factory.core/ObjAlertIncident.md (690 lines)
- Complete API reference
- Database schema documentation
- Integration examples
- CLI reference
- Testing information
-
factory.core/ObjAlert.md (483 lines)
- Replaced minimal placeholder content
- Comprehensive alert management guide
- Integration with incidents
- Best practices
-
resource.notes/howto/howto_alerts_and_incidents.md (665 lines)
- User-focused guide
- Step-by-step examples
- Complete workflows
- Troubleshooting tips
- factory.core/ObjNotify.md (600 lines total, +224 new)
- Added "Integration with Incident Management" section
- Short reference system documentation
- Automatic communication logging
- Communication query examples
- Best practices for incident notifications
Getting Started:
API Reference:
Workflows:
On-Call Engineer:
- Alert Response Best Practices
- Incident Lifecycle
- CLI Commands
- Communication Tracking
Developer:
- Python API Usage
- ObjAlertIncident Methods
- Integration Patterns
- Testing
System Administrator:
- Database Schema
- Channel Configuration
- Metrics and Reporting
- Troubleshooting
- ✅ Complete lifecycle tracking (create → investigate → resolve)
- ✅ Investigation notes with categorization
- ✅ Timeline and audit trail
- ✅ Alert linking and incident merging
- ✅ Status validation and transitions
- ✅ Metrics and trend analysis
- ✅ Report generation (text/markdown/HTML)
- ✅ Email delivery
- ✅ Unique short reference system (REF-YYYYMMDD-XXX)
- ✅ Automatic logging for all channels
- ✅ Reference embedding in messages
- ✅ Multi-channel support (7 channels)
- ✅ Communication history and filtering
- ✅ External ID tracking (PagerDuty, Slack, etc.)
- ✅ Metadata support (JSON format)
- ✅ Delivery statistics
- ✅ Multi-channel delivery
- ✅ Message templating with placeholders
- ✅ Asynchronous and direct flows
- ✅ Channel-specific wrappers
- ✅ Configuration via database
- ✅ Test commands for all channels
- ✅ Incident context integration
- ✅ 48 unit tests for ObjAlertIncident
- ✅ 18 integration tests for communication system
- ✅ Comprehensive test coverage
- ✅ Test documentation and examples
¶ Quick Start: Create Incident and Notify
from factory.core import ObjAlertIncident
# Create incident manager
incident = ObjAlertIncident.ObjAlertIncident()
# Create incident
guid = incident.create_incident(
title="Database Connection Pool Exhausted",
severity="CRITICAL",
package="DATABASE",
description="All connection slots occupied",
assigned_to="dba-team"
)
# Send notifications
stats = incident.notify_incident(
incident_guid=guid,
notify_code="INCIDENT_CRITICAL"
)
print(f"Sent: {stats['sent']}, Failed: {stats['failed']}")
# User provides reference from notification: REF-20251229-A7K
comm = incident.get_communication_by_ref("REF-20251229-A7K")
if comm:
print(f"Incident: {comm['IncidentTitle']}")
print(f"Severity: {comm['IncidentSeverity']}")
print(f"Status: {comm['IncidentStatus']}")
¶ Documentation Standards
All documentation follows these standards:
-
Structure
- Clear overview section
- Class/module diagram where applicable
- Database schema tables
- Method documentation with parameters and returns
- Code examples for all major features
- Related documentation links
-
Examples
- Real-world scenarios
- Complete, runnable code
- Expected output included
- Best practices highlighted
-
API Documentation
- All public methods documented
- Parameter types and descriptions
- Return value specifications
- Usage examples
-
Maintenance
- All docs updated when code changes
- Test coverage documented
- Version/date information where needed
- Issues: Report documentation issues on GitHub
- Questions: Contact development team
- Updates: Documentation updated with each feature release
Last Updated: 2025-12-29
Documentation Version: 1.0
Code Version: Aligns with feat/tensorflow branch