18 KiB
Dead Letter Queue (DLQ) System Guide
Navigation: 📖 Index | 🚀 Quick Start | 📚 API Reference | 🔧 Operations | 🏗️ System Guide
Overview
The AgMission DLQ system provides robust error handling, monitoring, and archival for failed partner integration tasks. Built on RabbitMQ's Dead Letter Exchange (DLX) mechanism, it automatically captures failed messages, enriches them with diagnostic metadata, and provides configurable retention with smart alerting.
Key Features:
- Automatic message enrichment with error categorization
- Configurable retention (default: 365 days) with TTL-based auto-archival
- Smart admin email alerts based on thresholds (warning @ 20, critical @ 50 messages)
- Real-time dashboard monitoring at
/dlq-monitor.html - Health check integration for infrastructure monitoring
- Complete audit trail preservation in organized filesystem archives
Architecture
Message Flow
flowchart LR
Task[Partner Task] --> Main[Main Queue<br/>partner_tasks]
Main -->|Processing| Success[✓ Success]
Main -->|On Failure| DLQ[Dead Letter Queue<br/>partner_tasks_failed]
DLQ -->|After DLQ_RETENTION_DAYS| Archive[Archive Queue<br/>partner_tasks_archive]
Archive --> Filesystem[Filesystem Archive<br/>./dlq_archives/YYYY/MM/DD/]
Components
-
Main Worker (
workers/partner_sync_worker.js)- Processes partner tasks from
partner_tasksqueue - Enriches messages with error metadata before DLQ routing
- Implements retry logic with max attempts (default: 5)
- Processes partner tasks from
-
DLQ API (
routes/dlq.js,controllers/dlq.js)- Global queue-native endpoints at
/api/dlq/:queueName/* - Direct RabbitMQ operations (no MongoDB coupling)
- Supports all queue types (partner_tasks, jobs, etc.)
- Provides retry operations: retryAll, retryByPosition, retryByHeader
- Global queue-native endpoints at
-
Archival Worker (
workers/dlq_archival_worker.js)- Consumes TTL-expired messages from archive queue
- Writes JSON files organized by date:
YYYY/MM/DD/timestamp_tasktype_filename.json - Preserves full message content + headers + properties for compliance
-
Alert Worker (
workers/dlq_alert_worker.js)- Monitors DLQ message counts for all queues
- Sends email alerts when thresholds exceeded (warning @ 20, critical @ 50)
- Smart throttling prevents alert spam (1 hour minimum between similar alerts)
- Configurable via DLQ_ALERT_* environment variables
-
Health Check (
controllers/health.js)- GET
/api/healthincludes DLQ component status - Status levels: healthy (<20 msgs), degraded (20-49), unhealthy (≥50)
- Exposes: messageCount, threshold, critical, retentionDays, consumerEnabled
- GET
-
Dashboard (
public/dlq-monitor.html)- Real-time monitoring UI with auto-refresh
- Visual alert indicators (green/yellow/red) based on thresholds
- Manual actions: retry, archive, process, purge
- Authentication via Bearer token
Configuration
Environment Variables
Add to your .env file or environment:
# DLQ Retention & Archival
DLQ_RETENTION_DAYS=365 # Days to keep messages in DLQ before auto-archive (default: 365)
DLQ_ARCHIVE_PATH=./dlq_archives # Where to store archived messages (default: ./dlq_archives)
# Alert Configuration
DLQ_ALERT_ENABLED=true # Enable admin email alerts (default: true)
DLQ_ALERT_THRESHOLD=20 # Warning threshold for message count (default: 20)
DLQ_ALERT_CRITICAL=50 # Critical threshold for escalated alerts (default: 50)
DLQ_ALERT_INTERVAL_MS=300000 # Check interval in milliseconds (default: 300000 = 5 min)
# DLQ Consumer Control
DLQ_CONSUMER_ENABLED=false # Enable DLQ message reprocessing (default: false, manual control)
RabbitMQ Queue Configuration
Automatically configured by partner_sync_worker.js on startup:
-
Main Queue (
partner_tasksordev_partner_tasks)- Durable: yes
- DLX: default exchange
- DLX routing key:
{queue_name}_failed
-
DLQ (
partner_tasks_failed)- Durable: yes
- TTL:
DLQ_RETENTION_DAYS * 86400000ms - DLX:
dlq_archiveexchange - DLX routing key:
archive
-
Archive Queue (
partner_tasks_archive)- Durable: yes
- Consumed by
dlq_archival_worker.js
Message Headers
Messages sent to DLQ include headers for filtering and diagnostics. All other data is in the message body for reprocessing.
Diagnostic Headers (All Queue Types)
| Header | Purpose | Example Values |
|---|---|---|
x-error-category |
Error classification | transient, validation, processing, infrastructure, partner_api, unknown |
x-error-reason |
Error message | "Connection timeout", "Invalid file format" |
x-task-type |
Task being processed | PROCESS_PARTNER_LOG, UPLOAD_JOB, SEND_NOTIFICATION |
x-severity |
Alert severity | low, medium, high, critical |
x-first-death-time |
Timestamp of failure | 1734567890123 |
Context Headers (partner_tasks queue only)
| Header | Purpose | Example Values |
|---|---|---|
x-partner-code |
Partner identifier (for filtering) | SATLOC, AGIDRONEX |
x-customer-id |
Customer ObjectId (for filtering) | 507f1f77bcf86cd799439011 |
Note: Headers are for filtering/identification only. Job IDs, user IDs, filenames, etc. are in the message body where they belong.
Operations
Starting Workers
# Start main partner sync worker (includes DLQ routing)
node workers/partner_sync_worker.js
# Start archival worker (consumes expired DLQ messages)
node workers/dlq_archival_worker.js
# Start alert worker (monitors DLQs and sends email alerts)
node workers/dlq_alert_worker.js
# Or use start_workers.js to manage all workers
node start_workers.js
Manual DLQ Operations
Web Dashboard (Recommended)
Navigate to: http://localhost:4100/dlq-monitor.html
- Enter admin Bearer token (from login)
- Select queue type (partner_tasks, jobs, etc.)
- View statistics and messages
- One-click retry operations
API Operations
# Check DLQ statistics
curl http://localhost:4100/api/dlq/partner_tasks/stats \
-H "Authorization: Bearer $TOKEN"
# Retry all messages
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"maxMessages": 100}'
View Dashboard Details
- Enter admin Bearer token (from login)
- View real-time statistics
- Actions available:
- Refresh: Update stats immediately
- Process DLQ: Retry eligible messages
- Dry Run: Analyze without processing
- Purge DLQ: Delete all messages (requires confirmation)
Monitoring
Health Check Integration
curl http://localhost:4100/api/health
Response includes DLQ component:
{
"timestamp": "2024-12-18T10:30:00.000Z",
"overall": "healthy",
"components": {
"dlq": {
"status": "healthy",
"message": "DLQ operating normally",
"messageCount": 5,
"threshold": 20,
"critical": 50,
"consumerEnabled": false,
"retentionDays": 365,
"queueName": "partner_tasks_failed"
}
}
}
Alert Email Format
When DLQ message count exceeds thresholds, admin receives:
Subject: [AgMission][WARNING] DLQ Alert: 25 failed partner tasks
Body:
Dead Letter Queue Alert - Partner Tasks
========================================
Status: WARNING
Current DLQ Message Count: 25
Alert Threshold: 20
Critical Threshold: 50
Time: 2024-12-18T10:30:00.000Z
Error Breakdown:
- transient: 12 (48.0%)
- partner_api: 8 (32.0%)
- validation: 5 (20.0%)
Recommended Actions:
1. Monitor DLQ dashboard at /dlq-monitor.html
2. Review error patterns and plan fixes
3. Messages will auto-archive after 365 days
Configuration:
- DLQ Retention: 365 days
- Archive Path: ./dlq_archives
- Max Retries: 5
To disable these alerts, set DLQ_ALERT_ENABLED=false in environment config.
Throttling: 1 email per hour per severity level to prevent floods.
Error Categorization
Based on DLQ Best Practices:
Transient Errors
Auto-retry within 2 hours
Patterns: timeout, connection, network, ECONNREFUSED, DNS
Action: May resolve automatically, suitable for retry
Validation Errors
Archive immediately
Patterns: validation, invalid, malformed, missing required, 400
Action: Requires code/data fix, not retryable
Processing Errors
Manual review
Patterns: processing, calculation, parse, transform
Action: Business logic issue, investigate root cause
Infrastructure Errors
Escalate
Patterns: database, mongo, transaction, filesystem, disk
Action: System-level problem, may affect multiple tasks
Partner API Errors
Extended backoff
Patterns: partner api, authentication, 401, 403, 5xx
Action: External service issue, retry with longer delays
Unknown Errors
Admin notification
Patterns: Anything not matching above
Action: New error type, update categorization logic
Archive Structure
Messages are archived to filesystem with this structure:
dlq_archives/
├── 2024/
│ ├── 12/
│ │ ├── 18/
│ │ │ ├── 1734567890123_PROCESS_PARTNER_LOG_satloc_flight_456.json
│ │ │ ├── 1734567891234_UPLOAD_PARTNER_JOB_job_789.json
│ │ │ └── ...
│ │ ├── 19/
│ │ └── ...
│ └── ...
└── ...
Archive File Format
{
"archived_at": "2024-12-18T10:30:00.000Z",
"timestamp": 1734567890123,
"queue_name": "partner_tasks",
"dlq_name": "partner_tasks_failed",
"task_type": "PROCESS_PARTNER_LOG",
"error_category": "transient",
"severity": "medium",
"headers": {
"x-error-category": "transient",
"x-error-reason": "Connection timeout after 30s",
"x-task-type": "PROCESS_PARTNER_LOG",
"x-severity": "medium",
"x-first-death-time": 1734567890000,
"x-partner-code": "SATLOC",
"x-customer-id": "507f1f77bcf86cd799439011",
"x-log-filename": "2024-12-18_aircraft_456.log"
},
"properties": {
"contentType": "application/json",
"deliveryMode": 2,
"timestamp": 1734567890000,
"expiration": "31536000000"
},
"message": {
"type": "PROCESS_PARTNER_LOG",
"customerId": "507f1f77bcf86cd799439011",
"partnerCode": "SATLOC",
"aircraftId": "aircraft_456",
"logFileName": "2024-12-18_aircraft_456.log",
"logId": "log_123",
"retryCount": 5,
"lastError": "Connection timeout after 30s",
"failedAt": "2024-12-18T10:25:00.000Z"
}
}
PartnerLogTracker vs DLQ
Separation of Concerns
PartnerLogTracker (MongoDB) - Business Intelligence Layer:
- Duplicate prevention via unique compound index (
logId,partnerCode) - Job matching audit trail with confidence scores
- Processing analytics (processTime, success rate)
- Customer reporting and dashboard queries
- File lifecycle tracking (download → process → complete)
- Timeout detection for stuck tasks
DLQ System (RabbitMQ + Filesystem) - Error Handling Layer:
- Failed message capture and retry orchestration
- Error categorization and severity assessment
- Configurable retention with TTL-based archival
- Admin alerting for operational issues
- Audit compliance with immutable archive logs
Why Keep Both?
- Different Query Patterns: PartnerLogTracker optimized for customer/job lookups, DLQ for error analysis
- Data Lifecycle: PartnerLogTracker persists forever (business records), DLQ archives after 365 days
- Performance: DLQ operations don't impact business query performance
- Compliance: Archive files provide immutable audit trail separate from operational database
Integration Points
- DLQ messages include
x-partner-codeandx-customer-idheaders for filtering - Original message body contains full task data (logFileName, jobId, etc.) for reprocessing
- PartnerLogTracker.status field reflects processing state independent of DLQ
- Health check queries both systems for comprehensive status
Troubleshooting
DLQ Buildup
Symptom: Message count > 20
Causes:
- Partner API downtime → Check partner service status
- Network connectivity issues → Review infrastructure health
- Database connection problems → Check MongoDB connection pool
- Code bugs in processing logic → Review recent deployments
Resolution:
- Identify root cause via error breakdown in dashboard
- Fix underlying issue
- Enable DLQ consumer:
DLQ_CONSUMER_ENABLED=true - Monitor processing until queue drains
- Disable consumer:
DLQ_CONSUMER_ENABLED=false
Archive Worker Not Running
Symptom: DLQ messages not archiving after TTL expiry
Check:
ps aux | grep dlq_archival_worker
Fix:
node workers/dlq_archival_worker.js &
No Alert Emails
Symptom: DLQ > threshold but no email received
Checks:
DLQ_ALERT_ENABLED=truein env?NO_EMAIL_MODE=falsein env?- SMTP credentials configured?
- Check last alert time (throttled to 1/hour)?
Test Email:
const mailer = require('./helpers/mailer');
await mailer.sendAdminNotification('Test', 'Testing DLQ alerts');
High Memory Usage
Symptom: Archival worker consuming excessive memory
Cause: Large message backlog in archive queue
Fix:
- Reduce
prefetchin archival worker (currently 1) - Process archive queue in batches
- Add memory monitoring alerts
Performance Tuning
Adjust Check Interval
Reduce alert latency:
DLQ_ALERT_INTERVAL_MS=60000 # Check every minute instead of 5
Optimize Archive Storage
Compress archived files:
// In dlq_archival_worker.js, add gzip compression:
const zlib = require('zlib');
const compressed = zlib.gzipSync(JSON.stringify(archiveRecord));
await fs.writeFile(filepath + '.gz', compressed);
Adjust Retention
For high-volume systems:
DLQ_RETENTION_DAYS=30 # Reduce to 30 days
API Reference
GET /api/health
Returns overall system health including DLQ component.
Response:
{
"components": {
"dlq": {
"status": "healthy|degraded|unhealthy",
"messageCount": 5,
"threshold": 20,
"critical": 50
}
}
}
GET /api/dlq/partner_tasks/stats
Get comprehensive DLQ statistics.
Response:
{
"dlq": {
"messageCount": 25,
"consumerCount": 0,
"queueName": "partner_tasks_failed"
},
"trackers": {
"failed": 25,
"processing": 3,
"downloaded": 10,
"processed": 1523,
"archived": 45
},
"recentFailures": [...]
}
GET /api/dlq/partner_tasks/messages?limit=50
Peek at DLQ messages without consuming.
POST /api/dlq/:queueName/retryAll
Retry all messages currently in the DLQ.
POST /api/dlq/:queueName/retryByPosition
Retry messages by position range.
POST /api/dlq/:queueName/retryByHeader
Retry messages matching specific header values.
POST /api/dlq/:queueName/process
Process all eligible DLQ messages with intelligent categorization.
DELETE /api/dlq/:queueName/purge
Purge entire DLQ (requires confirmation).
Best Practices
-
Never Enable DLQ Consumer Permanently
- Set
DLQ_CONSUMER_ENABLED=trueonly during active recovery - Return to
falseafter queue drains to prevent blind retries
- Set
-
Monitor Error Categories
- Review dashboard regularly for patterns
- High validation errors → Data quality issues
- High transient errors → Infrastructure problems
-
Adjust Thresholds Per Environment
- Production:
DLQ_ALERT_THRESHOLD=20,DLQ_ALERT_CRITICAL=50 - Staging: Higher thresholds acceptable
- Development: Consider disabling alerts (
DLQ_ALERT_ENABLED=false)
- Production:
-
Archive Retention Planning
- Default 365 days suitable for most compliance needs
- Consider lifecycle policy for
dlq_archives/directory - Old archives can be compressed or moved to cold storage
-
Correlate with PartnerLogTracker
- Query PartnerLogTracker by customerId and partnerCode
- Check message body for logFileName and other task details
- Match against tracker records using filter criteria
- Cross-reference for complete failure analysis
- Update PartnerLogTracker.errorMessage for business reporting
-
Regular Archive Cleanup
- Setup cron job to delete/compress old archives
- Example: Keep 2 years, then delete
find ./dlq_archives -type f -mtime +730 -delete -
Test Alert System
- Manually trigger test alerts during setup
- Verify email delivery and formatting
- Confirm throttling behavior
-
Health Check Integration
- Add DLQ health to monitoring dashboards (Grafana, Datadog)
- Alert on
status: "unhealthy"in CI/CD pipeline - Include in uptime checks
Migration Guide
From Old DLQ System
If migrating from PartnerLogTracker-based retry to queue-native DLQ:
-
Deploy New System:
# Update env with DLQ settings DLQ_RETENTION_DAYS=365 DLQ_ALERT_ENABLED=true DLQ_CONSUMER_ENABLED=false # Restart workers pm2 restart partner_sync_worker pm2 start workers/dlq_archival_worker.js --name dlq-archival -
Monitor Both Systems:
- Old DLQ code still handles in-flight messages
- New system captures new failures
- Gradually phase out old retry logic
-
Cleanup:
- Remove old DLQ retry code from controllers
- Keep PartnerLogTracker for BI purposes
- Update documentation
Support
For issues or questions:
- Check logs:
workers/*.rlogfiles - Review RabbitMQ management console:
http://localhost:15672 - Check archive directory:
ls -lah dlq_archives/ - Contact: trungh@agnav.com
See Also
- 📖 DLQ Index - Documentation overview
- 🚀 Quick Start Guide - Get started quickly
- 📚 API Reference - Complete API documentation
- 🔧 Operations Guide - Advanced operations
🔗 Related Resources
- Web Dashboard - Monitoring interface
Last Updated: December 18, 2025
System Version: 1.0.0
Author: AgMission Platform Team