agmission/Development/server/docs/DLQ_SYSTEM_GUIDE.md

18 KiB

Dead Letter Queue (DLQ) System Guide

Navigation: 📖 Index | 🚀 Quick Start | 📚 API Reference | 🔧 Operations | 🏗️ System Guide


Overview

The AgMission DLQ system provides robust error handling, monitoring, and archival for failed partner integration tasks. Built on RabbitMQ's Dead Letter Exchange (DLX) mechanism, it automatically captures failed messages, enriches them with diagnostic metadata, and provides configurable retention with smart alerting.

Key Features:

  • Automatic message enrichment with error categorization
  • Configurable retention (default: 365 days) with TTL-based auto-archival
  • Smart admin email alerts based on thresholds (warning @ 20, critical @ 50 messages)
  • Real-time dashboard monitoring at /dlq-monitor.html
  • Health check integration for infrastructure monitoring
  • Complete audit trail preservation in organized filesystem archives

Architecture

Message Flow

flowchart LR
    Task[Partner Task] --> Main[Main Queue<br/>partner_tasks]
    Main -->|Processing| Success[✓ Success]
    Main -->|On Failure| DLQ[Dead Letter Queue<br/>partner_tasks_failed]
    DLQ -->|After DLQ_RETENTION_DAYS| Archive[Archive Queue<br/>partner_tasks_archive]
    Archive --> Filesystem[Filesystem Archive<br/>./dlq_archives/YYYY/MM/DD/]

Components

  1. Main Worker (workers/partner_sync_worker.js)

    • Processes partner tasks from partner_tasks queue
    • Enriches messages with error metadata before DLQ routing
    • Implements retry logic with max attempts (default: 5)
  2. DLQ API (routes/dlq.js, controllers/dlq.js)

    • Global queue-native endpoints at /api/dlq/:queueName/*
    • Direct RabbitMQ operations (no MongoDB coupling)
    • Supports all queue types (partner_tasks, jobs, etc.)
    • Provides retry operations: retryAll, retryByPosition, retryByHeader
  3. Archival Worker (workers/dlq_archival_worker.js)

    • Consumes TTL-expired messages from archive queue
    • Writes JSON files organized by date: YYYY/MM/DD/timestamp_tasktype_filename.json
    • Preserves full message content + headers + properties for compliance
  4. Alert Worker (workers/dlq_alert_worker.js)

    • Monitors DLQ message counts for all queues
    • Sends email alerts when thresholds exceeded (warning @ 20, critical @ 50)
    • Smart throttling prevents alert spam (1 hour minimum between similar alerts)
    • Configurable via DLQ_ALERT_* environment variables
  5. Health Check (controllers/health.js)

    • GET /api/health includes DLQ component status
    • Status levels: healthy (<20 msgs), degraded (20-49), unhealthy (≥50)
    • Exposes: messageCount, threshold, critical, retentionDays, consumerEnabled
  6. Dashboard (public/dlq-monitor.html)

    • Real-time monitoring UI with auto-refresh
    • Visual alert indicators (green/yellow/red) based on thresholds
    • Manual actions: retry, archive, process, purge
    • Authentication via Bearer token

Configuration

Environment Variables

Add to your .env file or environment:

# DLQ Retention & Archival
DLQ_RETENTION_DAYS=365              # Days to keep messages in DLQ before auto-archive (default: 365)
DLQ_ARCHIVE_PATH=./dlq_archives     # Where to store archived messages (default: ./dlq_archives)

# Alert Configuration
DLQ_ALERT_ENABLED=true              # Enable admin email alerts (default: true)
DLQ_ALERT_THRESHOLD=20              # Warning threshold for message count (default: 20)
DLQ_ALERT_CRITICAL=50               # Critical threshold for escalated alerts (default: 50)
DLQ_ALERT_INTERVAL_MS=300000        # Check interval in milliseconds (default: 300000 = 5 min)

# DLQ Consumer Control
DLQ_CONSUMER_ENABLED=false          # Enable DLQ message reprocessing (default: false, manual control)

RabbitMQ Queue Configuration

Automatically configured by partner_sync_worker.js on startup:

  1. Main Queue (partner_tasks or dev_partner_tasks)

    • Durable: yes
    • DLX: default exchange
    • DLX routing key: {queue_name}_failed
  2. DLQ (partner_tasks_failed)

    • Durable: yes
    • TTL: DLQ_RETENTION_DAYS * 86400000 ms
    • DLX: dlq_archive exchange
    • DLX routing key: archive
  3. Archive Queue (partner_tasks_archive)

    • Durable: yes
    • Consumed by dlq_archival_worker.js

Message Headers

Messages sent to DLQ include headers for filtering and diagnostics. All other data is in the message body for reprocessing.

Diagnostic Headers (All Queue Types)

Header Purpose Example Values
x-error-category Error classification transient, validation, processing, infrastructure, partner_api, unknown
x-error-reason Error message "Connection timeout", "Invalid file format"
x-task-type Task being processed PROCESS_PARTNER_LOG, UPLOAD_JOB, SEND_NOTIFICATION
x-severity Alert severity low, medium, high, critical
x-first-death-time Timestamp of failure 1734567890123

Context Headers (partner_tasks queue only)

Header Purpose Example Values
x-partner-code Partner identifier (for filtering) SATLOC, AGIDRONEX
x-customer-id Customer ObjectId (for filtering) 507f1f77bcf86cd799439011

Note: Headers are for filtering/identification only. Job IDs, user IDs, filenames, etc. are in the message body where they belong.


Operations

Starting Workers

# Start main partner sync worker (includes DLQ routing)
node workers/partner_sync_worker.js

# Start archival worker (consumes expired DLQ messages)
node workers/dlq_archival_worker.js

# Start alert worker (monitors DLQs and sends email alerts)
node workers/dlq_alert_worker.js

# Or use start_workers.js to manage all workers
node start_workers.js

Manual DLQ Operations

Navigate to: http://localhost:4100/dlq-monitor.html

  1. Enter admin Bearer token (from login)
  2. Select queue type (partner_tasks, jobs, etc.)
  3. View statistics and messages
  4. One-click retry operations

API Operations

# Check DLQ statistics
curl http://localhost:4100/api/dlq/partner_tasks/stats \
  -H "Authorization: Bearer $TOKEN"

# Retry all messages
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"maxMessages": 100}'

View Dashboard Details

  1. Enter admin Bearer token (from login)
  2. View real-time statistics
  3. Actions available:
    • Refresh: Update stats immediately
    • Process DLQ: Retry eligible messages
    • Dry Run: Analyze without processing
    • Purge DLQ: Delete all messages (requires confirmation)

Monitoring

Health Check Integration

curl http://localhost:4100/api/health

Response includes DLQ component:

{
  "timestamp": "2024-12-18T10:30:00.000Z",
  "overall": "healthy",
  "components": {
    "dlq": {
      "status": "healthy",
      "message": "DLQ operating normally",
      "messageCount": 5,
      "threshold": 20,
      "critical": 50,
      "consumerEnabled": false,
      "retentionDays": 365,
      "queueName": "partner_tasks_failed"
    }
  }
}

Alert Email Format

When DLQ message count exceeds thresholds, admin receives:

Subject: [AgMission][WARNING] DLQ Alert: 25 failed partner tasks

Body:

Dead Letter Queue Alert - Partner Tasks
========================================

Status: WARNING
Current DLQ Message Count: 25
Alert Threshold: 20
Critical Threshold: 50
Time: 2024-12-18T10:30:00.000Z

Error Breakdown:
  - transient: 12 (48.0%)
  - partner_api: 8 (32.0%)
  - validation: 5 (20.0%)

Recommended Actions:
1. Monitor DLQ dashboard at /dlq-monitor.html
2. Review error patterns and plan fixes
3. Messages will auto-archive after 365 days

Configuration:
- DLQ Retention: 365 days
- Archive Path: ./dlq_archives
- Max Retries: 5

To disable these alerts, set DLQ_ALERT_ENABLED=false in environment config.

Throttling: 1 email per hour per severity level to prevent floods.


Error Categorization

Based on DLQ Best Practices:

Transient Errors

Auto-retry within 2 hours
Patterns: timeout, connection, network, ECONNREFUSED, DNS
Action: May resolve automatically, suitable for retry

Validation Errors

Archive immediately
Patterns: validation, invalid, malformed, missing required, 400
Action: Requires code/data fix, not retryable

Processing Errors

Manual review
Patterns: processing, calculation, parse, transform
Action: Business logic issue, investigate root cause

Infrastructure Errors

Escalate
Patterns: database, mongo, transaction, filesystem, disk
Action: System-level problem, may affect multiple tasks

Partner API Errors

Extended backoff
Patterns: partner api, authentication, 401, 403, 5xx
Action: External service issue, retry with longer delays

Unknown Errors

Admin notification
Patterns: Anything not matching above
Action: New error type, update categorization logic


Archive Structure

Messages are archived to filesystem with this structure:

dlq_archives/
├── 2024/
│   ├── 12/
│   │   ├── 18/
│   │   │   ├── 1734567890123_PROCESS_PARTNER_LOG_satloc_flight_456.json
│   │   │   ├── 1734567891234_UPLOAD_PARTNER_JOB_job_789.json
│   │   │   └── ...
│   │   ├── 19/
│   │   └── ...
│   └── ...
└── ...

Archive File Format

{
  "archived_at": "2024-12-18T10:30:00.000Z",
  "timestamp": 1734567890123,
  "queue_name": "partner_tasks",
  "dlq_name": "partner_tasks_failed",
  "task_type": "PROCESS_PARTNER_LOG",
  "error_category": "transient",
  "severity": "medium",
  "headers": {
    "x-error-category": "transient",
    "x-error-reason": "Connection timeout after 30s",
    "x-task-type": "PROCESS_PARTNER_LOG",
    "x-severity": "medium",
    "x-first-death-time": 1734567890000,
    "x-partner-code": "SATLOC",
    "x-customer-id": "507f1f77bcf86cd799439011",
    "x-log-filename": "2024-12-18_aircraft_456.log"
  },
  "properties": {
    "contentType": "application/json",
    "deliveryMode": 2,
    "timestamp": 1734567890000,
    "expiration": "31536000000"
  },
  "message": {
    "type": "PROCESS_PARTNER_LOG",
    "customerId": "507f1f77bcf86cd799439011",
    "partnerCode": "SATLOC",
    "aircraftId": "aircraft_456",
    "logFileName": "2024-12-18_aircraft_456.log",
    "logId": "log_123",
    "retryCount": 5,
    "lastError": "Connection timeout after 30s",
    "failedAt": "2024-12-18T10:25:00.000Z"
  }
}

PartnerLogTracker vs DLQ

Separation of Concerns

PartnerLogTracker (MongoDB) - Business Intelligence Layer:

  • Duplicate prevention via unique compound index (logId, partnerCode)
  • Job matching audit trail with confidence scores
  • Processing analytics (processTime, success rate)
  • Customer reporting and dashboard queries
  • File lifecycle tracking (download → process → complete)
  • Timeout detection for stuck tasks

DLQ System (RabbitMQ + Filesystem) - Error Handling Layer:

  • Failed message capture and retry orchestration
  • Error categorization and severity assessment
  • Configurable retention with TTL-based archival
  • Admin alerting for operational issues
  • Audit compliance with immutable archive logs

Why Keep Both?

  1. Different Query Patterns: PartnerLogTracker optimized for customer/job lookups, DLQ for error analysis
  2. Data Lifecycle: PartnerLogTracker persists forever (business records), DLQ archives after 365 days
  3. Performance: DLQ operations don't impact business query performance
  4. Compliance: Archive files provide immutable audit trail separate from operational database

Integration Points

  • DLQ messages include x-partner-code and x-customer-id headers for filtering
  • Original message body contains full task data (logFileName, jobId, etc.) for reprocessing
  • PartnerLogTracker.status field reflects processing state independent of DLQ
  • Health check queries both systems for comprehensive status

Troubleshooting

DLQ Buildup

Symptom: Message count > 20
Causes:

  1. Partner API downtime → Check partner service status
  2. Network connectivity issues → Review infrastructure health
  3. Database connection problems → Check MongoDB connection pool
  4. Code bugs in processing logic → Review recent deployments

Resolution:

  1. Identify root cause via error breakdown in dashboard
  2. Fix underlying issue
  3. Enable DLQ consumer: DLQ_CONSUMER_ENABLED=true
  4. Monitor processing until queue drains
  5. Disable consumer: DLQ_CONSUMER_ENABLED=false

Archive Worker Not Running

Symptom: DLQ messages not archiving after TTL expiry
Check:

ps aux | grep dlq_archival_worker

Fix:

node workers/dlq_archival_worker.js &

No Alert Emails

Symptom: DLQ > threshold but no email received
Checks:

  1. DLQ_ALERT_ENABLED=true in env?
  2. NO_EMAIL_MODE=false in env?
  3. SMTP credentials configured?
  4. Check last alert time (throttled to 1/hour)?

Test Email:

const mailer = require('./helpers/mailer');
await mailer.sendAdminNotification('Test', 'Testing DLQ alerts');

High Memory Usage

Symptom: Archival worker consuming excessive memory
Cause: Large message backlog in archive queue
Fix:

  1. Reduce prefetch in archival worker (currently 1)
  2. Process archive queue in batches
  3. Add memory monitoring alerts

Performance Tuning

Adjust Check Interval

Reduce alert latency:

DLQ_ALERT_INTERVAL_MS=60000  # Check every minute instead of 5

Optimize Archive Storage

Compress archived files:

// In dlq_archival_worker.js, add gzip compression:
const zlib = require('zlib');
const compressed = zlib.gzipSync(JSON.stringify(archiveRecord));
await fs.writeFile(filepath + '.gz', compressed);

Adjust Retention

For high-volume systems:

DLQ_RETENTION_DAYS=30  # Reduce to 30 days

API Reference

GET /api/health

Returns overall system health including DLQ component.

Response:

{
  "components": {
    "dlq": {
      "status": "healthy|degraded|unhealthy",
      "messageCount": 5,
      "threshold": 20,
      "critical": 50
    }
  }
}

GET /api/dlq/partner_tasks/stats

Get comprehensive DLQ statistics.

Response:

{
  "dlq": {
    "messageCount": 25,
    "consumerCount": 0,
    "queueName": "partner_tasks_failed"
  },
  "trackers": {
    "failed": 25,
    "processing": 3,
    "downloaded": 10,
    "processed": 1523,
    "archived": 45
  },
  "recentFailures": [...]
}

GET /api/dlq/partner_tasks/messages?limit=50

Peek at DLQ messages without consuming.

POST /api/dlq/:queueName/retryAll

Retry all messages currently in the DLQ.

POST /api/dlq/:queueName/retryByPosition

Retry messages by position range.

POST /api/dlq/:queueName/retryByHeader

Retry messages matching specific header values.

POST /api/dlq/:queueName/process

Process all eligible DLQ messages with intelligent categorization.

DELETE /api/dlq/:queueName/purge

Purge entire DLQ (requires confirmation).


Best Practices

  1. Never Enable DLQ Consumer Permanently

    • Set DLQ_CONSUMER_ENABLED=true only during active recovery
    • Return to false after queue drains to prevent blind retries
  2. Monitor Error Categories

    • Review dashboard regularly for patterns
    • High validation errors → Data quality issues
    • High transient errors → Infrastructure problems
  3. Adjust Thresholds Per Environment

    • Production: DLQ_ALERT_THRESHOLD=20, DLQ_ALERT_CRITICAL=50
    • Staging: Higher thresholds acceptable
    • Development: Consider disabling alerts (DLQ_ALERT_ENABLED=false)
  4. Archive Retention Planning

    • Default 365 days suitable for most compliance needs
    • Consider lifecycle policy for dlq_archives/ directory
    • Old archives can be compressed or moved to cold storage
  5. Correlate with PartnerLogTracker

    • Query PartnerLogTracker by customerId and partnerCode
    • Check message body for logFileName and other task details
    • Match against tracker records using filter criteria
    • Cross-reference for complete failure analysis
    • Update PartnerLogTracker.errorMessage for business reporting
  6. Regular Archive Cleanup

    • Setup cron job to delete/compress old archives
    • Example: Keep 2 years, then delete
    find ./dlq_archives -type f -mtime +730 -delete
    
  7. Test Alert System

    • Manually trigger test alerts during setup
    • Verify email delivery and formatting
    • Confirm throttling behavior
  8. Health Check Integration

    • Add DLQ health to monitoring dashboards (Grafana, Datadog)
    • Alert on status: "unhealthy" in CI/CD pipeline
    • Include in uptime checks

Migration Guide

From Old DLQ System

If migrating from PartnerLogTracker-based retry to queue-native DLQ:

  1. Deploy New System:

    # Update env with DLQ settings
    DLQ_RETENTION_DAYS=365
    DLQ_ALERT_ENABLED=true
    DLQ_CONSUMER_ENABLED=false
    
    # Restart workers
    pm2 restart partner_sync_worker
    pm2 start workers/dlq_archival_worker.js --name dlq-archival
    
  2. Monitor Both Systems:

    • Old DLQ code still handles in-flight messages
    • New system captures new failures
    • Gradually phase out old retry logic
  3. Cleanup:

    • Remove old DLQ retry code from controllers
    • Keep PartnerLogTracker for BI purposes
    • Update documentation

Support

For issues or questions:

  • Check logs: workers/*.rlog files
  • Review RabbitMQ management console: http://localhost:15672
  • Check archive directory: ls -lah dlq_archives/
  • Contact: trungh@agnav.com

See Also


Last Updated: December 18, 2025
System Version: 1.0.0
Author: AgMission Platform Team