7.2 KiB
DLQ Operations Guide
Navigation: 📖 Index | 🚀 Quick Start | 📚 API Reference | 🔧 Operations | 🏗️ System Guide
Comprehensive guide for managing Dead Letter Queues across all queue types.
Overview
The DLQ system provides queue-native tools for monitoring and managing failed tasks across all queue types:
- Partner tasks (
partner_tasks) - Job processing (
jobs) - Future queue types (notifications, analytics, etc.)
Key Benefits:
- Direct RabbitMQ operations (no MongoDB coupling)
- Supports multiple queue types
- Preserves original message content and headers
- Works with any task type
Architecture
Components
-
Workers - Process tasks, send failures to DLQ
workers/partner_sync_worker.jsworkers/job_worker.js- Future workers for other queue types
-
DLQ Routes - Global API endpoints
routes/dlq.js- Mounted at
/api/dlq/:queueName/*
-
DLQ Controller - Queue operations logic
controllers/dlq.js- Handles all queue types generically
-
Monitoring Tools
- Web dashboard:
public/dlq-monitor.html
- Web dashboard:
Message Flow
flowchart LR
A[Worker] --> B[Main Queue]
B --> C{Processing}
C -->|Success ✓| D[Complete]
C -->|Failure<br/>max retries| E[DLQ]
E --> F{Action}
F -->|Retry| B
F -->|Archive| G[Archive Storage]
F -->|Purge| H[Delete]
Queue-Native Operations
Retry Operations
Retry All Messages (Recommended)
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"maxMessages": 50}'
Retry by Position Range (0-based index)
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByPosition \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"startPosition": 0, "endPosition": 10}'
Retry by Header Match (Custom filtering)
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByHeader \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"headerKey": "x-retry-count", "headerValue": "1"}'
Benefits:
- No MongoDB coupling
- Preserves original message content
- Supports multiple queue types
- Direct RabbitMQ operations
Monitoring
Web Dashboard
Access at http://localhost:4100/dlq-monitor.html
Features:
- Real-time statistics
- Message list with error details
- One-click retry operations
- Queue selection dropdown
- Auto-refresh every 30 seconds
Manual Recovery Procedures
Clear Stuck Processing Tasks
If tasks are stuck in "processing" status:
mongo mongodb://localhost:27017/agmission << EOF
use agmission
db.partner_log_trackers.updateMany(
{
status: 'processing',
processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
},
{
\$set: {
status: 'failed',
errorMessage: 'Manually reset - stuck processing'
}
}
)
EOF
Purge DLQ (Dangerous!)
⚠️ Warning: This permanently deletes all DLQ messages.
curl -X DELETE http://localhost:4100/api/dlq/partner_tasks/purge \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"confirm": true}'
Multi-Queue Operations
Partner Queue
# View messages
curl http://localhost:4100/api/dlq/partner_tasks/messages \
-H "Authorization: Bearer $TOKEN"
# Retry all
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-d '{"maxMessages": 100}'
Job Queue
# View messages
curl http://localhost:4100/api/dlq/dev_jobs/messages \
-H "Authorization: Bearer $TOKEN"
# Retry all
curl -X POST http://localhost:4100/api/dlq/dev_jobs/retryAll \
-H "Authorization: Bearer $TOKEN" \
-d '{"maxMessages": 100}'
Future Queues
No code changes needed:
curl -X POST http://localhost:4100/api/dlq/notifications/retryAll \
-H "Authorization: Bearer $TOKEN" \
-d '{"maxMessages": 50}'
Alert Thresholds
Recommended Monitoring
# Check DLQ count
DLQ_COUNT=$(curl -s http://localhost:4100/api/dlq/partner_tasks/stats \
-H "Authorization: Bearer $TOKEN" | jq '.dlq.messageCount')
# Alert thresholds
if [ "$DLQ_COUNT" -gt 100 ]; then
echo "CRITICAL: DLQ has $DLQ_COUNT messages"
elif [ "$DLQ_COUNT" -gt 50 ]; then
echo "WARNING: DLQ has $DLQ_COUNT messages"
fi
Thresholds:
- Warning: DLQ > 20 messages
- Critical: DLQ > 50 messages
- Emergency: DLQ > 100 messages OR age > 6 hours
Error Categories
Common error patterns and recovery strategies:
Transient Errors
- Network timeouts
- Connection failures
- Temporary API unavailability
Action: Auto-retry (usually succeeds)
Validation Errors
- Invalid file format
- Missing required fields
- Data type mismatches
Action: Fix source data, then retry
Infrastructure Errors
- Database connection failures
- Disk space issues
- Memory errors
Action: Fix infrastructure, then retry all
Integration with Monitoring Systems
Prometheus Metrics (Future)
# DLQ message count gauge
dlq_messages_total{queue="partner_tasks"} 5
dlq_messages_total{queue="jobs"} 2
# Retry success rate
dlq_retry_success_rate{queue="partner_tasks"} 0.85
Alert Manager Rules
groups:
- name: dlq_alerts
rules:
- alert: HighDLQCount
expr: dlq_messages_total > 50
for: 30m
annotations:
summary: "High DLQ message count"
Best Practices
- Regular Monitoring: Check DLQ counts at least daily
- Investigate Patterns: Multiple similar failures indicate systemic issues
- Timely Retry: Don't let messages age too long
- Use Position Retry: For targeted retry of specific ranges
- Document Failures: Track patterns for future prevention
- Test Retry: Use small batches first to verify fixes
Troubleshooting
Cannot Connect to RabbitMQ
Check connection settings in environment.env:
QUEUE_HOST=localhost
QUEUE_PORT=5672
QUEUE_USR=agm
QUEUE_PWD=***
Messages Not Retrying
-
Check worker is running:
ps aux | grep partner_sync_worker -
Check main queue exists:
curl http://localhost:15672/api/queues/%2F/dev_partner_tasks \ -u agm:*** -
Check message format is valid
High Failure Rate
- Review recent error messages
- Check worker logs for patterns
- Verify external services are available
- Review worker configuration
Related Documentation
📚 DLQ Documentation
- 📖 DLQ Index - Documentation overview
- 🚀 Quick Start - Get started quickly
- 📚 API Reference - Complete API docs
- 🏗️ System Guide - Architecture details
🔗 Additional Resources
- Worker Configuration - Worker setup
- Global DLQ Refactoring - Architecture changes
- Web Dashboard - Monitoring interface