325 lines
7.2 KiB
Markdown
325 lines
7.2 KiB
Markdown
# DLQ Operations Guide
|
|
|
|
**Navigation:** [📖 Index](DLQ_INDEX.md) | [🚀 Quick Start](DLQ_QUICKSTART.md) | [📚 API Reference](DLQ_API_REFERENCE.md) | [🔧 Operations](DLQ_OPERATIONS.md) | [🏗️ System Guide](DLQ_SYSTEM_GUIDE.md)
|
|
|
|
---
|
|
|
|
Comprehensive guide for managing Dead Letter Queues across all queue types.
|
|
|
|
## Overview
|
|
|
|
The DLQ system provides queue-native tools for monitoring and managing failed tasks across **all queue types**:
|
|
- Partner tasks (`partner_tasks`)
|
|
- Job processing (`jobs`)
|
|
- Future queue types (notifications, analytics, etc.)
|
|
|
|
**Key Benefits:**
|
|
- Direct RabbitMQ operations (no MongoDB coupling)
|
|
- Supports multiple queue types
|
|
- Preserves original message content and headers
|
|
- Works with any task type
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### Components
|
|
|
|
1. **Workers** - Process tasks, send failures to DLQ
|
|
- `workers/partner_sync_worker.js`
|
|
- `workers/job_worker.js`
|
|
- Future workers for other queue types
|
|
|
|
2. **DLQ Routes** - Global API endpoints
|
|
- `routes/dlq.js`
|
|
- Mounted at `/api/dlq/:queueName/*`
|
|
|
|
3. **DLQ Controller** - Queue operations logic
|
|
- `controllers/dlq.js`
|
|
- Handles all queue types generically
|
|
|
|
4. **Monitoring Tools**
|
|
- Web dashboard: `public/dlq-monitor.html`
|
|
|
|
### Message Flow
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A[Worker] --> B[Main Queue]
|
|
B --> C{Processing}
|
|
C -->|Success ✓| D[Complete]
|
|
C -->|Failure<br/>max retries| E[DLQ]
|
|
E --> F{Action}
|
|
F -->|Retry| B
|
|
F -->|Archive| G[Archive Storage]
|
|
F -->|Purge| H[Delete]
|
|
```
|
|
|
|
---
|
|
|
|
## Queue-Native Operations
|
|
|
|
### Retry Operations
|
|
|
|
**Retry All Messages (Recommended)**
|
|
```bash
|
|
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"maxMessages": 50}'
|
|
```
|
|
|
|
**Retry by Position Range (0-based index)**
|
|
```bash
|
|
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByPosition \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"startPosition": 0, "endPosition": 10}'
|
|
```
|
|
|
|
**Retry by Header Match (Custom filtering)**
|
|
```bash
|
|
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByHeader \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"headerKey": "x-retry-count", "headerValue": "1"}'
|
|
```
|
|
|
|
**Benefits:**
|
|
- No MongoDB coupling
|
|
- Preserves original message content
|
|
- Supports multiple queue types
|
|
- Direct RabbitMQ operations
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Web Dashboard
|
|
|
|
Access at `http://localhost:4100/dlq-monitor.html`
|
|
|
|
Features:
|
|
- Real-time statistics
|
|
- Message list with error details
|
|
- One-click retry operations
|
|
- Queue selection dropdown
|
|
- Auto-refresh every 30 seconds
|
|
|
|
---
|
|
|
|
## Manual Recovery Procedures
|
|
|
|
### Clear Stuck Processing Tasks
|
|
|
|
If tasks are stuck in "processing" status:
|
|
|
|
```bash
|
|
mongo mongodb://localhost:27017/agmission << EOF
|
|
use agmission
|
|
db.partner_log_trackers.updateMany(
|
|
{
|
|
status: 'processing',
|
|
processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
|
|
},
|
|
{
|
|
\$set: {
|
|
status: 'failed',
|
|
errorMessage: 'Manually reset - stuck processing'
|
|
}
|
|
}
|
|
)
|
|
EOF
|
|
```
|
|
|
|
### Purge DLQ (Dangerous!)
|
|
|
|
⚠️ **Warning**: This permanently deletes all DLQ messages.
|
|
|
|
```bash
|
|
curl -X DELETE http://localhost:4100/api/dlq/partner_tasks/purge \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"confirm": true}'
|
|
```
|
|
|
|
---
|
|
|
|
## Multi-Queue Operations
|
|
|
|
### Partner Queue
|
|
```bash
|
|
# View messages
|
|
curl http://localhost:4100/api/dlq/partner_tasks/messages \
|
|
-H "Authorization: Bearer $TOKEN"
|
|
|
|
# Retry all
|
|
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"maxMessages": 100}'
|
|
```
|
|
|
|
### Job Queue
|
|
```bash
|
|
# View messages
|
|
curl http://localhost:4100/api/dlq/dev_jobs/messages \
|
|
-H "Authorization: Bearer $TOKEN"
|
|
|
|
# Retry all
|
|
curl -X POST http://localhost:4100/api/dlq/dev_jobs/retryAll \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"maxMessages": 100}'
|
|
```
|
|
|
|
### Future Queues
|
|
|
|
No code changes needed:
|
|
```bash
|
|
curl -X POST http://localhost:4100/api/dlq/notifications/retryAll \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"maxMessages": 50}'
|
|
```
|
|
|
|
---
|
|
|
|
## Alert Thresholds
|
|
|
|
### Recommended Monitoring
|
|
|
|
```bash
|
|
# Check DLQ count
|
|
DLQ_COUNT=$(curl -s http://localhost:4100/api/dlq/partner_tasks/stats \
|
|
-H "Authorization: Bearer $TOKEN" | jq '.dlq.messageCount')
|
|
|
|
# Alert thresholds
|
|
if [ "$DLQ_COUNT" -gt 100 ]; then
|
|
echo "CRITICAL: DLQ has $DLQ_COUNT messages"
|
|
elif [ "$DLQ_COUNT" -gt 50 ]; then
|
|
echo "WARNING: DLQ has $DLQ_COUNT messages"
|
|
fi
|
|
```
|
|
|
|
**Thresholds:**
|
|
- Warning: DLQ > 20 messages
|
|
- Critical: DLQ > 50 messages
|
|
- Emergency: DLQ > 100 messages OR age > 6 hours
|
|
|
|
---
|
|
|
|
## Error Categories
|
|
|
|
Common error patterns and recovery strategies:
|
|
|
|
### Transient Errors
|
|
- Network timeouts
|
|
- Connection failures
|
|
- Temporary API unavailability
|
|
|
|
**Action**: Auto-retry (usually succeeds)
|
|
|
|
### Validation Errors
|
|
- Invalid file format
|
|
- Missing required fields
|
|
- Data type mismatches
|
|
|
|
**Action**: Fix source data, then retry
|
|
|
|
### Infrastructure Errors
|
|
- Database connection failures
|
|
- Disk space issues
|
|
- Memory errors
|
|
|
|
**Action**: Fix infrastructure, then retry all
|
|
|
|
---
|
|
|
|
## Integration with Monitoring Systems
|
|
|
|
### Prometheus Metrics (Future)
|
|
|
|
```python
|
|
# DLQ message count gauge
|
|
dlq_messages_total{queue="partner_tasks"} 5
|
|
dlq_messages_total{queue="jobs"} 2
|
|
|
|
# Retry success rate
|
|
dlq_retry_success_rate{queue="partner_tasks"} 0.85
|
|
```
|
|
|
|
### Alert Manager Rules
|
|
|
|
```yaml
|
|
groups:
|
|
- name: dlq_alerts
|
|
rules:
|
|
- alert: HighDLQCount
|
|
expr: dlq_messages_total > 50
|
|
for: 30m
|
|
annotations:
|
|
summary: "High DLQ message count"
|
|
```
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
1. **Regular Monitoring**: Check DLQ counts at least daily
|
|
2. **Investigate Patterns**: Multiple similar failures indicate systemic issues
|
|
3. **Timely Retry**: Don't let messages age too long
|
|
4. **Use Position Retry**: For targeted retry of specific ranges
|
|
5. **Document Failures**: Track patterns for future prevention
|
|
6. **Test Retry**: Use small batches first to verify fixes
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Cannot Connect to RabbitMQ
|
|
|
|
Check connection settings in `environment.env`:
|
|
```env
|
|
QUEUE_HOST=localhost
|
|
QUEUE_PORT=5672
|
|
QUEUE_USR=agm
|
|
QUEUE_PWD=***
|
|
```
|
|
|
|
### Messages Not Retrying
|
|
|
|
1. Check worker is running:
|
|
```bash
|
|
ps aux | grep partner_sync_worker
|
|
```
|
|
|
|
2. Check main queue exists:
|
|
```bash
|
|
curl http://localhost:15672/api/queues/%2F/dev_partner_tasks \
|
|
-u agm:***
|
|
```
|
|
|
|
3. Check message format is valid
|
|
|
|
### High Failure Rate
|
|
|
|
1. Review recent error messages
|
|
2. Check worker logs for patterns
|
|
3. Verify external services are available
|
|
4. Review worker configuration
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
### 📚 DLQ Documentation
|
|
|
|
- **[📖 DLQ Index](DLQ_INDEX.md)** - Documentation overview
|
|
- **[🚀 Quick Start](DLQ_QUICKSTART.md)** - Get started quickly
|
|
- **[📚 API Reference](DLQ_API_REFERENCE.md)** - Complete API docs
|
|
- **[🏗️ System Guide](DLQ_SYSTEM_GUIDE.md)** - Architecture details
|
|
|
|
### 🔗 Additional Resources
|
|
|
|
- [Worker Configuration](../README.md#workers) - Worker setup
|
|
- [Global DLQ Refactoring](../GLOBAL_DLQ_REFACTORING_COMPLETE.md) - Architecture changes
|
|
- [Web Dashboard](../public/dlq-monitor.html) - Monitoring interface
|