agmission/Development/server/docs/DLQ_OPERATIONS.md
Devin Major df31b2080d
All checks were successful
Server Tests / Mocha – Unit & Utility Tests (push) Successful in 42s
-(#3013) Data Export - Implement Data Export API BE (Cont.)
+ Added public data export API enhancements, tests, and customer documentation
  + Extended /api/v1 data export endpoints with richer session, records, area, and async export output
  + Added confirmed/fallback report values, client metadata, mapped area, over-spray, volume/apprate (string) units, and weather blocks
  + Normalized flowController to "No FC" and align record field names with playback output
  + Converted record wind speed output to knots, add Fligh Mater only record/export fields behind fm=true, and persist fm on export jobs
  + Added export status/area constants, HTTP 202 support, route-level API docs, and per-account export rate limiting support
  + Added comprehensive endpoint, format, and verification test coverage plus test-suite README
  + Added customer-facing data export design, integration, rate-limit, and documentation index guides
  + Updated README/DLQ docs and related documentation links to current HTTPS dashboard paths
2026-04-24 09:05:55 -04:00

325 lines
7.2 KiB
Markdown

# DLQ Operations Guide
**Navigation:** [📖 Index](DLQ_INDEX.md) | [🚀 Quick Start](DLQ_QUICKSTART.md) | [📚 API Reference](DLQ_API_REFERENCE.md) | [🔧 Operations](DLQ_OPERATIONS.md) | [🏗️ System Guide](DLQ_SYSTEM_GUIDE.md)
---
Comprehensive guide for managing Dead Letter Queues across all queue types.
## Overview
The DLQ system provides queue-native tools for monitoring and managing failed tasks across **all queue types**:
- Partner tasks (`partner_tasks`)
- Job processing (`jobs`)
- Future queue types (notifications, analytics, etc.)
**Key Benefits:**
- Direct RabbitMQ operations (no MongoDB coupling)
- Supports multiple queue types
- Preserves original message content and headers
- Works with any task type
---
## Architecture
### Components
1. **Workers** - Process tasks, send failures to DLQ
- `workers/partner_sync_worker.js`
- `workers/job_worker.js`
- Future workers for other queue types
2. **DLQ Routes** - Global API endpoints
- `routes/dlq.js`
- Mounted at `/api/dlq/:queueName/*`
3. **DLQ Controller** - Queue operations logic
- `controllers/dlq.js`
- Handles all queue types generically
4. **Monitoring Tools**
- Web dashboard: `public/dlq-monitor.html`
### Message Flow
```mermaid
flowchart LR
A[Worker] --> B[Main Queue]
B --> C{Processing}
C -->|Success ✓| D[Complete]
C -->|Failure<br/>max retries| E[DLQ]
E --> F{Action}
F -->|Retry| B
F -->|Archive| G[Archive Storage]
F -->|Purge| H[Delete]
```
---
## Queue-Native Operations
### Retry Operations
**Retry All Messages (Recommended)**
```bash
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"maxMessages": 50}'
```
**Retry by Position Range (0-based index)**
```bash
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByPosition \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"startPosition": 0, "endPosition": 10}'
```
**Retry by Header Match (Custom filtering)**
```bash
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByHeader \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"headerKey": "x-retry-count", "headerValue": "1"}'
```
**Benefits:**
- No MongoDB coupling
- Preserves original message content
- Supports multiple queue types
- Direct RabbitMQ operations
---
## Monitoring
### Web Dashboard
Access at `https://localhost:4100/dlq-monitor.html`
Features:
- Real-time statistics
- Message list with error details
- One-click retry operations
- Queue selection dropdown
- Auto-refresh every 30 seconds
---
## Manual Recovery Procedures
### Clear Stuck Processing Tasks
If tasks are stuck in "processing" status:
```bash
mongo mongodb://localhost:27017/agmission << EOF
use agmission
db.partner_log_trackers.updateMany(
{
status: 'processing',
processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
},
{
\$set: {
status: 'failed',
errorMessage: 'Manually reset - stuck processing'
}
}
)
EOF
```
### Purge DLQ (Dangerous!)
⚠️ **Warning**: This permanently deletes all DLQ messages.
```bash
curl -X DELETE http://localhost:4100/api/dlq/partner_tasks/purge \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"confirm": true}'
```
---
## Multi-Queue Operations
### Partner Queue
```bash
# View messages
curl http://localhost:4100/api/dlq/partner_tasks/messages \
-H "Authorization: Bearer $TOKEN"
# Retry all
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-d '{"maxMessages": 100}'
```
### Job Queue
```bash
# View messages
curl http://localhost:4100/api/dlq/dev_jobs/messages \
-H "Authorization: Bearer $TOKEN"
# Retry all
curl -X POST http://localhost:4100/api/dlq/dev_jobs/retryAll \
-H "Authorization: Bearer $TOKEN" \
-d '{"maxMessages": 100}'
```
### Future Queues
No code changes needed:
```bash
curl -X POST http://localhost:4100/api/dlq/notifications/retryAll \
-H "Authorization: Bearer $TOKEN" \
-d '{"maxMessages": 50}'
```
---
## Alert Thresholds
### Recommended Monitoring
```bash
# Check DLQ count
DLQ_COUNT=$(curl -s http://localhost:4100/api/dlq/partner_tasks/stats \
-H "Authorization: Bearer $TOKEN" | jq '.dlq.messageCount')
# Alert thresholds
if [ "$DLQ_COUNT" -gt 100 ]; then
echo "CRITICAL: DLQ has $DLQ_COUNT messages"
elif [ "$DLQ_COUNT" -gt 50 ]; then
echo "WARNING: DLQ has $DLQ_COUNT messages"
fi
```
**Thresholds:**
- Warning: DLQ > 20 messages
- Critical: DLQ > 50 messages
- Emergency: DLQ > 100 messages OR age > 6 hours
---
## Error Categories
Common error patterns and recovery strategies:
### Transient Errors
- Network timeouts
- Connection failures
- Temporary API unavailability
**Action**: Auto-retry (usually succeeds)
### Validation Errors
- Invalid file format
- Missing required fields
- Data type mismatches
**Action**: Fix source data, then retry
### Infrastructure Errors
- Database connection failures
- Disk space issues
- Memory errors
**Action**: Fix infrastructure, then retry all
---
## Integration with Monitoring Systems
### Prometheus Metrics (Future)
```python
# DLQ message count gauge
dlq_messages_total{queue="partner_tasks"} 5
dlq_messages_total{queue="jobs"} 2
# Retry success rate
dlq_retry_success_rate{queue="partner_tasks"} 0.85
```
### Alert Manager Rules
```yaml
groups:
- name: dlq_alerts
rules:
- alert: HighDLQCount
expr: dlq_messages_total > 50
for: 30m
annotations:
summary: "High DLQ message count"
```
---
## Best Practices
1. **Regular Monitoring**: Check DLQ counts at least daily
2. **Investigate Patterns**: Multiple similar failures indicate systemic issues
3. **Timely Retry**: Don't let messages age too long
4. **Use Position Retry**: For targeted retry of specific ranges
5. **Document Failures**: Track patterns for future prevention
6. **Test Retry**: Use small batches first to verify fixes
---
## Troubleshooting
### Cannot Connect to RabbitMQ
Check connection settings in `environment.env`:
```env
QUEUE_HOST=localhost
QUEUE_PORT=5672
QUEUE_USR=agm
QUEUE_PWD=***
```
### Messages Not Retrying
1. Check worker is running:
```bash
ps aux | grep partner_sync_worker
```
2. Check main queue exists:
```bash
curl http://localhost:15672/api/queues/%2F/dev_partner_tasks \
-u agm:***
```
3. Check message format is valid
### High Failure Rate
1. Review recent error messages
2. Check worker logs for patterns
3. Verify external services are available
4. Review worker configuration
---
## Related Documentation
### 📚 DLQ Documentation
- **[📖 DLQ Index](DLQ_INDEX.md)** - Documentation overview
- **[🚀 Quick Start](DLQ_QUICKSTART.md)** - Get started quickly
- **[📚 API Reference](DLQ_API_REFERENCE.md)** - Complete API docs
- **[🏗️ System Guide](DLQ_SYSTEM_GUIDE.md)** - Architecture details
### 🔗 Additional Resources
- [Worker Configuration](../README.md#workers) - Worker setup
- [Global DLQ Refactoring](../GLOBAL_DLQ_REFACTORING_COMPLETE.md) - Architecture changes
- [Web Dashboard](../public/dlq-monitor.html) - Monitoring interface