365 lines
8.5 KiB
Markdown
365 lines
8.5 KiB
Markdown
# Partner DLQ (Dead Letter Queue) Handling System
|
|
|
|
## Overview
|
|
|
|
The Partner DLQ Handling System provides automatic and manual management of failed partner processing tasks. It categorizes failures, automatically retries transient errors, archives non-recoverable tasks, and provides monitoring tools for administrators.
|
|
|
|
## Architecture
|
|
|
|
### Components
|
|
|
|
1. **Partner Sync Worker** (`workers/partner_sync_worker.js`)
|
|
- Primary task processor
|
|
- Sends failed tasks to DLQ after max retries
|
|
- Implements circuit breaker for problematic files
|
|
|
|
2. **DLQ Handler** (`workers/partner_dlq_handler.js`)
|
|
- Monitors and processes DLQ messages
|
|
- Categorizes errors and makes retry/archive decisions
|
|
- Provides programmatic DLQ management
|
|
|
|
3. **DLQ Monitor** (`scripts/monitor_partner_dlq.js`)
|
|
- Interactive dashboard for DLQ monitoring
|
|
- Manual operations and statistics
|
|
|
|
### Message Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Polling Worker<br/>Enqueues Task] --> B[Partner Queue<br/>Main Queue]
|
|
B -->|Processing| C[Sync Worker<br/>Processing]
|
|
B -->|Max Retries<br/>Exceeded| D[Dead Letter Queue<br/>DLQ]
|
|
D --> E[DLQ Handler<br/>Analysis]
|
|
E --> F[Retry<br/>Queue]
|
|
E --> G[Archive<br/>DB]
|
|
E --> H[Manual<br/>Review]
|
|
```
|
|
|
|
## Error Categories
|
|
|
|
### 1. Transient Errors
|
|
- Network timeouts
|
|
- Temporary connection issues
|
|
- Database connection failures
|
|
- **Action**: Auto-retry within 2-hour window
|
|
|
|
### 2. Validation Errors
|
|
- Invalid file format
|
|
- Missing required fields
|
|
- Data validation failures
|
|
- **Action**: Archive immediately, notify admin
|
|
|
|
### 3. Processing Errors
|
|
- Calculation errors
|
|
- Parse errors
|
|
- Logic errors
|
|
- **Action**: Keep for manual review
|
|
|
|
### 4. Infrastructure Errors
|
|
- Database errors
|
|
- Filesystem errors
|
|
- Transaction failures
|
|
- **Action**: Retry with exponential backoff
|
|
|
|
### 5. Partner API Errors
|
|
- API authentication failures
|
|
- Rate limiting
|
|
- Partner service unavailable
|
|
- **Action**: Retry with longer delay
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Queue Configuration
|
|
QUEUE_HOST=localhost
|
|
QUEUE_PORT=5672
|
|
QUEUE_USR=agmuser
|
|
QUEUE_PWD=<password>
|
|
QUEUE_NAME_PARTNER=partner_tasks # Base name, auto-prefixes 'dev_' when PRODUCTION=false
|
|
|
|
# Retry Configuration
|
|
PARTNER_MAX_RETRIES=5 # Max retries before DLQ
|
|
PARTNER_RETRY_DELAY=10000 # Base retry delay (ms)
|
|
|
|
# DLQ Configuration
|
|
DLQ_CHECK_INTERVAL=300000 # Check DLQ every 5 minutes
|
|
MAX_DLQ_AGE_MS=86400000 # Archive after 24 hours
|
|
AUTO_RETRY_WINDOW_MS=7200000 # Auto-retry within 2 hours
|
|
```
|
|
|
|
### Worker Constants
|
|
|
|
```javascript
|
|
// Circuit Breaker
|
|
MAX_FILE_ATTEMPTS = 10 (dev) / 3 (prod)
|
|
FILE_BLOCK_DURATION = 5 min (dev) / 1 hour (prod)
|
|
|
|
// Timeouts
|
|
PROCESSING_TIMEOUT_MS = 90 minutes
|
|
TASK_TIMEOUT_MS = 90 minutes
|
|
CLEANUP_INTERVAL_MS = 15 minutes
|
|
```
|
|
|
|
## Usage
|
|
|
|
### 1. Start DLQ Handler (Automatic Mode)
|
|
|
|
```bash
|
|
# Run as background service
|
|
node workers/partner_dlq_handler.js monitor &
|
|
|
|
# Or with PM2
|
|
pm2 start workers/partner_dlq_handler.js --name partner-dlq-handler -- monitor
|
|
```
|
|
|
|
### 2. Manual DLQ Operations
|
|
|
|
```bash
|
|
# Show DLQ statistics
|
|
node workers/partner_dlq_handler.js stats
|
|
|
|
# Process DLQ messages once
|
|
node workers/partner_dlq_handler.js process
|
|
```
|
|
|
|
### 3. Interactive Dashboard
|
|
|
|
```bash
|
|
# Launch monitoring dashboard
|
|
node scripts/monitor_partner_dlq.js
|
|
```
|
|
|
|
Dashboard commands:
|
|
- `r` - Refresh dashboard
|
|
- `p` - Process DLQ now
|
|
- `s` - Show detailed statistics
|
|
- `c` - Clear archived tasks (> 7 days old)
|
|
- `q` - Quit
|
|
|
|
## Monitoring
|
|
|
|
### DLQ Statistics
|
|
|
|
```javascript
|
|
{
|
|
messageCount: 5, // Messages in DLQ
|
|
consumerCount: 0, // Active consumers
|
|
queueName: 'partner_tasks_failed'
|
|
}
|
|
```
|
|
|
|
### Tracker Statistics
|
|
|
|
```javascript
|
|
{
|
|
failed: 12, // Failed tasks
|
|
processing: 3, // Currently processing
|
|
downloaded: 8, // Downloaded, waiting
|
|
processed: 245, // Successfully processed
|
|
archived: 7 // Archived from DLQ
|
|
}
|
|
```
|
|
|
|
## Queue-Native Operations
|
|
|
|
### Retry Operations (Recommended - Direct RabbitMQ)
|
|
|
|
```bash
|
|
# Retry all messages in queue
|
|
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryAll \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"maxMessages": 50}'
|
|
|
|
# Retry by position range (0-based index)
|
|
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByPosition \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"startPosition": 0, "endPosition": 10}'
|
|
|
|
# Retry by header match
|
|
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByHeader \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"headerKey": "x-retry-count", "headerValue": "1"}'
|
|
```
|
|
|
|
**Benefits**:
|
|
- No MongoDB coupling
|
|
- Preserves original message content
|
|
- Supports multiple queue types
|
|
- Direct RabbitMQ operations
|
|
|
|
### Legacy Manual Recovery (Programmatic)
|
|
|
|
For advanced debugging scenarios only:
|
|
|
|
```javascript
|
|
const handler = new PartnerDLQHandler();
|
|
await handler.start();
|
|
|
|
// Get message from DLQ
|
|
const msg = await handler.channel.get('partner_tasks_failed');
|
|
|
|
// Retry it programmatically
|
|
await handler.retryMessage(msg, JSON.parse(msg.content));
|
|
```
|
|
|
|
### Clear Stuck Tasks
|
|
|
|
```bash
|
|
# Reset stuck processing tasks
|
|
mongo mongodb://localhost:27017/agmission << EOF
|
|
use agmission
|
|
db.partnerlogtrackers.updateMany(
|
|
{
|
|
status: 'processing',
|
|
processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
|
|
},
|
|
{
|
|
\$set: {
|
|
status: 'failed',
|
|
errorMessage: 'Manually reset - stuck processing'
|
|
}
|
|
}
|
|
)
|
|
EOF
|
|
```
|
|
|
|
### Purge DLQ
|
|
|
|
```bash
|
|
# WARNING: This deletes all DLQ messages
|
|
rabbitmqadmin purge queue name=partner_tasks_failed
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### High DLQ Message Count
|
|
|
|
1. Check error patterns:
|
|
```bash
|
|
node scripts/monitor_partner_dlq.js
|
|
# Press 's' for detailed stats
|
|
```
|
|
|
|
2. Identify root cause:
|
|
- **Validation errors**: Fix data source or add validation
|
|
- **Transient errors**: Check infrastructure (network, DB, partner API)
|
|
- **Processing errors**: Review logs, fix code bugs
|
|
|
|
3. Take action:
|
|
- Fix root cause
|
|
- Process DLQ: `node workers/partner_dlq_handler.js process`
|
|
- Monitor results
|
|
|
|
### Memory Issues
|
|
|
|
1. Check worker memory:
|
|
```bash
|
|
ps aux | grep partner_sync_worker
|
|
```
|
|
|
|
2. If high memory usage:
|
|
- Reduce `batchSize` in processor options
|
|
- Increase `PROCESSING_TIMEOUT_MS`
|
|
- Enable garbage collection: `node --expose-gc --max-old-space-size=2048`
|
|
|
|
### Circuit Breaker Blocking Files
|
|
|
|
1. Check blocked files:
|
|
```javascript
|
|
// In partner_sync_worker.js, add logging:
|
|
setInterval(() => {
|
|
console.log('Blocked files:', Array.from(problematicFiles.keys()));
|
|
}, 60000);
|
|
```
|
|
|
|
2. Reset circuit breaker:
|
|
- Restart worker (circuit breaker is in-memory)
|
|
- Or adjust thresholds for development
|
|
|
|
## Best Practices
|
|
|
|
### 1. Monitoring
|
|
- Set up alerts for high DLQ message count (> 50)
|
|
- Monitor DLQ age (messages > 1 hour need attention)
|
|
- Track processing success rate
|
|
|
|
### 2. Retry Strategy
|
|
- Use exponential backoff for retries
|
|
- Categorize errors properly
|
|
- Don't retry validation errors
|
|
|
|
### 3. Circuit Breaker
|
|
- Use lenient settings in development
|
|
- Use strict settings in production
|
|
- Monitor blocked files regularly
|
|
|
|
### 4. Database Cleanup
|
|
- Archive old failed tasks (> 30 days)
|
|
- Keep DLQ archived tasks for audit (7 days)
|
|
- Regularly check for stuck processing tasks
|
|
|
|
## API Reference
|
|
|
|
### PartnerDLQHandler
|
|
|
|
```javascript
|
|
const handler = new PartnerDLQHandler();
|
|
|
|
// Start handler
|
|
await handler.start();
|
|
|
|
// Process DLQ
|
|
await handler.processDLQ();
|
|
|
|
// Get statistics
|
|
const stats = await handler.getStatistics();
|
|
|
|
// Stop handler
|
|
await handler.stop();
|
|
```
|
|
|
|
### Error Categories
|
|
|
|
```javascript
|
|
ERROR_CATEGORIES = {
|
|
TRANSIENT: 'transient',
|
|
VALIDATION: 'validation',
|
|
PROCESSING: 'processing',
|
|
INFRASTRUCTURE: 'infrastructure',
|
|
PARTNER_API: 'partner_api',
|
|
UNKNOWN: 'unknown'
|
|
}
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Email/Slack Notifications**
|
|
- Alert admins on critical failures
|
|
- Daily DLQ summary reports
|
|
|
|
2. **Advanced Analytics**
|
|
- Failure trend analysis
|
|
- Automatic root cause detection
|
|
- Performance metrics
|
|
|
|
3. **Automatic Recovery**
|
|
- Smart retry scheduling
|
|
- Self-healing for known issues
|
|
- Predictive failure prevention
|
|
|
|
4. **Web Dashboard**
|
|
- Real-time DLQ visualization
|
|
- One-click retry/archive
|
|
- Historical analysis
|
|
|
|
## Related Documentation
|
|
|
|
- [Partner Integration Architecture](./PARTNER_INTEGRATION_ARCHITECTURE.md)
|
|
- [SatLoc Implementation Summary](./SATLOC_IMPLEMENTATION_SUMMARY.md)
|
|
- [Worker Responsibilities Update](./WORKER_RESPONSIBILITIES_UPDATE.md)
|