# Partner DLQ (Dead Letter Queue) Handling System
## Overview
The Partner DLQ Handling System provides automatic and manual management of failed partner processing tasks. It categorizes failures, automatically retries transient errors, archives non-recoverable tasks, and provides monitoring tools for administrators.
## Architecture
### Components
1. **Partner Sync Worker** (`workers/partner_sync_worker.js`)
- Primary task processor
- Sends failed tasks to DLQ after max retries
- Implements circuit breaker for problematic files
2. **DLQ Handler** (`workers/partner_dlq_handler.js`)
- Monitors and processes DLQ messages
- Categorizes errors and makes retry/archive decisions
- Provides programmatic DLQ management
3. **DLQ Monitor** (`scripts/monitor_partner_dlq.js`)
- Interactive dashboard for DLQ monitoring
- Manual operations and statistics
### Message Flow
```mermaid
flowchart TD
A[Polling Worker
Enqueues Task] --> B[Partner Queue
Main Queue]
B -->|Processing| C[Sync Worker
Processing]
B -->|Max Retries
Exceeded| D[Dead Letter Queue
DLQ]
D --> E[DLQ Handler
Analysis]
E --> F[Retry
Queue]
E --> G[Archive
DB]
E --> H[Manual
Review]
```
## Error Categories
### 1. Transient Errors
- Network timeouts
- Temporary connection issues
- Database connection failures
- **Action**: Auto-retry within 2-hour window
### 2. Validation Errors
- Invalid file format
- Missing required fields
- Data validation failures
- **Action**: Archive immediately, notify admin
### 3. Processing Errors
- Calculation errors
- Parse errors
- Logic errors
- **Action**: Keep for manual review
### 4. Infrastructure Errors
- Database errors
- Filesystem errors
- Transaction failures
- **Action**: Retry with exponential backoff
### 5. Partner API Errors
- API authentication failures
- Rate limiting
- Partner service unavailable
- **Action**: Retry with longer delay
## Configuration
### Environment Variables
```bash
# Queue Configuration
QUEUE_HOST=localhost
QUEUE_PORT=5672
QUEUE_USR=agmuser
QUEUE_PWD=
QUEUE_NAME_PARTNER=partner_tasks # Base name, auto-prefixes 'dev_' when PRODUCTION=false
# Retry Configuration
PARTNER_MAX_RETRIES=5 # Max retries before DLQ
PARTNER_RETRY_DELAY=10000 # Base retry delay (ms)
# DLQ Configuration
DLQ_CHECK_INTERVAL=300000 # Check DLQ every 5 minutes
MAX_DLQ_AGE_MS=86400000 # Archive after 24 hours
AUTO_RETRY_WINDOW_MS=7200000 # Auto-retry within 2 hours
```
### Worker Constants
```javascript
// Circuit Breaker
MAX_FILE_ATTEMPTS = 10 (dev) / 3 (prod)
FILE_BLOCK_DURATION = 5 min (dev) / 1 hour (prod)
// Timeouts
PROCESSING_TIMEOUT_MS = 90 minutes
TASK_TIMEOUT_MS = 90 minutes
CLEANUP_INTERVAL_MS = 15 minutes
```
## Usage
### 1. Start DLQ Handler (Automatic Mode)
```bash
# Run as background service
node workers/partner_dlq_handler.js monitor &
# Or with PM2
pm2 start workers/partner_dlq_handler.js --name partner-dlq-handler -- monitor
```
### 2. Manual DLQ Operations
```bash
# Show DLQ statistics
node workers/partner_dlq_handler.js stats
# Process DLQ messages once
node workers/partner_dlq_handler.js process
```
### 3. Interactive Dashboard
```bash
# Launch monitoring dashboard
node scripts/monitor_partner_dlq.js
```
Dashboard commands:
- `r` - Refresh dashboard
- `p` - Process DLQ now
- `s` - Show detailed statistics
- `c` - Clear archived tasks (> 7 days old)
- `q` - Quit
## Monitoring
### DLQ Statistics
```javascript
{
messageCount: 5, // Messages in DLQ
consumerCount: 0, // Active consumers
queueName: 'partner_tasks_failed'
}
```
### Tracker Statistics
```javascript
{
failed: 12, // Failed tasks
processing: 3, // Currently processing
downloaded: 8, // Downloaded, waiting
processed: 245, // Successfully processed
archived: 7 // Archived from DLQ
}
```
## Queue-Native Operations
### Retry Operations (Recommended - Direct RabbitMQ)
```bash
# Retry all messages in queue
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"maxMessages": 50}'
# Retry by position range (0-based index)
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByPosition \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"startPosition": 0, "endPosition": 10}'
# Retry by header match
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByHeader \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"headerKey": "x-retry-count", "headerValue": "1"}'
```
**Benefits**:
- No MongoDB coupling
- Preserves original message content
- Supports multiple queue types
- Direct RabbitMQ operations
### Legacy Manual Recovery (Programmatic)
For advanced debugging scenarios only:
```javascript
const handler = new PartnerDLQHandler();
await handler.start();
// Get message from DLQ
const msg = await handler.channel.get('partner_tasks_failed');
// Retry it programmatically
await handler.retryMessage(msg, JSON.parse(msg.content));
```
### Clear Stuck Tasks
```bash
# Reset stuck processing tasks
mongo mongodb://localhost:27017/agmission << EOF
use agmission
db.partnerlogtrackers.updateMany(
{
status: 'processing',
processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
},
{
\$set: {
status: 'failed',
errorMessage: 'Manually reset - stuck processing'
}
}
)
EOF
```
### Purge DLQ
```bash
# WARNING: This deletes all DLQ messages
rabbitmqadmin purge queue name=partner_tasks_failed
```
## Troubleshooting
### High DLQ Message Count
1. Check error patterns:
```bash
node scripts/monitor_partner_dlq.js
# Press 's' for detailed stats
```
2. Identify root cause:
- **Validation errors**: Fix data source or add validation
- **Transient errors**: Check infrastructure (network, DB, partner API)
- **Processing errors**: Review logs, fix code bugs
3. Take action:
- Fix root cause
- Process DLQ: `node workers/partner_dlq_handler.js process`
- Monitor results
### Memory Issues
1. Check worker memory:
```bash
ps aux | grep partner_sync_worker
```
2. If high memory usage:
- Reduce `batchSize` in processor options
- Increase `PROCESSING_TIMEOUT_MS`
- Enable garbage collection: `node --expose-gc --max-old-space-size=2048`
### Circuit Breaker Blocking Files
1. Check blocked files:
```javascript
// In partner_sync_worker.js, add logging:
setInterval(() => {
console.log('Blocked files:', Array.from(problematicFiles.keys()));
}, 60000);
```
2. Reset circuit breaker:
- Restart worker (circuit breaker is in-memory)
- Or adjust thresholds for development
## Best Practices
### 1. Monitoring
- Set up alerts for high DLQ message count (> 50)
- Monitor DLQ age (messages > 1 hour need attention)
- Track processing success rate
### 2. Retry Strategy
- Use exponential backoff for retries
- Categorize errors properly
- Don't retry validation errors
### 3. Circuit Breaker
- Use lenient settings in development
- Use strict settings in production
- Monitor blocked files regularly
### 4. Database Cleanup
- Archive old failed tasks (> 30 days)
- Keep DLQ archived tasks for audit (7 days)
- Regularly check for stuck processing tasks
## API Reference
### PartnerDLQHandler
```javascript
const handler = new PartnerDLQHandler();
// Start handler
await handler.start();
// Process DLQ
await handler.processDLQ();
// Get statistics
const stats = await handler.getStatistics();
// Stop handler
await handler.stop();
```
### Error Categories
```javascript
ERROR_CATEGORIES = {
TRANSIENT: 'transient',
VALIDATION: 'validation',
PROCESSING: 'processing',
INFRASTRUCTURE: 'infrastructure',
PARTNER_API: 'partner_api',
UNKNOWN: 'unknown'
}
```
## Future Enhancements
1. **Email/Slack Notifications**
- Alert admins on critical failures
- Daily DLQ summary reports
2. **Advanced Analytics**
- Failure trend analysis
- Automatic root cause detection
- Performance metrics
3. **Automatic Recovery**
- Smart retry scheduling
- Self-healing for known issues
- Predictive failure prevention
4. **Web Dashboard**
- Real-time DLQ visualization
- One-click retry/archive
- Historical analysis
## Related Documentation
- [Partner Integration Architecture](./PARTNER_INTEGRATION_ARCHITECTURE.md)
- [SatLoc Implementation Summary](./SATLOC_IMPLEMENTATION_SUMMARY.md)
- [Worker Responsibilities Update](./WORKER_RESPONSIBILITIES_UPDATE.md)