8.5 KiB
8.5 KiB
Partner DLQ (Dead Letter Queue) Handling System
Overview
The Partner DLQ Handling System provides automatic and manual management of failed partner processing tasks. It categorizes failures, automatically retries transient errors, archives non-recoverable tasks, and provides monitoring tools for administrators.
Architecture
Components
-
Partner Sync Worker (
workers/partner_sync_worker.js)- Primary task processor
- Sends failed tasks to DLQ after max retries
- Implements circuit breaker for problematic files
-
DLQ Handler (
workers/partner_dlq_handler.js)- Monitors and processes DLQ messages
- Categorizes errors and makes retry/archive decisions
- Provides programmatic DLQ management
-
DLQ Monitor (
scripts/monitor_partner_dlq.js)- Interactive dashboard for DLQ monitoring
- Manual operations and statistics
Message Flow
flowchart TD
A[Polling Worker<br/>Enqueues Task] --> B[Partner Queue<br/>Main Queue]
B -->|Processing| C[Sync Worker<br/>Processing]
B -->|Max Retries<br/>Exceeded| D[Dead Letter Queue<br/>DLQ]
D --> E[DLQ Handler<br/>Analysis]
E --> F[Retry<br/>Queue]
E --> G[Archive<br/>DB]
E --> H[Manual<br/>Review]
Error Categories
1. Transient Errors
- Network timeouts
- Temporary connection issues
- Database connection failures
- Action: Auto-retry within 2-hour window
2. Validation Errors
- Invalid file format
- Missing required fields
- Data validation failures
- Action: Archive immediately, notify admin
3. Processing Errors
- Calculation errors
- Parse errors
- Logic errors
- Action: Keep for manual review
4. Infrastructure Errors
- Database errors
- Filesystem errors
- Transaction failures
- Action: Retry with exponential backoff
5. Partner API Errors
- API authentication failures
- Rate limiting
- Partner service unavailable
- Action: Retry with longer delay
Configuration
Environment Variables
# Queue Configuration
QUEUE_HOST=localhost
QUEUE_PORT=5672
QUEUE_USR=agmuser
QUEUE_PWD=<password>
QUEUE_NAME_PARTNER=partner_tasks # Base name, auto-prefixes 'dev_' when PRODUCTION=false
# Retry Configuration
PARTNER_MAX_RETRIES=5 # Max retries before DLQ
PARTNER_RETRY_DELAY=10000 # Base retry delay (ms)
# DLQ Configuration
DLQ_CHECK_INTERVAL=300000 # Check DLQ every 5 minutes
MAX_DLQ_AGE_MS=86400000 # Archive after 24 hours
AUTO_RETRY_WINDOW_MS=7200000 # Auto-retry within 2 hours
Worker Constants
// Circuit Breaker
MAX_FILE_ATTEMPTS = 10 (dev) / 3 (prod)
FILE_BLOCK_DURATION = 5 min (dev) / 1 hour (prod)
// Timeouts
PROCESSING_TIMEOUT_MS = 90 minutes
TASK_TIMEOUT_MS = 90 minutes
CLEANUP_INTERVAL_MS = 15 minutes
Usage
1. Start DLQ Handler (Automatic Mode)
# Run as background service
node workers/partner_dlq_handler.js monitor &
# Or with PM2
pm2 start workers/partner_dlq_handler.js --name partner-dlq-handler -- monitor
2. Manual DLQ Operations
# Show DLQ statistics
node workers/partner_dlq_handler.js stats
# Process DLQ messages once
node workers/partner_dlq_handler.js process
3. Interactive Dashboard
# Launch monitoring dashboard
node scripts/monitor_partner_dlq.js
Dashboard commands:
r- Refresh dashboardp- Process DLQ nows- Show detailed statisticsc- Clear archived tasks (> 7 days old)q- Quit
Monitoring
DLQ Statistics
{
messageCount: 5, // Messages in DLQ
consumerCount: 0, // Active consumers
queueName: 'partner_tasks_failed'
}
Tracker Statistics
{
failed: 12, // Failed tasks
processing: 3, // Currently processing
downloaded: 8, // Downloaded, waiting
processed: 245, // Successfully processed
archived: 7 // Archived from DLQ
}
Queue-Native Operations
Retry Operations (Recommended - Direct RabbitMQ)
# Retry all messages in queue
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"maxMessages": 50}'
# Retry by position range (0-based index)
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByPosition \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"startPosition": 0, "endPosition": 10}'
# Retry by header match
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByHeader \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"headerKey": "x-retry-count", "headerValue": "1"}'
Benefits:
- No MongoDB coupling
- Preserves original message content
- Supports multiple queue types
- Direct RabbitMQ operations
Legacy Manual Recovery (Programmatic)
For advanced debugging scenarios only:
const handler = new PartnerDLQHandler();
await handler.start();
// Get message from DLQ
const msg = await handler.channel.get('partner_tasks_failed');
// Retry it programmatically
await handler.retryMessage(msg, JSON.parse(msg.content));
Clear Stuck Tasks
# Reset stuck processing tasks
mongo mongodb://localhost:27017/agmission << EOF
use agmission
db.partnerlogtrackers.updateMany(
{
status: 'processing',
processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
},
{
\$set: {
status: 'failed',
errorMessage: 'Manually reset - stuck processing'
}
}
)
EOF
Purge DLQ
# WARNING: This deletes all DLQ messages
rabbitmqadmin purge queue name=partner_tasks_failed
Troubleshooting
High DLQ Message Count
-
Check error patterns:
node scripts/monitor_partner_dlq.js # Press 's' for detailed stats -
Identify root cause:
- Validation errors: Fix data source or add validation
- Transient errors: Check infrastructure (network, DB, partner API)
- Processing errors: Review logs, fix code bugs
-
Take action:
- Fix root cause
- Process DLQ:
node workers/partner_dlq_handler.js process - Monitor results
Memory Issues
-
Check worker memory:
ps aux | grep partner_sync_worker -
If high memory usage:
- Reduce
batchSizein processor options - Increase
PROCESSING_TIMEOUT_MS - Enable garbage collection:
node --expose-gc --max-old-space-size=2048
- Reduce
Circuit Breaker Blocking Files
-
Check blocked files:
// In partner_sync_worker.js, add logging: setInterval(() => { console.log('Blocked files:', Array.from(problematicFiles.keys())); }, 60000); -
Reset circuit breaker:
- Restart worker (circuit breaker is in-memory)
- Or adjust thresholds for development
Best Practices
1. Monitoring
- Set up alerts for high DLQ message count (> 50)
- Monitor DLQ age (messages > 1 hour need attention)
- Track processing success rate
2. Retry Strategy
- Use exponential backoff for retries
- Categorize errors properly
- Don't retry validation errors
3. Circuit Breaker
- Use lenient settings in development
- Use strict settings in production
- Monitor blocked files regularly
4. Database Cleanup
- Archive old failed tasks (> 30 days)
- Keep DLQ archived tasks for audit (7 days)
- Regularly check for stuck processing tasks
API Reference
PartnerDLQHandler
const handler = new PartnerDLQHandler();
// Start handler
await handler.start();
// Process DLQ
await handler.processDLQ();
// Get statistics
const stats = await handler.getStatistics();
// Stop handler
await handler.stop();
Error Categories
ERROR_CATEGORIES = {
TRANSIENT: 'transient',
VALIDATION: 'validation',
PROCESSING: 'processing',
INFRASTRUCTURE: 'infrastructure',
PARTNER_API: 'partner_api',
UNKNOWN: 'unknown'
}
Future Enhancements
-
Email/Slack Notifications
- Alert admins on critical failures
- Daily DLQ summary reports
-
Advanced Analytics
- Failure trend analysis
- Automatic root cause detection
- Performance metrics
-
Automatic Recovery
- Smart retry scheduling
- Self-healing for known issues
- Predictive failure prevention
-
Web Dashboard
- Real-time DLQ visualization
- One-click retry/archive
- Historical analysis