# Partner DLQ (Dead Letter Queue) Handling System ## Overview The Partner DLQ Handling System provides automatic and manual management of failed partner processing tasks. It categorizes failures, automatically retries transient errors, archives non-recoverable tasks, and provides monitoring tools for administrators. ## Architecture ### Components 1. **Partner Sync Worker** (`workers/partner_sync_worker.js`) - Primary task processor - Sends failed tasks to DLQ after max retries - Implements circuit breaker for problematic files 2. **DLQ Handler** (`workers/partner_dlq_handler.js`) - Monitors and processes DLQ messages - Categorizes errors and makes retry/archive decisions - Provides programmatic DLQ management 3. **DLQ Monitor** (`scripts/monitor_partner_dlq.js`) - Interactive dashboard for DLQ monitoring - Manual operations and statistics ### Message Flow ```mermaid flowchart TD A[Polling Worker
Enqueues Task] --> B[Partner Queue
Main Queue] B -->|Processing| C[Sync Worker
Processing] B -->|Max Retries
Exceeded| D[Dead Letter Queue
DLQ] D --> E[DLQ Handler
Analysis] E --> F[Retry
Queue] E --> G[Archive
DB] E --> H[Manual
Review] ``` ## Error Categories ### 1. Transient Errors - Network timeouts - Temporary connection issues - Database connection failures - **Action**: Auto-retry within 2-hour window ### 2. Validation Errors - Invalid file format - Missing required fields - Data validation failures - **Action**: Archive immediately, notify admin ### 3. Processing Errors - Calculation errors - Parse errors - Logic errors - **Action**: Keep for manual review ### 4. Infrastructure Errors - Database errors - Filesystem errors - Transaction failures - **Action**: Retry with exponential backoff ### 5. Partner API Errors - API authentication failures - Rate limiting - Partner service unavailable - **Action**: Retry with longer delay ## Configuration ### Environment Variables ```bash # Queue Configuration QUEUE_HOST=localhost QUEUE_PORT=5672 QUEUE_USR=agmuser QUEUE_PWD= QUEUE_NAME_PARTNER=partner_tasks # Base name, auto-prefixes 'dev_' when PRODUCTION=false # Retry Configuration PARTNER_MAX_RETRIES=5 # Max retries before DLQ PARTNER_RETRY_DELAY=10000 # Base retry delay (ms) # DLQ Configuration DLQ_CHECK_INTERVAL=300000 # Check DLQ every 5 minutes MAX_DLQ_AGE_MS=86400000 # Archive after 24 hours AUTO_RETRY_WINDOW_MS=7200000 # Auto-retry within 2 hours ``` ### Worker Constants ```javascript // Circuit Breaker MAX_FILE_ATTEMPTS = 10 (dev) / 3 (prod) FILE_BLOCK_DURATION = 5 min (dev) / 1 hour (prod) // Timeouts PROCESSING_TIMEOUT_MS = 90 minutes TASK_TIMEOUT_MS = 90 minutes CLEANUP_INTERVAL_MS = 15 minutes ``` ## Usage ### 1. Start DLQ Handler (Automatic Mode) ```bash # Run as background service node workers/partner_dlq_handler.js monitor & # Or with PM2 pm2 start workers/partner_dlq_handler.js --name partner-dlq-handler -- monitor ``` ### 2. Manual DLQ Operations ```bash # Show DLQ statistics node workers/partner_dlq_handler.js stats # Process DLQ messages once node workers/partner_dlq_handler.js process ``` ### 3. Interactive Dashboard ```bash # Launch monitoring dashboard node scripts/monitor_partner_dlq.js ``` Dashboard commands: - `r` - Refresh dashboard - `p` - Process DLQ now - `s` - Show detailed statistics - `c` - Clear archived tasks (> 7 days old) - `q` - Quit ## Monitoring ### DLQ Statistics ```javascript { messageCount: 5, // Messages in DLQ consumerCount: 0, // Active consumers queueName: 'partner_tasks_failed' } ``` ### Tracker Statistics ```javascript { failed: 12, // Failed tasks processing: 3, // Currently processing downloaded: 8, // Downloaded, waiting processed: 245, // Successfully processed archived: 7 // Archived from DLQ } ``` ## Queue-Native Operations ### Retry Operations (Recommended - Direct RabbitMQ) ```bash # Retry all messages in queue curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryAll \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"maxMessages": 50}' # Retry by position range (0-based index) curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByPosition \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"startPosition": 0, "endPosition": 10}' # Retry by header match curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByHeader \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"headerKey": "x-retry-count", "headerValue": "1"}' ``` **Benefits**: - No MongoDB coupling - Preserves original message content - Supports multiple queue types - Direct RabbitMQ operations ### Legacy Manual Recovery (Programmatic) For advanced debugging scenarios only: ```javascript const handler = new PartnerDLQHandler(); await handler.start(); // Get message from DLQ const msg = await handler.channel.get('partner_tasks_failed'); // Retry it programmatically await handler.retryMessage(msg, JSON.parse(msg.content)); ``` ### Clear Stuck Tasks ```bash # Reset stuck processing tasks mongo mongodb://localhost:27017/agmission << EOF use agmission db.partnerlogtrackers.updateMany( { status: 'processing', processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) } }, { \$set: { status: 'failed', errorMessage: 'Manually reset - stuck processing' } } ) EOF ``` ### Purge DLQ ```bash # WARNING: This deletes all DLQ messages rabbitmqadmin purge queue name=partner_tasks_failed ``` ## Troubleshooting ### High DLQ Message Count 1. Check error patterns: ```bash node scripts/monitor_partner_dlq.js # Press 's' for detailed stats ``` 2. Identify root cause: - **Validation errors**: Fix data source or add validation - **Transient errors**: Check infrastructure (network, DB, partner API) - **Processing errors**: Review logs, fix code bugs 3. Take action: - Fix root cause - Process DLQ: `node workers/partner_dlq_handler.js process` - Monitor results ### Memory Issues 1. Check worker memory: ```bash ps aux | grep partner_sync_worker ``` 2. If high memory usage: - Reduce `batchSize` in processor options - Increase `PROCESSING_TIMEOUT_MS` - Enable garbage collection: `node --expose-gc --max-old-space-size=2048` ### Circuit Breaker Blocking Files 1. Check blocked files: ```javascript // In partner_sync_worker.js, add logging: setInterval(() => { console.log('Blocked files:', Array.from(problematicFiles.keys())); }, 60000); ``` 2. Reset circuit breaker: - Restart worker (circuit breaker is in-memory) - Or adjust thresholds for development ## Best Practices ### 1. Monitoring - Set up alerts for high DLQ message count (> 50) - Monitor DLQ age (messages > 1 hour need attention) - Track processing success rate ### 2. Retry Strategy - Use exponential backoff for retries - Categorize errors properly - Don't retry validation errors ### 3. Circuit Breaker - Use lenient settings in development - Use strict settings in production - Monitor blocked files regularly ### 4. Database Cleanup - Archive old failed tasks (> 30 days) - Keep DLQ archived tasks for audit (7 days) - Regularly check for stuck processing tasks ## API Reference ### PartnerDLQHandler ```javascript const handler = new PartnerDLQHandler(); // Start handler await handler.start(); // Process DLQ await handler.processDLQ(); // Get statistics const stats = await handler.getStatistics(); // Stop handler await handler.stop(); ``` ### Error Categories ```javascript ERROR_CATEGORIES = { TRANSIENT: 'transient', VALIDATION: 'validation', PROCESSING: 'processing', INFRASTRUCTURE: 'infrastructure', PARTNER_API: 'partner_api', UNKNOWN: 'unknown' } ``` ## Future Enhancements 1. **Email/Slack Notifications** - Alert admins on critical failures - Daily DLQ summary reports 2. **Advanced Analytics** - Failure trend analysis - Automatic root cause detection - Performance metrics 3. **Automatic Recovery** - Smart retry scheduling - Self-healing for known issues - Predictive failure prevention 4. **Web Dashboard** - Real-time DLQ visualization - One-click retry/archive - Historical analysis ## Related Documentation - [Partner Integration Architecture](./PARTNER_INTEGRATION_ARCHITECTURE.md) - [SatLoc Implementation Summary](./SATLOC_IMPLEMENTATION_SUMMARY.md) - [Worker Responsibilities Update](./WORKER_RESPONSIBILITIES_UPDATE.md)