agmission/Development/server/docs/archived/PARTNER_DLQ_HANDLING.md

8.5 KiB

Partner DLQ (Dead Letter Queue) Handling System

Overview

The Partner DLQ Handling System provides automatic and manual management of failed partner processing tasks. It categorizes failures, automatically retries transient errors, archives non-recoverable tasks, and provides monitoring tools for administrators.

Architecture

Components

  1. Partner Sync Worker (workers/partner_sync_worker.js)

    • Primary task processor
    • Sends failed tasks to DLQ after max retries
    • Implements circuit breaker for problematic files
  2. DLQ Handler (workers/partner_dlq_handler.js)

    • Monitors and processes DLQ messages
    • Categorizes errors and makes retry/archive decisions
    • Provides programmatic DLQ management
  3. DLQ Monitor (scripts/monitor_partner_dlq.js)

    • Interactive dashboard for DLQ monitoring
    • Manual operations and statistics

Message Flow

flowchart TD
    A[Polling Worker<br/>Enqueues Task] --> B[Partner Queue<br/>Main Queue]
    B -->|Processing| C[Sync Worker<br/>Processing]
    B -->|Max Retries<br/>Exceeded| D[Dead Letter Queue<br/>DLQ]
    D --> E[DLQ Handler<br/>Analysis]
    E --> F[Retry<br/>Queue]
    E --> G[Archive<br/>DB]
    E --> H[Manual<br/>Review]

Error Categories

1. Transient Errors

  • Network timeouts
  • Temporary connection issues
  • Database connection failures
  • Action: Auto-retry within 2-hour window

2. Validation Errors

  • Invalid file format
  • Missing required fields
  • Data validation failures
  • Action: Archive immediately, notify admin

3. Processing Errors

  • Calculation errors
  • Parse errors
  • Logic errors
  • Action: Keep for manual review

4. Infrastructure Errors

  • Database errors
  • Filesystem errors
  • Transaction failures
  • Action: Retry with exponential backoff

5. Partner API Errors

  • API authentication failures
  • Rate limiting
  • Partner service unavailable
  • Action: Retry with longer delay

Configuration

Environment Variables

# Queue Configuration
QUEUE_HOST=localhost
QUEUE_PORT=5672
QUEUE_USR=agmuser
QUEUE_PWD=<password>
QUEUE_NAME_PARTNER=partner_tasks  # Base name, auto-prefixes 'dev_' when PRODUCTION=false

# Retry Configuration
PARTNER_MAX_RETRIES=5              # Max retries before DLQ
PARTNER_RETRY_DELAY=10000          # Base retry delay (ms)

# DLQ Configuration
DLQ_CHECK_INTERVAL=300000          # Check DLQ every 5 minutes
MAX_DLQ_AGE_MS=86400000           # Archive after 24 hours
AUTO_RETRY_WINDOW_MS=7200000      # Auto-retry within 2 hours

Worker Constants

// Circuit Breaker
MAX_FILE_ATTEMPTS = 10 (dev) / 3 (prod)
FILE_BLOCK_DURATION = 5 min (dev) / 1 hour (prod)

// Timeouts
PROCESSING_TIMEOUT_MS = 90 minutes
TASK_TIMEOUT_MS = 90 minutes
CLEANUP_INTERVAL_MS = 15 minutes

Usage

1. Start DLQ Handler (Automatic Mode)

# Run as background service
node workers/partner_dlq_handler.js monitor &

# Or with PM2
pm2 start workers/partner_dlq_handler.js --name partner-dlq-handler -- monitor

2. Manual DLQ Operations

# Show DLQ statistics
node workers/partner_dlq_handler.js stats

# Process DLQ messages once
node workers/partner_dlq_handler.js process

3. Interactive Dashboard

# Launch monitoring dashboard
node scripts/monitor_partner_dlq.js

Dashboard commands:

  • r - Refresh dashboard
  • p - Process DLQ now
  • s - Show detailed statistics
  • c - Clear archived tasks (> 7 days old)
  • q - Quit

Monitoring

DLQ Statistics

{
  messageCount: 5,        // Messages in DLQ
  consumerCount: 0,       // Active consumers
  queueName: 'partner_tasks_failed'
}

Tracker Statistics

{
  failed: 12,            // Failed tasks
  processing: 3,         // Currently processing
  downloaded: 8,         // Downloaded, waiting
  processed: 245,        // Successfully processed
  archived: 7            // Archived from DLQ
}

Queue-Native Operations

# Retry all messages in queue
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryAll \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"maxMessages": 50}'

# Retry by position range (0-based index)
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByPosition \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"startPosition": 0, "endPosition": 10}'

# Retry by header match
curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryByHeader \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"headerKey": "x-retry-count", "headerValue": "1"}'

Benefits:

  • No MongoDB coupling
  • Preserves original message content
  • Supports multiple queue types
  • Direct RabbitMQ operations

Legacy Manual Recovery (Programmatic)

For advanced debugging scenarios only:

const handler = new PartnerDLQHandler();
await handler.start();

// Get message from DLQ
const msg = await handler.channel.get('partner_tasks_failed');

// Retry it programmatically
await handler.retryMessage(msg, JSON.parse(msg.content));

Clear Stuck Tasks

# Reset stuck processing tasks
mongo mongodb://localhost:27017/agmission << EOF
use agmission
db.partnerlogtrackers.updateMany(
  { 
    status: 'processing',
    processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
  },
  { 
    \$set: { 
      status: 'failed',
      errorMessage: 'Manually reset - stuck processing'
    }
  }
)
EOF

Purge DLQ

# WARNING: This deletes all DLQ messages
rabbitmqadmin purge queue name=partner_tasks_failed

Troubleshooting

High DLQ Message Count

  1. Check error patterns:

    node scripts/monitor_partner_dlq.js
    # Press 's' for detailed stats
    
  2. Identify root cause:

    • Validation errors: Fix data source or add validation
    • Transient errors: Check infrastructure (network, DB, partner API)
    • Processing errors: Review logs, fix code bugs
  3. Take action:

    • Fix root cause
    • Process DLQ: node workers/partner_dlq_handler.js process
    • Monitor results

Memory Issues

  1. Check worker memory:

    ps aux | grep partner_sync_worker
    
  2. If high memory usage:

    • Reduce batchSize in processor options
    • Increase PROCESSING_TIMEOUT_MS
    • Enable garbage collection: node --expose-gc --max-old-space-size=2048

Circuit Breaker Blocking Files

  1. Check blocked files:

    // In partner_sync_worker.js, add logging:
    setInterval(() => {
      console.log('Blocked files:', Array.from(problematicFiles.keys()));
    }, 60000);
    
  2. Reset circuit breaker:

    • Restart worker (circuit breaker is in-memory)
    • Or adjust thresholds for development

Best Practices

1. Monitoring

  • Set up alerts for high DLQ message count (> 50)
  • Monitor DLQ age (messages > 1 hour need attention)
  • Track processing success rate

2. Retry Strategy

  • Use exponential backoff for retries
  • Categorize errors properly
  • Don't retry validation errors

3. Circuit Breaker

  • Use lenient settings in development
  • Use strict settings in production
  • Monitor blocked files regularly

4. Database Cleanup

  • Archive old failed tasks (> 30 days)
  • Keep DLQ archived tasks for audit (7 days)
  • Regularly check for stuck processing tasks

API Reference

PartnerDLQHandler

const handler = new PartnerDLQHandler();

// Start handler
await handler.start();

// Process DLQ
await handler.processDLQ();

// Get statistics
const stats = await handler.getStatistics();

// Stop handler
await handler.stop();

Error Categories

ERROR_CATEGORIES = {
  TRANSIENT: 'transient',
  VALIDATION: 'validation',
  PROCESSING: 'processing',
  INFRASTRUCTURE: 'infrastructure',
  PARTNER_API: 'partner_api',
  UNKNOWN: 'unknown'
}

Future Enhancements

  1. Email/Slack Notifications

    • Alert admins on critical failures
    • Daily DLQ summary reports
  2. Advanced Analytics

    • Failure trend analysis
    • Automatic root cause detection
    • Performance metrics
  3. Automatic Recovery

    • Smart retry scheduling
    • Self-healing for known issues
    • Predictive failure prevention
  4. Web Dashboard

    • Real-time DLQ visualization
    • One-click retry/archive
    • Historical analysis