agmission/Development/server/docs/DLQ_OPERATIONS.md
Devin Major df31b2080d
All checks were successful
Server Tests / Mocha – Unit & Utility Tests (push) Successful in 42s
-(#3013) Data Export - Implement Data Export API BE (Cont.)
+ Added public data export API enhancements, tests, and customer documentation
  + Extended /api/v1 data export endpoints with richer session, records, area, and async export output
  + Added confirmed/fallback report values, client metadata, mapped area, over-spray, volume/apprate (string) units, and weather blocks
  + Normalized flowController to "No FC" and align record field names with playback output
  + Converted record wind speed output to knots, add Fligh Mater only record/export fields behind fm=true, and persist fm on export jobs
  + Added export status/area constants, HTTP 202 support, route-level API docs, and per-account export rate limiting support
  + Added comprehensive endpoint, format, and verification test coverage plus test-suite README
  + Added customer-facing data export design, integration, rate-limit, and documentation index guides
  + Updated README/DLQ docs and related documentation links to current HTTPS dashboard paths
2026-04-24 09:05:55 -04:00

7.2 KiB

DLQ Operations Guide

Navigation: 📖 Index | 🚀 Quick Start | 📚 API Reference | 🔧 Operations | 🏗️ System Guide


Comprehensive guide for managing Dead Letter Queues across all queue types.

Overview

The DLQ system provides queue-native tools for monitoring and managing failed tasks across all queue types:

  • Partner tasks (partner_tasks)
  • Job processing (jobs)
  • Future queue types (notifications, analytics, etc.)

Key Benefits:

  • Direct RabbitMQ operations (no MongoDB coupling)
  • Supports multiple queue types
  • Preserves original message content and headers
  • Works with any task type

Architecture

Components

  1. Workers - Process tasks, send failures to DLQ

    • workers/partner_sync_worker.js
    • workers/job_worker.js
    • Future workers for other queue types
  2. DLQ Routes - Global API endpoints

    • routes/dlq.js
    • Mounted at /api/dlq/:queueName/*
  3. DLQ Controller - Queue operations logic

    • controllers/dlq.js
    • Handles all queue types generically
  4. Monitoring Tools

    • Web dashboard: public/dlq-monitor.html

Message Flow

flowchart LR
    A[Worker] --> B[Main Queue]
    B --> C{Processing}
    C -->|Success ✓| D[Complete]
    C -->|Failure<br/>max retries| E[DLQ]
    E --> F{Action}
    F -->|Retry| B
    F -->|Archive| G[Archive Storage]
    F -->|Purge| H[Delete]

Queue-Native Operations

Retry Operations

Retry All Messages (Recommended)

curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"maxMessages": 50}'

Retry by Position Range (0-based index)

curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByPosition \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"startPosition": 0, "endPosition": 10}'

Retry by Header Match (Custom filtering)

curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryByHeader \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"headerKey": "x-retry-count", "headerValue": "1"}'

Benefits:

  • No MongoDB coupling
  • Preserves original message content
  • Supports multiple queue types
  • Direct RabbitMQ operations

Monitoring

Web Dashboard

Access at https://localhost:4100/dlq-monitor.html

Features:

  • Real-time statistics
  • Message list with error details
  • One-click retry operations
  • Queue selection dropdown
  • Auto-refresh every 30 seconds

Manual Recovery Procedures

Clear Stuck Processing Tasks

If tasks are stuck in "processing" status:

mongo mongodb://localhost:27017/agmission << EOF
use agmission
db.partner_log_trackers.updateMany(
  { 
    status: 'processing',
    processingStartedAt: { \$lt: new Date(Date.now() - 90*60*1000) }
  },
  { 
    \$set: { 
      status: 'failed',
      errorMessage: 'Manually reset - stuck processing'
    }
  }
)
EOF

Purge DLQ (Dangerous!)

⚠️ Warning: This permanently deletes all DLQ messages.

curl -X DELETE http://localhost:4100/api/dlq/partner_tasks/purge \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"confirm": true}'

Multi-Queue Operations

Partner Queue

# View messages
curl http://localhost:4100/api/dlq/partner_tasks/messages \
  -H "Authorization: Bearer $TOKEN"

# Retry all
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"maxMessages": 100}'

Job Queue

# View messages
curl http://localhost:4100/api/dlq/dev_jobs/messages \
  -H "Authorization: Bearer $TOKEN"

# Retry all
curl -X POST http://localhost:4100/api/dlq/dev_jobs/retryAll \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"maxMessages": 100}'

Future Queues

No code changes needed:

curl -X POST http://localhost:4100/api/dlq/notifications/retryAll \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"maxMessages": 50}'

Alert Thresholds

# Check DLQ count
DLQ_COUNT=$(curl -s http://localhost:4100/api/dlq/partner_tasks/stats \
  -H "Authorization: Bearer $TOKEN" | jq '.dlq.messageCount')

# Alert thresholds
if [ "$DLQ_COUNT" -gt 100 ]; then
  echo "CRITICAL: DLQ has $DLQ_COUNT messages"
elif [ "$DLQ_COUNT" -gt 50 ]; then
  echo "WARNING: DLQ has $DLQ_COUNT messages"
fi

Thresholds:

  • Warning: DLQ > 20 messages
  • Critical: DLQ > 50 messages
  • Emergency: DLQ > 100 messages OR age > 6 hours

Error Categories

Common error patterns and recovery strategies:

Transient Errors

  • Network timeouts
  • Connection failures
  • Temporary API unavailability

Action: Auto-retry (usually succeeds)

Validation Errors

  • Invalid file format
  • Missing required fields
  • Data type mismatches

Action: Fix source data, then retry

Infrastructure Errors

  • Database connection failures
  • Disk space issues
  • Memory errors

Action: Fix infrastructure, then retry all


Integration with Monitoring Systems

Prometheus Metrics (Future)

# DLQ message count gauge
dlq_messages_total{queue="partner_tasks"} 5
dlq_messages_total{queue="jobs"} 2

# Retry success rate
dlq_retry_success_rate{queue="partner_tasks"} 0.85

Alert Manager Rules

groups:
  - name: dlq_alerts
    rules:
      - alert: HighDLQCount
        expr: dlq_messages_total > 50
        for: 30m
        annotations:
          summary: "High DLQ message count"

Best Practices

  1. Regular Monitoring: Check DLQ counts at least daily
  2. Investigate Patterns: Multiple similar failures indicate systemic issues
  3. Timely Retry: Don't let messages age too long
  4. Use Position Retry: For targeted retry of specific ranges
  5. Document Failures: Track patterns for future prevention
  6. Test Retry: Use small batches first to verify fixes

Troubleshooting

Cannot Connect to RabbitMQ

Check connection settings in environment.env:

QUEUE_HOST=localhost
QUEUE_PORT=5672
QUEUE_USR=agm
QUEUE_PWD=***

Messages Not Retrying

  1. Check worker is running:

    ps aux | grep partner_sync_worker
    
  2. Check main queue exists:

    curl http://localhost:15672/api/queues/%2F/dev_partner_tasks \
      -u agm:***
    
  3. Check message format is valid

High Failure Rate

  1. Review recent error messages
  2. Check worker logs for patterns
  3. Verify external services are available
  4. Review worker configuration

📚 DLQ Documentation

🔗 Additional Resources