agmission/Development/server/docs/archived/PARTNER_DLQ_QUICKSTART.md

7.9 KiB

Partner DLQ API - Quick Start Guide

📋 Overview

The Partner DLQ (Dead Letter Queue) API provides comprehensive queue-native tools for monitoring and managing failed partner processing tasks. All operations work directly with RabbitMQ queues without MongoDB coupling, supporting multiple queue types and task categories. This includes REST API endpoints, a web dashboard, CLI tools, and automated processing capabilities.

🚀 Quick Start

1. Web Dashboard (Easiest)

Open your browser and navigate to:

http://localhost:3000/dlq-monitor.html

Features:

  • Real-time statistics (auto-refresh every 30s)
  • Visual error categorization
  • One-click retry/archive operations
  • Recent failures display with full details

2. API Endpoints

All endpoints require admin authentication.

Get Statistics

curl -X GET http://localhost:3000/api/dlq/partner_tasks/stats \
  -H "Authorization: Bearer YOUR_TOKEN"

Process DLQ (Dry Run)

curl -X POST http://localhost:3000/api/dlq/partner_tasks/process \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"dryRun": true}'

Retry All DLQ Messages (Queue-Native)

curl -X POST http://localhost:3000/api/dlq/partner_tasks/retryAll \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"maxMessages": 100}'

3. CLI Monitoring Tool

node scripts/monitor_partner_dlq.js

Interactive commands:

  • r - Refresh dashboard
  • p - Process DLQ now
  • s - Show detailed statistics
  • c - Clear archived tasks (> 7 days old)
  • q - Quit

4. Automated Background Processing

Start the DLQ handler as a background service:

# Using Node.js
node workers/partner_dlq_handler.js monitor &

# Using PM2 (recommended)
pm2 start workers/partner_dlq_handler.js --name partner-dlq-handler -- monitor

Or schedule periodic processing with cron:

# Edit crontab
crontab -e

# Add line to process DLQ every 4 hours
0 */4 * * * cd /path/to/server && node workers/partner_dlq_handler.js process >> /var/log/dlq-processing.log 2>&1

📚 Available Endpoints

Endpoint Method Description
/api/partners/dlq/stats GET Get DLQ statistics
/api/partners/dlq/messages GET View DLQ messages (peek mode)
/api/partners/dlq/process POST Process DLQ with auto retry/archive
/api/dlq/:queueName/retryAll POST Retry all DLQ messages
/api/dlq/:queueName/retryByPosition POST Retry messages by position
/api/dlq/:queueName/retryByHeader POST Retry messages by header
/api/partners/dlq/purge DELETE Purge all DLQ messages ⚠️

🔍 Error Categories

Messages are automatically categorized:

  • 🔵 Transient: Network timeouts, connection issues → Auto-retry within 2h
  • 🔴 Validation: Invalid data, missing fields → Archive immediately
  • 🟠 Processing: Parse errors, calculation errors → Keep for review
  • Infrastructure: Database errors, filesystem errors → Retry with backoff
  • 🟣 Partner API: API auth failures, rate limiting → Retry with delay
  • Unknown: Unclassified errors → Keep for review

🧪 Testing

Run Test Suite

# Set your auth token
export AUTH_TOKEN="your_token_here"

# Run tests
./scripts/test_dlq_api.sh

Import Postman Collection

Import docs/Partner_DLQ_API.postman_collection.json into Postman for interactive testing.

📖 Documentation

🔐 Authentication

All endpoints require admin authentication. Include your bearer token:

Authorization: Bearer YOUR_TOKEN

To obtain a token, authenticate through the regular login endpoint.

⚙️ Configuration

Environment variables:

# Queue Configuration
QUEUE_NAME_PARTNER=partner_tasks          # Main queue name (auto-prefixes 'dev_' in development)
PARTNER_MAX_RETRIES=5                     # Max retries before DLQ
DLQ_CHECK_INTERVAL=300000                 # DLQ check interval (5 min)

# Processing Rules
MAX_DLQ_AGE_MS=86400000                   # Archive after 24 hours
AUTO_RETRY_WINDOW_MS=7200000              # Auto-retry within 2 hours

📊 Monitoring

Key Metrics to Watch

  1. DLQ Message Count - Should stay < 20 under normal operation
  2. Failed Task Rate - Sudden spikes indicate issues
  3. Error Category Distribution - Patterns indicate root causes
  4. Archive Rate - High rate may indicate data quality issues

Alert Thresholds

  • ⚠️ Warning: DLQ > 20 messages
  • 🚨 Critical: DLQ > 50 messages
  • 🔥 Emergency: DLQ > 100 messages or age > 6 hours

🛠️ Common Operations

Check DLQ Health

curl -s http://localhost:3000/api/dlq/partner_tasks/stats \
  -H "Authorization: Bearer $TOKEN" | jq '.dlq.messageCount'

Process All Failed Messages

curl -X POST http://localhost:3000/api/dlq/partner_tasks/process \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"maxMessages": 100}'

Find Recent Failures

curl -s http://localhost:3000/api/dlq/partner_tasks/stats \
  -H "Authorization: Bearer $TOKEN" | jq '.recentFailures[0:5]'

🐛 Troubleshooting

High DLQ Count

  1. Check error categories in dashboard
  2. Identify patterns in error messages
  3. Fix root cause (network, data, code)
  4. Process DLQ to retry recoverable tasks

Stuck Processing Tasks

# Check for stuck tasks in MongoDB
mongo agmission --eval '
  db.partnerlogtrackers.find({
    status: "processing",
    processingStartedAt: { $lt: new Date(Date.now() - 90*60*1000) }
  }).pretty()
'

RabbitMQ Connection Issues

# Check RabbitMQ status
rabbitmqctl status

# Check queue stats
rabbitmqctl list_queues name messages consumers

🎯 Best Practices

  1. Monitor Daily: Check DLQ stats every day
  2. Process Regularly: Run DLQ processing every 4-6 hours
  3. Review Archives: Audit archived tasks weekly
  4. Document Patterns: Keep track of recurring errors
  5. Alert Early: Set up alerts at warning thresholds
  6. Test Changes: Always do a dry run first

💡 Tips

  • Use dry run mode before processing to preview actions
  • Check the web dashboard for visual overview
  • Use CLI tool for detailed statistics
  • Set up automated processing for hands-off operation
  • Review error categories to identify systemic issues

🚨 Emergency Procedures

DLQ is Full (>100 messages)

  1. Stop new task ingestion temporarily
  2. Identify root cause from error patterns
  3. Fix the root cause
  4. Process DLQ in batches
  5. Monitor recovery

Accidental Purge

Unfortunately, purged messages cannot be recovered. Prevention:

  • Always require confirmation in UI
  • Log all purge operations
  • Backup tracker database regularly

📞 Support

🔄 Updates and Maintenance

Regular Maintenance Tasks

  1. Daily: Check DLQ stats
  2. Weekly: Review archived tasks
  3. Monthly: Clean up old archived records
  4. Quarterly: Review error patterns and optimize

Version History

  • v1.0.0 (Oct 2025) - Initial implementation
    • REST API endpoints
    • Web dashboard
    • CLI monitoring tool
    • Automated processing

Ready to start? Open the web dashboard or run the test script to verify everything is working!

# Quick health check
curl http://localhost:3000/api/dlq/partner_tasks/stats -H "Authorization: Bearer $TOKEN"

# Or open the dashboard
open http://localhost:3000/dlq-monitor.html