agmission/Development/server/docs/DLQ_SYSTEM_GUIDE.md
Devin Major df31b2080d
All checks were successful
Server Tests / Mocha – Unit & Utility Tests (push) Successful in 42s
-(#3013) Data Export - Implement Data Export API BE (Cont.)
+ Added public data export API enhancements, tests, and customer documentation
  + Extended /api/v1 data export endpoints with richer session, records, area, and async export output
  + Added confirmed/fallback report values, client metadata, mapped area, over-spray, volume/apprate (string) units, and weather blocks
  + Normalized flowController to "No FC" and align record field names with playback output
  + Converted record wind speed output to knots, add Fligh Mater only record/export fields behind fm=true, and persist fm on export jobs
  + Added export status/area constants, HTTP 202 support, route-level API docs, and per-account export rate limiting support
  + Added comprehensive endpoint, format, and verification test coverage plus test-suite README
  + Added customer-facing data export design, integration, rate-limit, and documentation index guides
  + Updated README/DLQ docs and related documentation links to current HTTPS dashboard paths
2026-04-24 09:05:55 -04:00

636 lines
18 KiB
Markdown

# Dead Letter Queue (DLQ) System Guide
**Navigation:** [📖 Index](DLQ_INDEX.md) | [🚀 Quick Start](DLQ_QUICKSTART.md) | [📚 API Reference](DLQ_API_REFERENCE.md) | [🔧 Operations](DLQ_OPERATIONS.md) | [🏗️ System Guide](DLQ_SYSTEM_GUIDE.md)
---
## Overview
The AgMission DLQ system provides robust error handling, monitoring, and archival for failed partner integration tasks. Built on RabbitMQ's Dead Letter Exchange (DLX) mechanism, it automatically captures failed messages, enriches them with diagnostic metadata, and provides configurable retention with smart alerting.
**Key Features:**
- Automatic message enrichment with error categorization
- Configurable retention (default: 365 days) with TTL-based auto-archival
- Smart admin email alerts based on thresholds (warning @ 20, critical @ 50 messages)
- Real-time dashboard monitoring at `/dlq-monitor.html`
- Health check integration for infrastructure monitoring
- Complete audit trail preservation in organized filesystem archives
---
## Architecture
### Message Flow
```mermaid
flowchart LR
Task[Partner Task] --> Main[Main Queue<br/>partner_tasks]
Main -->|Processing| Success[✓ Success]
Main -->|On Failure| DLQ[Dead Letter Queue<br/>partner_tasks_failed]
DLQ -->|After DLQ_RETENTION_DAYS| Archive[Archive Queue<br/>partner_tasks_archive]
Archive --> Filesystem[Filesystem Archive<br/>./dlq_archives/YYYY/MM/DD/]
```
### Components
1. **Main Worker** (`workers/partner_sync_worker.js`)
- Processes partner tasks from `partner_tasks` queue
- Enriches messages with error metadata before DLQ routing
- Implements retry logic with max attempts (default: 5)
2. **DLQ API** (`routes/dlq.js`, `controllers/dlq.js`)
- Global queue-native endpoints at `/api/dlq/:queueName/*`
- Direct RabbitMQ operations (no MongoDB coupling)
- Supports all queue types (partner_tasks, jobs, etc.)
- Provides retry operations: retryAll, retryByPosition, retryByHeader
3. **Archival Worker** (`workers/dlq_archival_worker.js`)
- Consumes TTL-expired messages from archive queue
- Writes JSON files organized by date: `YYYY/MM/DD/timestamp_tasktype_filename.json`
- Preserves full message content + headers + properties for compliance
4. **Alert Worker** (`workers/dlq_alert_worker.js`)
- Monitors DLQ message counts for all queues
- Sends email alerts when thresholds exceeded (warning @ 20, critical @ 50)
- Smart throttling prevents alert spam (1 hour minimum between similar alerts)
- Configurable via DLQ_ALERT_* environment variables
5. **Health Check** (`controllers/health.js`)
- GET `/api/health` includes DLQ component status
- Status levels: healthy (<20 msgs), degraded (20-49), unhealthy (≥50)
- Exposes: messageCount, threshold, critical, retentionDays, consumerEnabled
5. **Dashboard** (`public/dlq-monitor.html`)
- Real-time monitoring UI with auto-refresh
- Visual alert indicators (green/yellow/red) based on thresholds
- Manual actions: retry, archive, process, purge
- Authentication via Bearer token
---
## Configuration
### Environment Variables
Add to your `.env` file or environment:
```bash
# DLQ Retention & Archival
DLQ_RETENTION_DAYS=365 # Days to keep messages in DLQ before auto-archive (default: 365)
DLQ_ARCHIVE_PATH=./dlq_archives # Where to store archived messages (default: ./dlq_archives)
# Alert Configuration
DLQ_ALERT_ENABLED=true # Enable admin email alerts (default: true)
DLQ_ALERT_THRESHOLD=20 # Warning threshold for message count (default: 20)
DLQ_ALERT_CRITICAL=50 # Critical threshold for escalated alerts (default: 50)
DLQ_ALERT_INTERVAL_MS=300000 # Check interval in milliseconds (default: 300000 = 5 min)
# DLQ Consumer Control
DLQ_CONSUMER_ENABLED=false # Enable DLQ message reprocessing (default: false, manual control)
```
### RabbitMQ Queue Configuration
Automatically configured by `partner_sync_worker.js` on startup:
1. **Main Queue** (`partner_tasks` or `dev_partner_tasks`)
- Durable: yes
- DLX: default exchange
- DLX routing key: `{queue_name}_failed`
2. **DLQ** (`partner_tasks_failed`)
- Durable: yes
- TTL: `DLQ_RETENTION_DAYS * 86400000` ms
- DLX: `dlq_archive` exchange
- DLX routing key: `archive`
3. **Archive Queue** (`partner_tasks_archive`)
- Durable: yes
- Consumed by `dlq_archival_worker.js`
### Message Headers
Messages sent to DLQ include headers for filtering and diagnostics. **All other data is in the message body** for reprocessing.
#### Diagnostic Headers (All Queue Types)
| Header | Purpose | Example Values |
|--------|---------|----------------|
| `x-error-category` | Error classification | transient, validation, processing, infrastructure, partner_api, unknown |
| `x-error-reason` | Error message | "Connection timeout", "Invalid file format" |
| `x-task-type` | Task being processed | PROCESS_PARTNER_LOG, UPLOAD_JOB, SEND_NOTIFICATION |
| `x-severity` | Alert severity | low, medium, high, critical |
| `x-first-death-time` | Timestamp of failure | 1734567890123 |
#### Context Headers (partner_tasks queue only)
| Header | Purpose | Example Values |
|--------|---------|----------------|
| `x-partner-code` | Partner identifier (for filtering) | SATLOC, AGIDRONEX |
| `x-customer-id` | Customer ObjectId (for filtering) | 507f1f77bcf86cd799439011 |
**Note**: Headers are for filtering/identification only. Job IDs, user IDs, filenames, etc. are in the message body where they belong.
---
## Operations
### Starting Workers
```bash
# Start main partner sync worker (includes DLQ routing)
node workers/partner_sync_worker.js
# Start archival worker (consumes expired DLQ messages)
node workers/dlq_archival_worker.js
# Start alert worker (monitors DLQs and sends email alerts)
node workers/dlq_alert_worker.js
# Or use start_workers.js to manage all workers
node start_workers.js
```
### Manual DLQ Operations
#### Web Dashboard (Recommended)
Navigate to: `https://localhost:4100/dlq-monitor.html`
1. Enter admin Bearer token (from login)
2. Select queue type (partner_tasks, jobs, etc.)
3. View statistics and messages
4. One-click retry operations
#### API Operations
```bash
# Check DLQ statistics
curl http://localhost:4100/api/dlq/partner_tasks/stats \
-H "Authorization: Bearer $TOKEN"
# Retry all messages
curl -X POST http://localhost:4100/api/dlq/partner_tasks/retryAll \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"maxMessages": 100}'
```
#### View Dashboard Details
1. Enter admin Bearer token (from login)
2. View real-time statistics
3. Actions available:
- **Refresh**: Update stats immediately
- **Process DLQ**: Retry eligible messages
- **Dry Run**: Analyze without processing
- **Purge DLQ**: Delete all messages (requires confirmation)
---
## Monitoring
### Health Check Integration
```bash
curl http://localhost:4100/api/health
```
Response includes DLQ component:
```json
{
"timestamp": "2024-12-18T10:30:00.000Z",
"overall": "healthy",
"components": {
"dlq": {
"status": "healthy",
"message": "DLQ operating normally",
"messageCount": 5,
"threshold": 20,
"critical": 50,
"consumerEnabled": false,
"retentionDays": 365,
"queueName": "partner_tasks_failed"
}
}
}
```
### Alert Email Format
When DLQ message count exceeds thresholds, admin receives:
**Subject:** `[AgMission][WARNING] DLQ Alert: 25 failed partner tasks`
**Body:**
```
Dead Letter Queue Alert - Partner Tasks
========================================
Status: WARNING
Current DLQ Message Count: 25
Alert Threshold: 20
Critical Threshold: 50
Time: 2024-12-18T10:30:00.000Z
Error Breakdown:
- transient: 12 (48.0%)
- partner_api: 8 (32.0%)
- validation: 5 (20.0%)
Recommended Actions:
1. Monitor DLQ dashboard at /dlq-monitor.html
2. Review error patterns and plan fixes
3. Messages will auto-archive after 365 days
Configuration:
- DLQ Retention: 365 days
- Archive Path: ./dlq_archives
- Max Retries: 5
To disable these alerts, set DLQ_ALERT_ENABLED=false in environment config.
```
**Throttling:** 1 email per hour per severity level to prevent floods.
---
## Error Categorization
Based on [DLQ Best Practices](https://rashadansari.medium.com/strategies-for-successful-dead-letter-queue-event-handling-e354f7dfbb3e):
### Transient Errors
**Auto-retry within 2 hours**
Patterns: `timeout`, `connection`, `network`, `ECONNREFUSED`, `DNS`
Action: May resolve automatically, suitable for retry
### Validation Errors
**Archive immediately**
Patterns: `validation`, `invalid`, `malformed`, `missing required`, `400`
Action: Requires code/data fix, not retryable
### Processing Errors
**Manual review**
Patterns: `processing`, `calculation`, `parse`, `transform`
Action: Business logic issue, investigate root cause
### Infrastructure Errors
**Escalate**
Patterns: `database`, `mongo`, `transaction`, `filesystem`, `disk`
Action: System-level problem, may affect multiple tasks
### Partner API Errors
**Extended backoff**
Patterns: `partner api`, `authentication`, `401`, `403`, `5xx`
Action: External service issue, retry with longer delays
### Unknown Errors
**Admin notification**
Patterns: Anything not matching above
Action: New error type, update categorization logic
---
## Archive Structure
Messages are archived to filesystem with this structure:
```
dlq_archives/
├── 2024/
│ ├── 12/
│ │ ├── 18/
│ │ │ ├── 1734567890123_PROCESS_PARTNER_LOG_satloc_flight_456.json
│ │ │ ├── 1734567891234_UPLOAD_PARTNER_JOB_job_789.json
│ │ │ └── ...
│ │ ├── 19/
│ │ └── ...
│ └── ...
└── ...
```
### Archive File Format
```json
{
"archived_at": "2024-12-18T10:30:00.000Z",
"timestamp": 1734567890123,
"queue_name": "partner_tasks",
"dlq_name": "partner_tasks_failed",
"task_type": "PROCESS_PARTNER_LOG",
"error_category": "transient",
"severity": "medium",
"headers": {
"x-error-category": "transient",
"x-error-reason": "Connection timeout after 30s",
"x-task-type": "PROCESS_PARTNER_LOG",
"x-severity": "medium",
"x-first-death-time": 1734567890000,
"x-partner-code": "SATLOC",
"x-customer-id": "507f1f77bcf86cd799439011",
"x-log-filename": "2024-12-18_aircraft_456.log"
},
"properties": {
"contentType": "application/json",
"deliveryMode": 2,
"timestamp": 1734567890000,
"expiration": "31536000000"
},
"message": {
"type": "PROCESS_PARTNER_LOG",
"customerId": "507f1f77bcf86cd799439011",
"partnerCode": "SATLOC",
"aircraftId": "aircraft_456",
"logFileName": "2024-12-18_aircraft_456.log",
"logId": "log_123",
"retryCount": 5,
"lastError": "Connection timeout after 30s",
"failedAt": "2024-12-18T10:25:00.000Z"
}
}
```
---
## PartnerLogTracker vs DLQ
### Separation of Concerns
**PartnerLogTracker** (MongoDB) - Business Intelligence Layer:
- Duplicate prevention via unique compound index (`logId`, `partnerCode`)
- Job matching audit trail with confidence scores
- Processing analytics (processTime, success rate)
- Customer reporting and dashboard queries
- File lifecycle tracking (download process complete)
- Timeout detection for stuck tasks
**DLQ System** (RabbitMQ + Filesystem) - Error Handling Layer:
- Failed message capture and retry orchestration
- Error categorization and severity assessment
- Configurable retention with TTL-based archival
- Admin alerting for operational issues
- Audit compliance with immutable archive logs
### Why Keep Both?
1. **Different Query Patterns**: PartnerLogTracker optimized for customer/job lookups, DLQ for error analysis
2. **Data Lifecycle**: PartnerLogTracker persists forever (business records), DLQ archives after 365 days
3. **Performance**: DLQ operations don't impact business query performance
4. **Compliance**: Archive files provide immutable audit trail separate from operational database
### Integration Points
- DLQ messages include `x-partner-code` and `x-customer-id` headers for filtering
- Original message body contains full task data (logFileName, jobId, etc.) for reprocessing
- PartnerLogTracker.status field reflects processing state independent of DLQ
- Health check queries both systems for comprehensive status
---
## Troubleshooting
### DLQ Buildup
**Symptom:** Message count > 20
**Causes:**
1. Partner API downtime → Check partner service status
2. Network connectivity issues → Review infrastructure health
3. Database connection problems → Check MongoDB connection pool
4. Code bugs in processing logic → Review recent deployments
**Resolution:**
1. Identify root cause via error breakdown in dashboard
2. Fix underlying issue
3. Enable DLQ consumer: `DLQ_CONSUMER_ENABLED=true`
4. Monitor processing until queue drains
5. Disable consumer: `DLQ_CONSUMER_ENABLED=false`
### Archive Worker Not Running
**Symptom:** DLQ messages not archiving after TTL expiry
**Check:**
```bash
ps aux | grep dlq_archival_worker
```
**Fix:**
```bash
node workers/dlq_archival_worker.js &
```
### No Alert Emails
**Symptom:** DLQ > threshold but no email received
**Checks:**
1. `DLQ_ALERT_ENABLED=true` in env?
2. `NO_EMAIL_MODE=false` in env?
3. SMTP credentials configured?
4. Check last alert time (throttled to 1/hour)?
**Test Email:**
```javascript
const mailer = require('./helpers/mailer');
await mailer.sendAdminNotification('Test', 'Testing DLQ alerts');
```
### High Memory Usage
**Symptom:** Archival worker consuming excessive memory
**Cause:** Large message backlog in archive queue
**Fix:**
1. Reduce `prefetch` in archival worker (currently 1)
2. Process archive queue in batches
3. Add memory monitoring alerts
---
## Performance Tuning
### Adjust Check Interval
Reduce alert latency:
```bash
DLQ_ALERT_INTERVAL_MS=60000 # Check every minute instead of 5
```
### Optimize Archive Storage
Compress archived files:
```javascript
// In dlq_archival_worker.js, add gzip compression:
const zlib = require('zlib');
const compressed = zlib.gzipSync(JSON.stringify(archiveRecord));
await fs.writeFile(filepath + '.gz', compressed);
```
### Adjust Retention
For high-volume systems:
```bash
DLQ_RETENTION_DAYS=30 # Reduce to 30 days
```
---
## API Reference
### GET /api/health
Returns overall system health including DLQ component.
**Response:**
```json
{
"components": {
"dlq": {
"status": "healthy|degraded|unhealthy",
"messageCount": 5,
"threshold": 20,
"critical": 50
}
}
}
```
### GET /api/dlq/partner_tasks/stats
Get comprehensive DLQ statistics.
**Response:**
```json
{
"dlq": {
"messageCount": 25,
"consumerCount": 0,
"queueName": "partner_tasks_failed"
},
"trackers": {
"failed": 25,
"processing": 3,
"downloaded": 10,
"processed": 1523,
"archived": 45
},
"recentFailures": [...]
}
```
### GET /api/dlq/partner_tasks/messages?limit=50
Peek at DLQ messages without consuming.
### POST /api/dlq/:queueName/retryAll
Retry all messages currently in the DLQ.
### POST /api/dlq/:queueName/retryByPosition
Retry messages by position range.
### POST /api/dlq/:queueName/retryByHeader
Retry messages matching specific header values.
### POST /api/dlq/:queueName/process
Process all eligible DLQ messages with intelligent categorization.
### DELETE /api/dlq/:queueName/purge
Purge entire DLQ (requires confirmation).
---
## Best Practices
1. **Never Enable DLQ Consumer Permanently**
- Set `DLQ_CONSUMER_ENABLED=true` only during active recovery
- Return to `false` after queue drains to prevent blind retries
2. **Monitor Error Categories**
- Review dashboard regularly for patterns
- High validation errors → Data quality issues
- High transient errors → Infrastructure problems
3. **Adjust Thresholds Per Environment**
- Production: `DLQ_ALERT_THRESHOLD=20`, `DLQ_ALERT_CRITICAL=50`
- Staging: Higher thresholds acceptable
- Development: Consider disabling alerts (`DLQ_ALERT_ENABLED=false`)
4. **Archive Retention Planning**
- Default 365 days suitable for most compliance needs
- Consider lifecycle policy for `dlq_archives/` directory
- Old archives can be compressed or moved to cold storage
5. **Correlate with PartnerLogTracker**
- Query PartnerLogTracker by customerId and partnerCode
- Check message body for logFileName and other task details
- Match against tracker records using filter criteria
- Cross-reference for complete failure analysis
- Update PartnerLogTracker.errorMessage for business reporting
6. **Regular Archive Cleanup**
- Setup cron job to delete/compress old archives
- Example: Keep 2 years, then delete
```bash
find ./dlq_archives -type f -mtime +730 -delete
```
7. **Test Alert System**
- Manually trigger test alerts during setup
- Verify email delivery and formatting
- Confirm throttling behavior
8. **Health Check Integration**
- Add DLQ health to monitoring dashboards (Grafana, Datadog)
- Alert on `status: "unhealthy"` in CI/CD pipeline
- Include in uptime checks
---
## Migration Guide
### From Old DLQ System
If migrating from PartnerLogTracker-based retry to queue-native DLQ:
1. **Deploy New System:**
```bash
# Update env with DLQ settings
DLQ_RETENTION_DAYS=365
DLQ_ALERT_ENABLED=true
DLQ_CONSUMER_ENABLED=false
# Restart workers
pm2 restart partner_sync_worker
pm2 start workers/dlq_archival_worker.js --name dlq-archival
```
2. **Monitor Both Systems:**
- Old DLQ code still handles in-flight messages
- New system captures new failures
- Gradually phase out old retry logic
3. **Cleanup:**
- Remove old DLQ retry code from controllers
- Keep PartnerLogTracker for BI purposes
- Update documentation
---
## Support
For issues or questions:
- Check logs: `workers/*.rlog` files
- Review RabbitMQ management console: `http://localhost:15672`
- Check archive directory: `ls -lah dlq_archives/`
- Contact: trungh@agnav.com
---
## See Also
- **[📖 DLQ Index](DLQ_INDEX.md)** - Documentation overview
- **[🚀 Quick Start Guide](DLQ_QUICKSTART.md)** - Get started quickly
- **[📚 API Reference](DLQ_API_REFERENCE.md)** - Complete API documentation
- **[🔧 Operations Guide](DLQ_OPERATIONS.md)** - Advanced operations
### 🔗 Related Resources
- [Web Dashboard](../public/dlq-monitor.html) - Monitoring interface
---
**Last Updated:** December 18, 2025
**System Version:** 1.0.0
**Author:** AgMission Platform Team