9.1 KiB
Fatal Error Handling Architecture
Overview
The AgMission server and workers use a shared, robust fatal error handling system that:
- Writes a last-fatal JSON report atomically (preventing corruption)
- Archives corrupt log files automatically
- Throttles duplicate errors to avoid email floods
- Optionally sends admin notifications via email
- Configurable process exit behavior for clean restarts
Why This Solution?
Problem with Previous Approach
The original error-handler package used key-file-storage which performed read-modify-write operations:
- Corruption risk: Under concurrent errors or abrupt termination, JSON could be left malformed
- Crash loops: If
.rlogwas corrupt, the next error would fail to parse it and crash again - Forced exit: Always called
process.exit(1), even for benign client disconnect errors
Current Solution Benefits
- Atomic writes: Uses
fs.writeFile+fs.renameto ensure valid JSON or no write at all - Corruption recovery: Detects bad JSON, archives it, and continues with a fresh log
- Flexible exit: Opt-in via
FATAL_EXIT_ON_ERRORwith configurable delay for graceful cleanup - Intelligent ignore: Server filters HTTP stream errors (client disconnects, finalhandler cleanups)
- Shared helper: Single implementation (
helpers/process_fatal_handlers.js) for consistency
Architecture
Core Components
1. helpers/fatal_error_reporter.js
Low-level module responsible for:
- Safe read/write of
.rlogfiles - Atomic JSON persistence
- Throttling duplicate errors
- Optional email via existing
mailer
Key Functions:
await reportFatal({
filePath: './agm_server.rlog',
kind: 'server:uncaughtException',
error: err,
message: err.stack,
throttleMs: 120000,
emailEnabled: true,
emailTo: 'admin@example.com',
mailer,
});
2. helpers/process_fatal_handlers.js
High-level registration helper that:
- Attaches
uncaughtExceptionandunhandledRejectionhandlers - Applies ignore filters (e.g., HTTP stream errors for server)
- Calls
reportFatalwhen appropriate - Optionally exits process after delay
Key Functions:
const { registerFatalHandlers, createServerIgnore } = require('./helpers/process_fatal_handlers');
registerFatalHandlers(process, {
env,
debug,
kindPrefix: 'server',
reportFilePath: env.FATAL_REPORT_FILE,
ignore: createServerIgnore(), // Server-specific filter
});
Applied In
- Server:
server.js(with HTTP stream ignore filter) - Job Worker:
workers/job_worker.js - Partner Sync Worker:
workers/partner_sync_worker.js - Partner Polling Worker:
workers/partner_data_polling_worker.js - Cleanup Worker:
workers/cleanup_worker.js - Invoice Worker:
workers/invoice_worker.js - Obstacle Worker:
workers/obstacle_worker.js
Each worker logs to its own .rlog file (e.g., job_worker.rlog, partner_sync_worker.rlog).
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
FATAL_REPORT_ENABLED |
false |
Enable .rlog + email reporting |
FATAL_REPORT_FILE |
./agm_server.rlog |
Path to last-fatal JSON log |
FATAL_REPORT_EMAIL_ENABLED |
false |
Send email to admin on fatal error |
FATAL_REPORT_EMAIL_TO |
AGM_ADM_EMAIL |
Admin email address |
FATAL_EXIT_ON_ERROR |
true |
Exit process after fatal error (production) |
FATAL_EXIT_DELAY_MS |
1500 |
Delay before exit (for graceful cleanup) |
FATAL_THROTTLE_MS |
120000 |
Minimum time between identical error reports (2 min) |
Development vs Production
Development (environment.env):
FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=false
FATAL_EXIT_ON_ERROR=false # Don't crash-loop during dev
Production (environment_prod.env):
FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=true
FATAL_EXIT_ON_ERROR=true # Let PM2/systemd restart cleanly
HTTP/2 and Express Compatibility
The Issue
Express expects HTTP/1.1-style IncomingMessage and ServerResponse objects. When Node's native http2.createSecureServer() advertises h2 via ALPN, browsers negotiate HTTP/2 and send native HTTP/2 streams, which Express cannot handle properly. This causes:
- Hanging requests (spinning browser)
- "Cannot read properties of undefined (reading 'readable')" errors
- Finalhandler cleanup exceptions
The Solution
- Default to HTTPS/1.1:
HTTP2_ENABLED=false(recommended) - If using HTTP/2 server: Set
HTTP2_ADVERTISE_H2=falseto prevent browser negotiation - Production setup: Terminate HTTP/2 at nginx, proxy HTTP/1.1 to Node/Express
Env Configuration:
# Let nginx handle HTTP/2, Node stays HTTP/1.1
HTTP2_ENABLED=false
HTTP2_ADVERTISE_H2=false
Nginx Config Example:
server {
listen 443 ssl http2;
server_name app.example.com;
location / {
proxy_pass http://127.0.0.1:4100; # Node server (HTTP/1.1)
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
Testing
Automated Tests
Run the test suite:
cd /path/to/server
node test_fatal_error_reporter.js
Tests cover:
- Atomic write (no corruption)
- Corrupt JSON recovery (archives bad files)
- Throttling (duplicate errors within window)
- Different errors not throttled
- Process handler integration + ignore filters
Manual Testing
Test Fatal Reporting
// In server or worker
throw new Error('Test fatal error');
Check:
.rlogfile created with valid JSON- Email sent (if
FATAL_REPORT_EMAIL_ENABLED=true) - Process exit after delay (if
FATAL_EXIT_ON_ERROR=true)
Test Corruption Recovery
# Corrupt the log file
echo '{ "broken": json }' > agm_server.rlog
# Trigger error (will archive corrupt file)
node -e "throw new Error('test')"
# Check for archived file
ls -l agm_server.rlog.corrupt.*
Test Throttling
// Trigger same error twice quickly
const err = new Error('Duplicate test');
err.code = 'DUP_TEST';
throw err;
// Wait < 2 minutes, trigger again
throw err; // Should be throttled (no new timestamp in .rlog)
Troubleshooting
.rlog File is Corrupt
Symptom: Server/worker crashes on startup with JSON parse error
Solution: Automatic—the reporter will archive corrupt files and create a fresh log
Email Not Sending
Check:
FATAL_REPORT_EMAIL_ENABLED=trueFATAL_REPORT_EMAIL_TOset orAGM_ADM_EMAILconfigured- SMTP settings in
environment.envare correct NO_EMAIL_MODE=false
Process Not Exiting After Fatal Error
Check:
FATAL_EXIT_ON_ERROR=true(production)FATAL_EXIT_DELAY_MSallows time for cleanup (default 1500ms)- Worker may have cleanup handlers (SIGINT/SIGTERM) preventing immediate exit
HTTP Stream Errors Still Appearing
These are expected and ignored when createServerIgnore() is used. They indicate client disconnects and don't trigger reports or exits. Look for:
agm:server HTTP stream error (ignored - likely client disconnect): ...
Migration Notes
From error-handler Package
If you see:
const errorHandler = require('error-handler').errorHandler;
errorHandler.registerUnCaughtProcessErrorsHandler(process, logPath);
Replace with:
const { registerFatalHandlers } = require('./helpers/process_fatal_handlers');
registerFatalHandlers(process, {
env: require('./helpers/env'),
debug: require('debug')('your:namespace'),
kindPrefix: 'your_worker_name',
reportFilePath: path.join(__dirname, 'your_worker.rlog'),
});
Cleanup Old .rlog Files
The new system writes JSON only (not the binary format from key-file-storage). Old .rlog files can be removed or archived.
Best Practices
- Always enable reporting in production: Set
FATAL_REPORT_ENABLED=true - Enable email for critical services: Set
FATAL_REPORT_EMAIL_ENABLED=truefor server and partner workers - Don't exit on error in dev: Set
FATAL_EXIT_ON_ERROR=falseto avoid crash loops during debugging - Use descriptive
kindPrefix: Helps identify which worker/service crashed - Monitor
.rlogfiles: Set up log rotation or alerts if files exist (indicates recent crashes) - Test email notifications: Verify admin emails arrive before deploying to production
Future Enhancements
- Add structured logging integration (JSON output to stdout)
- Expose metrics endpoint for fatal error counts
- Support multiple admin email recipients
- Add webhook notification option (e.g., Slack, PagerDuty)
- Centralized fatal log aggregation for distributed workers
Related Documentation
- PARTNER_INTEGRATION_ARCHITECTURE.md - Partner integration architecture
- WORKER_RESPONSIBILITIES_UPDATE.md - Worker responsibilities (archived)
- DLQ_INDEX.md - DLQ and error recovery
Support
For issues or questions:
- Check test suite:
node tests/test_fatal_error_reporter.js - Review
.rlogfiles in server root andworkers/directory - Verify env variables in
environment.envorenvironment_prod.env - Check PM2 logs:
pm2 logs agm-server --lines 100