# Fatal Error Handling Architecture ## Overview The AgMission server and workers use a shared, robust fatal error handling system that: - Writes a last-fatal JSON report atomically (preventing corruption) - Archives corrupt log files automatically - Throttles duplicate errors to avoid email floods - Optionally sends admin notifications via email - Configurable process exit behavior for clean restarts ## Why This Solution? ### Problem with Previous Approach The original `error-handler` package used `key-file-storage` which performed read-modify-write operations: 1. **Corruption risk**: Under concurrent errors or abrupt termination, JSON could be left malformed 2. **Crash loops**: If `.rlog` was corrupt, the next error would fail to parse it and crash again 3. **Forced exit**: Always called `process.exit(1)`, even for benign client disconnect errors ### Current Solution Benefits - **Atomic writes**: Uses `fs.writeFile` + `fs.rename` to ensure valid JSON or no write at all - **Corruption recovery**: Detects bad JSON, archives it, and continues with a fresh log - **Flexible exit**: Opt-in via `FATAL_EXIT_ON_ERROR` with configurable delay for graceful cleanup - **Intelligent ignore**: Server filters HTTP stream errors (client disconnects, finalhandler cleanups) - **Shared helper**: Single implementation (`helpers/process_fatal_handlers.js`) for consistency ## Architecture ### Core Components #### 1. `helpers/fatal_error_reporter.js` Low-level module responsible for: - Safe read/write of `.rlog` files - Atomic JSON persistence - Throttling duplicate errors - Optional email via existing `mailer` **Key Functions**: ```javascript await reportFatal({ filePath: './agm_server.rlog', kind: 'server:uncaughtException', error: err, message: err.stack, throttleMs: 120000, emailEnabled: true, emailTo: 'admin@example.com', mailer, }); ``` #### 2. `helpers/process_fatal_handlers.js` High-level registration helper that: - Attaches `uncaughtException` and `unhandledRejection` handlers - Applies ignore filters (e.g., HTTP stream errors for server) - Calls `reportFatal` when appropriate - Optionally exits process after delay **Key Functions**: ```javascript const { registerFatalHandlers, createServerIgnore } = require('./helpers/process_fatal_handlers'); registerFatalHandlers(process, { env, debug, kindPrefix: 'server', reportFilePath: env.FATAL_REPORT_FILE, ignore: createServerIgnore(), // Server-specific filter }); ``` ### Applied In - **Server**: `server.js` (with HTTP stream ignore filter) - **Job Worker**: `workers/job_worker.js` - **Partner Sync Worker**: `workers/partner_sync_worker.js` - **Partner Polling Worker**: `workers/partner_data_polling_worker.js` - **Cleanup Worker**: `workers/cleanup_worker.js` - **Invoice Worker**: `workers/invoice_worker.js` - **Obstacle Worker**: `workers/obstacle_worker.js` Each worker logs to its own `.rlog` file (e.g., `job_worker.rlog`, `partner_sync_worker.rlog`). ## Configuration ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `FATAL_REPORT_ENABLED` | `false` | Enable `.rlog` + email reporting | | `FATAL_REPORT_FILE` | `./agm_server.rlog` | Path to last-fatal JSON log | | `FATAL_REPORT_EMAIL_ENABLED` | `false` | Send email to admin on fatal error | | `FATAL_REPORT_EMAIL_TO` | `AGM_ADM_EMAIL` | Admin email address | | `FATAL_EXIT_ON_ERROR` | `true` | Exit process after fatal error (production) | | `FATAL_EXIT_DELAY_MS` | `1500` | Delay before exit (for graceful cleanup) | | `FATAL_THROTTLE_MS` | `120000` | Minimum time between identical error reports (2 min) | ### Development vs Production **Development** (`environment.env`): ```env FATAL_REPORT_ENABLED=true FATAL_REPORT_EMAIL_ENABLED=false FATAL_EXIT_ON_ERROR=false # Don't crash-loop during dev ``` **Production** (`environment_prod.env`): ```env FATAL_REPORT_ENABLED=true FATAL_REPORT_EMAIL_ENABLED=true FATAL_EXIT_ON_ERROR=true # Let PM2/systemd restart cleanly ``` ## HTTP/2 and Express Compatibility ### The Issue Express expects HTTP/1.1-style `IncomingMessage` and `ServerResponse` objects. When Node's native `http2.createSecureServer()` advertises `h2` via ALPN, browsers negotiate HTTP/2 and send native HTTP/2 streams, which Express cannot handle properly. This causes: - Hanging requests (spinning browser) - "Cannot read properties of undefined (reading 'readable')" errors - Finalhandler cleanup exceptions ### The Solution 1. **Default to HTTPS/1.1**: `HTTP2_ENABLED=false` (recommended) 2. **If using HTTP/2 server**: Set `HTTP2_ADVERTISE_H2=false` to prevent browser negotiation 3. **Production setup**: Terminate HTTP/2 at nginx, proxy HTTP/1.1 to Node/Express **Env Configuration**: ```env # Let nginx handle HTTP/2, Node stays HTTP/1.1 HTTP2_ENABLED=false HTTP2_ADVERTISE_H2=false ``` **Nginx Config Example**: ```nginx server { listen 443 ssl http2; server_name app.example.com; location / { proxy_pass http://127.0.0.1:4100; # Node server (HTTP/1.1) proxy_http_version 1.1; proxy_set_header Connection ""; } } ``` ## Testing ### Automated Tests Run the test suite: ```bash cd /path/to/server node test_fatal_error_reporter.js ``` **Tests cover**: 1. Atomic write (no corruption) 2. Corrupt JSON recovery (archives bad files) 3. Throttling (duplicate errors within window) 4. Different errors not throttled 5. Process handler integration + ignore filters ### Manual Testing #### Test Fatal Reporting ```javascript // In server or worker throw new Error('Test fatal error'); ``` Check: - `.rlog` file created with valid JSON - Email sent (if `FATAL_REPORT_EMAIL_ENABLED=true`) - Process exit after delay (if `FATAL_EXIT_ON_ERROR=true`) #### Test Corruption Recovery ```bash # Corrupt the log file echo '{ "broken": json }' > agm_server.rlog # Trigger error (will archive corrupt file) node -e "throw new Error('test')" # Check for archived file ls -l agm_server.rlog.corrupt.* ``` #### Test Throttling ```javascript // Trigger same error twice quickly const err = new Error('Duplicate test'); err.code = 'DUP_TEST'; throw err; // Wait < 2 minutes, trigger again throw err; // Should be throttled (no new timestamp in .rlog) ``` ## Troubleshooting ### `.rlog` File is Corrupt **Symptom**: Server/worker crashes on startup with JSON parse error **Solution**: Automatic—the reporter will archive corrupt files and create a fresh log ### Email Not Sending **Check**: 1. `FATAL_REPORT_EMAIL_ENABLED=true` 2. `FATAL_REPORT_EMAIL_TO` set or `AGM_ADM_EMAIL` configured 3. SMTP settings in `environment.env` are correct 4. `NO_EMAIL_MODE=false` ### Process Not Exiting After Fatal Error **Check**: 1. `FATAL_EXIT_ON_ERROR=true` (production) 2. `FATAL_EXIT_DELAY_MS` allows time for cleanup (default 1500ms) 3. Worker may have cleanup handlers (SIGINT/SIGTERM) preventing immediate exit ### HTTP Stream Errors Still Appearing These are **expected and ignored** when `createServerIgnore()` is used. They indicate client disconnects and don't trigger reports or exits. Look for: ``` agm:server HTTP stream error (ignored - likely client disconnect): ... ``` ## Migration Notes ### From `error-handler` Package If you see: ```javascript const errorHandler = require('error-handler').errorHandler; errorHandler.registerUnCaughtProcessErrorsHandler(process, logPath); ``` Replace with: ```javascript const { registerFatalHandlers } = require('./helpers/process_fatal_handlers'); registerFatalHandlers(process, { env: require('./helpers/env'), debug: require('debug')('your:namespace'), kindPrefix: 'your_worker_name', reportFilePath: path.join(__dirname, 'your_worker.rlog'), }); ``` ### Cleanup Old `.rlog` Files The new system writes JSON only (not the binary format from `key-file-storage`). Old `.rlog` files can be removed or archived. ## Best Practices 1. **Always enable reporting in production**: Set `FATAL_REPORT_ENABLED=true` 2. **Enable email for critical services**: Set `FATAL_REPORT_EMAIL_ENABLED=true` for server and partner workers 3. **Don't exit on error in dev**: Set `FATAL_EXIT_ON_ERROR=false` to avoid crash loops during debugging 4. **Use descriptive `kindPrefix`**: Helps identify which worker/service crashed 5. **Monitor `.rlog` files**: Set up log rotation or alerts if files exist (indicates recent crashes) 6. **Test email notifications**: Verify admin emails arrive before deploying to production ## Future Enhancements - [ ] Add structured logging integration (JSON output to stdout) - [ ] Expose metrics endpoint for fatal error counts - [ ] Support multiple admin email recipients - [ ] Add webhook notification option (e.g., Slack, PagerDuty) - [ ] Centralized fatal log aggregation for distributed workers ## Related Documentation - [PARTNER_INTEGRATION_ARCHITECTURE.md](./PARTNER_INTEGRATION_ARCHITECTURE.md) - Partner integration architecture - [WORKER_RESPONSIBILITIES_UPDATE.md](./archived/WORKER_RESPONSIBILITIES_UPDATE.md) - Worker responsibilities (archived) - [DLQ_INDEX.md](./DLQ_INDEX.md) - DLQ and error recovery ## Support For issues or questions: 1. Check test suite: `node tests/test_fatal_error_reporter.js` 2. Review `.rlog` files in server root and `workers/` directory 3. Verify env variables in `environment.env` or `environment_prod.env` 4. Check PM2 logs: `pm2 logs agm-server --lines 100`