agmission/Development/server/docs/FATAL_ERROR_HANDLING.md

9.1 KiB

Fatal Error Handling Architecture

Overview

The AgMission server and workers use a shared, robust fatal error handling system that:

  • Writes a last-fatal JSON report atomically (preventing corruption)
  • Archives corrupt log files automatically
  • Throttles duplicate errors to avoid email floods
  • Optionally sends admin notifications via email
  • Configurable process exit behavior for clean restarts

Why This Solution?

Problem with Previous Approach

The original error-handler package used key-file-storage which performed read-modify-write operations:

  1. Corruption risk: Under concurrent errors or abrupt termination, JSON could be left malformed
  2. Crash loops: If .rlog was corrupt, the next error would fail to parse it and crash again
  3. Forced exit: Always called process.exit(1), even for benign client disconnect errors

Current Solution Benefits

  • Atomic writes: Uses fs.writeFile + fs.rename to ensure valid JSON or no write at all
  • Corruption recovery: Detects bad JSON, archives it, and continues with a fresh log
  • Flexible exit: Opt-in via FATAL_EXIT_ON_ERROR with configurable delay for graceful cleanup
  • Intelligent ignore: Server filters HTTP stream errors (client disconnects, finalhandler cleanups)
  • Shared helper: Single implementation (helpers/process_fatal_handlers.js) for consistency

Architecture

Core Components

1. helpers/fatal_error_reporter.js

Low-level module responsible for:

  • Safe read/write of .rlog files
  • Atomic JSON persistence
  • Throttling duplicate errors
  • Optional email via existing mailer

Key Functions:

await reportFatal({
  filePath: './agm_server.rlog',
  kind: 'server:uncaughtException',
  error: err,
  message: err.stack,
  throttleMs: 120000,
  emailEnabled: true,
  emailTo: 'admin@example.com',
  mailer,
});

2. helpers/process_fatal_handlers.js

High-level registration helper that:

  • Attaches uncaughtException and unhandledRejection handlers
  • Applies ignore filters (e.g., HTTP stream errors for server)
  • Calls reportFatal when appropriate
  • Optionally exits process after delay

Key Functions:

const { registerFatalHandlers, createServerIgnore } = require('./helpers/process_fatal_handlers');

registerFatalHandlers(process, {
  env,
  debug,
  kindPrefix: 'server',
  reportFilePath: env.FATAL_REPORT_FILE,
  ignore: createServerIgnore(), // Server-specific filter
});

Applied In

  • Server: server.js (with HTTP stream ignore filter)
  • Job Worker: workers/job_worker.js
  • Partner Sync Worker: workers/partner_sync_worker.js
  • Partner Polling Worker: workers/partner_data_polling_worker.js
  • Cleanup Worker: workers/cleanup_worker.js
  • Invoice Worker: workers/invoice_worker.js
  • Obstacle Worker: workers/obstacle_worker.js

Each worker logs to its own .rlog file (e.g., job_worker.rlog, partner_sync_worker.rlog).

Configuration

Environment Variables

Variable Default Description
FATAL_REPORT_ENABLED false Enable .rlog + email reporting
FATAL_REPORT_FILE ./agm_server.rlog Path to last-fatal JSON log
FATAL_REPORT_EMAIL_ENABLED false Send email to admin on fatal error
FATAL_REPORT_EMAIL_TO AGM_ADM_EMAIL Admin email address
FATAL_EXIT_ON_ERROR true Exit process after fatal error (production)
FATAL_EXIT_DELAY_MS 1500 Delay before exit (for graceful cleanup)
FATAL_THROTTLE_MS 120000 Minimum time between identical error reports (2 min)

Development vs Production

Development (environment.env):

FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=false
FATAL_EXIT_ON_ERROR=false  # Don't crash-loop during dev

Production (environment_prod.env):

FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=true
FATAL_EXIT_ON_ERROR=true   # Let PM2/systemd restart cleanly

HTTP/2 and Express Compatibility

The Issue

Express expects HTTP/1.1-style IncomingMessage and ServerResponse objects. When Node's native http2.createSecureServer() advertises h2 via ALPN, browsers negotiate HTTP/2 and send native HTTP/2 streams, which Express cannot handle properly. This causes:

  • Hanging requests (spinning browser)
  • "Cannot read properties of undefined (reading 'readable')" errors
  • Finalhandler cleanup exceptions

The Solution

  1. Default to HTTPS/1.1: HTTP2_ENABLED=false (recommended)
  2. If using HTTP/2 server: Set HTTP2_ADVERTISE_H2=false to prevent browser negotiation
  3. Production setup: Terminate HTTP/2 at nginx, proxy HTTP/1.1 to Node/Express

Env Configuration:

# Let nginx handle HTTP/2, Node stays HTTP/1.1
HTTP2_ENABLED=false
HTTP2_ADVERTISE_H2=false

Nginx Config Example:

server {
  listen 443 ssl http2;
  server_name app.example.com;

  location / {
    proxy_pass http://127.0.0.1:4100;  # Node server (HTTP/1.1)
    proxy_http_version 1.1;
    proxy_set_header Connection "";
  }
}

Testing

Automated Tests

Run the test suite:

cd /path/to/server
node test_fatal_error_reporter.js

Tests cover:

  1. Atomic write (no corruption)
  2. Corrupt JSON recovery (archives bad files)
  3. Throttling (duplicate errors within window)
  4. Different errors not throttled
  5. Process handler integration + ignore filters

Manual Testing

Test Fatal Reporting

// In server or worker
throw new Error('Test fatal error');

Check:

  • .rlog file created with valid JSON
  • Email sent (if FATAL_REPORT_EMAIL_ENABLED=true)
  • Process exit after delay (if FATAL_EXIT_ON_ERROR=true)

Test Corruption Recovery

# Corrupt the log file
echo '{ "broken": json }' > agm_server.rlog

# Trigger error (will archive corrupt file)
node -e "throw new Error('test')"

# Check for archived file
ls -l agm_server.rlog.corrupt.*

Test Throttling

// Trigger same error twice quickly
const err = new Error('Duplicate test');
err.code = 'DUP_TEST';
throw err;
// Wait < 2 minutes, trigger again
throw err; // Should be throttled (no new timestamp in .rlog)

Troubleshooting

.rlog File is Corrupt

Symptom: Server/worker crashes on startup with JSON parse error

Solution: Automatic—the reporter will archive corrupt files and create a fresh log

Email Not Sending

Check:

  1. FATAL_REPORT_EMAIL_ENABLED=true
  2. FATAL_REPORT_EMAIL_TO set or AGM_ADM_EMAIL configured
  3. SMTP settings in environment.env are correct
  4. NO_EMAIL_MODE=false

Process Not Exiting After Fatal Error

Check:

  1. FATAL_EXIT_ON_ERROR=true (production)
  2. FATAL_EXIT_DELAY_MS allows time for cleanup (default 1500ms)
  3. Worker may have cleanup handlers (SIGINT/SIGTERM) preventing immediate exit

HTTP Stream Errors Still Appearing

These are expected and ignored when createServerIgnore() is used. They indicate client disconnects and don't trigger reports or exits. Look for:

agm:server HTTP stream error (ignored - likely client disconnect): ...

Migration Notes

From error-handler Package

If you see:

const errorHandler = require('error-handler').errorHandler;
errorHandler.registerUnCaughtProcessErrorsHandler(process, logPath);

Replace with:

const { registerFatalHandlers } = require('./helpers/process_fatal_handlers');
registerFatalHandlers(process, {
  env: require('./helpers/env'),
  debug: require('debug')('your:namespace'),
  kindPrefix: 'your_worker_name',
  reportFilePath: path.join(__dirname, 'your_worker.rlog'),
});

Cleanup Old .rlog Files

The new system writes JSON only (not the binary format from key-file-storage). Old .rlog files can be removed or archived.

Best Practices

  1. Always enable reporting in production: Set FATAL_REPORT_ENABLED=true
  2. Enable email for critical services: Set FATAL_REPORT_EMAIL_ENABLED=true for server and partner workers
  3. Don't exit on error in dev: Set FATAL_EXIT_ON_ERROR=false to avoid crash loops during debugging
  4. Use descriptive kindPrefix: Helps identify which worker/service crashed
  5. Monitor .rlog files: Set up log rotation or alerts if files exist (indicates recent crashes)
  6. Test email notifications: Verify admin emails arrive before deploying to production

Future Enhancements

  • Add structured logging integration (JSON output to stdout)
  • Expose metrics endpoint for fatal error counts
  • Support multiple admin email recipients
  • Add webhook notification option (e.g., Slack, PagerDuty)
  • Centralized fatal log aggregation for distributed workers

Support

For issues or questions:

  1. Check test suite: node tests/test_fatal_error_reporter.js
  2. Review .rlog files in server root and workers/ directory
  3. Verify env variables in environment.env or environment_prod.env
  4. Check PM2 logs: pm2 logs agm-server --lines 100