agnav/agmission

Fork 0

Devin Major 0836fc0fbc first commit (copy of Trunk as of April 22 2026)

2026-04-22 15:00:02 -04:00

9.1 KiB

Raw Blame History

Fatal Error Handling Architecture

Overview

The AgMission server and workers use a shared, robust fatal error handling system that:

Writes a last-fatal JSON report atomically (preventing corruption)
Archives corrupt log files automatically
Throttles duplicate errors to avoid email floods
Optionally sends admin notifications via email
Configurable process exit behavior for clean restarts

Why This Solution?

Problem with Previous Approach

The original error-handler package used key-file-storage which performed read-modify-write operations:

Corruption risk: Under concurrent errors or abrupt termination, JSON could be left malformed
Crash loops: If .rlog was corrupt, the next error would fail to parse it and crash again
Forced exit: Always called process.exit(1), even for benign client disconnect errors

Current Solution Benefits

Atomic writes: Uses fs.writeFile + fs.rename to ensure valid JSON or no write at all
Corruption recovery: Detects bad JSON, archives it, and continues with a fresh log
Flexible exit: Opt-in via FATAL_EXIT_ON_ERROR with configurable delay for graceful cleanup
Intelligent ignore: Server filters HTTP stream errors (client disconnects, finalhandler cleanups)
Shared helper: Single implementation (helpers/process_fatal_handlers.js) for consistency

Architecture

Core Components

1. `helpers/fatal_error_reporter.js`

Low-level module responsible for:

Safe read/write of .rlog files
Atomic JSON persistence
Throttling duplicate errors
Optional email via existing mailer

Key Functions:

await reportFatal({
  filePath: './agm_server.rlog',
  kind: 'server:uncaughtException',
  error: err,
  message: err.stack,
  throttleMs: 120000,
  emailEnabled: true,
  emailTo: 'admin@example.com',
  mailer,
});

2. `helpers/process_fatal_handlers.js`

High-level registration helper that:

Attaches uncaughtException and unhandledRejection handlers
Applies ignore filters (e.g., HTTP stream errors for server)
Calls reportFatal when appropriate
Optionally exits process after delay

Key Functions:

const { registerFatalHandlers, createServerIgnore } = require('./helpers/process_fatal_handlers');

registerFatalHandlers(process, {
  env,
  debug,
  kindPrefix: 'server',
  reportFilePath: env.FATAL_REPORT_FILE,
  ignore: createServerIgnore(), // Server-specific filter
});

Applied In

Server: server.js (with HTTP stream ignore filter)
Job Worker: workers/job_worker.js
Partner Sync Worker: workers/partner_sync_worker.js
Partner Polling Worker: workers/partner_data_polling_worker.js
Cleanup Worker: workers/cleanup_worker.js
Invoice Worker: workers/invoice_worker.js
Obstacle Worker: workers/obstacle_worker.js

Each worker logs to its own .rlog file (e.g., job_worker.rlog, partner_sync_worker.rlog).

Configuration

Environment Variables

Variable	Default	Description
`FATAL_REPORT_ENABLED`	`false`	Enable `.rlog` + email reporting
`FATAL_REPORT_FILE`	`./agm_server.rlog`	Path to last-fatal JSON log
`FATAL_REPORT_EMAIL_ENABLED`	`false`	Send email to admin on fatal error
`FATAL_REPORT_EMAIL_TO`	`AGM_ADM_EMAIL`	Admin email address
`FATAL_EXIT_ON_ERROR`	`true`	Exit process after fatal error (production)
`FATAL_EXIT_DELAY_MS`	`1500`	Delay before exit (for graceful cleanup)
`FATAL_THROTTLE_MS`	`120000`	Minimum time between identical error reports (2 min)

Development vs Production

Development (environment.env):

FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=false
FATAL_EXIT_ON_ERROR=false  # Don't crash-loop during dev

Production (environment_prod.env):

FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=true
FATAL_EXIT_ON_ERROR=true   # Let PM2/systemd restart cleanly

HTTP/2 and Express Compatibility

The Issue

Express expects HTTP/1.1-style IncomingMessage and ServerResponse objects. When Node's native http2.createSecureServer() advertises h2 via ALPN, browsers negotiate HTTP/2 and send native HTTP/2 streams, which Express cannot handle properly. This causes:

Hanging requests (spinning browser)
"Cannot read properties of undefined (reading 'readable')" errors
Finalhandler cleanup exceptions

The Solution

Default to HTTPS/1.1: HTTP2_ENABLED=false (recommended)
If using HTTP/2 server: Set HTTP2_ADVERTISE_H2=false to prevent browser negotiation
Production setup: Terminate HTTP/2 at nginx, proxy HTTP/1.1 to Node/Express

Env Configuration:

# Let nginx handle HTTP/2, Node stays HTTP/1.1
HTTP2_ENABLED=false
HTTP2_ADVERTISE_H2=false

Nginx Config Example:

server {
  listen 443 ssl http2;
  server_name app.example.com;

  location / {
    proxy_pass http://127.0.0.1:4100;  # Node server (HTTP/1.1)
    proxy_http_version 1.1;
    proxy_set_header Connection "";
  }
}

Testing

Automated Tests

Run the test suite:

cd /path/to/server
node test_fatal_error_reporter.js

Tests cover:

Atomic write (no corruption)
Corrupt JSON recovery (archives bad files)
Throttling (duplicate errors within window)
Different errors not throttled
Process handler integration + ignore filters

Manual Testing

Test Fatal Reporting

// In server or worker
throw new Error('Test fatal error');

Check:

.rlog file created with valid JSON
Email sent (if FATAL_REPORT_EMAIL_ENABLED=true)
Process exit after delay (if FATAL_EXIT_ON_ERROR=true)

Test Corruption Recovery

# Corrupt the log file
echo '{ "broken": json }' > agm_server.rlog

# Trigger error (will archive corrupt file)
node -e "throw new Error('test')"

# Check for archived file
ls -l agm_server.rlog.corrupt.*

Test Throttling

// Trigger same error twice quickly
const err = new Error('Duplicate test');
err.code = 'DUP_TEST';
throw err;
// Wait < 2 minutes, trigger again
throw err; // Should be throttled (no new timestamp in .rlog)

Troubleshooting

`.rlog` File is Corrupt

Symptom: Server/worker crashes on startup with JSON parse error

Solution: Automatic—the reporter will archive corrupt files and create a fresh log

Email Not Sending

Check:

FATAL_REPORT_EMAIL_ENABLED=true
FATAL_REPORT_EMAIL_TO set or AGM_ADM_EMAIL configured
SMTP settings in environment.env are correct
NO_EMAIL_MODE=false

Process Not Exiting After Fatal Error

Check:

FATAL_EXIT_ON_ERROR=true (production)
FATAL_EXIT_DELAY_MS allows time for cleanup (default 1500ms)
Worker may have cleanup handlers (SIGINT/SIGTERM) preventing immediate exit

HTTP Stream Errors Still Appearing

These are expected and ignored when createServerIgnore() is used. They indicate client disconnects and don't trigger reports or exits. Look for:

agm:server HTTP stream error (ignored - likely client disconnect): ...

Migration Notes

From `error-handler` Package

If you see:

const errorHandler = require('error-handler').errorHandler;
errorHandler.registerUnCaughtProcessErrorsHandler(process, logPath);

Replace with:

const { registerFatalHandlers } = require('./helpers/process_fatal_handlers');
registerFatalHandlers(process, {
  env: require('./helpers/env'),
  debug: require('debug')('your:namespace'),
  kindPrefix: 'your_worker_name',
  reportFilePath: path.join(__dirname, 'your_worker.rlog'),
});

Cleanup Old `.rlog` Files

The new system writes JSON only (not the binary format from key-file-storage). Old .rlog files can be removed or archived.

Best Practices

Always enable reporting in production: Set FATAL_REPORT_ENABLED=true
Enable email for critical services: Set FATAL_REPORT_EMAIL_ENABLED=true for server and partner workers
Don't exit on error in dev: Set FATAL_EXIT_ON_ERROR=false to avoid crash loops during debugging
Use descriptive kindPrefix: Helps identify which worker/service crashed
Monitor .rlog files: Set up log rotation or alerts if files exist (indicates recent crashes)
Test email notifications: Verify admin emails arrive before deploying to production

Future Enhancements

Add structured logging integration (JSON output to stdout)
Expose metrics endpoint for fatal error counts
Support multiple admin email recipients
Add webhook notification option (e.g., Slack, PagerDuty)
Centralized fatal log aggregation for distributed workers

PARTNER_INTEGRATION_ARCHITECTURE.md - Partner integration architecture
WORKER_RESPONSIBILITIES_UPDATE.md - Worker responsibilities (archived)
DLQ_INDEX.md - DLQ and error recovery

Support

For issues or questions:

Check test suite: node tests/test_fatal_error_reporter.js
Review .rlog files in server root and workers/ directory
Verify env variables in environment.env or environment_prod.env
Check PM2 logs: pm2 logs agm-server --lines 100

9.1 KiB Raw Blame History