277 lines
9.1 KiB
Markdown
277 lines
9.1 KiB
Markdown
# Fatal Error Handling Architecture
|
|
|
|
## Overview
|
|
|
|
The AgMission server and workers use a shared, robust fatal error handling system that:
|
|
- Writes a last-fatal JSON report atomically (preventing corruption)
|
|
- Archives corrupt log files automatically
|
|
- Throttles duplicate errors to avoid email floods
|
|
- Optionally sends admin notifications via email
|
|
- Configurable process exit behavior for clean restarts
|
|
|
|
## Why This Solution?
|
|
|
|
### Problem with Previous Approach
|
|
The original `error-handler` package used `key-file-storage` which performed read-modify-write operations:
|
|
1. **Corruption risk**: Under concurrent errors or abrupt termination, JSON could be left malformed
|
|
2. **Crash loops**: If `.rlog` was corrupt, the next error would fail to parse it and crash again
|
|
3. **Forced exit**: Always called `process.exit(1)`, even for benign client disconnect errors
|
|
|
|
### Current Solution Benefits
|
|
- **Atomic writes**: Uses `fs.writeFile` + `fs.rename` to ensure valid JSON or no write at all
|
|
- **Corruption recovery**: Detects bad JSON, archives it, and continues with a fresh log
|
|
- **Flexible exit**: Opt-in via `FATAL_EXIT_ON_ERROR` with configurable delay for graceful cleanup
|
|
- **Intelligent ignore**: Server filters HTTP stream errors (client disconnects, finalhandler cleanups)
|
|
- **Shared helper**: Single implementation (`helpers/process_fatal_handlers.js`) for consistency
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
#### 1. `helpers/fatal_error_reporter.js`
|
|
Low-level module responsible for:
|
|
- Safe read/write of `.rlog` files
|
|
- Atomic JSON persistence
|
|
- Throttling duplicate errors
|
|
- Optional email via existing `mailer`
|
|
|
|
**Key Functions**:
|
|
```javascript
|
|
await reportFatal({
|
|
filePath: './agm_server.rlog',
|
|
kind: 'server:uncaughtException',
|
|
error: err,
|
|
message: err.stack,
|
|
throttleMs: 120000,
|
|
emailEnabled: true,
|
|
emailTo: 'admin@example.com',
|
|
mailer,
|
|
});
|
|
```
|
|
|
|
#### 2. `helpers/process_fatal_handlers.js`
|
|
High-level registration helper that:
|
|
- Attaches `uncaughtException` and `unhandledRejection` handlers
|
|
- Applies ignore filters (e.g., HTTP stream errors for server)
|
|
- Calls `reportFatal` when appropriate
|
|
- Optionally exits process after delay
|
|
|
|
**Key Functions**:
|
|
```javascript
|
|
const { registerFatalHandlers, createServerIgnore } = require('./helpers/process_fatal_handlers');
|
|
|
|
registerFatalHandlers(process, {
|
|
env,
|
|
debug,
|
|
kindPrefix: 'server',
|
|
reportFilePath: env.FATAL_REPORT_FILE,
|
|
ignore: createServerIgnore(), // Server-specific filter
|
|
});
|
|
```
|
|
|
|
### Applied In
|
|
- **Server**: `server.js` (with HTTP stream ignore filter)
|
|
- **Job Worker**: `workers/job_worker.js`
|
|
- **Partner Sync Worker**: `workers/partner_sync_worker.js`
|
|
- **Partner Polling Worker**: `workers/partner_data_polling_worker.js`
|
|
- **Cleanup Worker**: `workers/cleanup_worker.js`
|
|
- **Invoice Worker**: `workers/invoice_worker.js`
|
|
- **Obstacle Worker**: `workers/obstacle_worker.js`
|
|
|
|
Each worker logs to its own `.rlog` file (e.g., `job_worker.rlog`, `partner_sync_worker.rlog`).
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `FATAL_REPORT_ENABLED` | `false` | Enable `.rlog` + email reporting |
|
|
| `FATAL_REPORT_FILE` | `./agm_server.rlog` | Path to last-fatal JSON log |
|
|
| `FATAL_REPORT_EMAIL_ENABLED` | `false` | Send email to admin on fatal error |
|
|
| `FATAL_REPORT_EMAIL_TO` | `AGM_ADM_EMAIL` | Admin email address |
|
|
| `FATAL_EXIT_ON_ERROR` | `true` | Exit process after fatal error (production) |
|
|
| `FATAL_EXIT_DELAY_MS` | `1500` | Delay before exit (for graceful cleanup) |
|
|
| `FATAL_THROTTLE_MS` | `120000` | Minimum time between identical error reports (2 min) |
|
|
|
|
### Development vs Production
|
|
|
|
**Development** (`environment.env`):
|
|
```env
|
|
FATAL_REPORT_ENABLED=true
|
|
FATAL_REPORT_EMAIL_ENABLED=false
|
|
FATAL_EXIT_ON_ERROR=false # Don't crash-loop during dev
|
|
```
|
|
|
|
**Production** (`environment_prod.env`):
|
|
```env
|
|
FATAL_REPORT_ENABLED=true
|
|
FATAL_REPORT_EMAIL_ENABLED=true
|
|
FATAL_EXIT_ON_ERROR=true # Let PM2/systemd restart cleanly
|
|
```
|
|
|
|
## HTTP/2 and Express Compatibility
|
|
|
|
### The Issue
|
|
Express expects HTTP/1.1-style `IncomingMessage` and `ServerResponse` objects. When Node's native `http2.createSecureServer()` advertises `h2` via ALPN, browsers negotiate HTTP/2 and send native HTTP/2 streams, which Express cannot handle properly. This causes:
|
|
- Hanging requests (spinning browser)
|
|
- "Cannot read properties of undefined (reading 'readable')" errors
|
|
- Finalhandler cleanup exceptions
|
|
|
|
### The Solution
|
|
1. **Default to HTTPS/1.1**: `HTTP2_ENABLED=false` (recommended)
|
|
2. **If using HTTP/2 server**: Set `HTTP2_ADVERTISE_H2=false` to prevent browser negotiation
|
|
3. **Production setup**: Terminate HTTP/2 at nginx, proxy HTTP/1.1 to Node/Express
|
|
|
|
**Env Configuration**:
|
|
```env
|
|
# Let nginx handle HTTP/2, Node stays HTTP/1.1
|
|
HTTP2_ENABLED=false
|
|
HTTP2_ADVERTISE_H2=false
|
|
```
|
|
|
|
**Nginx Config Example**:
|
|
```nginx
|
|
server {
|
|
listen 443 ssl http2;
|
|
server_name app.example.com;
|
|
|
|
location / {
|
|
proxy_pass http://127.0.0.1:4100; # Node server (HTTP/1.1)
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Connection "";
|
|
}
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Automated Tests
|
|
Run the test suite:
|
|
```bash
|
|
cd /path/to/server
|
|
node test_fatal_error_reporter.js
|
|
```
|
|
|
|
**Tests cover**:
|
|
1. Atomic write (no corruption)
|
|
2. Corrupt JSON recovery (archives bad files)
|
|
3. Throttling (duplicate errors within window)
|
|
4. Different errors not throttled
|
|
5. Process handler integration + ignore filters
|
|
|
|
### Manual Testing
|
|
|
|
#### Test Fatal Reporting
|
|
```javascript
|
|
// In server or worker
|
|
throw new Error('Test fatal error');
|
|
```
|
|
|
|
Check:
|
|
- `.rlog` file created with valid JSON
|
|
- Email sent (if `FATAL_REPORT_EMAIL_ENABLED=true`)
|
|
- Process exit after delay (if `FATAL_EXIT_ON_ERROR=true`)
|
|
|
|
#### Test Corruption Recovery
|
|
```bash
|
|
# Corrupt the log file
|
|
echo '{ "broken": json }' > agm_server.rlog
|
|
|
|
# Trigger error (will archive corrupt file)
|
|
node -e "throw new Error('test')"
|
|
|
|
# Check for archived file
|
|
ls -l agm_server.rlog.corrupt.*
|
|
```
|
|
|
|
#### Test Throttling
|
|
```javascript
|
|
// Trigger same error twice quickly
|
|
const err = new Error('Duplicate test');
|
|
err.code = 'DUP_TEST';
|
|
throw err;
|
|
// Wait < 2 minutes, trigger again
|
|
throw err; // Should be throttled (no new timestamp in .rlog)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### `.rlog` File is Corrupt
|
|
**Symptom**: Server/worker crashes on startup with JSON parse error
|
|
|
|
**Solution**: Automatic—the reporter will archive corrupt files and create a fresh log
|
|
|
|
### Email Not Sending
|
|
**Check**:
|
|
1. `FATAL_REPORT_EMAIL_ENABLED=true`
|
|
2. `FATAL_REPORT_EMAIL_TO` set or `AGM_ADM_EMAIL` configured
|
|
3. SMTP settings in `environment.env` are correct
|
|
4. `NO_EMAIL_MODE=false`
|
|
|
|
### Process Not Exiting After Fatal Error
|
|
**Check**:
|
|
1. `FATAL_EXIT_ON_ERROR=true` (production)
|
|
2. `FATAL_EXIT_DELAY_MS` allows time for cleanup (default 1500ms)
|
|
3. Worker may have cleanup handlers (SIGINT/SIGTERM) preventing immediate exit
|
|
|
|
### HTTP Stream Errors Still Appearing
|
|
These are **expected and ignored** when `createServerIgnore()` is used. They indicate client disconnects and don't trigger reports or exits. Look for:
|
|
```
|
|
agm:server HTTP stream error (ignored - likely client disconnect): ...
|
|
```
|
|
|
|
## Migration Notes
|
|
|
|
### From `error-handler` Package
|
|
If you see:
|
|
```javascript
|
|
const errorHandler = require('error-handler').errorHandler;
|
|
errorHandler.registerUnCaughtProcessErrorsHandler(process, logPath);
|
|
```
|
|
|
|
Replace with:
|
|
```javascript
|
|
const { registerFatalHandlers } = require('./helpers/process_fatal_handlers');
|
|
registerFatalHandlers(process, {
|
|
env: require('./helpers/env'),
|
|
debug: require('debug')('your:namespace'),
|
|
kindPrefix: 'your_worker_name',
|
|
reportFilePath: path.join(__dirname, 'your_worker.rlog'),
|
|
});
|
|
```
|
|
|
|
### Cleanup Old `.rlog` Files
|
|
The new system writes JSON only (not the binary format from `key-file-storage`). Old `.rlog` files can be removed or archived.
|
|
|
|
## Best Practices
|
|
|
|
1. **Always enable reporting in production**: Set `FATAL_REPORT_ENABLED=true`
|
|
2. **Enable email for critical services**: Set `FATAL_REPORT_EMAIL_ENABLED=true` for server and partner workers
|
|
3. **Don't exit on error in dev**: Set `FATAL_EXIT_ON_ERROR=false` to avoid crash loops during debugging
|
|
4. **Use descriptive `kindPrefix`**: Helps identify which worker/service crashed
|
|
5. **Monitor `.rlog` files**: Set up log rotation or alerts if files exist (indicates recent crashes)
|
|
6. **Test email notifications**: Verify admin emails arrive before deploying to production
|
|
|
|
## Future Enhancements
|
|
|
|
- [ ] Add structured logging integration (JSON output to stdout)
|
|
- [ ] Expose metrics endpoint for fatal error counts
|
|
- [ ] Support multiple admin email recipients
|
|
- [ ] Add webhook notification option (e.g., Slack, PagerDuty)
|
|
- [ ] Centralized fatal log aggregation for distributed workers
|
|
|
|
## Related Documentation
|
|
|
|
- [PARTNER_INTEGRATION_ARCHITECTURE.md](./PARTNER_INTEGRATION_ARCHITECTURE.md) - Partner integration architecture
|
|
- [WORKER_RESPONSIBILITIES_UPDATE.md](./archived/WORKER_RESPONSIBILITIES_UPDATE.md) - Worker responsibilities (archived)
|
|
- [DLQ_INDEX.md](./DLQ_INDEX.md) - DLQ and error recovery
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
1. Check test suite: `node tests/test_fatal_error_reporter.js`
|
|
2. Review `.rlog` files in server root and `workers/` directory
|
|
3. Verify env variables in `environment.env` or `environment_prod.env`
|
|
4. Check PM2 logs: `pm2 logs agm-server --lines 100`
|