agmission/Development/server/docs/FATAL_ERROR_HANDLING.md

277 lines
9.1 KiB
Markdown

# Fatal Error Handling Architecture
## Overview
The AgMission server and workers use a shared, robust fatal error handling system that:
- Writes a last-fatal JSON report atomically (preventing corruption)
- Archives corrupt log files automatically
- Throttles duplicate errors to avoid email floods
- Optionally sends admin notifications via email
- Configurable process exit behavior for clean restarts
## Why This Solution?
### Problem with Previous Approach
The original `error-handler` package used `key-file-storage` which performed read-modify-write operations:
1. **Corruption risk**: Under concurrent errors or abrupt termination, JSON could be left malformed
2. **Crash loops**: If `.rlog` was corrupt, the next error would fail to parse it and crash again
3. **Forced exit**: Always called `process.exit(1)`, even for benign client disconnect errors
### Current Solution Benefits
- **Atomic writes**: Uses `fs.writeFile` + `fs.rename` to ensure valid JSON or no write at all
- **Corruption recovery**: Detects bad JSON, archives it, and continues with a fresh log
- **Flexible exit**: Opt-in via `FATAL_EXIT_ON_ERROR` with configurable delay for graceful cleanup
- **Intelligent ignore**: Server filters HTTP stream errors (client disconnects, finalhandler cleanups)
- **Shared helper**: Single implementation (`helpers/process_fatal_handlers.js`) for consistency
## Architecture
### Core Components
#### 1. `helpers/fatal_error_reporter.js`
Low-level module responsible for:
- Safe read/write of `.rlog` files
- Atomic JSON persistence
- Throttling duplicate errors
- Optional email via existing `mailer`
**Key Functions**:
```javascript
await reportFatal({
filePath: './agm_server.rlog',
kind: 'server:uncaughtException',
error: err,
message: err.stack,
throttleMs: 120000,
emailEnabled: true,
emailTo: 'admin@example.com',
mailer,
});
```
#### 2. `helpers/process_fatal_handlers.js`
High-level registration helper that:
- Attaches `uncaughtException` and `unhandledRejection` handlers
- Applies ignore filters (e.g., HTTP stream errors for server)
- Calls `reportFatal` when appropriate
- Optionally exits process after delay
**Key Functions**:
```javascript
const { registerFatalHandlers, createServerIgnore } = require('./helpers/process_fatal_handlers');
registerFatalHandlers(process, {
env,
debug,
kindPrefix: 'server',
reportFilePath: env.FATAL_REPORT_FILE,
ignore: createServerIgnore(), // Server-specific filter
});
```
### Applied In
- **Server**: `server.js` (with HTTP stream ignore filter)
- **Job Worker**: `workers/job_worker.js`
- **Partner Sync Worker**: `workers/partner_sync_worker.js`
- **Partner Polling Worker**: `workers/partner_data_polling_worker.js`
- **Cleanup Worker**: `workers/cleanup_worker.js`
- **Invoice Worker**: `workers/invoice_worker.js`
- **Obstacle Worker**: `workers/obstacle_worker.js`
Each worker logs to its own `.rlog` file (e.g., `job_worker.rlog`, `partner_sync_worker.rlog`).
## Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `FATAL_REPORT_ENABLED` | `false` | Enable `.rlog` + email reporting |
| `FATAL_REPORT_FILE` | `./agm_server.rlog` | Path to last-fatal JSON log |
| `FATAL_REPORT_EMAIL_ENABLED` | `false` | Send email to admin on fatal error |
| `FATAL_REPORT_EMAIL_TO` | `AGM_ADM_EMAIL` | Admin email address |
| `FATAL_EXIT_ON_ERROR` | `true` | Exit process after fatal error (production) |
| `FATAL_EXIT_DELAY_MS` | `1500` | Delay before exit (for graceful cleanup) |
| `FATAL_THROTTLE_MS` | `120000` | Minimum time between identical error reports (2 min) |
### Development vs Production
**Development** (`environment.env`):
```env
FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=false
FATAL_EXIT_ON_ERROR=false # Don't crash-loop during dev
```
**Production** (`environment_prod.env`):
```env
FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=true
FATAL_EXIT_ON_ERROR=true # Let PM2/systemd restart cleanly
```
## HTTP/2 and Express Compatibility
### The Issue
Express expects HTTP/1.1-style `IncomingMessage` and `ServerResponse` objects. When Node's native `http2.createSecureServer()` advertises `h2` via ALPN, browsers negotiate HTTP/2 and send native HTTP/2 streams, which Express cannot handle properly. This causes:
- Hanging requests (spinning browser)
- "Cannot read properties of undefined (reading 'readable')" errors
- Finalhandler cleanup exceptions
### The Solution
1. **Default to HTTPS/1.1**: `HTTP2_ENABLED=false` (recommended)
2. **If using HTTP/2 server**: Set `HTTP2_ADVERTISE_H2=false` to prevent browser negotiation
3. **Production setup**: Terminate HTTP/2 at nginx, proxy HTTP/1.1 to Node/Express
**Env Configuration**:
```env
# Let nginx handle HTTP/2, Node stays HTTP/1.1
HTTP2_ENABLED=false
HTTP2_ADVERTISE_H2=false
```
**Nginx Config Example**:
```nginx
server {
listen 443 ssl http2;
server_name app.example.com;
location / {
proxy_pass http://127.0.0.1:4100; # Node server (HTTP/1.1)
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
```
## Testing
### Automated Tests
Run the test suite:
```bash
cd /path/to/server
node test_fatal_error_reporter.js
```
**Tests cover**:
1. Atomic write (no corruption)
2. Corrupt JSON recovery (archives bad files)
3. Throttling (duplicate errors within window)
4. Different errors not throttled
5. Process handler integration + ignore filters
### Manual Testing
#### Test Fatal Reporting
```javascript
// In server or worker
throw new Error('Test fatal error');
```
Check:
- `.rlog` file created with valid JSON
- Email sent (if `FATAL_REPORT_EMAIL_ENABLED=true`)
- Process exit after delay (if `FATAL_EXIT_ON_ERROR=true`)
#### Test Corruption Recovery
```bash
# Corrupt the log file
echo '{ "broken": json }' > agm_server.rlog
# Trigger error (will archive corrupt file)
node -e "throw new Error('test')"
# Check for archived file
ls -l agm_server.rlog.corrupt.*
```
#### Test Throttling
```javascript
// Trigger same error twice quickly
const err = new Error('Duplicate test');
err.code = 'DUP_TEST';
throw err;
// Wait < 2 minutes, trigger again
throw err; // Should be throttled (no new timestamp in .rlog)
```
## Troubleshooting
### `.rlog` File is Corrupt
**Symptom**: Server/worker crashes on startup with JSON parse error
**Solution**: Automatic—the reporter will archive corrupt files and create a fresh log
### Email Not Sending
**Check**:
1. `FATAL_REPORT_EMAIL_ENABLED=true`
2. `FATAL_REPORT_EMAIL_TO` set or `AGM_ADM_EMAIL` configured
3. SMTP settings in `environment.env` are correct
4. `NO_EMAIL_MODE=false`
### Process Not Exiting After Fatal Error
**Check**:
1. `FATAL_EXIT_ON_ERROR=true` (production)
2. `FATAL_EXIT_DELAY_MS` allows time for cleanup (default 1500ms)
3. Worker may have cleanup handlers (SIGINT/SIGTERM) preventing immediate exit
### HTTP Stream Errors Still Appearing
These are **expected and ignored** when `createServerIgnore()` is used. They indicate client disconnects and don't trigger reports or exits. Look for:
```
agm:server HTTP stream error (ignored - likely client disconnect): ...
```
## Migration Notes
### From `error-handler` Package
If you see:
```javascript
const errorHandler = require('error-handler').errorHandler;
errorHandler.registerUnCaughtProcessErrorsHandler(process, logPath);
```
Replace with:
```javascript
const { registerFatalHandlers } = require('./helpers/process_fatal_handlers');
registerFatalHandlers(process, {
env: require('./helpers/env'),
debug: require('debug')('your:namespace'),
kindPrefix: 'your_worker_name',
reportFilePath: path.join(__dirname, 'your_worker.rlog'),
});
```
### Cleanup Old `.rlog` Files
The new system writes JSON only (not the binary format from `key-file-storage`). Old `.rlog` files can be removed or archived.
## Best Practices
1. **Always enable reporting in production**: Set `FATAL_REPORT_ENABLED=true`
2. **Enable email for critical services**: Set `FATAL_REPORT_EMAIL_ENABLED=true` for server and partner workers
3. **Don't exit on error in dev**: Set `FATAL_EXIT_ON_ERROR=false` to avoid crash loops during debugging
4. **Use descriptive `kindPrefix`**: Helps identify which worker/service crashed
5. **Monitor `.rlog` files**: Set up log rotation or alerts if files exist (indicates recent crashes)
6. **Test email notifications**: Verify admin emails arrive before deploying to production
## Future Enhancements
- [ ] Add structured logging integration (JSON output to stdout)
- [ ] Expose metrics endpoint for fatal error counts
- [ ] Support multiple admin email recipients
- [ ] Add webhook notification option (e.g., Slack, PagerDuty)
- [ ] Centralized fatal log aggregation for distributed workers
## Related Documentation
- [PARTNER_INTEGRATION_ARCHITECTURE.md](./PARTNER_INTEGRATION_ARCHITECTURE.md) - Partner integration architecture
- [WORKER_RESPONSIBILITIES_UPDATE.md](./archived/WORKER_RESPONSIBILITIES_UPDATE.md) - Worker responsibilities (archived)
- [DLQ_INDEX.md](./DLQ_INDEX.md) - DLQ and error recovery
## Support
For issues or questions:
1. Check test suite: `node tests/test_fatal_error_reporter.js`
2. Review `.rlog` files in server root and `workers/` directory
3. Verify env variables in `environment.env` or `environment_prod.env`
4. Check PM2 logs: `pm2 logs agm-server --lines 100`