agmission/Development/server/docs/FATAL_ERROR_HANDLING.md

# Fatal Error Handling Architecture

## Overview

The AgMission server and workers use a shared, robust fatal error handling system that:
- Writes a last-fatal JSON report atomically (preventing corruption)
- Archives corrupt log files automatically
- Throttles duplicate errors to avoid email floods
- Optionally sends admin notifications via email
- Configurable process exit behavior for clean restarts

## Why This Solution?

### Problem with Previous Approach
The original `error-handler` package used `key-file-storage` which performed read-modify-write operations:
1. **Corruption risk**: Under concurrent errors or abrupt termination, JSON could be left malformed
2. **Crash loops**: If `.rlog` was corrupt, the next error would fail to parse it and crash again
3. **Forced exit**: Always called `process.exit(1)`, even for benign client disconnect errors

### Current Solution Benefits
- **Atomic writes**: Uses `fs.writeFile` + `fs.rename` to ensure valid JSON or no write at all
- **Corruption recovery**: Detects bad JSON, archives it, and continues with a fresh log
- **Flexible exit**: Opt-in via `FATAL_EXIT_ON_ERROR` with configurable delay for graceful cleanup
- **Intelligent ignore**: Server filters HTTP stream errors (client disconnects, finalhandler cleanups)
- **Shared helper**: Single implementation (`helpers/process_fatal_handlers.js`) for consistency

## Architecture

### Core Components

#### 1. `helpers/fatal_error_reporter.js`
Low-level module responsible for:
- Safe read/write of `.rlog` files
- Atomic JSON persistence
- Throttling duplicate errors
- Optional email via existing `mailer`

**Key Functions**:
```javascript
await reportFatal({
  filePath: './agm_server.rlog',
  kind: 'server:uncaughtException',
  error: err,
  message: err.stack,
  throttleMs: 120000,
  emailEnabled: true,
  emailTo: 'admin@example.com',
  mailer,
});
```

#### 2. `helpers/process_fatal_handlers.js`
High-level registration helper that:
- Attaches `uncaughtException` and `unhandledRejection` handlers
- Applies ignore filters (e.g., HTTP stream errors for server)
- Calls `reportFatal` when appropriate
- Optionally exits process after delay

**Key Functions**:
```javascript
const { registerFatalHandlers, createServerIgnore } = require('./helpers/process_fatal_handlers');

registerFatalHandlers(process, {
  env,
  debug,
  kindPrefix: 'server',
  reportFilePath: env.FATAL_REPORT_FILE,
  ignore: createServerIgnore(), // Server-specific filter
});
```

### Applied In
- **Server**: `server.js` (with HTTP stream ignore filter)
- **Job Worker**: `workers/job_worker.js`
- **Partner Sync Worker**: `workers/partner_sync_worker.js`
- **Partner Polling Worker**: `workers/partner_data_polling_worker.js`
- **Cleanup Worker**: `workers/cleanup_worker.js`
- **Invoice Worker**: `workers/invoice_worker.js`
- **Obstacle Worker**: `workers/obstacle_worker.js`

Each worker logs to its own `.rlog` file (e.g., `job_worker.rlog`, `partner_sync_worker.rlog`).

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `FATAL_REPORT_ENABLED` | `false` | Enable `.rlog` + email reporting |
| `FATAL_REPORT_FILE` | `./agm_server.rlog` | Path to last-fatal JSON log |
| `FATAL_REPORT_EMAIL_ENABLED` | `false` | Send email to admin on fatal error |
| `FATAL_REPORT_EMAIL_TO` | `AGM_ADM_EMAIL` | Admin email address |
| `FATAL_EXIT_ON_ERROR` | `true` | Exit process after fatal error (production) |
| `FATAL_EXIT_DELAY_MS` | `1500` | Delay before exit (for graceful cleanup) |
| `FATAL_THROTTLE_MS` | `120000` | Minimum time between identical error reports (2 min) |

### Development vs Production

**Development** (`environment.env`):
```env
FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=false
FATAL_EXIT_ON_ERROR=false  # Don't crash-loop during dev
```

**Production** (`environment_prod.env`):
```env
FATAL_REPORT_ENABLED=true
FATAL_REPORT_EMAIL_ENABLED=true
FATAL_EXIT_ON_ERROR=true   # Let PM2/systemd restart cleanly
```

## HTTP/2 and Express Compatibility

### The Issue
Express expects HTTP/1.1-style `IncomingMessage` and `ServerResponse` objects. When Node's native `http2.createSecureServer()` advertises `h2` via ALPN, browsers negotiate HTTP/2 and send native HTTP/2 streams, which Express cannot handle properly. This causes:
- Hanging requests (spinning browser)
- "Cannot read properties of undefined (reading 'readable')" errors
- Finalhandler cleanup exceptions

### The Solution
1. **Default to HTTPS/1.1**: `HTTP2_ENABLED=false` (recommended)
2. **If using HTTP/2 server**: Set `HTTP2_ADVERTISE_H2=false` to prevent browser negotiation
3. **Production setup**: Terminate HTTP/2 at nginx, proxy HTTP/1.1 to Node/Express

**Env Configuration**:
```env
# Let nginx handle HTTP/2, Node stays HTTP/1.1
HTTP2_ENABLED=false
HTTP2_ADVERTISE_H2=false
```

**Nginx Config Example**:
```nginx
server {
  listen 443 ssl http2;
  server_name app.example.com;

  location / {
    proxy_pass http://127.0.0.1:4100;  # Node server (HTTP/1.1)
    proxy_http_version 1.1;
    proxy_set_header Connection "";
  }
}
```

## Testing

### Automated Tests
Run the test suite:
```bash
cd /path/to/server
node test_fatal_error_reporter.js
```

**Tests cover**:
1. Atomic write (no corruption)
2. Corrupt JSON recovery (archives bad files)
3. Throttling (duplicate errors within window)
4. Different errors not throttled
5. Process handler integration + ignore filters

### Manual Testing

#### Test Fatal Reporting
```javascript
// In server or worker
throw new Error('Test fatal error');
```

Check:
- `.rlog` file created with valid JSON
- Email sent (if `FATAL_REPORT_EMAIL_ENABLED=true`)
- Process exit after delay (if `FATAL_EXIT_ON_ERROR=true`)

#### Test Corruption Recovery
```bash
# Corrupt the log file
echo '{ "broken": json }' > agm_server.rlog

# Trigger error (will archive corrupt file)
node -e "throw new Error('test')"

# Check for archived file
ls -l agm_server.rlog.corrupt.*
```

#### Test Throttling
```javascript
// Trigger same error twice quickly
const err = new Error('Duplicate test');
err.code = 'DUP_TEST';
throw err;
// Wait < 2 minutes, trigger again
throw err; // Should be throttled (no new timestamp in .rlog)
```

## Troubleshooting

### `.rlog` File is Corrupt
**Symptom**: Server/worker crashes on startup with JSON parse error

**Solution**: Automatic—the reporter will archive corrupt files and create a fresh log

### Email Not Sending
**Check**:
1. `FATAL_REPORT_EMAIL_ENABLED=true`
2. `FATAL_REPORT_EMAIL_TO` set or `AGM_ADM_EMAIL` configured
3. SMTP settings in `environment.env` are correct
4. `NO_EMAIL_MODE=false`

### Process Not Exiting After Fatal Error
**Check**:
1. `FATAL_EXIT_ON_ERROR=true` (production)
2. `FATAL_EXIT_DELAY_MS` allows time for cleanup (default 1500ms)
3. Worker may have cleanup handlers (SIGINT/SIGTERM) preventing immediate exit

### HTTP Stream Errors Still Appearing
These are **expected and ignored** when `createServerIgnore()` is used. They indicate client disconnects and don't trigger reports or exits. Look for:
```
agm:server HTTP stream error (ignored - likely client disconnect): ...
```

## Migration Notes

### From `error-handler` Package
If you see:
```javascript
const errorHandler = require('error-handler').errorHandler;
errorHandler.registerUnCaughtProcessErrorsHandler(process, logPath);
```

Replace with:
```javascript
const { registerFatalHandlers } = require('./helpers/process_fatal_handlers');
registerFatalHandlers(process, {
  env: require('./helpers/env'),
  debug: require('debug')('your:namespace'),
  kindPrefix: 'your_worker_name',
  reportFilePath: path.join(__dirname, 'your_worker.rlog'),
});
```

### Cleanup Old `.rlog` Files
The new system writes JSON only (not the binary format from `key-file-storage`). Old `.rlog` files can be removed or archived.

## Best Practices

1. **Always enable reporting in production**: Set `FATAL_REPORT_ENABLED=true`
2. **Enable email for critical services**: Set `FATAL_REPORT_EMAIL_ENABLED=true` for server and partner workers
3. **Don't exit on error in dev**: Set `FATAL_EXIT_ON_ERROR=false` to avoid crash loops during debugging
4. **Use descriptive `kindPrefix`**: Helps identify which worker/service crashed
5. **Monitor `.rlog` files**: Set up log rotation or alerts if files exist (indicates recent crashes)
6. **Test email notifications**: Verify admin emails arrive before deploying to production

## Future Enhancements

- [ ] Add structured logging integration (JSON output to stdout)
- [ ] Expose metrics endpoint for fatal error counts
- [ ] Support multiple admin email recipients
- [ ] Add webhook notification option (e.g., Slack, PagerDuty)
- [ ] Centralized fatal log aggregation for distributed workers

## Related Documentation

- [PARTNER_INTEGRATION_ARCHITECTURE.md](./PARTNER_INTEGRATION_ARCHITECTURE.md) - Partner integration architecture
- [WORKER_RESPONSIBILITIES_UPDATE.md](./archived/WORKER_RESPONSIBILITIES_UPDATE.md) - Worker responsibilities (archived)
- [DLQ_INDEX.md](./DLQ_INDEX.md) - DLQ and error recovery

## Support

For issues or questions:
1. Check test suite: `node tests/test_fatal_error_reporter.js`
2. Review `.rlog` files in server root and `workers/` directory
3. Verify env variables in `environment.env` or `environment_prod.env`
4. Check PM2 logs: `pm2 logs agm-server --lines 100`