# Operations

This page covers running `daycry/jobs` v3 in production: keeping long-running workers alive under a
process supervisor, shutting them down cleanly, scaling out to many workers, running the periodic
reaper, the operational behaviour of the circuit breaker and rate limiter, the dead-letter queue,
and observability via the metrics collector.

It assumes you understand the queue model from [Queues & Backends](QUEUES.md) and the worker
commands from [CLI Commands](COMMANDS.md).

## Running long-running workers

In production you run one or more `jobs:queue:work` processes per queue, each kept alive by a process
supervisor so it restarts automatically on exit (crash, deploy, OOM, or a graceful stop). The worker
itself runs an unbounded loop when invoked without `--once`/`--max`.

### Supervisor

A typical `supervisord` program. `numprocs` runs several identical workers against the same queue
(see [Scaling out](#scaling-out)):

```ini
[program:jobs-reports]
command=php /var/www/app/spark jobs:queue:work reports --backend redis
directory=/var/www/app
user=www-data
numprocs=4
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=true
startsecs=3
stopsignal=TERM
stopwaitsecs=3600
stdout_logfile=/var/log/jobs/reports.out.log
stderr_logfile=/var/log/jobs/reports.err.log
```

> **Warning:** Set `stopwaitsecs` (Supervisor) **higher than your longest job runtime**. The worker
> finishes the in-flight job before exiting on `SIGTERM`; if the supervisor force-kills it first
> (`SIGKILL`) the job is interrupted and will be redelivered after its visibility timeout.

### systemd

A templated unit (`jobs-worker@.service`) so you can start one instance per queue with
`systemctl start jobs-worker@reports`:

```ini
[Unit]
Description=Jobs queue worker (%i)
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/var/www/app
ExecStart=/usr/bin/php /var/www/app/spark jobs:queue:work %i --backend redis
Restart=always
RestartSec=3
# Allow the in-flight job to finish before SIGKILL on stop/restart.
TimeoutStopSec=3600
KillSignal=SIGTERM

[Install]
WantedBy=multi-user.target
```

```bash
systemctl daemon-reload
systemctl enable --now jobs-worker@reports
systemctl enable --now jobs-worker@emails
```

> **Note:** Restart your workers on every deploy. A long-running PHP process keeps the old code (and
> a warm `Config\Jobs`) in memory until it restarts, so new handler code or config changes are not
> picked up by a worker that keeps running across the deploy.

## Graceful shutdown and signals

The worker installs handlers for `SIGTERM` and `SIGINT` (POSIX, requires the `pcntl` extension).
On receipt it:

1. Prints `stop signal received, finishing current cycle...`.
2. Sets an internal stop flag (checked at the top of every loop iteration).
3. Finishes the **current** cycle — an in-flight job is run to completion and settled normally.
4. Prints `graceful shutdown complete.` and exits with `SUCCESS`.

This means a deploy/restart never aborts a running job mid-flight; the worker simply stops pulling
new work and exits.

> **Warning:** On platforms without `pcntl` (notably Windows), the worker cannot trap signals.
> Bound such runs with `--once` or `--max N` and re-invoke from a scheduler, or stop the process
> externally between cycles. Always give your supervisor enough stop-grace time
> (`stopwaitsecs` / `TimeoutStopSec`) to exceed the longest job runtime.

## Scaling out

Because the queue contract is lease-based and claims are atomic, you scale throughput simply by
running **more workers** against the same queue:

- The **database** backend claims rows with `FOR UPDATE SKIP LOCKED` (optimistic fallback for SQLite),
  so concurrent workers never grab the same row.
- The **redis** backend moves messages atomically with `RPOPLPUSH` into a per-message processing
  entry, so a message is leased by exactly one worker.
- **beanstalk** and **serviceBus** reserve/peek-lock each message server-side.

Run dedicated worker pools per queue so a slow queue does not starve a fast one:

```bash
# 4 workers on 'emails', 2 on 'reports'
php spark jobs:queue:work emails  --backend redis   # x4 under the supervisor
php spark jobs:queue:work reports --backend redis   # x2 under the supervisor
```

> **Warning:** With multiple workers, delivery is **at-least-once** and a message can be processed
> more than once (after a crash + reaper recovery, or a lease expiry). Make handlers idempotent —
> use `idempotencyKey()` for built-in de-duplication. See [Idempotency](advanced.md#idempotency-in-depth).

## The periodic reaper

A worker that crashes between `fetch()` and `ack()` leaves its message leased and invisible until the
visibility timeout elapses. Run `jobs:queue:reap <queue>` periodically (every minute is typical) to
return such messages to the ready state. This is required for the **database** and **redis** backends;
beanstalk and Service Bus recover natively.

```bash
# System cron, once a minute per queue
* * * * * cd /var/www/app && php spark jobs:queue:reap reports >> /dev/null 2>&1
* * * * * cd /var/www/app && php spark jobs:queue:reap emails --backend redis >> /dev/null 2>&1
```

The visibility timeout used is `redisProcessingVisibilityTimeout` for the redis backend and
`databaseVisibilityTimeout` otherwise (both default 300s).

> **Warning — visibility timeout must exceed runtime.** If a job's real runtime can exceed the
> visibility timeout, the reaper (or the broker, for beanstalk TTR / Service Bus lock) will treat the
> still-running worker as crashed and redeliver the message, causing a duplicate execution. Always
> set the visibility timeout (and beanstalk TTR / `serviceBusLockTimeout`) **greater than your
> longest expected job runtime**, with headroom. For redis, a long-running worker can also extend its
> lease by calling `RedisBackend::renewLease()`.

## Circuit breaker

The worker wraps each cycle in a per-queue `CircuitBreaker` (cache-backed, so state persists across
worker restarts). It protects an unhealthy backend from being hammered:

- **Closed** (normal): failures are counted. After `Config\Jobs::$circuitBreakerThreshold`
  consecutive backend errors the circuit **opens**.
- **Open**: cycles are skipped for `Config\Jobs::$circuitBreakerCooldown` seconds (the worker logs
  `[Circuit Open] ...` and idles `pollInterval`). After the cooldown it allows one probe (half-open).
- **Half-open**: a successful cycle closes the circuit; a failed probe re-opens it.

```php
// Config\Jobs
public int $circuitBreakerThreshold = 5;   // consecutive failures before opening
public int $circuitBreakerCooldown  = 60;  // seconds the circuit stays open
```

> **Note:** The breaker reacts to **thrown backend errors** during a cycle (e.g. the broker is
> unreachable), not to ordinary job failures — a job that runs and fails is nacked/abandoned by the
> pipeline and counts as a *successful* backend cycle for the breaker.

## Rate limiting

Cap how many jobs a queue processes per minute with `Config\Jobs::$queueRateLimits` (jobs/minute,
`0` = unlimited). The worker checks the limit before each cycle and, when throttled, logs
`[Rate Limited] ...` and idles for `pollInterval`.

```php
// Config\Jobs
public array $queueRateLimits = [
    'emails'  => 100, // at most 100 email jobs/minute
    'reports' => 10,
];
```

The limiter (`Daycry\Jobs\Libraries\RateLimiter`) uses a cache-based, per-minute token bucket.

> **Note:** Use an **atomic cache driver (Redis or Memcached)** in production. With those, the
> increment is server-side atomic and the cap is enforced precisely. The file/dummy fallback is
> best-effort and may overshoot by one per racing worker.

## Dead-letter queue

The DLQ holds jobs that have permanently failed so they can be inspected or replayed instead of being
lost. Configure a queue name:

```php
// Config\Jobs
public ?string $deadLetterQueue = 'dead-letter'; // null disables the DLQ helper
```

Routing is provided by `Daycry\Jobs\Libraries\DeadLetterQueue::store($payload, $handler, $reason, $attempts)`,
which enqueues the failed payload (annotated with `_dlq_metadata`: reason, timestamp, attempts) onto
the configured queue using the default backend, and returns `false` when the DLQ is unconfigured or
the enqueue fails.

```php
use Daycry\Jobs\Libraries\DeadLetterQueue;

$stored = (new DeadLetterQueue())->store(
    payload: $failedPayload,
    handler: 'command',
    reason:  'connection timeout',
    attempts: 4,
);

if (! $stored) {
    // DLQ disabled or enqueue failed — decide whether to drop or requeue; never silently lose work.
}
```

> **Warning:** In the current worker pipeline, retry exhaustion calls the backend's `abandon()`
> directly — which routes to a **native** dead-letter facility where the backend has one (beanstalk
> `bury`, Service Bus dead-letter after `MaxDeliveryCount`) and otherwise marks the message `failed`
> (database) or drops it (redis). The `DeadLetterQueue` helper and `$deadLetterQueue` config are an
> **opt-in application-level** facility you invoke yourself; they are not automatically called by the
> worker on abandon. For redis, in particular, configure your own DLQ handling (or rely on
> inspection) so permanently-failed messages are not lost. See also
> [Retries & Backoff](RETRIES.md#dead-letter-queue).

## Observability and metrics

The worker emits counters through a pluggable `Daycry\Jobs\Metrics\MetricsCollectorInterface`,
resolved from `Config\Jobs::$metricsCollector`:

```php
// Config\Jobs
// InMemoryMetricsCollector (default) is fine for dev; null disables all metrics.
public ?string $metricsCollector = InMemoryMetricsCollector::class;
```

The interface is small:

```php
interface MetricsCollectorInterface
{
    public function increment(string $counter, int $value = 1, array $labels = []): void;
    public function observe(string $metric, float $value, array $labels = []): void;
    public function getSnapshot(): array;
}
```

### Counters emitted by the worker

Every counter carries a `queue` label.

| Counter | Incremented when |
|---------|------------------|
| `jobs_fetched` | A message was leased from the backend. |
| `jobs_rejected_signature` | A message failed HMAC signature verification (then abandoned). |
| `jobs_skipped_idempotent` | A message was skipped because its idempotency key was already processed. |
| `jobs_succeeded` | A job ran successfully and was acked. |
| `jobs_failed` | A job attempt failed (before deciding requeue vs dead-letter). |
| `jobs_requeued` | A failed job had retries left and was nacked with backoff. |
| `jobs_failed_permanently` | A failed job exhausted its retries and was abandoned. |

### Reading metrics

The default `InMemoryMetricsCollector` aggregates counters/histograms in process memory (with a
cardinality cap and FIFO eviction so a long-running worker cannot grow unbounded). Read a snapshot:

```php
use Daycry\Jobs\Metrics\Metrics;

$snapshot = Metrics::get()?->getSnapshot();
// ['counters' => ['jobs_succeeded|queue=reports' => 42, ...], 'histograms' => [...]]
```

> **Note:** In-memory metrics live only for the lifetime of one worker process and are not scraped
> across processes. For production monitoring (e.g. Prometheus), implement
> `MetricsCollectorInterface` with an exporter that writes to a shared, scrapeable store — for
> example a Redis/StatsD-backed collector or a Prometheus pushgateway client — and point
> `Config\Jobs::$metricsCollector` at it. Set the config to `null` to disable metrics entirely (all
> `increment`/`observe` calls become no-ops).

In addition to metrics, the worker logs operational events through CodeIgniter's logger: rejected
signatures and retry exhaustion are logged at `critical`, and backend errors surface as CLI error
output. Aggregate these logs centrally to alert on `jobs_failed_permanently` and signature
rejections.

## See also

- [Queues & Backends](QUEUES.md) — backend semantics and recovery model.
- [CLI Commands](COMMANDS.md) — `jobs:queue:work`, `jobs:queue:reap`, `jobs:queue:purge`.
- [Retries & Backoff](RETRIES.md) — retry budget and the dead-letter relationship.
- [Configuration](CONFIGURATION.md) — every operational setting referenced here.
- [Scheduling](scheduling.md) — the cron runner that feeds queued work.