// LIVE DIAGNOSIS

See the failure explained, not just flagged

When a monitored job fails, most cron monitors tell you that it failed. CronShield's paid tiers read the failing run's logs and put the last log line plus a specific likely cause in the alert. Below are three real cron failures, each diagnosed from its raw log.

The diagnoses below are the actual output of the same model (claude-opus-4-8) the paid tiers use, run over the exact prompt the app builds. The logs are realistic reproductions of common failures; the analysis is the model's real work. Nothing here is fabricated.

exit 1

Nightly analytics ETL

Python exception (exit 1) — a KeyError deep in a data pipeline

Raw run log

[2026-07-04 02:00:04] INFO  starting nightly ETL
[2026-07-04 02:00:05] INFO  connecting to warehouse
[2026-07-04 02:00:07] INFO  loaded 41,882 rows from orders
Traceback (most recent call last):
  File "/opt/etl/run.py", line 118, in <module>
    main()
  File "/opt/etl/run.py", line 96, in main
    revenue = totals["net_revenue"] / totals["order_count"]
KeyError: 'order_count'

CronShield diagnosis

Nightly analytics ETL crashed at run.py:96 with KeyError: 'order_count' while computing revenue.

The revenue calculation totals["net_revenue"] / totals["order_count"] failed because the order_count key is absent from totals — and note it is specifically order_count, not net_revenue, that's missing (the numerator resolved). The 41,882 orders loaded fine, so the row read isn't the problem; the aggregation that should populate order_count didn't emit that key this run. Most likely the aggregation groups/filters produced no order_count field (a renamed column, a SELECT that dropped COUNT(*), or an all-NULL/empty group). Fix: make the code defensive with totals.get("order_count", 0) and guard the divide-by-zero, and verify the upstream query still returns an order_count column — a schema/column rename upstream is the usual trigger.

Grounded on this log evidence

KeyError: 'order_count'
  File "/opt/etl/run.py", line 96, in main
    revenue = totals["net_revenue"] / totals["order_count"]

Why this is useful: It doesn't just say "KeyError" — it isolates which key is missing versus present, rules out the row-load step using the "loaded 41,882 rows" line, and points at an upstream column/aggregation contract change plus a divide-by-zero guard. More on exit code 1.

exit 137

Image thumbnailer batch

OOM kill (exit 137) — the kernel killed the process, not a code bug

Raw run log

[batch] processing 12,400 images at 4096px
[batch] worker-3 resizing IMG_88213.tiff (287 MB decoded)
[batch] worker-3 resizing IMG_88214.tiff (301 MB decoded)
[ 4823.551] Out of memory: Killed process 20417 (python) total-vm:9812004kB, anon-rss:7994512kB, file-rss:0kB
/var/spool/cron/run.sh: line 12: 20417 Killed                  python thumbnail.py --size 4096

CronShield diagnosis

Image thumbnailer batch was OOM-killed (exit 137 = SIGKILL) by the kernel while worker-3 decoded a ~300 MB TIFF.

Exit 137 is 128+9 (SIGKILL), and the log shows the Linux OOM-killer — not your code — terminated it: Out of memory: Killed process 20417 (python) with anon-rss ~7.99 GB. The process ballooned to ~8 GB decoding 4096px TIFFs (IMG_88214.tiff (301 MB decoded)) — full-resolution TIFF decodes are many times their on-disk size and you're holding several in memory at once across workers. This is a resource ceiling, so a plain retry will just be killed again. Fix: cap worker concurrency, process images in a streaming/tiled fashion or downscale before full decode, and/or raise the box/container memory limit; if this runs in a cgroup, its memory.limit is the hard ceiling to lift.

Grounded on this log evidence

[ 4823.551] Out of memory: Killed process 20417 (python) total-vm:9812004kB, anon-rss:7994512kB, file-rss:0kB
/var/spool/cron/run.sh: line 12: 20417 Killed                  python thumbnail.py --size 4096

Why this is useful: It decodes exit 137 to SIGKILL to the OOM-killer (not an app bug), quantifies the ~8 GB RSS, ties it to full-resolution TIFF decoding, and warns that a retry won't help — an actionable memory-side fix. More on exit code 137.

exit 127

DB backup to S3

command not found (exit 127) — a cron PATH gap that falsely reported success

Raw run log

+ export PGPASSWORD
+ pg_dump -h db.internal -U backup app_prod
+ gzip -9
+ aws s3 cp - s3://acme-backups/app_prod/2026-07-04.sql.gz
/usr/local/bin/backup.sh: line 8: aws: command not found
+ echo 'backup finished'
backup finished

CronShield diagnosis

DB backup to S3 failed at backup.sh:8 — aws: command not found (exit 127) — and, dangerously, it still printed 'backup finished'.

Exit 127 means 'command not found', and line 8 confirms it: aws: command not found. The pg_dump | gzip produced output but the upload step never ran because the aws CLI isn't on PATH in the cron environment — cron runs with a minimal PATH (often just /usr/bin:/bin) that excludes /usr/local/bin where the AWS CLI usually lives. Critically, the script continued and printed 'backup finished', so a check that only greps for that string would falsely report success while NO backup was uploaded. Fix: call the CLI by absolute path (e.g. /usr/local/bin/aws) or set PATH at the top of the script, and add set -euo pipefail so the pipeline fails loudly instead of printing a false success.

Grounded on this log evidence

/usr/local/bin/backup.sh: line 8: aws: command not found
+ echo 'backup finished'
backup finished

Why this is useful: The standout insight is that the job reported success falsely (it printed 'backup finished' after the failed upload) — exactly the silent-failure class CronShield exists to catch. It names exit 127, the cron-PATH root cause, and the absolute-path / set -e fix. More on exit code 127.

Get this on your own jobs

The free tier alerts you the moment a job misses its expected run. Paid tiers add the log-aware diagnosis you see above — the last log line and a likely cause, in the alert. The heartbeat receiver ships in an upcoming release; see the plans for what each tier adds.

Start free View pricing Browse exit codes