What grace period should I use for a daily ETL job?

Set the grace period above the maximum expected pipeline duration. If the job normally takes 30 minutes but occasionally takes up to 2 hours, use a 2.5-hour grace period. Tune it down once you have a few weeks of runtime data to know the true maximum.

Should I monitor each ETL stage separately?

For complex pipelines, yes. Each stage can have its own heartbeat so you know which step failed, not just that the pipeline failed. Use the start+success pattern on each stage and a final success ping after the load step. This gives you per-stage failure visibility without rebuilding the orchestration.

How to alert when a scheduled ETL job is late

To alert when a scheduled ETL job is late, use dead man's switch monitoring: have the ETL job send an HTTP ping to a monitor every time it completes successfully, and the monitor alerts you when the expected ping doesn't arrive within the grace period. For pipeline-duration awareness, also send a start signal at the beginning so the monitor can measure elapsed time and alert if the job is still running past its expected maximum duration.

What counts as a late ETL job?

An ETL job is late when any of the following are true:

The job didn't finish by the expected completion time (success ping is overdue).
The job started but is still running past its maximum expected duration.
The job was never triggered — the schedule is broken, the machine is down, or the cron entry was removed.
The job finished but reported a failure (a non-success ping or an error exit code).

How do I send a heartbeat from an ETL pipeline?

etl_pipeline.py

import httpx

PING_URL = "https://ping.cronshield.com/<your-check-id>"

def run_etl():
    # Signal the start of the pipeline.
    httpx.get(f"{PING_URL}/start", timeout=10)

    try:
        extract()
        transform()
        load()
        # Signal success — the full pipeline completed.
        httpx.get(PING_URL, timeout=10)
    except Exception as exc:
        # Signal failure explicitly so the monitor records it.
        httpx.get(f"{PING_URL}/fail", timeout=10)
        raise

if __name__ == "__main__":
    run_etl()

Configure the monitor's period to match the ETL schedule (hourly, daily, etc.) and the grace period to the maximum acceptable latency before the data is stale enough to alert. If the pipeline normally takes 20 minutes, a grace period of 45 minutes catches a slow run before it becomes a problem.

How do I alert on stale data, not just a missed run?

A heartbeat confirms the ETL ran; it doesn't confirm the data is fresh. Add a data-freshness check: after the pipeline loads, write a watermark timestamp to the destination (a dedicated table row or a metadata field), and alert when that watermark is older than the acceptable staleness window. This catches a pipeline that ran but loaded stale source data.

PING_URL is a placeholder for the endpoint you get when you create a monitor. The /start and /fail sub-paths are a convention supported by heartbeat monitors like CronShield; the CronShield receiver ships in an upcoming release.

Add a missed-run alert to this job

The free tier gives you a heartbeat endpoint and an email alert when an expected ping doesn't arrive. Paid tiers add the log-aware diagnosis — the last log line and a likely cause in the alert. The heartbeat receiver ships in an upcoming release; see the plans to learn what each tier adds.

Start free View pricing

Frequently asked questions

What grace period should I use for a daily ETL job?: Set the grace period above the maximum expected pipeline duration. If the job normally takes 30 minutes but occasionally takes up to 2 hours, use a 2.5-hour grace period. Tune it down once you have a few weeks of runtime data to know the true maximum.
Should I monitor each ETL stage separately?: For complex pipelines, yes. Each stage can have its own heartbeat so you know which step failed, not just that the pipeline failed. Use the start+success pattern on each stage and a final success ping after the load step. This gives you per-stage failure visibility without rebuilding the orchestration.