What does 'unhealthy' mean in the Airflow health endpoint?

The scheduler reports unhealthy when its last heartbeat is older than scheduler_health_check_threshold (30 seconds by default). An unhealthy scheduler means no new task instances are being dispatched, so DAGs that were due to run will queue but never start.

Does Airflow retry failed tasks automatically?

Yes, if you set retries on a task (or globally in default_args). Each retry is a separate task instance. On_failure_callback fires after all retries are exhausted.

How to monitor Apache Airflow DAG schedules

Monitor Airflow DAG schedules by watching the scheduler's heartbeat via the /api/v2/monitor/health endpoint, which flags the scheduler unhealthy if its last heartbeat is more than 30 seconds old by default. For individual DAG run monitoring, use the on_failure_callback on a task or DAG to send an alert when a task fails, and check data_interval_end against the actual execution time to catch late or missed runs. The scheduler's heartbeat is the single point of failure — if it stops, all DAG runs stop with it.

How do I check if the Airflow scheduler is healthy?

Airflow exposes a health endpoint that reports the scheduler's last heartbeat. If the heartbeat is stale, the scheduler has stopped dispatching runs:

curl -s https://<your-airflow-host>/api/v2/monitor/health | python3 -m json.tool
# Look for: "scheduler": {"status": "healthy", "latest_scheduler_heartbeat": "..."}
# "status": "unhealthy" means the scheduler is not running or is hung.

The scheduler is considered unhealthy when its last heartbeat is more than 30 seconds old (the default scheduler_health_check_threshold). Poll this endpoint from an external monitor and alert when status is not 'healthy'.

How do I get alerted when a task or DAG run fails?

Use on_failure_callback on the DAG or individual tasks to run a function when a task fails. A common pattern sends a notification to Slack, email, or a webhook:

dags/nightly_report.py

import httpx
from airflow.sdk import DAG, task
from datetime import datetime

def on_failure(context):
    task_id = context["task"].task_id
    dag_id = context["dag"].dag_id
    httpx.post("https://hooks.slack.com/your-webhook", json={
        "text": f"Airflow task failed: {dag_id}/{task_id}"
    }, timeout=10)

with DAG(
    "nightly_report",
    schedule="0 3 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    on_failure_callback=on_failure,
) as dag:
    @task
    def run_report():
        # job logic here
        pass
    run_report()

How do I detect a missed or late DAG run?

Airflow's DAG view shows each run's data_interval_end (when the run was scheduled to cover data through) and the actual execution start. A run that is far past its data_interval_end without starting is late or missed. You can query this via the Airflow REST API and alert when a run in 'queued' or 'scheduled' state is overdue:

# List the latest DAG run and its state
curl -s "https://<host>/api/v2/dags/nightly_report/dagRuns?limit=1&order_by=-start_date" \
  -u airflow:airflow | python3 -m json.tool

catchup=False is important for production DAGs — without it, Airflow attempts to backfill every missed run since start_date, which can flood your workers on a restart.

Add a missed-run alert to this job

The free tier gives you a heartbeat endpoint and an email alert when an expected ping doesn't arrive. Paid tiers add the log-aware diagnosis — the last log line and a likely cause in the alert. The heartbeat receiver ships in an upcoming release; see the plans to learn what each tier adds.

Start free View pricing

Frequently asked questions

What does 'unhealthy' mean in the Airflow health endpoint?: The scheduler reports unhealthy when its last heartbeat is older than scheduler_health_check_threshold (30 seconds by default). An unhealthy scheduler means no new task instances are being dispatched, so DAGs that were due to run will queue but never start.
Does Airflow retry failed tasks automatically?: Yes, if you set retries on a task (or globally in default_args). Each retry is a separate task instance. On_failure_callback fires after all retries are exhausted.