Real-Time vs. Scheduled Data Sync: Webhooks vs. Polling

TL;DR Matrix

Dimension	Webhooks (Push)	Polling (Pull)
Data Freshness	Real-time	Latent (by polling interval)
Server Load (Provider)	Low. Event-driven.	High. Constant requests.
Server Load (Consumer)	Bursty. Scales with event volume.	Predictable. Based on schedule.
Implementation	More complex. Requires a public endpoint, signature verification.	Simpler. Standard client-side logic.
Network Traffic	Efficient. Data sent only on change.	Inefficient. Many requests return no new data.
State Management	Minimal. Provider manages retry state.	Critical. Client must track last fetch point (e.g., timestamp, ID).

Use Cases

Webhooks (Real-Time)

Choose webhooks when your system needs to react immediately to events happening in another system.

Shopify order creation notifications to a fulfilment service.
GitHub push events triggering a CI/CD build on Jenkins.
Stripe payment_intent.succeeded events updating an invoice status.

Polling (Scheduled)

Choose polling when real-time data isn't critical, or the source system doesn't offer webhooks.

A dashboard fetching analytics data from Google Analytics every hour.
A marketing tool synchronising user lists from a CRM nightly.
Checking the status of a long-running export job from an ERP system every five minutes.

Technical Analysis

Webhooks (Push Model)

A webhook is an HTTP callback. When an event occurs in the source system (the provider), it sends an HTTP POST request to a pre-configured URL in your system (the consumer).

sequenceDiagram
    participant Provider as Provider System
    participant Consumer as Consumer System

    Note over Provider,Consumer: One-time setup: Consumer registers a URL
    Provider->>Consumer: `POST /webhook/register`
    Consumer-->>Provider: `200 OK`

    Note over Provider,Consumer: Event-driven flow
    Provider->>Provider: Event occurs (e.g., Order Created)
    Provider->>Consumer: `POST /webhooks/orders` with JSON payload
    Consumer-->>Provider: `202 Accepted`

A typical webhook payload from GitHub for a push event looks like this:

{
  "ref": "refs/heads/main",
  "before": "a1b2c3d4e5f6...",
  "after": "f6e5d4c3b2a1...",
  "repository": {
    "id": 12345678,
    "name": "my-repo",
    "full_name": "my-org/my-repo"
  },
  "pusher": {
    "name": "octocat",
    "email": "[email protected]"
  },
  "commits": [
    {
      "id": "f6e5d4c3b2a1...",
      "message": "Fix: Corrected a typo in the README.",
      "timestamp": "2023-10-27T10:00:00Z",
      "author": {
        "name": "Octocat",
        "email": "[email protected]"
      }
    }
  ]
}

Your consumer application needs an exposed HTTP endpoint to receive this data.

// Node.js with Express
const express = require('express')
const crypto = require('crypto')
const app = express()

// Use raw body for signature verification
app.use(express.json({ verify: (req, res, buf) => { req.rawBody = buf } }))

const GITHUB_SECRET = process.env.GITHUB_WEBHOOK_SECRET

app.post('/webhooks/github', (req, res) => {
  // **The bit most guides skip:** Always verify the signature.
  const signature = req.headers['x-hub-signature-256']
  const hmac = crypto.createHmac('sha256', GITHUB_SECRET)
  const digest = 'sha256=' + hmac.update(req.rawBody).digest('hex')

  if (!signature || !crypto.timingSafeEqual(Buffer.from(digest), Buffer.from(signature))) {
    return res.status(401).send('Invalid signature')
  }

  // Signature is valid, process the event
  const { ref, commits } = req.body
  console.log(`Push to ${ref} with ${commits.length} commit(s).`)

  // Acknowledge receipt immediately
  res.status(202).send('Accepted')
})

app.listen(3000, () => console.log('Webhook listener running on port 3000'))

Polling (Pull Model)

Polling involves your system (the client) making scheduled HTTP GET requests to an API endpoint on the source system to ask for new data.

sequenceDiagram
    participant Client as Client System
    participant Provider as Provider API

    loop Every 5 minutes
        Client->>Provider: `GET /api/orders?updated_since=2023-10-27T09:55:00Z`
        Provider-->>Client: `200 OK` with JSON array of new/updated orders
        Client->>Client: Process orders and store new timestamp `2023-10-27T10:00:00Z`
    end

To avoid re-fetching all data, the client must track the last time it successfully fetched data. This is usually done with a timestamp or a sequential ID.

Request: GET /api/v1/orders?updated_since=2023-10-27T09:55:00Z&sort=asc

Response:

{
  "data": [
    {
      "id": 9876,
      "status": "paid",
      "total": 42.00,
      "updated_at": "2023-10-27T09:58:12Z"
    }
  ],
  "has_more": false
}

A polling client can be a simple scheduled script.

# Python with requests and schedule
import requests
import schedule
import time
import os

API_ENDPOINT = "https://api.example.com/v1/orders"
API_KEY = os.environ.get("API_KEY")
STATE_FILE = "last_sync_timestamp.txt"

def get_last_sync_time():
    try:
        with open(STATE_FILE, 'r') as f:
            return f.read().strip()
    except FileNotFoundError:
        return "1970-01-01T00:00:00Z" # Default to epoch on first run

def set_last_sync_time(timestamp):
    with open(STATE_FILE, 'w') as f:
        f.write(timestamp)

def fetch_new_orders():
    # **The bit most guides skip:** State must be managed reliably.
    last_sync = get_last_sync_time()
    print(f"Fetching orders updated since {last_sync}")

    try:
        response = requests.get(
            API_ENDPOINT,
            params={"updated_since": last_sync},
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx)
        orders = response.json().get('data', [])

        if not orders:
            print("No new orders found.")
            return

        for order in orders:
            print(f"Processing order ID: {order['id']}")
            # ... business logic ...

        # Update state only after successful processing
        latest_timestamp = orders[-1]['updated_at']
        set_last_sync_time(latest_timestamp)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching data: {e}")

schedule.every(5).minutes.do(fetch_new_orders)

while True:
    schedule.run_pending()
    time.sleep(1)

Error Handling

Webhooks

Error handling is a shared responsibility, but the consumer's behaviour is key.

Consumer Downtime (5xx errors): If your endpoint returns a 500 or 503, the provider should interpret this as a temporary failure and retry. Most services use an exponential backoff schedule (e.g., retry after 1, 2, 4, 8 minutes). If your endpoint is down for an extended period, the provider will eventually give up, and you'll lose that event. Many providers offer a dead-letter queue (DLQ) for failed events.
Consumer Overload (429 errors): If your service can't process events as fast as they arrive, you should return a 429 Too Many Requests status. This signals to a well-behaved provider to slow down its delivery rate.
Permanent Errors (4xx errors): Returning a 400 Bad Request or 401 Unauthorized tells the provider the request is invalid and should not be retried. Some providers will automatically disable a webhook after several consecutive 4xx failures.

Polling

Error handling is entirely the client's responsibility.

Provider Downtime (5xx errors): If the API you're polling is down, your client script should catch the exception and implement its own retry logic, perhaps with a simple backoff. Because it's on a schedule, the next polling attempt will naturally retry the request anyway.
Rate Limiting (429 errors): It's common to hit API rate limits with frequent polling. A well-designed client must inspect the response for a 429 status code and respect the Retry-After header if present. If not, implement a client-side exponential backoff.
State Corruption: If your polling script fails after fetching data but before saving the new updated_since timestamp, it will re-process the same data on its next run. Your processing logic must be idempotent to handle this gracefully.

Cost & Scalability

Webhooks

Compute Cost: For the consumer, costs are directly proportional to the number of events. This can be unpredictable. Using serverless functions (e.g., AWS Lambda, Google Cloud Functions) is a cost-effective way to handle this, as you only pay for compute time when an event is being processed.
Scalability: Webhooks scale with event volume. A sudden spike in events can overwhelm a small server. Auto-scaling infrastructure or a serverless architecture is essential for high-volume webhook consumers.

Polling

Compute Cost: For the client, costs are predictable and fixed based on the polling schedule, not the data volume. A cron job on a single small virtual machine can run for a very low, fixed monthly cost.
API Cost: Polling can be expensive if the provider charges per API call. Thousands of polls per day that return no new data are wasted calls that may still incur costs or count against a rate limit quota.
Scalability: Polling doesn't scale well from the provider's perspective. It creates a high, constant load on their infrastructure to serve requests that often yield nothing new. This is why providers heavily rate-limit polling-heavy endpoints.

TL;DR Matrix

Use Cases

Webhooks (Real-Time)

Polling (Scheduled)

Technical Analysis

Webhooks (Push Model)

Polling (Pull Model)

Error Handling

Webhooks

Polling

Cost & Scalability

Webhooks

Polling

Related Implementation Blueprints

Need to implement one of these patterns?