Real-Time vs. Scheduled Data Sync: Webhooks vs. Polling
A technical breakdown of Webhooks and Polling for data synchronisation, covering implementation details, error handling and cost implications.
TL;DR Matrix
| Dimension | Webhooks (Push) | Polling (Pull) |
|---|---|---|
| Data Freshness | Real-time | Latent (by polling interval) |
| Server Load (Provider) | Low. Event-driven. | High. Constant requests. |
| Server Load (Consumer) | Bursty. Scales with event volume. | Predictable. Based on schedule. |
| Implementation | More complex. Requires a public endpoint, signature verification. | Simpler. Standard client-side logic. |
| Network Traffic | Efficient. Data sent only on change. | Inefficient. Many requests return no new data. |
| State Management | Minimal. Provider manages retry state. | Critical. Client must track last fetch point (e.g., timestamp, ID). |
Use Cases
Webhooks (Real-Time)
Choose webhooks when your system needs to react immediately to events happening in another system.
- Shopify order creation notifications to a fulfilment service.
- GitHub push events triggering a CI/CD build on Jenkins.
- Stripe
payment_intent.succeededevents updating an invoice status.
Polling (Scheduled)
Choose polling when real-time data isn't critical, or the source system doesn't offer webhooks.
- A dashboard fetching analytics data from Google Analytics every hour.
- A marketing tool synchronising user lists from a CRM nightly.
- Checking the status of a long-running export job from an ERP system every five minutes.
Technical Analysis
Webhooks (Push Model)
A webhook is an HTTP callback. When an event occurs in the source system (the provider), it sends an HTTP POST request to a pre-configured URL in your system (the consumer).
sequenceDiagram
participant Provider as Provider System
participant Consumer as Consumer System
Note over Provider,Consumer: One-time setup: Consumer registers a URL
Provider->>Consumer: `POST /webhook/register`
Consumer-->>Provider: `200 OK`
Note over Provider,Consumer: Event-driven flow
Provider->>Provider: Event occurs (e.g., Order Created)
Provider->>Consumer: `POST /webhooks/orders` with JSON payload
Consumer-->>Provider: `202 Accepted`
A typical webhook payload from GitHub for a push event looks like this:
{
"ref": "refs/heads/main",
"before": "a1b2c3d4e5f6...",
"after": "f6e5d4c3b2a1...",
"repository": {
"id": 12345678,
"name": "my-repo",
"full_name": "my-org/my-repo"
},
"pusher": {
"name": "octocat",
"email": "[email protected]"
},
"commits": [
{
"id": "f6e5d4c3b2a1...",
"message": "Fix: Corrected a typo in the README.",
"timestamp": "2023-10-27T10:00:00Z",
"author": {
"name": "Octocat",
"email": "[email protected]"
}
}
]
}
Your consumer application needs an exposed HTTP endpoint to receive this data.
// Node.js with Express
const express = require('express')
const crypto = require('crypto')
const app = express()
// Use raw body for signature verification
app.use(express.json({ verify: (req, res, buf) => { req.rawBody = buf } }))
const GITHUB_SECRET = process.env.GITHUB_WEBHOOK_SECRET
app.post('/webhooks/github', (req, res) => {
// **The bit most guides skip:** Always verify the signature.
const signature = req.headers['x-hub-signature-256']
const hmac = crypto.createHmac('sha256', GITHUB_SECRET)
const digest = 'sha256=' + hmac.update(req.rawBody).digest('hex')
if (!signature || !crypto.timingSafeEqual(Buffer.from(digest), Buffer.from(signature))) {
return res.status(401).send('Invalid signature')
}
// Signature is valid, process the event
const { ref, commits } = req.body
console.log(`Push to ${ref} with ${commits.length} commit(s).`)
// Acknowledge receipt immediately
res.status(202).send('Accepted')
})
app.listen(3000, () => console.log('Webhook listener running on port 3000'))
Polling (Pull Model)
Polling involves your system (the client) making scheduled HTTP GET requests to an API endpoint on the source system to ask for new data.
sequenceDiagram
participant Client as Client System
participant Provider as Provider API
loop Every 5 minutes
Client->>Provider: `GET /api/orders?updated_since=2023-10-27T09:55:00Z`
Provider-->>Client: `200 OK` with JSON array of new/updated orders
Client->>Client: Process orders and store new timestamp `2023-10-27T10:00:00Z`
end
To avoid re-fetching all data, the client must track the last time it successfully fetched data. This is usually done with a timestamp or a sequential ID.
Request:
GET /api/v1/orders?updated_since=2023-10-27T09:55:00Z&sort=asc
Response:
{
"data": [
{
"id": 9876,
"status": "paid",
"total": 42.00,
"updated_at": "2023-10-27T09:58:12Z"
}
],
"has_more": false
}
A polling client can be a simple scheduled script.
# Python with requests and schedule
import requests
import schedule
import time
import os
API_ENDPOINT = "https://api.example.com/v1/orders"
API_KEY = os.environ.get("API_KEY")
STATE_FILE = "last_sync_timestamp.txt"
def get_last_sync_time():
try:
with open(STATE_FILE, 'r') as f:
return f.read().strip()
except FileNotFoundError:
return "1970-01-01T00:00:00Z" # Default to epoch on first run
def set_last_sync_time(timestamp):
with open(STATE_FILE, 'w') as f:
f.write(timestamp)
def fetch_new_orders():
# **The bit most guides skip:** State must be managed reliably.
last_sync = get_last_sync_time()
print(f"Fetching orders updated since {last_sync}")
try:
response = requests.get(
API_ENDPOINT,
params={"updated_since": last_sync},
headers={"Authorization": f"Bearer {API_KEY}"}
)
response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx)
orders = response.json().get('data', [])
if not orders:
print("No new orders found.")
return
for order in orders:
print(f"Processing order ID: {order['id']}")
# ... business logic ...
# Update state only after successful processing
latest_timestamp = orders[-1]['updated_at']
set_last_sync_time(latest_timestamp)
except requests.exceptions.RequestException as e:
print(f"Error fetching data: {e}")
schedule.every(5).minutes.do(fetch_new_orders)
while True:
schedule.run_pending()
time.sleep(1)
Error Handling
Webhooks
Error handling is a shared responsibility, but the consumer's behaviour is key.
- Consumer Downtime (
5xxerrors): If your endpoint returns a500or503, the provider should interpret this as a temporary failure and retry. Most services use an exponential backoff schedule (e.g., retry after 1, 2, 4, 8 minutes). If your endpoint is down for an extended period, the provider will eventually give up, and you'll lose that event. Many providers offer a dead-letter queue (DLQ) for failed events. - Consumer Overload (
429errors): If your service can't process events as fast as they arrive, you should return a429 Too Many Requestsstatus. This signals to a well-behaved provider to slow down its delivery rate. - Permanent Errors (
4xxerrors): Returning a400 Bad Requestor401 Unauthorizedtells the provider the request is invalid and should not be retried. Some providers will automatically disable a webhook after several consecutive4xxfailures.
Polling
Error handling is entirely the client's responsibility.
- Provider Downtime (
5xxerrors): If the API you're polling is down, your client script should catch the exception and implement its own retry logic, perhaps with a simple backoff. Because it's on a schedule, the next polling attempt will naturally retry the request anyway. - Rate Limiting (
429errors): It's common to hit API rate limits with frequent polling. A well-designed client must inspect the response for a429status code and respect theRetry-Afterheader if present. If not, implement a client-side exponential backoff. - State Corruption: If your polling script fails after fetching data but before saving the new
updated_sincetimestamp, it will re-process the same data on its next run. Your processing logic must be idempotent to handle this gracefully.
Cost & Scalability
Webhooks
- Compute Cost: For the consumer, costs are directly proportional to the number of events. This can be unpredictable. Using serverless functions (e.g., AWS Lambda, Google Cloud Functions) is a cost-effective way to handle this, as you only pay for compute time when an event is being processed.
- Scalability: Webhooks scale with event volume. A sudden spike in events can overwhelm a small server. Auto-scaling infrastructure or a serverless architecture is essential for high-volume webhook consumers.
Polling
- Compute Cost: For the client, costs are predictable and fixed based on the polling schedule, not the data volume. A cron job on a single small virtual machine can run for a very low, fixed monthly cost.
- API Cost: Polling can be expensive if the provider charges per API call. Thousands of polls per day that return no new data are wasted calls that may still incur costs or count against a rate limit quota.
- Scalability: Polling doesn't scale well from the provider's perspective. It creates a high, constant load on their infrastructure to serve requests that often yield nothing new. This is why providers heavily rate-limit polling-heavy endpoints.