Shared Health Check State Across Pods
This feature enables coordination of health checks across multiple LiteLLM proxy pods to avoid duplicate health checks and reduce costs.
Overviewโ
When running multiple LiteLLM proxy pods (e.g., in Kubernetes), each pod typically runs its own independent health checks on every model. This can result in:
- Duplicate health checks across pods
- Increased costs for expensive models (e.g., Gemini 2.5-pro)
- Redundant monitoring/logging noise
- Inefficient resource usage
The shared health check state feature solves this by:
- Coordinating health checks across pods using Redis
- Caching results with configurable TTL
- Using distributed locks to ensure only one pod runs health checks at a time
- Allowing other pods to read cached results instead of running redundant checks
How It Worksโ
1. Lock Acquisitionโ
When a pod needs to run health checks:
- It attempts to acquire a Redis lock
- If successful, it runs the health checks
- If failed, it waits briefly and checks for cached results
2. Result Cachingโ
After running health checks:
- Results are cached in Redis with a configurable TTL
- Other pods can read these cached results
- Cache includes timestamp and pod ID for tracking
3. Fallback Behaviorโ
If Redis is unavailable or cache is expired:
- Pods fall back to running health checks locally
- System continues to function normally
Configurationโ
Enable Shared Health Checkโ
Add to your proxy_config.yaml
:
general_settings:
# Enable background health checks (required)
background_health_checks: true
# Enable shared health check state across pods
use_shared_health_check: true
# Health check interval (seconds)
health_check_interval: 300 # 5 minutes
# Redis configuration (required for shared health check)
litellm_settings:
cache: true
cache_params:
type: redis
host: your-redis-host
port: 6379
password: your-redis-password
Environment Variablesโ
You can also configure using environment variables:
# Enable shared health check
export USE_SHARED_HEALTH_CHECK=true
# Health check TTL (seconds)
export DEFAULT_SHARED_HEALTH_CHECK_TTL=300
# Lock TTL (seconds)
export DEFAULT_SHARED_HEALTH_CHECK_LOCK_TTL=60
Requirementsโ
- Redis: Required for shared state coordination
- Background Health Checks: Must be enabled (
background_health_checks: true
) - Multiple Pods: Most beneficial with 2+ proxy instances
API Endpointsโ
Check Shared Health Check Statusโ
GET /health/shared-status
Returns information about the shared health check coordination:
{
"shared_health_check_enabled": true,
"status": {
"pod_id": "pod_1703123456789",
"redis_available": true,
"lock_ttl": 60,
"cache_ttl": 300,
"lock_owner": "pod_1703123456788",
"lock_in_progress": true,
"cache_available": true,
"cache_age_seconds": 45.2,
"last_checked_by": "pod_1703123456788"
}
}
Monitoringโ
Health Check Statusโ
Monitor the shared health check status to ensure proper coordination:
curl -H "Authorization: Bearer your-api-key" \
http://your-proxy-host/health/shared-status
Logsโ
Look for these log messages:
INFO: Initialized shared health check manager
INFO: Pod pod_123 acquired health check lock
INFO: Pod pod_123 released health check lock
INFO: Cached health check results for 5 healthy and 0 unhealthy endpoints
DEBUG: Using cached health check results
Troubleshootingโ
Common Issuesโ
1. Shared Health Check Not Workingโ
Symptoms: Each pod still runs independent health checks
Solutions:
- Verify Redis is configured and accessible
- Check that
use_shared_health_check: true
is set - Ensure
background_health_checks: true
is enabled - Check Redis connectivity in logs
2. Redis Connection Issuesโ
Symptoms: Health checks fall back to local execution
Solutions:
- Verify Redis host, port, and credentials
- Check network connectivity between pods and Redis
- Monitor Redis server logs for errors
3. Lock Not Releasedโ
Symptoms: One pod holds the lock indefinitely
Solutions:
- Lock has automatic TTL (default 60 seconds)
- Check pod logs for lock release messages
- Verify Redis TTL settings
Debug Modeโ
Enable debug logging to see detailed coordination:
general_settings:
set_verbose: true
Performance Impactโ
Benefitsโ
- Reduced API calls: Only one pod runs health checks per interval
- Lower costs: Especially significant for expensive models
- Better resource utilization: Less redundant work across pods
- Cleaner monitoring: Reduced noise in logs and metrics
Overheadโ
- Redis operations: Minimal overhead for lock/cache operations
- Network latency: Small delay for Redis communication
- Memory usage: Negligible additional memory usage
Best Practicesโ
1. Redis Configurationโ
- Use Redis with persistence enabled
- Configure appropriate memory limits
- Set up Redis monitoring and alerts
2. TTL Settingsโ
- Set
health_check_interval
to your desired check frequency - Use default TTL values unless you have specific requirements
- Consider model-specific timeouts for expensive models
3. Monitoringโ
- Monitor shared health check status endpoint
- Set up alerts for Redis connectivity issues
- Track health check costs and frequency
4. Scalingโ
- Feature works with any number of pods
- More pods = better coordination benefits
- Consider Redis cluster for high availability
Example Configurationโ
Complete Exampleโ
# proxy_config.yaml
model_list:
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: os.environ/OPENAI_API_KEY
model_info:
health_check_timeout: 30 # 30 second timeout for health checks
general_settings:
# Enable background health checks
background_health_checks: true
# Enable shared health check coordination
use_shared_health_check: true
# Health check interval (5 minutes)
health_check_interval: 300
# Health check details
health_check_details: true
litellm_settings:
# Redis configuration
cache: true
cache_params:
type: redis
host: redis-cluster.example.com
port: 6379
password: os.environ/REDIS_PASSWORD
ssl: true
Kubernetes Exampleโ
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
spec:
replicas: 3 # Multiple pods for coordination
template:
spec:
containers:
- name: litellm-proxy
image: ghcr.io/berriai/litellm:latest
env:
- name: USE_SHARED_HEALTH_CHECK
value: "true"
- name: REDIS_HOST
value: "redis-service"
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-secret
key: password
Migrationโ
From Independent Health Checksโ
- Enable Redis: Ensure Redis is configured and accessible
- Enable Background Health Checks: Set
background_health_checks: true
- Enable Shared Health Check: Set
use_shared_health_check: true
- Deploy: Update your proxy configuration
- Monitor: Check
/health/shared-status
endpoint
Rollbackโ
To disable shared health check:
general_settings:
use_shared_health_check: false
# background_health_checks can remain true for independent checks