Building a Real-Time Infrastructure Monitoring System with Telegram Alerts
How I built an advanced monitoring system that sends real-time alerts to Telegram for 56+ containers running across my infrastructure
Words
834
Read Time
5 min read
Category
General
Recent articles you open here will appear in this quick history.
Building a Real-Time Infrastructure Monitoring System with Telegram Alerts
Managing a complex infrastructure with 56 containers running 36+ services across multiple domains requires robust monitoring. Here's how I built a comprehensive alerting system that keeps me informed about the health of my entire stack via Telegram.
The Challenge
When you're running critical services like:
- Media servers (Jellyfin, Plex alternatives)
- Development tools (VS Code Server, Jenkins CI/CD)
- Cloud storage (Nextcloud, file browsers)
- Monitoring dashboards (Homarr, metrics)
- Security infrastructure (Vault, authentication services)
You need to know immediately when something goes wrong. Email alerts get lost, SMS costs money, and checking dashboards manually isn't scalable.
The Solution: Telegram Bot Integration
I built an enhanced Telegram bot that:
Real-Time Alerts
- Service health monitoring - Continuous checks for all 56+ containers
- Docker container status - Tracks container restarts and failures
- Network connectivity - Monitors Tailscale VPN and proxy connectivity
- Resource usage - CPU, memory, and disk space warnings
Smart Notifications
# Example alert structure
{
"service": "jellyfin.jay739.dev",
"status": "down",
"timestamp": "2024-10-22T14:30:00Z",
"severity": "critical",
"action_required": "Container restart needed"
}
Advanced Features
- Prometheus integration - Exports metrics for visualization
- Flask web dashboard - Browser-based status overview
- Network analytics - Traffic patterns and anomaly detection
- Backup verification - Ensures critical data is being backed up
Architecture
The system runs as a containerized service:
services:
batcave-webhook-listener:
image: batcave-notifications_webhook-listener
container_name: batcave-webhook-listener
restart: unless-stopped
environment:
- TELEGRAM_BOT_TOKEN=${BOT_TOKEN}
- CHAT_ID=${CHAT_ID}
- MONITOR_INTERVAL=60
ports:
- "5000:5000" # Flask dashboard
- "9090:9090" # Prometheus metrics
Monitoring Coverage
Infrastructure Layer
- Reverse Proxy - Nginx Proxy Manager health
- DNS Resolution - All 40+ subdomain availability
- SSL Certificates - Expiration warnings
- Tailscale Network - VPN connectivity status
Application Layer
- Container Health Checks - Docker service status
- HTTP Endpoints - Response time monitoring
- Database Connections - PostgreSQL, Redis availability
- Storage Systems - Cloud storage accessibility
System Layer
- CPU Usage - Alerts at >80% sustained usage
- Memory Pressure - Warnings before OOM situations
- Disk Space - Critical alerts at >90% capacity
- Network I/O - Unusual traffic pattern detection
Deployment Strategy
Using Docker Compose for easy deployment:
cd /root/enhanced-telegram-bot
docker-compose -f docker-compose.fixed.yml up -d --build
The bot includes:
- Health checks - Self-monitoring and auto-recovery
- Persistent storage - Logs and metrics retention
- Network isolation - Secure communication channels
- Rolling updates - Zero-downtime deployments
Benefits Realized
Since implementing this system:
- 99.9% Uptime - Immediate notification of issues
- 5-minute MTTR - Mean time to recovery drastically reduced
- Proactive Maintenance - Trend analysis prevents failures
- Peace of Mind - 24/7 monitoring without manual checks
Security Considerations
- Encrypted communications - TLS for all external connections
- Token management - Secrets stored in Vault
- Rate limiting - Prevents alert spam and abuse
- Access control - Restricted to authorized chat IDs
Lessons Learned
What Worked Well
- Telegram's API is incredibly reliable
- Docker makes deployment and updates trivial
- Prometheus metrics provide excellent visualization
- Python's async capabilities handle concurrent monitoring efficiently
Challenges Faced
- Alert fatigue - Had to fine-tune thresholds to reduce noise
- Network complexity - Tailscale routing required special handling
- False positives - Implemented retry logic before alerting
Future Enhancements
Planning to add:
- AI-powered anomaly detection - Machine learning for pattern recognition
- Automated remediation - Self-healing for common issues
- Mobile app integration - Native push notifications
- Team collaboration - Multi-user support with role-based access
Tech Stack
- Python 3.11 - Core application logic
- python-telegram-bot - Telegram API wrapper
- Flask - Web dashboard
- Prometheus - Metrics collection and export
- Docker - Containerization
- Tailscale - Secure networking
Code Example: Service Health Check
async def check_service_health(service_url):
try:
async with aiohttp.ClientSession() as session:
async with session.get(
service_url,
timeout=aiohttp.ClientTimeout(total=5)
) as response:
if response.status == 200:
return {"status": "healthy", "latency": response.elapsed}
else:
await send_alert(
f"β οΈ Service {service_url} returned {response.status}"
)
return {"status": "degraded"}
except asyncio.TimeoutError:
await send_alert(f"π΄ Service {service_url} timeout")
return {"status": "down"}
Conclusion
Building a comprehensive monitoring system doesn't have to be expensive or complicated. With open-source tools and a bit of Python scripting, you can achieve enterprise-grade monitoring for your home lab or production infrastructure.
The key is to:
- Start simple and iterate
- Focus on actionable alerts, not noise
- Make it reliable - your monitoring system must be more reliable than what it monitors
- Document everything for future you
Questions or want to implement something similar? Feel free to reach out! I'm always happy to discuss infrastructure monitoring and DevOps best practices.
Links:
Follow This Topic
Keep exploring through related builds and skill areas connected to this post.
Related Projects