Building a Real-Time Infrastructure Monitoring System with Telegram Alerts

Managing a complex infrastructure with 56 containers running 36+ services across multiple domains requires robust monitoring. Here's how I built a comprehensive alerting system that keeps me informed about the health of my entire stack via Telegram.

The Challenge

When you're running critical services like:

Media servers (Jellyfin, Plex alternatives)
Development tools (VS Code Server, Jenkins CI/CD)
Cloud storage (Nextcloud, file browsers)
Monitoring dashboards (Homarr, metrics)
Security infrastructure (Vault, authentication services)

You need to know immediately when something goes wrong. Email alerts get lost, SMS costs money, and checking dashboards manually isn't scalable.

The Solution: Telegram Bot Integration

I built an enhanced Telegram bot that:

Real-Time Alerts

Service health monitoring - Continuous checks for all 56+ containers
Docker container status - Tracks container restarts and failures
Network connectivity - Monitors Tailscale VPN and proxy connectivity
Resource usage - CPU, memory, and disk space warnings

Smart Notifications

# Example alert structure
{
    "service": "jellyfin.jay739.dev",
    "status": "down",
    "timestamp": "2024-10-22T14:30:00Z",
    "severity": "critical",
    "action_required": "Container restart needed"
}

Advanced Features

Prometheus integration - Exports metrics for visualization
Flask web dashboard - Browser-based status overview
Network analytics - Traffic patterns and anomaly detection
Backup verification - Ensures critical data is being backed up

Architecture

The system runs as a containerized service:

services:
  batcave-webhook-listener:
    image: batcave-notifications_webhook-listener
    container_name: batcave-webhook-listener
    restart: unless-stopped
    environment:
      - TELEGRAM_BOT_TOKEN=${BOT_TOKEN}
      - CHAT_ID=${CHAT_ID}
      - MONITOR_INTERVAL=60
    ports:
      - "5000:5000"  # Flask dashboard
      - "9090:9090"  # Prometheus metrics

Monitoring Coverage

Infrastructure Layer

Reverse Proxy - Nginx Proxy Manager health
DNS Resolution - All 40+ subdomain availability
SSL Certificates - Expiration warnings
Tailscale Network - VPN connectivity status

Application Layer

Container Health Checks - Docker service status
HTTP Endpoints - Response time monitoring
Database Connections - PostgreSQL, Redis availability
Storage Systems - Cloud storage accessibility

System Layer

CPU Usage - Alerts at >80% sustained usage
Memory Pressure - Warnings before OOM situations
Disk Space - Critical alerts at >90% capacity
Network I/O - Unusual traffic pattern detection

Deployment Strategy

Using Docker Compose for easy deployment:

cd /root/enhanced-telegram-bot
docker-compose -f docker-compose.fixed.yml up -d --build

The bot includes:

Health checks - Self-monitoring and auto-recovery
Persistent storage - Logs and metrics retention
Network isolation - Secure communication channels
Rolling updates - Zero-downtime deployments

Benefits Realized

Since implementing this system:

99.9% Uptime - Immediate notification of issues
5-minute MTTR - Mean time to recovery drastically reduced
Proactive Maintenance - Trend analysis prevents failures
Peace of Mind - 24/7 monitoring without manual checks

Security Considerations

Encrypted communications - TLS for all external connections
Token management - Secrets stored in Vault
Rate limiting - Prevents alert spam and abuse
Access control - Restricted to authorized chat IDs

Lessons Learned

What Worked Well

Telegram's API is incredibly reliable
Docker makes deployment and updates trivial
Prometheus metrics provide excellent visualization
Python's async capabilities handle concurrent monitoring efficiently

Challenges Faced

Alert fatigue - Had to fine-tune thresholds to reduce noise
Network complexity - Tailscale routing required special handling
False positives - Implemented retry logic before alerting

Future Enhancements

Planning to add:

AI-powered anomaly detection - Machine learning for pattern recognition
Automated remediation - Self-healing for common issues
Mobile app integration - Native push notifications
Team collaboration - Multi-user support with role-based access

Tech Stack

Python 3.11 - Core application logic
python-telegram-bot - Telegram API wrapper
Flask - Web dashboard
Prometheus - Metrics collection and export
Docker - Containerization
Tailscale - Secure networking

Code Example: Service Health Check

async def check_service_health(service_url):
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(
                service_url,
                timeout=aiohttp.ClientTimeout(total=5)
            ) as response:
                if response.status == 200:
                    return {"status": "healthy", "latency": response.elapsed}
                else:
                    await send_alert(
                        f"⚠️ Service {service_url} returned {response.status}"
                    )
                    return {"status": "degraded"}
    except asyncio.TimeoutError:
        await send_alert(f"🔴 Service {service_url} timeout")
        return {"status": "down"}

Conclusion

Building a comprehensive monitoring system doesn't have to be expensive or complicated. With open-source tools and a bit of Python scripting, you can achieve enterprise-grade monitoring for your home lab or production infrastructure.

The key is to:

Start simple and iterate
Focus on actionable alerts, not noise
Make it reliable - your monitoring system must be more reliable than what it monitors
Document everything for future you

Questions or want to implement something similar? Feel free to reach out! I'm always happy to discuss infrastructure monitoring and DevOps best practices.

Links:

Loading article…

Building a Real-Time Infrastructure Monitoring System with Telegram Alerts

Building a Real-Time Infrastructure Monitoring System with Telegram Alerts

The Challenge

The Solution: Telegram Bot Integration

Real-Time Alerts

Smart Notifications

Advanced Features

Architecture

Monitoring Coverage

Infrastructure Layer

Application Layer

System Layer

Deployment Strategy

Benefits Realized

Security Considerations

Lessons Learned

What Worked Well

Challenges Faced

Future Enhancements

Tech Stack

Code Example: Service Health Check

Conclusion

Follow This Topic

Related Articles

Benchmarking Local LLMs Across an RTX 3060 Ti and an M4 Mac Mini (With a Kernel Panic Along the Way)

The Freeze Wasn't Memory: Tracing a Homelab Server's Root Cause Through Three Wrong Turns

Continue Reading

Continue Exploring

Building a Real-Time Infrastructure Monitoring System with Telegram Alerts

The Challenge

The Solution: Telegram Bot Integration

Real-Time Alerts

Smart Notifications

Advanced Features

Architecture

Monitoring Coverage

Infrastructure Layer

Application Layer

System Layer

Deployment Strategy

Benefits Realized

Security Considerations

Lessons Learned

What Worked Well

Challenges Faced

Future Enhancements

Tech Stack

Code Example: Service Health Check

Conclusion