Blog freshness: Research notes liveLatest update: May 2026Telemetry mode: Public-safe live stripAI tools: Self-hosted demos live
Skip to main content
General
October 22, 2024
5 min read

Building a Real-Time Infrastructure Monitoring System with Telegram Alerts

How I built an advanced monitoring system that sends real-time alerts to Telegram for 56+ containers running across my infrastructure

Words

834

Read Time

5 min read

Category

General

Read aloud
Browser TTS unavailable
Ready for a more natural read-aloud pass.
Reading list
Reading History

Recent articles you open here will appear in this quick history.

#DevOps#Monitoring#Telegram#Docker#Infrastructure#Python

Building a Real-Time Infrastructure Monitoring System with Telegram Alerts

Managing a complex infrastructure with 56 containers running 36+ services across multiple domains requires robust monitoring. Here's how I built a comprehensive alerting system that keeps me informed about the health of my entire stack via Telegram.

The Challenge

When you're running critical services like:

  • Media servers (Jellyfin, Plex alternatives)
  • Development tools (VS Code Server, Jenkins CI/CD)
  • Cloud storage (Nextcloud, file browsers)
  • Monitoring dashboards (Homarr, metrics)
  • Security infrastructure (Vault, authentication services)

You need to know immediately when something goes wrong. Email alerts get lost, SMS costs money, and checking dashboards manually isn't scalable.

The Solution: Telegram Bot Integration

I built an enhanced Telegram bot that:

Real-Time Alerts

  • Service health monitoring - Continuous checks for all 56+ containers
  • Docker container status - Tracks container restarts and failures
  • Network connectivity - Monitors Tailscale VPN and proxy connectivity
  • Resource usage - CPU, memory, and disk space warnings

Smart Notifications

# Example alert structure
{
    "service": "jellyfin.jay739.dev",
    "status": "down",
    "timestamp": "2024-10-22T14:30:00Z",
    "severity": "critical",
    "action_required": "Container restart needed"
}

Advanced Features

  • Prometheus integration - Exports metrics for visualization
  • Flask web dashboard - Browser-based status overview
  • Network analytics - Traffic patterns and anomaly detection
  • Backup verification - Ensures critical data is being backed up

Architecture

The system runs as a containerized service:

services:
  batcave-webhook-listener:
    image: batcave-notifications_webhook-listener
    container_name: batcave-webhook-listener
    restart: unless-stopped
    environment:
      - TELEGRAM_BOT_TOKEN=${BOT_TOKEN}
      - CHAT_ID=${CHAT_ID}
      - MONITOR_INTERVAL=60
    ports:
      - "5000:5000"  # Flask dashboard
      - "9090:9090"  # Prometheus metrics

Monitoring Coverage

Infrastructure Layer

  • Reverse Proxy - Nginx Proxy Manager health
  • DNS Resolution - All 40+ subdomain availability
  • SSL Certificates - Expiration warnings
  • Tailscale Network - VPN connectivity status

Application Layer

  • Container Health Checks - Docker service status
  • HTTP Endpoints - Response time monitoring
  • Database Connections - PostgreSQL, Redis availability
  • Storage Systems - Cloud storage accessibility

System Layer

  • CPU Usage - Alerts at >80% sustained usage
  • Memory Pressure - Warnings before OOM situations
  • Disk Space - Critical alerts at >90% capacity
  • Network I/O - Unusual traffic pattern detection

Deployment Strategy

Using Docker Compose for easy deployment:

cd /root/enhanced-telegram-bot
docker-compose -f docker-compose.fixed.yml up -d --build

The bot includes:

  • Health checks - Self-monitoring and auto-recovery
  • Persistent storage - Logs and metrics retention
  • Network isolation - Secure communication channels
  • Rolling updates - Zero-downtime deployments

Benefits Realized

Since implementing this system:

  1. 99.9% Uptime - Immediate notification of issues
  2. 5-minute MTTR - Mean time to recovery drastically reduced
  3. Proactive Maintenance - Trend analysis prevents failures
  4. Peace of Mind - 24/7 monitoring without manual checks

Security Considerations

  • Encrypted communications - TLS for all external connections
  • Token management - Secrets stored in Vault
  • Rate limiting - Prevents alert spam and abuse
  • Access control - Restricted to authorized chat IDs

Lessons Learned

What Worked Well

  • Telegram's API is incredibly reliable
  • Docker makes deployment and updates trivial
  • Prometheus metrics provide excellent visualization
  • Python's async capabilities handle concurrent monitoring efficiently

Challenges Faced

  • Alert fatigue - Had to fine-tune thresholds to reduce noise
  • Network complexity - Tailscale routing required special handling
  • False positives - Implemented retry logic before alerting

Future Enhancements

Planning to add:

  • AI-powered anomaly detection - Machine learning for pattern recognition
  • Automated remediation - Self-healing for common issues
  • Mobile app integration - Native push notifications
  • Team collaboration - Multi-user support with role-based access

Tech Stack

  • Python 3.11 - Core application logic
  • python-telegram-bot - Telegram API wrapper
  • Flask - Web dashboard
  • Prometheus - Metrics collection and export
  • Docker - Containerization
  • Tailscale - Secure networking

Code Example: Service Health Check

async def check_service_health(service_url):
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(
                service_url,
                timeout=aiohttp.ClientTimeout(total=5)
            ) as response:
                if response.status == 200:
                    return {"status": "healthy", "latency": response.elapsed}
                else:
                    await send_alert(
                        f"⚠️ Service {service_url} returned {response.status}"
                    )
                    return {"status": "degraded"}
    except asyncio.TimeoutError:
        await send_alert(f"πŸ”΄ Service {service_url} timeout")
        return {"status": "down"}

Conclusion

Building a comprehensive monitoring system doesn't have to be expensive or complicated. With open-source tools and a bit of Python scripting, you can achieve enterprise-grade monitoring for your home lab or production infrastructure.

The key is to:

  1. Start simple and iterate
  2. Focus on actionable alerts, not noise
  3. Make it reliable - your monitoring system must be more reliable than what it monitors
  4. Document everything for future you

Questions or want to implement something similar? Feel free to reach out! I'm always happy to discuss infrastructure monitoring and DevOps best practices.

Links: