Checkmk/TSM/Dokumentation.md
2026-01-13 23:25:08 +01:00

39 KiB

TSM Backup Monitoring - Technische Dokumentation

Inhaltsverzeichnis

  1. Architektur-Details
  2. API-Referenz
  3. Erweiterte Konfiguration
  4. Entwicklungsleitfaden
  5. Performance-Optimierung
  6. Sicherheit
  7. Integration
  8. Best Practices

1. Architektur-Details

1.1 Komponenten-Übersicht

Agent Plugin (tsm_backups)

Speicherort: /usr/lib/check_mk_agent/plugins/tsm_backups

Aufgaben:

  • CSV-Dateien aus /mnt/CMK_TSM einlesen
  • Node-Namen normalisieren (RRZ*/NFRZ*-Präfixe entfernen)
  • Backup-Daten pro Node aggregieren
  • JSON-Output für CheckMK Agent generieren

Ausführung:

  • Wird bei jedem Agent-Aufruf ausgeführt
  • Standard-Intervall: 60 Sekunden (CheckMK Standard)
  • Kann asynchron konfiguriert werden (siehe Async Plugins)

Output-Format:

{
  "SERVER_MSSQL": {
    "statuses": ["Completed", "Completed", "Failed"],
    "schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
    "last": 1736693420,
    "count": 3
  },
  "DATABASE_HANA": {
    "statuses": ["Completed"],
    "schedules": ["00-00-00_FULL"],
    "last": 1736690000,
    "count": 1
  }
}

Check Plugin (tsm_backups.py)

Speicherort: /omd/sites/<site>/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py

Aufgaben:

  • JSON vom Agent parsen
  • Services mit Labels discovern
  • Backup-Status bewerten
  • Metriken generieren
  • Schwellwerte prüfen

CheckMK API Version: v2 (cmk.agent_based.v2)

1.2 Datenfluss-Diagramm

┌──────────────────────────────────────────────────────────────────┐
│ TSM Server                                                        │
│                                                                    │
│  SELECT                                                            │
│    DATE_TIME, ENTITY, NODE_NAME, SCHEDULE_NAME, RESULT           │
│  FROM ACTLOG                                                      │
│  WHERE TIMESTAMP > CURRENT_TIMESTAMP - 24 HOURS                   │
│                                                                    │
│  ↓ Export als CSV                                                 │
│  /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV                  │
└──────────────────────────────────────────────────────────────────┘
                            │
                            │ NFS/SCP/Rsync
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ Host: /mnt/CMK_TSM/                                               │
│  ├── TSM_BACKUP_SCHED_24H.CSV                                    │
│  ├── TSM_DB_SCHED_24H.CSV                                        │
│  └── TSM_FILE_SCHED_24H.CSV                                      │
└──────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ Agent Plugin: /usr/lib/check_mk_agent/plugins/tsm_backups        │
│                                                                    │
│  1. Liste CSV-Dateien in /mnt/CMK_TSM                            │
│  2. Parse jede Zeile:                                             │
│     - Extrahiere: timestamp, node, schedule, status              │
│     - Validiere Node (Länge, MAINTENANCE)                        │
│     - Normalisiere Node-Name:                                    │
│       RRZ01_SERVER_MSSQL → _SERVER_MSSQL → SERVER_MSSQL         │
│  3. Aggregiere pro Node:                                          │
│     - Sammle alle Statuses                                       │
│     - Sammle alle Schedules                                      │
│     - Finde letzten Timestamp                                    │
│     - Zähle Jobs                                                 │
│  4. Generiere JSON-Output                                        │
│                                                                    │
│  Output: <<<tsm_backups:sep(0)>>>                                │
│          {"SERVER_MSSQL": {...}, ...}                             │
└──────────────────────────────────────────────────────────────────┘
                            │
                            │ CheckMK Agent Protocol
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ CheckMK Server: Agent Section Parser                             │
│                                                                    │
│  parse_tsm_backups(string_table):                                │
│    - Extrahiere JSON-String aus string_table[0][0]               │
│    - Parse JSON → Python Dict                                    │
│    - Return: {node: data, ...}                                   │
└──────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ Service Discovery                                                 │
│                                                                    │
│  discover_tsm_backups(section):                                  │
│    FOR EACH node IN section:                                     │
│      1. Extrahiere Metadata:                                     │
│         - backup_type = extract_backup_type(node)                │
│         - backup_level = extract_backup_level(schedules)         │
│         - frequency = extract_frequency(schedules)               │
│         - error_handling = get_error_handling(backup_type)       │
│         - category = get_backup_category(backup_type)            │
│      2. Erstelle Service mit Labels:                             │
│         Service(                                                  │
│           item=node,                                              │
│           labels=[                                                │
│             ServiceLabel("backup_type", backup_type),            │
│             ServiceLabel("backup_category", category),           │
│             ...                                                   │
│           ]                                                       │
│         )                                                         │
└──────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ Service Check Execution                                           │
│                                                                    │
│  check_tsm_backups(item, section):                               │
│    1. Lade Node-Daten aus section[item]                          │
│    2. Extrahiere Metadata (wie bei Discovery)                    │
│    3. Berechne Status:                                            │
│       - calculate_state() → (State, status_text)                 │
│    4. Berechne Backup-Alter:                                     │
│       - age = now - last_timestamp                               │
│    5. Hole Schwellwerte:                                          │
│       - thresholds = get_thresholds(type, level)                 │
│    6. Prüfe Alter gegen Schwellwerte                             │
│    7. Generiere Output:                                           │
│       - Result(state, summary)                                    │
│       - Metric("backup_age", age, levels)                        │
│       - Metric("backup_jobs", count)                             │
└──────────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│ CheckMK Service                                                   │
│                                                                    │
│  Name: TSM Backup SERVER_MSSQL                                   │
│  State: OK                                                        │
│  Summary: Type=MSSQL (database), Level=FULL, Freq=daily,        │
│           Status=Completed, Last=3h 15m, Jobs=3                  │
│  Metrics:                                                         │
│    - backup_age: 11700s (warn: 93600s, crit: 172800s)           │
│    - backup_jobs: 3                                               │
│  Labels:                                                          │
│    - backup_type: mssql                                           │
│    - backup_category: database                                    │
│    - frequency: daily                                             │
│    - backup_level: full                                           │
│    - error_handling: strict                                       │
└──────────────────────────────────────────────────────────────────┘

1.3 Node-Normalisierung im Detail

Zweck: TSM-Umgebungen mit redundanten Servern (z.B. RRZ01, RRZ02, NFRZ01) sollen als ein logischer Node überwacht werden.

Algorithmus:

def normalize_node_name(node):
    """
    Input: "RRZ01_MYSERVER_MSSQL"
   
    Schritt 1: Entferne RRZ*/NFRZ*/RZ* Präfix mit Unterstrich
               Pattern: r'(RRZ|NFRZ|RZ)\d+(_)'
               Ergebnis: "_MYSERVER_MSSQL"
   
    Schritt 2: Entferne führenden Unterstrich
               Ergebnis: "MYSERVER_MSSQL"
   
    Schritt 3: Entferne RRZ*/NFRZ*/RZ* Suffix ohne Unterstrich
               Pattern: r'(RRZ|NFRZ|RZ)\d+$'
               Ergebnis: "MYSERVER_MSSQL"
   
    Output: "MYSERVER_MSSQL"
    """

Beispiele:

Original Node Normalisiert Ergebnis
RRZ01_SERVER_MSSQL SERVER_MSSQL
RRZ02_SERVER_MSSQL SERVER_MSSQL Zusammengeführt
NFRZ01_DATABASE_HANA DATABASE_HANA
SERVER_FILE_RRZ01 SERVER_FILE
MYSERVER_ORACLE MYSERVER_ORACLE Unverändert

2. API-Referenz

2.1 Agent Plugin Funktionen

TSMParser.normalize_node_name(node: str) -> str

Normalisiert TSM-Node-Namen für Redundanz-Logik.

Parameter:

  • node (str): Original TSM-Node-Name

Returns:

  • str: Normalisierter Node-Name

Beispiel:

parser = TSMParser()
normalized = parser.normalize_node_name("RRZ01_SERVER_MSSQL")
# normalized == "SERVER_MSSQL"

TSMParser.is_valid_node(node: str, status: str) -> bool

Prüft, ob ein Node für Monitoring valide ist.

Parameter:

  • node (str): Node-Name
  • status (str): Backup-Status

Returns:

  • bool: True wenn valide, False sonst

Validierungs-Regeln:

  • Node muss existieren (not empty)
  • Node muss mindestens 3 Zeichen lang sein
  • Status muss existieren
  • Node darf nicht "MAINTENANCE" enthalten

Beispiel:

parser.is_valid_node("SERVER", "Completed")  # False (zu kurz)
parser.is_valid_node("SERVER_MSSQL", "Completed")  # True
parser.is_valid_node("SERVER_MAINTENANCE", "Completed")  # False

TSMParser.parse_csv(csv_file: Path) -> None

Parsed eine TSM-CSV-Datei und sammelt Backup-Informationen.

Parameter:

  • csv_file (Path): Pfad zur CSV-Datei

CSV-Format:

TIMESTAMP,FIELD,NODE_NAME,SCHEDULE_NAME,STATUS
2026-01-12 08:00:00,SOMETHING,SERVER_MSSQL,DAILY_FULL,Completed

Side Effects:

  • Fügt geparste Backups zu self.backups hinzu

TSMParser.aggregate() -> dict

Aggregiert Backup-Daten pro normalisierten Node.

Returns:

{
    "SERVER_MSSQL": {
        "statuses": ["Completed", "Completed", "Failed"],
        "schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
        "last": 1736693420,  # Unix timestamp
        "count": 3
    }
}

2.2 Check Plugin Funktionen

extract_backup_type(node: str) -> str

Extrahiert Backup-Typ aus Node-Namen anhand bekannter Typen.

Parameter:

  • node (str): Normalisierter Node-Name

Returns:

  • str: Backup-Typ in Kleinbuchstaben, oder "unknown"

Bekannte Typen:

  • Datenbanken: MSSQL, HANA, Oracle, DB2, MySQL
  • Virtualisierung: Virtual
  • Dateisysteme: FILE, SCALE, DM, Datacenter
  • Applikationen: Mail

Algorithmus:

  1. Splitte Node-Namen bei Unterstrich
  2. Nehme letztes Segment
  3. Falls letztes Segment numerisch → nehme vorletztes Segment
  4. Prüfe gegen Liste bekannter Typen
  5. Return lowercase oder "unknown"

Beispiele:

extract_backup_type("SERVER_MSSQL")           # → "mssql"
extract_backup_type("DATABASE_HANA_01")       # → "hana"
extract_backup_type("FILESERVER_FILE")        # → "file"
extract_backup_type("VM_HYPERV_123")          # → "hyperv"
extract_backup_type("APP_UNKNOWN")            # → "unknown"

extract_backup_level(schedules: list[str]) -> str

Extrahiert Backup-Level aus Schedule-Namen.

Parameter:

  • schedules (list[str]): Liste von Schedule-Namen

Returns:

  • str: "log", "full", "incremental", "differential"

Priorität: log > full > differential > incremental

Erkennungs-Pattern:

  • _LOG oder LOG → log
  • _FULL oder FULL → full
  • _INCR oder INCREMENTAL → incremental
  • _DIFF oder DIFFERENTIAL → differential

Beispiele:

extract_backup_level(["DAILY_FULL"])                    # → "full"
extract_backup_level(["HOURLY_LOG", "DAILY_FULL"])     # → "log"
extract_backup_level(["00-00-00_FULL"])                 # → "full"

extract_frequency(schedules: list[str]) -> str

Extrahiert Backup-Frequenz aus Schedule-Namen.

Parameter:

  • schedules (list[str]): Liste von Schedule-Namen

Returns:

  • str: "hourly", "daily", "weekly", "monthly", "unknown"

Priorität: hourly > daily > weekly > monthly

Erkennungs-Pattern:

  • HOURLY → hourly
  • DAILY → daily
  • WEEKLY → weekly
  • MONTHLY → monthly
  • HH-MM-SS_*LOG → hourly (Zeit-basiert mit LOG)
  • 00-00-00_* → daily (Mitternacht)

Beispiele:

extract_frequency(["DAILY_FULL"])               # → "daily"
extract_frequency(["00-00-00_FULL"])            # → "daily"
extract_frequency(["08-00-00_LOG"])             # → "hourly"
extract_frequency(["WEEKLY_FULL", "DAILY_DIFF"]) # → "daily"

get_error_handling(backup_type: str) -> str

Bestimmt Error-Handling-Strategie basierend auf Backup-Typ.

Parameter:

  • backup_type (str): Backup-Typ

Returns:

  • str: "tolerant" oder "strict"

Logik:

if backup_type in TOLERANT_TYPES:
    return "tolerant"  # Failed → WARN
else:
    return "strict"    # Failed → CRIT

Tolerante Typen:

  • file, virtual, scale, dm, datacenter
  • vmware, hyperv, mail, exchange

Strikt:

  • Alle Datenbank-Typen (mssql, hana, oracle, db2, ...)
  • Alle anderen Typen

get_backup_category(backup_type: str) -> str

Kategorisiert Backup-Typ in Oberkategorien.

Parameter:

  • backup_type (str): Backup-Typ

Returns:

  • str: "database", "virtualization", "filesystem", "application", "other"

Kategorien:

Kategorie Typen
database mssql, hana, oracle, db2, mysql, postgres, mariadb, sybase, mongodb
virtualization virtual, vmware, hyperv, kvm, xen
filesystem file, scale, dm, datacenter
application mail, exchange
other Alle anderen

get_thresholds(backup_type: str, backup_level: str) -> dict

Liefert typ- und level-spezifische Schwellwerte.

Parameter:

  • backup_type (str): Backup-Typ
  • backup_level (str): Backup-Level

Returns:

{
    "warn": 93600,   # Sekunden
    "crit": 172800   # Sekunden
}

Priorität:

  1. Falls backup_level == "log" → LOG-Schwellwerte (4h/8h)
  2. Falls backup_type in THRESHOLDS → Typ-spezifische Schwellwerte
  3. Sonst → Default-Schwellwerte (26h/48h)

Beispiele:

get_thresholds("mssql", "log")   # → {"warn": 14400, "crit": 28800}
get_thresholds("mssql", "full")  # → {"warn": 93600, "crit": 172800}
get_thresholds("newtype", "full") # → {"warn": 93600, "crit": 172800}

calculate_state(statuses: list[str], last_time: int, backup_type: str, error_handling: str) -> tuple[State, str]

Berechnet CheckMK-Status aus Backup-Zuständen.

Parameter:

  • statuses (list[str]): Liste aller Backup-Statuses
  • last_time (int): Unix-Timestamp des letzten Backups
  • backup_type (str): Backup-Typ
  • error_handling (str): "tolerant" oder "strict"

Returns:

  • tuple: (State, status_text)

Status-Logik-Tabelle:

Bedingung Alter Error Handling State Text
≥1x "completed" - - OK "Completed"
Nur "pending"/"started" <2h - OK "Pending/Started"
Nur "pending"/"started" >2h - WARN "Pending (>2h)"
Nur "pending"/"started" unknown - WARN "Pending"
"failed"/"missed" - tolerant WARN "Failed (partial)"
"failed"/"missed" - strict CRIT "Failed/Missed"
Andere - - CRIT "Unknown State"

Beispiele:

calculate_state(["Completed", "Completed"], 1736690000, "mssql", "strict")
# → (State.OK, "Completed")


calculate_state(["Failed"], 1736690000, "file", "tolerant")
# → (State.WARN, "Failed (partial)")


calculate_state(["Failed"], 1736690000, "mssql", "strict")
# → (State.CRIT, "Failed/Missed")

2.3 CheckMK API v2 Objekte

Service

Definiert einen CheckMK-Service während der Discovery.

from cmk.agent_based.v2 import Service, ServiceLabel


Service(
    item="SERVER_MSSQL",
    labels=[
        ServiceLabel("backup_type", "mssql"),
        ServiceLabel("frequency", "daily"),
    ]
)

Result

Repräsentiert ein Check-Ergebnis.

from cmk.agent_based.v2 import Result, State


Result(
    state=State.OK,
    summary="Type=MSSQL, Status=Completed, Last=3h"
)


Result(
    state=State.OK,
    notice="Detailed information for details page"
)

Metric

Definiert eine Performance-Metrik.

from cmk.agent_based.v2 import Metric


Metric(
    name="backup_age",
    value=11700,                    # Aktueller Wert
    levels=(93600, 172800),         # (warn, crit)
    boundaries=(0, None),            # (min, max)
)

3. Erweiterte Konfiguration

3.1 Benutzerdefinierte Backup-Typen

Szenario: Neuer Backup-Typ "SAPASE" (SAP ASE Datenbank) soll überwacht werden.

Schritt 1: Typ zur known_types Liste hinzufügen

# In tsm_backups.py, extract_backup_type() Funktion
known_types = [
    'MSSQL', 'HANA', 'FILE', 'ORACLE', 'DB2', 'SCALE', 'DM',
    'DATACENTER', 'VIRTUAL', 'MAIL', 'MYSQL',
    'SAPASE',  # NEU
]

Schritt 2: Schwellwerte definieren (optional)

THRESHOLDS = {
    # ... bestehende Einträge ...
    "sapase": {"warn": 26 * 3600, "crit": 48 * 3600},
}

Schritt 3: Typ zur passenden Kategorie hinzufügen

DATABASE_TYPES = {
    'mssql', 'hana', 'db2', 'oracle', 'mysql',
    'sapase',  # NEU
}

Schritt 4: Error-Handling festlegen (optional)

Falls tolerant gewünscht:

TOLERANT_TYPES = {
    'file', 'virtual', 'scale', 'dm', 'datacenter',
    'vmware', 'hyperv', 'mail', 'exchange',
    'sapase',  # NEU (falls tolerant erwünscht)
}

Schritt 5: Plugin neu laden

cmk -R
cmk -II --all

Ergebnis:

  • Nodes wie SERVER_SAPASE werden automatisch erkannt
  • Typ-Label: backup_type=sapase
  • Kategorie-Label: backup_category=database
  • Schwellwerte: 26h/48h

3.2 Async Agent Plugin

Bei großen TSM-Umgebungen kann das CSV-Parsing Zeit in Anspruch nehmen. Async-Plugins laufen unabhängig vom Agent-Intervall.

Konfiguration:

# Als root auf dem Host
cat > /etc/check_mk/mrpe.cfg << 'EOF'
# TSM Backups async (alle 5 Minuten)
(interval=300) tsm_backups /usr/lib/check_mk_agent/plugins/tsm_backups
EOF

Oder mit CheckMK Bakery (Regel):

Setup > Agents > Agent Rules > Asynchronous execution of plugins (Windows, Linux)

Einstellungen:

  • Plugin: tsm_backups
  • Execution interval: 300 Sekunden (5 Minuten)
  • Cache age: 600 Sekunden (10 Minuten)

3.3 CSV-Export Automation

Option A: NFS-Mount (empfohlen)

# /etc/fstab
tsm-server.example.com:/exports/backup-stats  /mnt/CMK_TSM  nfs  defaults,ro  0  0


# Mount testen
mount -a
ls /mnt/CMK_TSM/

Option B: Rsync via Cron

# Crontab für root
*/15 * * * * rsync -az --delete tsm-server:/path/to/csv/ /mnt/CMK_TSM/

Option C: SCP mit SSH-Key

# SSH-Key einrichten
ssh-keygen -t ed25519 -f ~/.ssh/tsm_backup_key -N ""
ssh-copy-id -i ~/.ssh/tsm_backup_key.pub tsm-server


# Crontab
*/15 * * * * scp -i ~/.ssh/tsm_backup_key tsm-server:/path/*.CSV /mnt/CMK_TSM/

3.4 Regel-basierte Service-Erstellung

CheckMK-Regeln für automatische Service-Labels:

Setup > Services > Discovery rules > Host labels

Beispiel-Regel:

conditions:
  service_labels:
    backup_category: database
 
actions:
  add_labels:
    criticality: high
    team: dba

3.5 Custom Views

View: Alle kritischen Datenbank-Backups

Setup > General > Custom views > Create new view


Name: Critical Database Backups
Datasource: All services


Filters:
- Service state: CRIT
- Service labels: backup_category = database


Columns:
- Host
- Service description
- Service state
- Service output
- Service labels: backup_type
- Service labels: frequency
- Perf-O-Meter

3.6 Custom Notifications

Notification Rule: Nur strikte Failed-Backups eskalieren

Setup > Notifications > Add rule


Conditions:
- Service labels: error_handling = strict
- Service state: CRIT
- Service state type: HARD


Contact selection:
- Specify users: dba-team


Notification method:
- Email
- PagerDuty

4. Entwicklungsleitfaden

4.1 Entwicklungsumgebung einrichten

# CheckMK-Site für Entwicklung
omd create dev
omd start dev
su - dev


# Git-Repository
cd ~/local/lib/python3/cmk_addons/plugins/
git init
git add .
git commit -m "Initial commit"


# Entwicklungs-Workflow
vim tsm/agent_based/tsm_backups.py
cmk -R
cmk -vv --debug test-host | grep "TSM Backup"

4.2 Unit Tests schreiben

Test-Datei: test_tsm_backups.py

#!/usr/bin/env python3
import pytest
from tsm_backups import (
    extract_backup_type,
    extract_backup_level,
    calculate_state,
)
from cmk.agent_based.v2 import State


def test_extract_backup_type():
    assert extract_backup_type("SERVER_MSSQL") == "mssql"
    assert extract_backup_type("DATABASE_HANA_01") == "hana"
    assert extract_backup_type("NEWTYPE_CUSTOM") == "custom"


def test_extract_backup_level():
    assert extract_backup_level(["DAILY_FULL"]) == "full"
    assert extract_backup_level(["HOURLY_LOG", "DAILY_FULL"]) == "log"


def test_calculate_state_completed():
    state, text = calculate_state(
        ["Completed", "Completed"],
        1736690000,
        "mssql",
        "strict"
    )
    assert state == State.OK
    assert text == "Completed"


def test_calculate_state_failed_strict():
    state, text = calculate_state(
        ["Failed"],
        1736690000,
        "mssql",
        "strict"
    )
    assert state == State.CRIT
    assert text == "Failed/Missed"


def test_calculate_state_failed_tolerant():
    state, text = calculate_state(
        ["Failed"],
        1736690000,
        "file",
        "tolerant"
    )
    assert state == State.WARN
    assert text == "Failed (partial)"

Tests ausführen:

pytest test_tsm_backups.py -v

4.3 Code-Style

PEP 8 Compliance:

pip install black flake8 mypy


# Auto-Formatierung
black tsm_backups.py


# Linting
flake8 tsm_backups.py


# Type Checking
mypy tsm_backups.py

4.4 Debugging

Agent-Plugin debuggen

# Direkter Aufruf mit Traceback
python3 /usr/lib/check_mk_agent/plugins/tsm_backups


# Mit Debugger
python3 -m pdb /usr/lib/check_mk_agent/plugins/tsm_backups

Check-Plugin debuggen

# Verbose Check mit Debug-Output
cmk -vv --debug hostname | less


# Nur TSM-Services
cmk -vv --debug hostname | grep -A 20 "TSM Backup"


# Python-Debugger im Plugin
import pdb; pdb.set_trace()

4.5 Performance-Profiling

# In tsm_backups.py
import cProfile
import pstats


def main():
    profiler = cProfile.Profile()
    profiler.enable()
   
    # ... bestehender Code ...
   
    profiler.disable()
    stats = pstats.Stats(profiler)
    stats.sort_stats('cumulative')
    stats.print_stats(20)

5. Performance-Optimierung

5.1 CSV-Parsing beschleunigen

Problem: Große CSV-Dateien (>100 MB) verlangsamen Agent

Lösung 1: Nur relevante Zeilen parsen

def parse_csv_optimized(self, csv_file):
    # Nur die letzten 24h sind relevant
    cutoff_time = datetime.now() - timedelta(hours=24)
   
    with open(csv_file, 'r') as f:
        reader = csv.reader(f)
        for row in reader:
            try:
                time_str = row[0].strip()
                timestamp = datetime.strptime(time_str, "%Y-%m-%d %H:%M:%S")
               
                # Skip alte Einträge
                if timestamp < cutoff_time:
                    continue
               
                # ... restliche Verarbeitung ...
            except:
                continue

Lösung 2: Pandas für große Dateien

import pandas as pd


def parse_csv_pandas(csv_file):
    df = pd.read_csv(
        csv_file,
        names=['timestamp', 'field', 'node', 'schedule', 'status'],
        parse_dates=['timestamp'],
    )
   
    # Filter letzten 24h
    cutoff = pd.Timestamp.now() - pd.Timedelta(hours=24)
    df = df[df['timestamp'] > cutoff]
   
    # Aggregation
    grouped = df.groupby('node').agg({
        'status': list,
        'schedule': list,
        'timestamp': 'max',
        'node': 'count'
    })
   
    return grouped.to_dict()

5.2 Caching

Problem: CSV-Dateien ändern sich nur alle 15-30 Minuten

Lösung: Cache mit Timestamp-Check

import json
from pathlib import Path
import time


CACHE_FILE = Path("/tmp/tsm_backups_cache.json")
CACHE_TTL = 300  # 5 Minuten


def get_cached_or_parse():
    if CACHE_FILE.exists():
        cache_age = time.time() - CACHE_FILE.stat().st_mtime
        if cache_age < CACHE_TTL:
            with open(CACHE_FILE, 'r') as f:
                return json.load(f)
   
    # Parse fresh
    parser = TSMParser()
    # ... parse logic ...
    result = parser.aggregate()
   
    # Cache schreiben
    with open(CACHE_FILE, 'w') as f:
        json.dump(result, f)
   
    return result

5.3 Speicher-Optimierung

Problem: Große Listen von Status/Schedule strings

Lösung: Nur unique values speichern

def aggregate_optimized(self):
    nodes = defaultdict(lambda: {
        "statuses": set(),       # Set statt Liste
        "schedules": set(),
        "last": None,
        "count": 0,
    })
   
    for b in self.backups:
        node = b["node"]
        nodes[node]["count"] += 1
        nodes[node]["statuses"].add(b["status"])  # Automatisch unique
        nodes[node]["schedules"].add(b["schedule"])
        # ... rest ...
   
    # Konvertiere Sets zu Listen für JSON
    for node in nodes:
        nodes[node]["statuses"] = list(nodes[node]["statuses"])
        nodes[node]["schedules"] = list(nodes[node]["schedules"])
   
    return nodes

6. Sicherheit

6.1 Dateiberechtigungen

# Agent-Plugin
chown root:root /usr/lib/check_mk_agent/plugins/tsm_backups
chmod 755 /usr/lib/check_mk_agent/plugins/tsm_backups


# CSV-Verzeichnis
chown root:root /mnt/CMK_TSM
chmod 755 /mnt/CMK_TSM
chmod 644 /mnt/CMK_TSM/*.CSV


# Check-Plugin
chown <site>:<site> $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
chmod 644 $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py

6.2 Input-Validierung

Agent-Plugin:

def is_valid_node(self, node, status):
    # Länge prüfen
    if not node or len(node) < 3 or len(node) > 200:
        return False
   
    # Unerlaubte Zeichen
    if not re.match(r'^[A-Za-z0-9_-]+$', node):
        return False
   
    # Status whitelist
    valid_statuses = ['Completed', 'Failed', 'Missed', 'Pending', 'Started']
    if status not in valid_statuses:
        return False
   
    return True

6.3 Sichere CSV-Verarbeitung

def parse_csv_safe(self, csv_file):
    try:
        # Dateigröße prüfen (max 500 MB)
        if csv_file.stat().st_size > 500 * 1024 * 1024:
            return
       
        with open(csv_file, 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
           
            line_count = 0
            for row in reader:
                line_count += 1
               
                # Max. 1 Million Zeilen
                if line_count > 1000000:
                    break
               
                # ... Verarbeitung ...
    except Exception as e:
        # Logging statt Crash
        pass

7. Integration

7.1 Grafana-Dashboard

InfluxDB Query:

SELECT
  mean("backup_age") AS "avg_age",
  max("backup_age") AS "max_age"
FROM "tsm_backups"
WHERE
  "backup_category" = 'database'
  AND time > now() - 7d
GROUP BY
  time(1h),
  "node_name"

Panels:

  • Backup Age Heatmap (pro Node)
  • Status Distribution (Pie Chart)
  • Backup Jobs Timeline
  • Alert History

7.2 Prometheus Exporter

CheckMK Prometheus Exporter konfigurieren:

Setup > Exporter > Prometheus


Metrics:
- cmk_tsm_backups_backup_age_seconds
- cmk_tsm_backups_backup_jobs_total


Labels:
- backup_type
- backup_category
- frequency

7.3 REST API Zugriff

import requests


# CheckMK REST API
url = "https://checkmk.example.com/site/check_mk/api/1.0"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Accept": "application/json"
}


# Alle TSM-Services abfragen
response = requests.get(
    f"{url}/domain-types/service/collections/all",
    headers=headers,
    params={
        "query": '{"op": "and", "expr": [{"op": "~", "left": "description", "right": "TSM Backup"}]}'
    }
)


services = response.json()

8. Best Practices

8.1 Naming Conventions

Node-Namen:

✅ EMPFOHLEN:
- SERVER_MSSQL
- APP_ORACLE_01
- FILESERVER_BACKUP


❌ VERMEIDEN:
- MSSQL (zu generisch)
- SERVER-PROD (Bindestrich kann Probleme machen)
- very_long_name_that_is_too_descriptive_mssql_backup_node (>50 Zeichen)

Schedule-Namen:

✅ EMPFOHLEN:
- DAILY_FULL
- HOURLY_LOG
- WEEKLY_FULL


❌ VERMEIDEN:
- PROD_BACKUP (keine Frequency erkennbar)
- BACKUP01 (keine Informationen)

8.2 Monitoring-Strategie

Alarm-Eskalation:

  1. Stufe 1 (INFO): Backup Started/Pending
  2. Stufe 2 (WARN):    - Backup-Alter > WARN-Schwellwert    - Failed (tolerante Typen)    - Pending > 2h
  3. Stufe 3 (CRIT):    - Backup-Alter > CRIT-Schwellwert    - Failed/Missed (strikte Typen)

Notification Delays:

Setup > Notifications > Rules


WARN: Notify after 15 minutes (allow recovery)
CRIT: Notify immediately

8.3 Maintenance Windows

Backup-Services während Maintenance pausieren:

Setup > Services > Service monitoring rules > Disabled checks


Conditions:
- Service labels: backup_system = tsm
- Timeperiod: maintenance_window


Action: Disable active checks

8.4 Dokumentation

Pro Installation dokumentieren:

  1. CSV-Export-Quelle: Welcher TSM-Server, welche Queries
  2. CSV-Transfer-Methode: NFS/SCP/Rsync + Schedule
  3. Benutzerdefinierte Typen: Liste aller hinzugefügten Backup-Typen
  4. Angepasste Schwellwerte: Begründung für Abweichungen
  5. Kontakte: Wer ist für TSM-Backups verantwortlich

8.5 Regelmäßige Wartung

Monatlich:

  • CSV-Verzeichnis aufräumen (alte Dateien löschen)
  • Überprüfen: Werden alle erwarteten Nodes gefunden?
  • Alert-History analysieren: False Positives?

Quartalsweise:

  • Schwellwerte überprüfen und ggf. anpassen
  • Neue Backup-Typen dokumentieren
  • Check-Plugin auf Updates prüfen

Jährlich:

  • CheckMK-Upgrade-Kompatibilität testen
  • Performance-Review (Agent-Laufzeit, Check-Dauer)
  • Architektur-Review (Ist die Lösung noch passend?)

Anhang

A. Glossar

Begriff Beschreibung
Agent Plugin Script auf dem überwachten Host, liefert Daten an CheckMK
Check Plugin Code auf CheckMK-Server, erstellt Services und bewertet Status
Service Label Key-Value-Paar, das einem Service zugeordnet ist (Filterung/Reporting)
Discovery Prozess, bei dem CheckMK automatisch Services erstellt
Threshold Schwellwert (WARN/CRIT) für eine Metrik
Node TSM-Begriff für einen Backup-Client
Schedule TSM-Begriff für einen geplanten Backup-Job

B. Fehlercode-Referenz

Fehler Ursache Lösung
Backup not found in data Node existiert in Discovery, aber nicht im aktuellen Agent-Output CSV-Dateien prüfen, ggf. Re-Discovery
Empty agent section Agent liefert keine Daten Agent-Plugin-Ausführung prüfen, CSV-Verzeichnis prüfen
JSON decode error Agent-Output ist kein valides JSON Agent-Plugin manuell testen, Fehler im Output suchen
Unknown State Unerwarteter Status vom TSM Agent-Output prüfen, ggf. calculate_state() erweitern

C. TSM-Query für CSV-Export

Beispiel-Query für TSM-Server (dsmadmc):

SELECT
  DATE(END_TIME) || ' ' || TIME(END_TIME) AS DATETIME,
  ENTITY,
  NODE_NAME,
  SCHEDULE_NAME,
  RESULT
FROM ACTLOG
WHERE
  SCHEDULE_NAME IS NOT NULL
  AND SCHEDULE_NAME != ''
  AND TIMESTAMPDIFF(4, CHAR(CURRENT_TIMESTAMP - END_TIME)) <= 24
ORDER BY END_TIME DESC

Export als CSV:

dsmadmc -id=admin -pa=password -comma \
  "SELECT ... FROM ACTLOG ..." \
  > /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV

Ende der technischen Dokumentation

Letzte Aktualisierung: 2026-01-12   Version: 4.1   Autor: Marius Gielnik