Checkmk/TSM/Dokumentation.md

1648 lines
39 KiB
Markdown
Raw Permalink Normal View History

2026-01-13 23:25:08 +01:00
# TSM Backup Monitoring - Technische Dokumentation
## Inhaltsverzeichnis
1. [Architektur-Details](#architektur-details)
2. [API-Referenz](#api-referenz)
3. [Erweiterte Konfiguration](#erweiterte-konfiguration)
4. [Entwicklungsleitfaden](#entwicklungsleitfaden)
5. [Performance-Optimierung](#performance-optimierung)
6. [Sicherheit](#sicherheit)
7. [Integration](#integration)
8. [Best Practices](#best-practices)
---
## 1. Architektur-Details
### 1.1 Komponenten-Übersicht
#### Agent Plugin (`tsm_backups`)
**Speicherort:** `/usr/lib/check_mk_agent/plugins/tsm_backups`
**Aufgaben:**
- CSV-Dateien aus `/mnt/CMK_TSM` einlesen
- Node-Namen normalisieren (RRZ*/NFRZ*-Präfixe entfernen)
- Backup-Daten pro Node aggregieren
- JSON-Output für CheckMK Agent generieren
**Ausführung:**
- Wird bei jedem Agent-Aufruf ausgeführt
- Standard-Intervall: 60 Sekunden (CheckMK Standard)
- Kann asynchron konfiguriert werden (siehe [Async Plugins](#async-plugins))
**Output-Format:**
```json
{
  "SERVER_MSSQL": {
    "statuses": ["Completed", "Completed", "Failed"],
    "schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
    "last": 1736693420,
    "count": 3
  },
  "DATABASE_HANA": {
    "statuses": ["Completed"],
    "schedules": ["00-00-00_FULL"],
    "last": 1736690000,
    "count": 1
  }
}
```
#### Check Plugin (`tsm_backups.py`)
**Speicherort:** `/omd/sites/<site>/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py`
**Aufgaben:**
- JSON vom Agent parsen
- Services mit Labels discovern
- Backup-Status bewerten
- Metriken generieren
- Schwellwerte prüfen
**CheckMK API Version:** v2 (cmk.agent_based.v2)
### 1.2 Datenfluss-Diagramm
```
┌──────────────────────────────────────────────────────────────────┐
│ TSM Server                                                        │
                                                                   │
 SELECT                                                            │
   DATE_TIME, ENTITY, NODE_NAME, SCHEDULE_NAME, RESULT          
 FROM ACTLOG                                                      │
 WHERE TIMESTAMP > CURRENT_TIMESTAMP - 24 HOURS                  
                                                                   │
│  ↓ Export als CSV                                                
 /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV                  │
└──────────────────────────────────────────────────────────────────┘
                           
                            │ NFS/SCP/Rsync
                           
┌──────────────────────────────────────────────────────────────────┐
│ Host: /mnt/CMK_TSM/                                              
│  ├── TSM_BACKUP_SCHED_24H.CSV                                    │
│  ├── TSM_DB_SCHED_24H.CSV                                        │
│  └── TSM_FILE_SCHED_24H.CSV                                      │
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ Agent Plugin: /usr/lib/check_mk_agent/plugins/tsm_backups        │
                                                                   │
 1. Liste CSV-Dateien in /mnt/CMK_TSM                            │
 2. Parse jede Zeile:                                            
    - Extrahiere: timestamp, node, schedule, status              │
    - Validiere Node (Länge, MAINTENANCE)                        │
    - Normalisiere Node-Name:                                    │
      RRZ01_SERVER_MSSQL → _SERVER_MSSQL → SERVER_MSSQL        
 3. Aggregiere pro Node:                                          │
    - Sammle alle Statuses                                      
    - Sammle alle Schedules                                      │
    - Finde letzten Timestamp                                    │
    - Zähle Jobs                                                
 4. Generiere JSON-Output                                        │
                                                                   │
 Output: <<<tsm_backups:sep(0)>>>                                │
         {"SERVER_MSSQL": {...}, ...}                            
└──────────────────────────────────────────────────────────────────┘
                           
                            │ CheckMK Agent Protocol
                           
┌──────────────────────────────────────────────────────────────────┐
│ CheckMK Server: Agent Section Parser                            
                                                                   │
 parse_tsm_backups(string_table):                                │
   - Extrahiere JSON-String aus string_table[0][0]              
   - Parse JSON → Python Dict                                    │
   - Return: {node: data, ...}                                  
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ Service Discovery                                                
                                                                   │
 discover_tsm_backups(section):                                  │
   FOR EACH node IN section:                                    
     1. Extrahiere Metadata:                                    
        - backup_type = extract_backup_type(node)                │
        - backup_level = extract_backup_level(schedules)        
        - frequency = extract_frequency(schedules)              
        - error_handling = get_error_handling(backup_type)      
        - category = get_backup_category(backup_type)            │
     2. Erstelle Service mit Labels:                            
        Service(                                                  │
          item=node,                                              │
          labels=[                                                │
            ServiceLabel("backup_type", backup_type),            │
            ServiceLabel("backup_category", category),          
            ...                                                  
          ]                                                      
        )                                                        
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ Service Check Execution                                          
                                                                   │
 check_tsm_backups(item, section):                              
   1. Lade Node-Daten aus section[item]                          │
   2. Extrahiere Metadata (wie bei Discovery)                    │
   3. Berechne Status:                                            │
      - calculate_state() → (State, status_text)                
   4. Berechne Backup-Alter:                                    
      - age = now - last_timestamp                              
   5. Hole Schwellwerte:                                          │
      - thresholds = get_thresholds(type, level)                
   6. Prüfe Alter gegen Schwellwerte                            
   7. Generiere Output:                                          
      - Result(state, summary)                                    │
      - Metric("backup_age", age, levels)                        │
      - Metric("backup_jobs", count)                            
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ CheckMK Service                                                  
                                                                   │
 Name: TSM Backup SERVER_MSSQL                                  
 State: OK                                                        │
 Summary: Type=MSSQL (database), Level=FULL, Freq=daily,        │
          Status=Completed, Last=3h 15m, Jobs=3                  │
 Metrics:                                                        
   - backup_age: 11700s (warn: 93600s, crit: 172800s)          
   - backup_jobs: 3                                              
 Labels:                                                          │
   - backup_type: mssql                                          
   - backup_category: database                                    │
   - frequency: daily                                            
   - backup_level: full                                          
   - error_handling: strict                                      
└──────────────────────────────────────────────────────────────────┘
```
### 1.3 Node-Normalisierung im Detail
**Zweck:** TSM-Umgebungen mit redundanten Servern (z.B. RRZ01, RRZ02, NFRZ01) sollen als ein logischer Node überwacht werden.
**Algorithmus:**
```python
def normalize_node_name(node):
    """
    Input: "RRZ01_MYSERVER_MSSQL"
   
    Schritt 1: Entferne RRZ*/NFRZ*/RZ* Präfix mit Unterstrich
               Pattern: r'(RRZ|NFRZ|RZ)\d+(_)'
               Ergebnis: "_MYSERVER_MSSQL"
   
    Schritt 2: Entferne führenden Unterstrich
               Ergebnis: "MYSERVER_MSSQL"
   
    Schritt 3: Entferne RRZ*/NFRZ*/RZ* Suffix ohne Unterstrich
               Pattern: r'(RRZ|NFRZ|RZ)\d+$'
               Ergebnis: "MYSERVER_MSSQL"
   
    Output: "MYSERVER_MSSQL"
    """
```
**Beispiele:**
| Original Node | Normalisiert | Ergebnis |
|---------------|--------------|----------|
| `RRZ01_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ |
| `RRZ02_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ Zusammengeführt |
| `NFRZ01_DATABASE_HANA` | `DATABASE_HANA` | ✅ |
| `SERVER_FILE_RRZ01` | `SERVER_FILE` | ✅ |
| `MYSERVER_ORACLE` | `MYSERVER_ORACLE` | ✅ Unverändert |
---
## 2. API-Referenz
### 2.1 Agent Plugin Funktionen
#### `TSMParser.normalize_node_name(node: str) -> str`
Normalisiert TSM-Node-Namen für Redundanz-Logik.
**Parameter:**
- `node` (str): Original TSM-Node-Name
**Returns:**
- `str`: Normalisierter Node-Name
**Beispiel:**
```python
parser = TSMParser()
normalized = parser.normalize_node_name("RRZ01_SERVER_MSSQL")
# normalized == "SERVER_MSSQL"
```
---
#### `TSMParser.is_valid_node(node: str, status: str) -> bool`
Prüft, ob ein Node für Monitoring valide ist.
**Parameter:**
- `node` (str): Node-Name
- `status` (str): Backup-Status
**Returns:**
- `bool`: True wenn valide, False sonst
**Validierungs-Regeln:**
- Node muss existieren (not empty)
- Node muss mindestens 3 Zeichen lang sein
- Status muss existieren
- Node darf nicht "MAINTENANCE" enthalten
**Beispiel:**
```python
parser.is_valid_node("SERVER", "Completed")  # False (zu kurz)
parser.is_valid_node("SERVER_MSSQL", "Completed")  # True
parser.is_valid_node("SERVER_MAINTENANCE", "Completed")  # False
```
---
#### `TSMParser.parse_csv(csv_file: Path) -> None`
Parsed eine TSM-CSV-Datei und sammelt Backup-Informationen.
**Parameter:**
- `csv_file` (Path): Pfad zur CSV-Datei
**CSV-Format:**
```
TIMESTAMP,FIELD,NODE_NAME,SCHEDULE_NAME,STATUS
2026-01-12 08:00:00,SOMETHING,SERVER_MSSQL,DAILY_FULL,Completed
```
**Side Effects:**
- Fügt geparste Backups zu `self.backups` hinzu
---
#### `TSMParser.aggregate() -> dict`
Aggregiert Backup-Daten pro normalisierten Node.
**Returns:**
```python
{
    "SERVER_MSSQL": {
        "statuses": ["Completed", "Completed", "Failed"],
        "schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
        "last": 1736693420,  # Unix timestamp
        "count": 3
    }
}
```
---
### 2.2 Check Plugin Funktionen
#### `extract_backup_type(node: str) -> str`
Extrahiert Backup-Typ aus Node-Namen anhand bekannter Typen.
**Parameter:**
- `node` (str): Normalisierter Node-Name
**Returns:**
- `str`: Backup-Typ in Kleinbuchstaben, oder "unknown"
**Bekannte Typen:**
- Datenbanken: MSSQL, HANA, Oracle, DB2, MySQL
- Virtualisierung: Virtual
- Dateisysteme: FILE, SCALE, DM, Datacenter
- Applikationen: Mail
**Algorithmus:**
1. Splitte Node-Namen bei Unterstrich
2. Nehme letztes Segment
3. Falls letztes Segment numerisch → nehme vorletztes Segment
4. Prüfe gegen Liste bekannter Typen
5. Return lowercase oder "unknown"
**Beispiele:**
```python
extract_backup_type("SERVER_MSSQL")           # → "mssql"
extract_backup_type("DATABASE_HANA_01")       # → "hana"
extract_backup_type("FILESERVER_FILE")        # → "file"
extract_backup_type("VM_HYPERV_123")          # → "hyperv"
extract_backup_type("APP_UNKNOWN")            # → "unknown"
```
---
#### `extract_backup_level(schedules: list[str]) -> str`
Extrahiert Backup-Level aus Schedule-Namen.
**Parameter:**
- `schedules` (list[str]): Liste von Schedule-Namen
**Returns:**
- `str`: `"log"`, `"full"`, `"incremental"`, `"differential"`
**Priorität:** log > full > differential > incremental
**Erkennungs-Pattern:**
- `_LOG` oder `LOG` → log
- `_FULL` oder `FULL` → full
- `_INCR` oder `INCREMENTAL` → incremental
- `_DIFF` oder `DIFFERENTIAL` → differential
**Beispiele:**
```python
extract_backup_level(["DAILY_FULL"])                    # → "full"
extract_backup_level(["HOURLY_LOG", "DAILY_FULL"])     # → "log"
extract_backup_level(["00-00-00_FULL"])                 # → "full"
```
---
#### `extract_frequency(schedules: list[str]) -> str`
Extrahiert Backup-Frequenz aus Schedule-Namen.
**Parameter:**
- `schedules` (list[str]): Liste von Schedule-Namen
**Returns:**
- `str`: `"hourly"`, `"daily"`, `"weekly"`, `"monthly"`, `"unknown"`
**Priorität:** hourly > daily > weekly > monthly
**Erkennungs-Pattern:**
- `HOURLY` → hourly
- `DAILY` → daily
- `WEEKLY` → weekly
- `MONTHLY` → monthly
- `HH-MM-SS_*LOG` → hourly (Zeit-basiert mit LOG)
- `00-00-00_*` → daily (Mitternacht)
**Beispiele:**
```python
extract_frequency(["DAILY_FULL"])               # → "daily"
extract_frequency(["00-00-00_FULL"])            # → "daily"
extract_frequency(["08-00-00_LOG"])             # → "hourly"
extract_frequency(["WEEKLY_FULL", "DAILY_DIFF"]) # → "daily"
```
---
#### `get_error_handling(backup_type: str) -> str`
Bestimmt Error-Handling-Strategie basierend auf Backup-Typ.
**Parameter:**
- `backup_type` (str): Backup-Typ
**Returns:**
- `str`: `"tolerant"` oder `"strict"`
**Logik:**
```python
if backup_type in TOLERANT_TYPES:
    return "tolerant"  # Failed → WARN
else:
    return "strict"    # Failed → CRIT
```
**Tolerante Typen:**
- file, virtual, scale, dm, datacenter
- vmware, hyperv, mail, exchange
**Strikt:**
- Alle Datenbank-Typen (mssql, hana, oracle, db2, ...)
- Alle anderen Typen
---
#### `get_backup_category(backup_type: str) -> str`
Kategorisiert Backup-Typ in Oberkategorien.
**Parameter:**
- `backup_type` (str): Backup-Typ
**Returns:**
- `str`: `"database"`, `"virtualization"`, `"filesystem"`, `"application"`, `"other"`
**Kategorien:**
| Kategorie | Typen |
|-----------|-------|
| `database` | mssql, hana, oracle, db2, mysql, postgres, mariadb, sybase, mongodb |
| `virtualization` | virtual, vmware, hyperv, kvm, xen |
| `filesystem` | file, scale, dm, datacenter |
| `application` | mail, exchange |
| `other` | Alle anderen |
---
#### `get_thresholds(backup_type: str, backup_level: str) -> dict`
Liefert typ- und level-spezifische Schwellwerte.
**Parameter:**
- `backup_type` (str): Backup-Typ
- `backup_level` (str): Backup-Level
**Returns:**
```python
{
    "warn": 93600,   # Sekunden
    "crit": 172800   # Sekunden
}
```
**Priorität:**
1. Falls `backup_level == "log"` → LOG-Schwellwerte (4h/8h)
2. Falls `backup_type` in THRESHOLDS → Typ-spezifische Schwellwerte
3. Sonst → Default-Schwellwerte (26h/48h)
**Beispiele:**
```python
get_thresholds("mssql", "log")   # → {"warn": 14400, "crit": 28800}
get_thresholds("mssql", "full")  # → {"warn": 93600, "crit": 172800}
get_thresholds("newtype", "full") # → {"warn": 93600, "crit": 172800}
```
---
#### `calculate_state(statuses: list[str], last_time: int, backup_type: str, error_handling: str) -> tuple[State, str]`
Berechnet CheckMK-Status aus Backup-Zuständen.
**Parameter:**
- `statuses` (list[str]): Liste aller Backup-Statuses
- `last_time` (int): Unix-Timestamp des letzten Backups
- `backup_type` (str): Backup-Typ
- `error_handling` (str): "tolerant" oder "strict"
**Returns:**
- `tuple`: `(State, status_text)`
**Status-Logik-Tabelle:**
| Bedingung | Alter | Error Handling | State | Text |
|-----------|-------|----------------|-------|------|
| ≥1x "completed" | - | - | OK | "Completed" |
| Nur "pending"/"started" | <2h | - | OK | "Pending/Started" |
| Nur "pending"/"started" | >2h | - | WARN | "Pending (>2h)" |
| Nur "pending"/"started" | unknown | - | WARN | "Pending" |
| "failed"/"missed" | - | tolerant | WARN | "Failed (partial)" |
| "failed"/"missed" | - | strict | CRIT | "Failed/Missed" |
| Andere | - | - | CRIT | "Unknown State" |
**Beispiele:**
```python
calculate_state(["Completed", "Completed"], 1736690000, "mssql", "strict")
# → (State.OK, "Completed")
calculate_state(["Failed"], 1736690000, "file", "tolerant")
# → (State.WARN, "Failed (partial)")
calculate_state(["Failed"], 1736690000, "mssql", "strict")
# → (State.CRIT, "Failed/Missed")
```
---
### 2.3 CheckMK API v2 Objekte
#### `Service`
Definiert einen CheckMK-Service während der Discovery.
```python
from cmk.agent_based.v2 import Service, ServiceLabel
Service(
    item="SERVER_MSSQL",
    labels=[
        ServiceLabel("backup_type", "mssql"),
        ServiceLabel("frequency", "daily"),
    ]
)
```
---
#### `Result`
Repräsentiert ein Check-Ergebnis.
```python
from cmk.agent_based.v2 import Result, State
Result(
    state=State.OK,
    summary="Type=MSSQL, Status=Completed, Last=3h"
)
Result(
    state=State.OK,
    notice="Detailed information for details page"
)
```
---
#### `Metric`
Definiert eine Performance-Metrik.
```python
from cmk.agent_based.v2 import Metric
Metric(
    name="backup_age",
    value=11700,                    # Aktueller Wert
    levels=(93600, 172800),         # (warn, crit)
    boundaries=(0, None),            # (min, max)
)
```
---
## 3. Erweiterte Konfiguration
### 3.1 Benutzerdefinierte Backup-Typen
**Szenario:** Neuer Backup-Typ "SAPASE" (SAP ASE Datenbank) soll überwacht werden.
**Schritt 1: Typ zur known_types Liste hinzufügen**
```python
# In tsm_backups.py, extract_backup_type() Funktion
known_types = [
    'MSSQL', 'HANA', 'FILE', 'ORACLE', 'DB2', 'SCALE', 'DM',
    'DATACENTER', 'VIRTUAL', 'MAIL', 'MYSQL',
    'SAPASE',  # NEU
]
```
**Schritt 2: Schwellwerte definieren (optional)**
```python
THRESHOLDS = {
    # ... bestehende Einträge ...
    "sapase": {"warn": 26 * 3600, "crit": 48 * 3600},
}
```
**Schritt 3: Typ zur passenden Kategorie hinzufügen**
```python
DATABASE_TYPES = {
    'mssql', 'hana', 'db2', 'oracle', 'mysql',
    'sapase',  # NEU
}
```
**Schritt 4: Error-Handling festlegen (optional)**
Falls tolerant gewünscht:
```python
TOLERANT_TYPES = {
    'file', 'virtual', 'scale', 'dm', 'datacenter',
    'vmware', 'hyperv', 'mail', 'exchange',
    'sapase',  # NEU (falls tolerant erwünscht)
}
```
**Schritt 5: Plugin neu laden**
```bash
cmk -R
cmk -II --all
```
**Ergebnis:**
- Nodes wie `SERVER_SAPASE` werden automatisch erkannt
- Typ-Label: `backup_type=sapase`
- Kategorie-Label: `backup_category=database`
- Schwellwerte: 26h/48h
---
### 3.2 Async Agent Plugin
Bei großen TSM-Umgebungen kann das CSV-Parsing Zeit in Anspruch nehmen. Async-Plugins laufen unabhängig vom Agent-Intervall.
**Konfiguration:**
```bash
# Als root auf dem Host
cat > /etc/check_mk/mrpe.cfg << 'EOF'
# TSM Backups async (alle 5 Minuten)
(interval=300) tsm_backups /usr/lib/check_mk_agent/plugins/tsm_backups
EOF
```
**Oder mit CheckMK Bakery (Regel):**
```
Setup > Agents > Agent Rules > Asynchronous execution of plugins (Windows, Linux)
```
**Einstellungen:**
- Plugin: `tsm_backups`
- Execution interval: `300` Sekunden (5 Minuten)
- Cache age: `600` Sekunden (10 Minuten)
---
### 3.3 CSV-Export Automation
#### Option A: NFS-Mount (empfohlen)
```bash
# /etc/fstab
tsm-server.example.com:/exports/backup-stats  /mnt/CMK_TSM  nfs  defaults,ro  0  0
# Mount testen
mount -a
ls /mnt/CMK_TSM/
```
#### Option B: Rsync via Cron
```bash
# Crontab für root
*/15 * * * * rsync -az --delete tsm-server:/path/to/csv/ /mnt/CMK_TSM/
```
#### Option C: SCP mit SSH-Key
```bash
# SSH-Key einrichten
ssh-keygen -t ed25519 -f ~/.ssh/tsm_backup_key -N ""
ssh-copy-id -i ~/.ssh/tsm_backup_key.pub tsm-server
# Crontab
*/15 * * * * scp -i ~/.ssh/tsm_backup_key tsm-server:/path/*.CSV /mnt/CMK_TSM/
```
---
### 3.4 Regel-basierte Service-Erstellung
**CheckMK-Regeln für automatische Service-Labels:**
```
Setup > Services > Discovery rules > Host labels
```
**Beispiel-Regel:**
```yaml
conditions:
  service_labels:
    backup_category: database
 
actions:
  add_labels:
    criticality: high
    team: dba
```
---
### 3.5 Custom Views
#### View: Alle kritischen Datenbank-Backups
```
Setup > General > Custom views > Create new view
Name: Critical Database Backups
Datasource: All services
Filters:
- Service state: CRIT
- Service labels: backup_category = database
Columns:
- Host
- Service description
- Service state
- Service output
- Service labels: backup_type
- Service labels: frequency
- Perf-O-Meter
```
---
### 3.6 Custom Notifications
**Notification Rule: Nur strikte Failed-Backups eskalieren**
```
Setup > Notifications > Add rule
Conditions:
- Service labels: error_handling = strict
- Service state: CRIT
- Service state type: HARD
Contact selection:
- Specify users: dba-team
Notification method:
- Email
- PagerDuty
```
---
## 4. Entwicklungsleitfaden
### 4.1 Entwicklungsumgebung einrichten
```bash
# CheckMK-Site für Entwicklung
omd create dev
omd start dev
su - dev
# Git-Repository
cd ~/local/lib/python3/cmk_addons/plugins/
git init
git add .
git commit -m "Initial commit"
# Entwicklungs-Workflow
vim tsm/agent_based/tsm_backups.py
cmk -R
cmk -vv --debug test-host | grep "TSM Backup"
```
---
### 4.2 Unit Tests schreiben
**Test-Datei:** `test_tsm_backups.py`
```python
#!/usr/bin/env python3
import pytest
from tsm_backups import (
    extract_backup_type,
    extract_backup_level,
    calculate_state,
)
from cmk.agent_based.v2 import State
def test_extract_backup_type():
    assert extract_backup_type("SERVER_MSSQL") == "mssql"
    assert extract_backup_type("DATABASE_HANA_01") == "hana"
    assert extract_backup_type("NEWTYPE_CUSTOM") == "custom"
def test_extract_backup_level():
    assert extract_backup_level(["DAILY_FULL"]) == "full"
    assert extract_backup_level(["HOURLY_LOG", "DAILY_FULL"]) == "log"
def test_calculate_state_completed():
    state, text = calculate_state(
        ["Completed", "Completed"],
        1736690000,
        "mssql",
        "strict"
    )
    assert state == State.OK
    assert text == "Completed"
def test_calculate_state_failed_strict():
    state, text = calculate_state(
        ["Failed"],
        1736690000,
        "mssql",
        "strict"
    )
    assert state == State.CRIT
    assert text == "Failed/Missed"
def test_calculate_state_failed_tolerant():
    state, text = calculate_state(
        ["Failed"],
        1736690000,
        "file",
        "tolerant"
    )
    assert state == State.WARN
    assert text == "Failed (partial)"
```
**Tests ausführen:**
```bash
pytest test_tsm_backups.py -v
```
---
### 4.3 Code-Style
**PEP 8 Compliance:**
```bash
pip install black flake8 mypy
# Auto-Formatierung
black tsm_backups.py
# Linting
flake8 tsm_backups.py
# Type Checking
mypy tsm_backups.py
```
---
### 4.4 Debugging
#### Agent-Plugin debuggen
```bash
# Direkter Aufruf mit Traceback
python3 /usr/lib/check_mk_agent/plugins/tsm_backups
# Mit Debugger
python3 -m pdb /usr/lib/check_mk_agent/plugins/tsm_backups
```
#### Check-Plugin debuggen
```bash
# Verbose Check mit Debug-Output
cmk -vv --debug hostname | less
# Nur TSM-Services
cmk -vv --debug hostname | grep -A 20 "TSM Backup"
# Python-Debugger im Plugin
import pdb; pdb.set_trace()
```
---
### 4.5 Performance-Profiling
```python
# In tsm_backups.py
import cProfile
import pstats
def main():
    profiler = cProfile.Profile()
    profiler.enable()
   
    # ... bestehender Code ...
   
    profiler.disable()
    stats = pstats.Stats(profiler)
    stats.sort_stats('cumulative')
    stats.print_stats(20)
```
---
## 5. Performance-Optimierung
### 5.1 CSV-Parsing beschleunigen
**Problem:** Große CSV-Dateien (>100 MB) verlangsamen Agent
**Lösung 1: Nur relevante Zeilen parsen**
```python
def parse_csv_optimized(self, csv_file):
    # Nur die letzten 24h sind relevant
    cutoff_time = datetime.now() - timedelta(hours=24)
   
    with open(csv_file, 'r') as f:
        reader = csv.reader(f)
        for row in reader:
            try:
                time_str = row[0].strip()
                timestamp = datetime.strptime(time_str, "%Y-%m-%d %H:%M:%S")
               
                # Skip alte Einträge
                if timestamp < cutoff_time:
                    continue
               
                # ... restliche Verarbeitung ...
            except:
                continue
```
**Lösung 2: Pandas für große Dateien**
```python
import pandas as pd
def parse_csv_pandas(csv_file):
    df = pd.read_csv(
        csv_file,
        names=['timestamp', 'field', 'node', 'schedule', 'status'],
        parse_dates=['timestamp'],
    )
   
    # Filter letzten 24h
    cutoff = pd.Timestamp.now() - pd.Timedelta(hours=24)
    df = df[df['timestamp'] > cutoff]
   
    # Aggregation
    grouped = df.groupby('node').agg({
        'status': list,
        'schedule': list,
        'timestamp': 'max',
        'node': 'count'
    })
   
    return grouped.to_dict()
```
---
### 5.2 Caching
**Problem:** CSV-Dateien ändern sich nur alle 15-30 Minuten
**Lösung: Cache mit Timestamp-Check**
```python
import json
from pathlib import Path
import time
CACHE_FILE = Path("/tmp/tsm_backups_cache.json")
CACHE_TTL = 300  # 5 Minuten
def get_cached_or_parse():
    if CACHE_FILE.exists():
        cache_age = time.time() - CACHE_FILE.stat().st_mtime
        if cache_age < CACHE_TTL:
            with open(CACHE_FILE, 'r') as f:
                return json.load(f)
   
    # Parse fresh
    parser = TSMParser()
    # ... parse logic ...
    result = parser.aggregate()
   
    # Cache schreiben
    with open(CACHE_FILE, 'w') as f:
        json.dump(result, f)
   
    return result
```
---
### 5.3 Speicher-Optimierung
**Problem:** Große Listen von Status/Schedule strings
**Lösung: Nur unique values speichern**
```python
def aggregate_optimized(self):
    nodes = defaultdict(lambda: {
        "statuses": set(),       # Set statt Liste
        "schedules": set(),
        "last": None,
        "count": 0,
    })
   
    for b in self.backups:
        node = b["node"]
        nodes[node]["count"] += 1
        nodes[node]["statuses"].add(b["status"])  # Automatisch unique
        nodes[node]["schedules"].add(b["schedule"])
        # ... rest ...
   
    # Konvertiere Sets zu Listen für JSON
    for node in nodes:
        nodes[node]["statuses"] = list(nodes[node]["statuses"])
        nodes[node]["schedules"] = list(nodes[node]["schedules"])
   
    return nodes
```
---
## 6. Sicherheit
### 6.1 Dateiberechtigungen
```bash
# Agent-Plugin
chown root:root /usr/lib/check_mk_agent/plugins/tsm_backups
chmod 755 /usr/lib/check_mk_agent/plugins/tsm_backups
# CSV-Verzeichnis
chown root:root /mnt/CMK_TSM
chmod 755 /mnt/CMK_TSM
chmod 644 /mnt/CMK_TSM/*.CSV
# Check-Plugin
chown <site>:<site> $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
chmod 644 $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
```
---
### 6.2 Input-Validierung
**Agent-Plugin:**
```python
def is_valid_node(self, node, status):
    # Länge prüfen
    if not node or len(node) < 3 or len(node) > 200:
        return False
   
    # Unerlaubte Zeichen
    if not re.match(r'^[A-Za-z0-9_-]+$', node):
        return False
   
    # Status whitelist
    valid_statuses = ['Completed', 'Failed', 'Missed', 'Pending', 'Started']
    if status not in valid_statuses:
        return False
   
    return True
```
---
### 6.3 Sichere CSV-Verarbeitung
```python
def parse_csv_safe(self, csv_file):
    try:
        # Dateigröße prüfen (max 500 MB)
        if csv_file.stat().st_size > 500 * 1024 * 1024:
            return
       
        with open(csv_file, 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
           
            line_count = 0
            for row in reader:
                line_count += 1
               
                # Max. 1 Million Zeilen
                if line_count > 1000000:
                    break
               
                # ... Verarbeitung ...
    except Exception as e:
        # Logging statt Crash
        pass
```
---
## 7. Integration
### 7.1 Grafana-Dashboard
**InfluxDB Query:**
```sql
SELECT
  mean("backup_age") AS "avg_age",
  max("backup_age") AS "max_age"
FROM "tsm_backups"
WHERE
  "backup_category" = 'database'
  AND time > now() - 7d
GROUP BY
  time(1h),
  "node_name"
```
**Panels:**
- Backup Age Heatmap (pro Node)
- Status Distribution (Pie Chart)
- Backup Jobs Timeline
- Alert History
---
### 7.2 Prometheus Exporter
**CheckMK Prometheus Exporter konfigurieren:**
```
Setup > Exporter > Prometheus
Metrics:
- cmk_tsm_backups_backup_age_seconds
- cmk_tsm_backups_backup_jobs_total
Labels:
- backup_type
- backup_category
- frequency
```
---
### 7.3 REST API Zugriff
```python
import requests
# CheckMK REST API
url = "https://checkmk.example.com/site/check_mk/api/1.0"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Accept": "application/json"
}
# Alle TSM-Services abfragen
response = requests.get(
    f"{url}/domain-types/service/collections/all",
    headers=headers,
    params={
        "query": '{"op": "and", "expr": [{"op": "~", "left": "description", "right": "TSM Backup"}]}'
    }
)
services = response.json()
```
---
## 8. Best Practices
### 8.1 Naming Conventions
**Node-Namen:**
```
✅ EMPFOHLEN:
- SERVER_MSSQL
- APP_ORACLE_01
- FILESERVER_BACKUP
❌ VERMEIDEN:
- MSSQL (zu generisch)
- SERVER-PROD (Bindestrich kann Probleme machen)
- very_long_name_that_is_too_descriptive_mssql_backup_node (>50 Zeichen)
```
**Schedule-Namen:**
```
✅ EMPFOHLEN:
- DAILY_FULL
- HOURLY_LOG
- WEEKLY_FULL
❌ VERMEIDEN:
- PROD_BACKUP (keine Frequency erkennbar)
- BACKUP01 (keine Informationen)
```
---
### 8.2 Monitoring-Strategie
**Alarm-Eskalation:**
1. **Stufe 1 (INFO):** Backup Started/Pending
2. **Stufe 2 (WARN):**
   - Backup-Alter > WARN-Schwellwert
   - Failed (tolerante Typen)
   - Pending > 2h
3. **Stufe 3 (CRIT):**
   - Backup-Alter > CRIT-Schwellwert
   - Failed/Missed (strikte Typen)
**Notification Delays:**
```
Setup > Notifications > Rules
WARN: Notify after 15 minutes (allow recovery)
CRIT: Notify immediately
```
---
### 8.3 Maintenance Windows
**Backup-Services während Maintenance pausieren:**
```
Setup > Services > Service monitoring rules > Disabled checks
Conditions:
- Service labels: backup_system = tsm
- Timeperiod: maintenance_window
Action: Disable active checks
```
---
### 8.4 Dokumentation
**Pro Installation dokumentieren:**
1. **CSV-Export-Quelle:** Welcher TSM-Server, welche Queries
2. **CSV-Transfer-Methode:** NFS/SCP/Rsync + Schedule
3. **Benutzerdefinierte Typen:** Liste aller hinzugefügten Backup-Typen
4. **Angepasste Schwellwerte:** Begründung für Abweichungen
5. **Kontakte:** Wer ist für TSM-Backups verantwortlich
---
### 8.5 Regelmäßige Wartung
**Monatlich:**
- CSV-Verzeichnis aufräumen (alte Dateien löschen)
- Überprüfen: Werden alle erwarteten Nodes gefunden?
- Alert-History analysieren: False Positives?
**Quartalsweise:**
- Schwellwerte überprüfen und ggf. anpassen
- Neue Backup-Typen dokumentieren
- Check-Plugin auf Updates prüfen
**Jährlich:**
- CheckMK-Upgrade-Kompatibilität testen
- Performance-Review (Agent-Laufzeit, Check-Dauer)
- Architektur-Review (Ist die Lösung noch passend?)
---
## Anhang
### A. Glossar
| Begriff | Beschreibung |
|---------|--------------|
| **Agent Plugin** | Script auf dem überwachten Host, liefert Daten an CheckMK |
| **Check Plugin** | Code auf CheckMK-Server, erstellt Services und bewertet Status |
| **Service Label** | Key-Value-Paar, das einem Service zugeordnet ist (Filterung/Reporting) |
| **Discovery** | Prozess, bei dem CheckMK automatisch Services erstellt |
| **Threshold** | Schwellwert (WARN/CRIT) für eine Metrik |
| **Node** | TSM-Begriff für einen Backup-Client |
| **Schedule** | TSM-Begriff für einen geplanten Backup-Job |
---
### B. Fehlercode-Referenz
| Fehler | Ursache | Lösung |
|--------|---------|--------|
| `Backup not found in data` | Node existiert in Discovery, aber nicht im aktuellen Agent-Output | CSV-Dateien prüfen, ggf. Re-Discovery |
| `Empty agent section` | Agent liefert keine Daten | Agent-Plugin-Ausführung prüfen, CSV-Verzeichnis prüfen |
| `JSON decode error` | Agent-Output ist kein valides JSON | Agent-Plugin manuell testen, Fehler im Output suchen |
| `Unknown State` | Unerwarteter Status vom TSM | Agent-Output prüfen, ggf. `calculate_state()` erweitern |
---
### C. TSM-Query für CSV-Export
**Beispiel-Query für TSM-Server (dsmadmc):**
```sql
SELECT
  DATE(END_TIME) || ' ' || TIME(END_TIME) AS DATETIME,
  ENTITY,
  NODE_NAME,
  SCHEDULE_NAME,
  RESULT
FROM ACTLOG
WHERE
  SCHEDULE_NAME IS NOT NULL
  AND SCHEDULE_NAME != ''
  AND TIMESTAMPDIFF(4, CHAR(CURRENT_TIMESTAMP - END_TIME)) <= 24
ORDER BY END_TIME DESC
```
**Export als CSV:**
```bash
dsmadmc -id=admin -pa=password -comma \
  "SELECT ... FROM ACTLOG ..." \
  > /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV
```
---
**Ende der technischen Dokumentation**
**Letzte Aktualisierung:** 2026-01-12  
**Version:** 4.1  
**Autor:** Marius Gielnik