Checkmk/TSM/Dokumentation.md
2026-01-13 23:25:08 +01:00

1648 lines
39 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# TSM Backup Monitoring - Technische Dokumentation
## Inhaltsverzeichnis
1. [Architektur-Details](#architektur-details)
2. [API-Referenz](#api-referenz)
3. [Erweiterte Konfiguration](#erweiterte-konfiguration)
4. [Entwicklungsleitfaden](#entwicklungsleitfaden)
5. [Performance-Optimierung](#performance-optimierung)
6. [Sicherheit](#sicherheit)
7. [Integration](#integration)
8. [Best Practices](#best-practices)
---
## 1. Architektur-Details
### 1.1 Komponenten-Übersicht
#### Agent Plugin (`tsm_backups`)
**Speicherort:** `/usr/lib/check_mk_agent/plugins/tsm_backups`
**Aufgaben:**
- CSV-Dateien aus `/mnt/CMK_TSM` einlesen
- Node-Namen normalisieren (RRZ*/NFRZ*-Präfixe entfernen)
- Backup-Daten pro Node aggregieren
- JSON-Output für CheckMK Agent generieren
**Ausführung:**
- Wird bei jedem Agent-Aufruf ausgeführt
- Standard-Intervall: 60 Sekunden (CheckMK Standard)
- Kann asynchron konfiguriert werden (siehe [Async Plugins](#async-plugins))
**Output-Format:**
```json
{
  "SERVER_MSSQL": {
    "statuses": ["Completed", "Completed", "Failed"],
    "schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
    "last": 1736693420,
    "count": 3
  },
  "DATABASE_HANA": {
    "statuses": ["Completed"],
    "schedules": ["00-00-00_FULL"],
    "last": 1736690000,
    "count": 1
  }
}
```
#### Check Plugin (`tsm_backups.py`)
**Speicherort:** `/omd/sites/<site>/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py`
**Aufgaben:**
- JSON vom Agent parsen
- Services mit Labels discovern
- Backup-Status bewerten
- Metriken generieren
- Schwellwerte prüfen
**CheckMK API Version:** v2 (cmk.agent_based.v2)
### 1.2 Datenfluss-Diagramm
```
┌──────────────────────────────────────────────────────────────────┐
│ TSM Server                                                        │
                                                                   │
 SELECT                                                            │
   DATE_TIME, ENTITY, NODE_NAME, SCHEDULE_NAME, RESULT          
 FROM ACTLOG                                                      │
 WHERE TIMESTAMP > CURRENT_TIMESTAMP - 24 HOURS                  
                                                                   │
│  ↓ Export als CSV                                                
 /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV                  │
└──────────────────────────────────────────────────────────────────┘
                           
                            │ NFS/SCP/Rsync
                           
┌──────────────────────────────────────────────────────────────────┐
│ Host: /mnt/CMK_TSM/                                              
│  ├── TSM_BACKUP_SCHED_24H.CSV                                    │
│  ├── TSM_DB_SCHED_24H.CSV                                        │
│  └── TSM_FILE_SCHED_24H.CSV                                      │
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ Agent Plugin: /usr/lib/check_mk_agent/plugins/tsm_backups        │
                                                                   │
 1. Liste CSV-Dateien in /mnt/CMK_TSM                            │
 2. Parse jede Zeile:                                            
    - Extrahiere: timestamp, node, schedule, status              │
    - Validiere Node (Länge, MAINTENANCE)                        │
    - Normalisiere Node-Name:                                    │
      RRZ01_SERVER_MSSQL → _SERVER_MSSQL → SERVER_MSSQL        
 3. Aggregiere pro Node:                                          │
    - Sammle alle Statuses                                      
    - Sammle alle Schedules                                      │
    - Finde letzten Timestamp                                    │
    - Zähle Jobs                                                
 4. Generiere JSON-Output                                        │
                                                                   │
 Output: <<<tsm_backups:sep(0)>>>                                │
         {"SERVER_MSSQL": {...}, ...}                            
└──────────────────────────────────────────────────────────────────┘
                           
                            │ CheckMK Agent Protocol
                           
┌──────────────────────────────────────────────────────────────────┐
│ CheckMK Server: Agent Section Parser                            
                                                                   │
 parse_tsm_backups(string_table):                                │
   - Extrahiere JSON-String aus string_table[0][0]              
   - Parse JSON → Python Dict                                    │
   - Return: {node: data, ...}                                  
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ Service Discovery                                                
                                                                   │
 discover_tsm_backups(section):                                  │
   FOR EACH node IN section:                                    
     1. Extrahiere Metadata:                                    
        - backup_type = extract_backup_type(node)                │
        - backup_level = extract_backup_level(schedules)        
        - frequency = extract_frequency(schedules)              
        - error_handling = get_error_handling(backup_type)      
        - category = get_backup_category(backup_type)            │
     2. Erstelle Service mit Labels:                            
        Service(                                                  │
          item=node,                                              │
          labels=[                                                │
            ServiceLabel("backup_type", backup_type),            │
            ServiceLabel("backup_category", category),          
            ...                                                  
          ]                                                      
        )                                                        
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ Service Check Execution                                          
                                                                   │
 check_tsm_backups(item, section):                              
   1. Lade Node-Daten aus section[item]                          │
   2. Extrahiere Metadata (wie bei Discovery)                    │
   3. Berechne Status:                                            │
      - calculate_state() → (State, status_text)                
   4. Berechne Backup-Alter:                                    
      - age = now - last_timestamp                              
   5. Hole Schwellwerte:                                          │
      - thresholds = get_thresholds(type, level)                
   6. Prüfe Alter gegen Schwellwerte                            
   7. Generiere Output:                                          
      - Result(state, summary)                                    │
      - Metric("backup_age", age, levels)                        │
      - Metric("backup_jobs", count)                            
└──────────────────────────────────────────────────────────────────┘
                           
                           
┌──────────────────────────────────────────────────────────────────┐
│ CheckMK Service                                                  
                                                                   │
 Name: TSM Backup SERVER_MSSQL                                  
 State: OK                                                        │
 Summary: Type=MSSQL (database), Level=FULL, Freq=daily,        │
          Status=Completed, Last=3h 15m, Jobs=3                  │
 Metrics:                                                        
   - backup_age: 11700s (warn: 93600s, crit: 172800s)          
   - backup_jobs: 3                                              
 Labels:                                                          │
   - backup_type: mssql                                          
   - backup_category: database                                    │
   - frequency: daily                                            
   - backup_level: full                                          
   - error_handling: strict                                      
└──────────────────────────────────────────────────────────────────┘
```
### 1.3 Node-Normalisierung im Detail
**Zweck:** TSM-Umgebungen mit redundanten Servern (z.B. RRZ01, RRZ02, NFRZ01) sollen als ein logischer Node überwacht werden.
**Algorithmus:**
```python
def normalize_node_name(node):
    """
    Input: "RRZ01_MYSERVER_MSSQL"
   
    Schritt 1: Entferne RRZ*/NFRZ*/RZ* Präfix mit Unterstrich
               Pattern: r'(RRZ|NFRZ|RZ)\d+(_)'
               Ergebnis: "_MYSERVER_MSSQL"
   
    Schritt 2: Entferne führenden Unterstrich
               Ergebnis: "MYSERVER_MSSQL"
   
    Schritt 3: Entferne RRZ*/NFRZ*/RZ* Suffix ohne Unterstrich
               Pattern: r'(RRZ|NFRZ|RZ)\d+$'
               Ergebnis: "MYSERVER_MSSQL"
   
    Output: "MYSERVER_MSSQL"
    """
```
**Beispiele:**
| Original Node | Normalisiert | Ergebnis |
|---------------|--------------|----------|
| `RRZ01_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ |
| `RRZ02_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ Zusammengeführt |
| `NFRZ01_DATABASE_HANA` | `DATABASE_HANA` | ✅ |
| `SERVER_FILE_RRZ01` | `SERVER_FILE` | ✅ |
| `MYSERVER_ORACLE` | `MYSERVER_ORACLE` | ✅ Unverändert |
---
## 2. API-Referenz
### 2.1 Agent Plugin Funktionen
#### `TSMParser.normalize_node_name(node: str) -> str`
Normalisiert TSM-Node-Namen für Redundanz-Logik.
**Parameter:**
- `node` (str): Original TSM-Node-Name
**Returns:**
- `str`: Normalisierter Node-Name
**Beispiel:**
```python
parser = TSMParser()
normalized = parser.normalize_node_name("RRZ01_SERVER_MSSQL")
# normalized == "SERVER_MSSQL"
```
---
#### `TSMParser.is_valid_node(node: str, status: str) -> bool`
Prüft, ob ein Node für Monitoring valide ist.
**Parameter:**
- `node` (str): Node-Name
- `status` (str): Backup-Status
**Returns:**
- `bool`: True wenn valide, False sonst
**Validierungs-Regeln:**
- Node muss existieren (not empty)
- Node muss mindestens 3 Zeichen lang sein
- Status muss existieren
- Node darf nicht "MAINTENANCE" enthalten
**Beispiel:**
```python
parser.is_valid_node("SERVER", "Completed")  # False (zu kurz)
parser.is_valid_node("SERVER_MSSQL", "Completed")  # True
parser.is_valid_node("SERVER_MAINTENANCE", "Completed")  # False
```
---
#### `TSMParser.parse_csv(csv_file: Path) -> None`
Parsed eine TSM-CSV-Datei und sammelt Backup-Informationen.
**Parameter:**
- `csv_file` (Path): Pfad zur CSV-Datei
**CSV-Format:**
```
TIMESTAMP,FIELD,NODE_NAME,SCHEDULE_NAME,STATUS
2026-01-12 08:00:00,SOMETHING,SERVER_MSSQL,DAILY_FULL,Completed
```
**Side Effects:**
- Fügt geparste Backups zu `self.backups` hinzu
---
#### `TSMParser.aggregate() -> dict`
Aggregiert Backup-Daten pro normalisierten Node.
**Returns:**
```python
{
    "SERVER_MSSQL": {
        "statuses": ["Completed", "Completed", "Failed"],
        "schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
        "last": 1736693420,  # Unix timestamp
        "count": 3
    }
}
```
---
### 2.2 Check Plugin Funktionen
#### `extract_backup_type(node: str) -> str`
Extrahiert Backup-Typ aus Node-Namen anhand bekannter Typen.
**Parameter:**
- `node` (str): Normalisierter Node-Name
**Returns:**
- `str`: Backup-Typ in Kleinbuchstaben, oder "unknown"
**Bekannte Typen:**
- Datenbanken: MSSQL, HANA, Oracle, DB2, MySQL
- Virtualisierung: Virtual
- Dateisysteme: FILE, SCALE, DM, Datacenter
- Applikationen: Mail
**Algorithmus:**
1. Splitte Node-Namen bei Unterstrich
2. Nehme letztes Segment
3. Falls letztes Segment numerisch → nehme vorletztes Segment
4. Prüfe gegen Liste bekannter Typen
5. Return lowercase oder "unknown"
**Beispiele:**
```python
extract_backup_type("SERVER_MSSQL")           # → "mssql"
extract_backup_type("DATABASE_HANA_01")       # → "hana"
extract_backup_type("FILESERVER_FILE")        # → "file"
extract_backup_type("VM_HYPERV_123")          # → "hyperv"
extract_backup_type("APP_UNKNOWN")            # → "unknown"
```
---
#### `extract_backup_level(schedules: list[str]) -> str`
Extrahiert Backup-Level aus Schedule-Namen.
**Parameter:**
- `schedules` (list[str]): Liste von Schedule-Namen
**Returns:**
- `str`: `"log"`, `"full"`, `"incremental"`, `"differential"`
**Priorität:** log > full > differential > incremental
**Erkennungs-Pattern:**
- `_LOG` oder `LOG` → log
- `_FULL` oder `FULL` → full
- `_INCR` oder `INCREMENTAL` → incremental
- `_DIFF` oder `DIFFERENTIAL` → differential
**Beispiele:**
```python
extract_backup_level(["DAILY_FULL"])                    # → "full"
extract_backup_level(["HOURLY_LOG", "DAILY_FULL"])     # → "log"
extract_backup_level(["00-00-00_FULL"])                 # → "full"
```
---
#### `extract_frequency(schedules: list[str]) -> str`
Extrahiert Backup-Frequenz aus Schedule-Namen.
**Parameter:**
- `schedules` (list[str]): Liste von Schedule-Namen
**Returns:**
- `str`: `"hourly"`, `"daily"`, `"weekly"`, `"monthly"`, `"unknown"`
**Priorität:** hourly > daily > weekly > monthly
**Erkennungs-Pattern:**
- `HOURLY` → hourly
- `DAILY` → daily
- `WEEKLY` → weekly
- `MONTHLY` → monthly
- `HH-MM-SS_*LOG` → hourly (Zeit-basiert mit LOG)
- `00-00-00_*` → daily (Mitternacht)
**Beispiele:**
```python
extract_frequency(["DAILY_FULL"])               # → "daily"
extract_frequency(["00-00-00_FULL"])            # → "daily"
extract_frequency(["08-00-00_LOG"])             # → "hourly"
extract_frequency(["WEEKLY_FULL", "DAILY_DIFF"]) # → "daily"
```
---
#### `get_error_handling(backup_type: str) -> str`
Bestimmt Error-Handling-Strategie basierend auf Backup-Typ.
**Parameter:**
- `backup_type` (str): Backup-Typ
**Returns:**
- `str`: `"tolerant"` oder `"strict"`
**Logik:**
```python
if backup_type in TOLERANT_TYPES:
    return "tolerant"  # Failed → WARN
else:
    return "strict"    # Failed → CRIT
```
**Tolerante Typen:**
- file, virtual, scale, dm, datacenter
- vmware, hyperv, mail, exchange
**Strikt:**
- Alle Datenbank-Typen (mssql, hana, oracle, db2, ...)
- Alle anderen Typen
---
#### `get_backup_category(backup_type: str) -> str`
Kategorisiert Backup-Typ in Oberkategorien.
**Parameter:**
- `backup_type` (str): Backup-Typ
**Returns:**
- `str`: `"database"`, `"virtualization"`, `"filesystem"`, `"application"`, `"other"`
**Kategorien:**
| Kategorie | Typen |
|-----------|-------|
| `database` | mssql, hana, oracle, db2, mysql, postgres, mariadb, sybase, mongodb |
| `virtualization` | virtual, vmware, hyperv, kvm, xen |
| `filesystem` | file, scale, dm, datacenter |
| `application` | mail, exchange |
| `other` | Alle anderen |
---
#### `get_thresholds(backup_type: str, backup_level: str) -> dict`
Liefert typ- und level-spezifische Schwellwerte.
**Parameter:**
- `backup_type` (str): Backup-Typ
- `backup_level` (str): Backup-Level
**Returns:**
```python
{
    "warn": 93600,   # Sekunden
    "crit": 172800   # Sekunden
}
```
**Priorität:**
1. Falls `backup_level == "log"` → LOG-Schwellwerte (4h/8h)
2. Falls `backup_type` in THRESHOLDS → Typ-spezifische Schwellwerte
3. Sonst → Default-Schwellwerte (26h/48h)
**Beispiele:**
```python
get_thresholds("mssql", "log")   # → {"warn": 14400, "crit": 28800}
get_thresholds("mssql", "full")  # → {"warn": 93600, "crit": 172800}
get_thresholds("newtype", "full") # → {"warn": 93600, "crit": 172800}
```
---
#### `calculate_state(statuses: list[str], last_time: int, backup_type: str, error_handling: str) -> tuple[State, str]`
Berechnet CheckMK-Status aus Backup-Zuständen.
**Parameter:**
- `statuses` (list[str]): Liste aller Backup-Statuses
- `last_time` (int): Unix-Timestamp des letzten Backups
- `backup_type` (str): Backup-Typ
- `error_handling` (str): "tolerant" oder "strict"
**Returns:**
- `tuple`: `(State, status_text)`
**Status-Logik-Tabelle:**
| Bedingung | Alter | Error Handling | State | Text |
|-----------|-------|----------------|-------|------|
| ≥1x "completed" | - | - | OK | "Completed" |
| Nur "pending"/"started" | <2h | - | OK | "Pending/Started" |
| Nur "pending"/"started" | >2h | - | WARN | "Pending (>2h)" |
| Nur "pending"/"started" | unknown | - | WARN | "Pending" |
| "failed"/"missed" | - | tolerant | WARN | "Failed (partial)" |
| "failed"/"missed" | - | strict | CRIT | "Failed/Missed" |
| Andere | - | - | CRIT | "Unknown State" |
**Beispiele:**
```python
calculate_state(["Completed", "Completed"], 1736690000, "mssql", "strict")
# → (State.OK, "Completed")
calculate_state(["Failed"], 1736690000, "file", "tolerant")
# → (State.WARN, "Failed (partial)")
calculate_state(["Failed"], 1736690000, "mssql", "strict")
# → (State.CRIT, "Failed/Missed")
```
---
### 2.3 CheckMK API v2 Objekte
#### `Service`
Definiert einen CheckMK-Service während der Discovery.
```python
from cmk.agent_based.v2 import Service, ServiceLabel
Service(
    item="SERVER_MSSQL",
    labels=[
        ServiceLabel("backup_type", "mssql"),
        ServiceLabel("frequency", "daily"),
    ]
)
```
---
#### `Result`
Repräsentiert ein Check-Ergebnis.
```python
from cmk.agent_based.v2 import Result, State
Result(
    state=State.OK,
    summary="Type=MSSQL, Status=Completed, Last=3h"
)
Result(
    state=State.OK,
    notice="Detailed information for details page"
)
```
---
#### `Metric`
Definiert eine Performance-Metrik.
```python
from cmk.agent_based.v2 import Metric
Metric(
    name="backup_age",
    value=11700,                    # Aktueller Wert
    levels=(93600, 172800),         # (warn, crit)
    boundaries=(0, None),            # (min, max)
)
```
---
## 3. Erweiterte Konfiguration
### 3.1 Benutzerdefinierte Backup-Typen
**Szenario:** Neuer Backup-Typ "SAPASE" (SAP ASE Datenbank) soll überwacht werden.
**Schritt 1: Typ zur known_types Liste hinzufügen**
```python
# In tsm_backups.py, extract_backup_type() Funktion
known_types = [
    'MSSQL', 'HANA', 'FILE', 'ORACLE', 'DB2', 'SCALE', 'DM',
    'DATACENTER', 'VIRTUAL', 'MAIL', 'MYSQL',
    'SAPASE',  # NEU
]
```
**Schritt 2: Schwellwerte definieren (optional)**
```python
THRESHOLDS = {
    # ... bestehende Einträge ...
    "sapase": {"warn": 26 * 3600, "crit": 48 * 3600},
}
```
**Schritt 3: Typ zur passenden Kategorie hinzufügen**
```python
DATABASE_TYPES = {
    'mssql', 'hana', 'db2', 'oracle', 'mysql',
    'sapase',  # NEU
}
```
**Schritt 4: Error-Handling festlegen (optional)**
Falls tolerant gewünscht:
```python
TOLERANT_TYPES = {
    'file', 'virtual', 'scale', 'dm', 'datacenter',
    'vmware', 'hyperv', 'mail', 'exchange',
    'sapase',  # NEU (falls tolerant erwünscht)
}
```
**Schritt 5: Plugin neu laden**
```bash
cmk -R
cmk -II --all
```
**Ergebnis:**
- Nodes wie `SERVER_SAPASE` werden automatisch erkannt
- Typ-Label: `backup_type=sapase`
- Kategorie-Label: `backup_category=database`
- Schwellwerte: 26h/48h
---
### 3.2 Async Agent Plugin
Bei großen TSM-Umgebungen kann das CSV-Parsing Zeit in Anspruch nehmen. Async-Plugins laufen unabhängig vom Agent-Intervall.
**Konfiguration:**
```bash
# Als root auf dem Host
cat > /etc/check_mk/mrpe.cfg << 'EOF'
# TSM Backups async (alle 5 Minuten)
(interval=300) tsm_backups /usr/lib/check_mk_agent/plugins/tsm_backups
EOF
```
**Oder mit CheckMK Bakery (Regel):**
```
Setup > Agents > Agent Rules > Asynchronous execution of plugins (Windows, Linux)
```
**Einstellungen:**
- Plugin: `tsm_backups`
- Execution interval: `300` Sekunden (5 Minuten)
- Cache age: `600` Sekunden (10 Minuten)
---
### 3.3 CSV-Export Automation
#### Option A: NFS-Mount (empfohlen)
```bash
# /etc/fstab
tsm-server.example.com:/exports/backup-stats  /mnt/CMK_TSM  nfs  defaults,ro  0  0
# Mount testen
mount -a
ls /mnt/CMK_TSM/
```
#### Option B: Rsync via Cron
```bash
# Crontab für root
*/15 * * * * rsync -az --delete tsm-server:/path/to/csv/ /mnt/CMK_TSM/
```
#### Option C: SCP mit SSH-Key
```bash
# SSH-Key einrichten
ssh-keygen -t ed25519 -f ~/.ssh/tsm_backup_key -N ""
ssh-copy-id -i ~/.ssh/tsm_backup_key.pub tsm-server
# Crontab
*/15 * * * * scp -i ~/.ssh/tsm_backup_key tsm-server:/path/*.CSV /mnt/CMK_TSM/
```
---
### 3.4 Regel-basierte Service-Erstellung
**CheckMK-Regeln für automatische Service-Labels:**
```
Setup > Services > Discovery rules > Host labels
```
**Beispiel-Regel:**
```yaml
conditions:
  service_labels:
    backup_category: database
 
actions:
  add_labels:
    criticality: high
    team: dba
```
---
### 3.5 Custom Views
#### View: Alle kritischen Datenbank-Backups
```
Setup > General > Custom views > Create new view
Name: Critical Database Backups
Datasource: All services
Filters:
- Service state: CRIT
- Service labels: backup_category = database
Columns:
- Host
- Service description
- Service state
- Service output
- Service labels: backup_type
- Service labels: frequency
- Perf-O-Meter
```
---
### 3.6 Custom Notifications
**Notification Rule: Nur strikte Failed-Backups eskalieren**
```
Setup > Notifications > Add rule
Conditions:
- Service labels: error_handling = strict
- Service state: CRIT
- Service state type: HARD
Contact selection:
- Specify users: dba-team
Notification method:
- Email
- PagerDuty
```
---
## 4. Entwicklungsleitfaden
### 4.1 Entwicklungsumgebung einrichten
```bash
# CheckMK-Site für Entwicklung
omd create dev
omd start dev
su - dev
# Git-Repository
cd ~/local/lib/python3/cmk_addons/plugins/
git init
git add .
git commit -m "Initial commit"
# Entwicklungs-Workflow
vim tsm/agent_based/tsm_backups.py
cmk -R
cmk -vv --debug test-host | grep "TSM Backup"
```
---
### 4.2 Unit Tests schreiben
**Test-Datei:** `test_tsm_backups.py`
```python
#!/usr/bin/env python3
import pytest
from tsm_backups import (
    extract_backup_type,
    extract_backup_level,
    calculate_state,
)
from cmk.agent_based.v2 import State
def test_extract_backup_type():
    assert extract_backup_type("SERVER_MSSQL") == "mssql"
    assert extract_backup_type("DATABASE_HANA_01") == "hana"
    assert extract_backup_type("NEWTYPE_CUSTOM") == "custom"
def test_extract_backup_level():
    assert extract_backup_level(["DAILY_FULL"]) == "full"
    assert extract_backup_level(["HOURLY_LOG", "DAILY_FULL"]) == "log"
def test_calculate_state_completed():
    state, text = calculate_state(
        ["Completed", "Completed"],
        1736690000,
        "mssql",
        "strict"
    )
    assert state == State.OK
    assert text == "Completed"
def test_calculate_state_failed_strict():
    state, text = calculate_state(
        ["Failed"],
        1736690000,
        "mssql",
        "strict"
    )
    assert state == State.CRIT
    assert text == "Failed/Missed"
def test_calculate_state_failed_tolerant():
    state, text = calculate_state(
        ["Failed"],
        1736690000,
        "file",
        "tolerant"
    )
    assert state == State.WARN
    assert text == "Failed (partial)"
```
**Tests ausführen:**
```bash
pytest test_tsm_backups.py -v
```
---
### 4.3 Code-Style
**PEP 8 Compliance:**
```bash
pip install black flake8 mypy
# Auto-Formatierung
black tsm_backups.py
# Linting
flake8 tsm_backups.py
# Type Checking
mypy tsm_backups.py
```
---
### 4.4 Debugging
#### Agent-Plugin debuggen
```bash
# Direkter Aufruf mit Traceback
python3 /usr/lib/check_mk_agent/plugins/tsm_backups
# Mit Debugger
python3 -m pdb /usr/lib/check_mk_agent/plugins/tsm_backups
```
#### Check-Plugin debuggen
```bash
# Verbose Check mit Debug-Output
cmk -vv --debug hostname | less
# Nur TSM-Services
cmk -vv --debug hostname | grep -A 20 "TSM Backup"
# Python-Debugger im Plugin
import pdb; pdb.set_trace()
```
---
### 4.5 Performance-Profiling
```python
# In tsm_backups.py
import cProfile
import pstats
def main():
    profiler = cProfile.Profile()
    profiler.enable()
   
    # ... bestehender Code ...
   
    profiler.disable()
    stats = pstats.Stats(profiler)
    stats.sort_stats('cumulative')
    stats.print_stats(20)
```
---
## 5. Performance-Optimierung
### 5.1 CSV-Parsing beschleunigen
**Problem:** Große CSV-Dateien (>100 MB) verlangsamen Agent
**Lösung 1: Nur relevante Zeilen parsen**
```python
def parse_csv_optimized(self, csv_file):
    # Nur die letzten 24h sind relevant
    cutoff_time = datetime.now() - timedelta(hours=24)
   
    with open(csv_file, 'r') as f:
        reader = csv.reader(f)
        for row in reader:
            try:
                time_str = row[0].strip()
                timestamp = datetime.strptime(time_str, "%Y-%m-%d %H:%M:%S")
               
                # Skip alte Einträge
                if timestamp < cutoff_time:
                    continue
               
                # ... restliche Verarbeitung ...
            except:
                continue
```
**Lösung 2: Pandas für große Dateien**
```python
import pandas as pd
def parse_csv_pandas(csv_file):
    df = pd.read_csv(
        csv_file,
        names=['timestamp', 'field', 'node', 'schedule', 'status'],
        parse_dates=['timestamp'],
    )
   
    # Filter letzten 24h
    cutoff = pd.Timestamp.now() - pd.Timedelta(hours=24)
    df = df[df['timestamp'] > cutoff]
   
    # Aggregation
    grouped = df.groupby('node').agg({
        'status': list,
        'schedule': list,
        'timestamp': 'max',
        'node': 'count'
    })
   
    return grouped.to_dict()
```
---
### 5.2 Caching
**Problem:** CSV-Dateien ändern sich nur alle 15-30 Minuten
**Lösung: Cache mit Timestamp-Check**
```python
import json
from pathlib import Path
import time
CACHE_FILE = Path("/tmp/tsm_backups_cache.json")
CACHE_TTL = 300  # 5 Minuten
def get_cached_or_parse():
    if CACHE_FILE.exists():
        cache_age = time.time() - CACHE_FILE.stat().st_mtime
        if cache_age < CACHE_TTL:
            with open(CACHE_FILE, 'r') as f:
                return json.load(f)
   
    # Parse fresh
    parser = TSMParser()
    # ... parse logic ...
    result = parser.aggregate()
   
    # Cache schreiben
    with open(CACHE_FILE, 'w') as f:
        json.dump(result, f)
   
    return result
```
---
### 5.3 Speicher-Optimierung
**Problem:** Große Listen von Status/Schedule strings
**Lösung: Nur unique values speichern**
```python
def aggregate_optimized(self):
    nodes = defaultdict(lambda: {
        "statuses": set(),       # Set statt Liste
        "schedules": set(),
        "last": None,
        "count": 0,
    })
   
    for b in self.backups:
        node = b["node"]
        nodes[node]["count"] += 1
        nodes[node]["statuses"].add(b["status"])  # Automatisch unique
        nodes[node]["schedules"].add(b["schedule"])
        # ... rest ...
   
    # Konvertiere Sets zu Listen für JSON
    for node in nodes:
        nodes[node]["statuses"] = list(nodes[node]["statuses"])
        nodes[node]["schedules"] = list(nodes[node]["schedules"])
   
    return nodes
```
---
## 6. Sicherheit
### 6.1 Dateiberechtigungen
```bash
# Agent-Plugin
chown root:root /usr/lib/check_mk_agent/plugins/tsm_backups
chmod 755 /usr/lib/check_mk_agent/plugins/tsm_backups
# CSV-Verzeichnis
chown root:root /mnt/CMK_TSM
chmod 755 /mnt/CMK_TSM
chmod 644 /mnt/CMK_TSM/*.CSV
# Check-Plugin
chown <site>:<site> $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
chmod 644 $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
```
---
### 6.2 Input-Validierung
**Agent-Plugin:**
```python
def is_valid_node(self, node, status):
    # Länge prüfen
    if not node or len(node) < 3 or len(node) > 200:
        return False
   
    # Unerlaubte Zeichen
    if not re.match(r'^[A-Za-z0-9_-]+$', node):
        return False
   
    # Status whitelist
    valid_statuses = ['Completed', 'Failed', 'Missed', 'Pending', 'Started']
    if status not in valid_statuses:
        return False
   
    return True
```
---
### 6.3 Sichere CSV-Verarbeitung
```python
def parse_csv_safe(self, csv_file):
    try:
        # Dateigröße prüfen (max 500 MB)
        if csv_file.stat().st_size > 500 * 1024 * 1024:
            return
       
        with open(csv_file, 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
           
            line_count = 0
            for row in reader:
                line_count += 1
               
                # Max. 1 Million Zeilen
                if line_count > 1000000:
                    break
               
                # ... Verarbeitung ...
    except Exception as e:
        # Logging statt Crash
        pass
```
---
## 7. Integration
### 7.1 Grafana-Dashboard
**InfluxDB Query:**
```sql
SELECT
  mean("backup_age") AS "avg_age",
  max("backup_age") AS "max_age"
FROM "tsm_backups"
WHERE
  "backup_category" = 'database'
  AND time > now() - 7d
GROUP BY
  time(1h),
  "node_name"
```
**Panels:**
- Backup Age Heatmap (pro Node)
- Status Distribution (Pie Chart)
- Backup Jobs Timeline
- Alert History
---
### 7.2 Prometheus Exporter
**CheckMK Prometheus Exporter konfigurieren:**
```
Setup > Exporter > Prometheus
Metrics:
- cmk_tsm_backups_backup_age_seconds
- cmk_tsm_backups_backup_jobs_total
Labels:
- backup_type
- backup_category
- frequency
```
---
### 7.3 REST API Zugriff
```python
import requests
# CheckMK REST API
url = "https://checkmk.example.com/site/check_mk/api/1.0"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Accept": "application/json"
}
# Alle TSM-Services abfragen
response = requests.get(
    f"{url}/domain-types/service/collections/all",
    headers=headers,
    params={
        "query": '{"op": "and", "expr": [{"op": "~", "left": "description", "right": "TSM Backup"}]}'
    }
)
services = response.json()
```
---
## 8. Best Practices
### 8.1 Naming Conventions
**Node-Namen:**
```
✅ EMPFOHLEN:
- SERVER_MSSQL
- APP_ORACLE_01
- FILESERVER_BACKUP
❌ VERMEIDEN:
- MSSQL (zu generisch)
- SERVER-PROD (Bindestrich kann Probleme machen)
- very_long_name_that_is_too_descriptive_mssql_backup_node (>50 Zeichen)
```
**Schedule-Namen:**
```
✅ EMPFOHLEN:
- DAILY_FULL
- HOURLY_LOG
- WEEKLY_FULL
❌ VERMEIDEN:
- PROD_BACKUP (keine Frequency erkennbar)
- BACKUP01 (keine Informationen)
```
---
### 8.2 Monitoring-Strategie
**Alarm-Eskalation:**
1. **Stufe 1 (INFO):** Backup Started/Pending
2. **Stufe 2 (WARN):**
   - Backup-Alter > WARN-Schwellwert
   - Failed (tolerante Typen)
   - Pending > 2h
3. **Stufe 3 (CRIT):**
   - Backup-Alter > CRIT-Schwellwert
   - Failed/Missed (strikte Typen)
**Notification Delays:**
```
Setup > Notifications > Rules
WARN: Notify after 15 minutes (allow recovery)
CRIT: Notify immediately
```
---
### 8.3 Maintenance Windows
**Backup-Services während Maintenance pausieren:**
```
Setup > Services > Service monitoring rules > Disabled checks
Conditions:
- Service labels: backup_system = tsm
- Timeperiod: maintenance_window
Action: Disable active checks
```
---
### 8.4 Dokumentation
**Pro Installation dokumentieren:**
1. **CSV-Export-Quelle:** Welcher TSM-Server, welche Queries
2. **CSV-Transfer-Methode:** NFS/SCP/Rsync + Schedule
3. **Benutzerdefinierte Typen:** Liste aller hinzugefügten Backup-Typen
4. **Angepasste Schwellwerte:** Begründung für Abweichungen
5. **Kontakte:** Wer ist für TSM-Backups verantwortlich
---
### 8.5 Regelmäßige Wartung
**Monatlich:**
- CSV-Verzeichnis aufräumen (alte Dateien löschen)
- Überprüfen: Werden alle erwarteten Nodes gefunden?
- Alert-History analysieren: False Positives?
**Quartalsweise:**
- Schwellwerte überprüfen und ggf. anpassen
- Neue Backup-Typen dokumentieren
- Check-Plugin auf Updates prüfen
**Jährlich:**
- CheckMK-Upgrade-Kompatibilität testen
- Performance-Review (Agent-Laufzeit, Check-Dauer)
- Architektur-Review (Ist die Lösung noch passend?)
---
## Anhang
### A. Glossar
| Begriff | Beschreibung |
|---------|--------------|
| **Agent Plugin** | Script auf dem überwachten Host, liefert Daten an CheckMK |
| **Check Plugin** | Code auf CheckMK-Server, erstellt Services und bewertet Status |
| **Service Label** | Key-Value-Paar, das einem Service zugeordnet ist (Filterung/Reporting) |
| **Discovery** | Prozess, bei dem CheckMK automatisch Services erstellt |
| **Threshold** | Schwellwert (WARN/CRIT) für eine Metrik |
| **Node** | TSM-Begriff für einen Backup-Client |
| **Schedule** | TSM-Begriff für einen geplanten Backup-Job |
---
### B. Fehlercode-Referenz
| Fehler | Ursache | Lösung |
|--------|---------|--------|
| `Backup not found in data` | Node existiert in Discovery, aber nicht im aktuellen Agent-Output | CSV-Dateien prüfen, ggf. Re-Discovery |
| `Empty agent section` | Agent liefert keine Daten | Agent-Plugin-Ausführung prüfen, CSV-Verzeichnis prüfen |
| `JSON decode error` | Agent-Output ist kein valides JSON | Agent-Plugin manuell testen, Fehler im Output suchen |
| `Unknown State` | Unerwarteter Status vom TSM | Agent-Output prüfen, ggf. `calculate_state()` erweitern |
---
### C. TSM-Query für CSV-Export
**Beispiel-Query für TSM-Server (dsmadmc):**
```sql
SELECT
  DATE(END_TIME) || ' ' || TIME(END_TIME) AS DATETIME,
  ENTITY,
  NODE_NAME,
  SCHEDULE_NAME,
  RESULT
FROM ACTLOG
WHERE
  SCHEDULE_NAME IS NOT NULL
  AND SCHEDULE_NAME != ''
  AND TIMESTAMPDIFF(4, CHAR(CURRENT_TIMESTAMP - END_TIME)) <= 24
ORDER BY END_TIME DESC
```
**Export als CSV:**
```bash
dsmadmc -id=admin -pa=password -comma \
  "SELECT ... FROM ACTLOG ..." \
  > /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV
```
---
**Ende der technischen Dokumentation**
**Letzte Aktualisierung:** 2026-01-12  
**Version:** 4.1  
**Autor:** Marius Gielnik