1648 lines
39 KiB
Markdown
1648 lines
39 KiB
Markdown
|
|
# TSM Backup Monitoring - Technische Dokumentation
|
|||
|
|
|
|||
|
|
|
|||
|
|
## Inhaltsverzeichnis
|
|||
|
|
|
|||
|
|
|
|||
|
|
1. [Architektur-Details](#architektur-details)
|
|||
|
|
2. [API-Referenz](#api-referenz)
|
|||
|
|
3. [Erweiterte Konfiguration](#erweiterte-konfiguration)
|
|||
|
|
4. [Entwicklungsleitfaden](#entwicklungsleitfaden)
|
|||
|
|
5. [Performance-Optimierung](#performance-optimierung)
|
|||
|
|
6. [Sicherheit](#sicherheit)
|
|||
|
|
7. [Integration](#integration)
|
|||
|
|
8. [Best Practices](#best-practices)
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 1. Architektur-Details
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 1.1 Komponenten-Übersicht
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### Agent Plugin (`tsm_backups`)
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Speicherort:** `/usr/lib/check_mk_agent/plugins/tsm_backups`
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Aufgaben:**
|
|||
|
|
- CSV-Dateien aus `/mnt/CMK_TSM` einlesen
|
|||
|
|
- Node-Namen normalisieren (RRZ*/NFRZ*-Präfixe entfernen)
|
|||
|
|
- Backup-Daten pro Node aggregieren
|
|||
|
|
- JSON-Output für CheckMK Agent generieren
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Ausführung:**
|
|||
|
|
- Wird bei jedem Agent-Aufruf ausgeführt
|
|||
|
|
- Standard-Intervall: 60 Sekunden (CheckMK Standard)
|
|||
|
|
- Kann asynchron konfiguriert werden (siehe [Async Plugins](#async-plugins))
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Output-Format:**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"SERVER_MSSQL": {
|
|||
|
|
"statuses": ["Completed", "Completed", "Failed"],
|
|||
|
|
"schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
|
|||
|
|
"last": 1736693420,
|
|||
|
|
"count": 3
|
|||
|
|
},
|
|||
|
|
"DATABASE_HANA": {
|
|||
|
|
"statuses": ["Completed"],
|
|||
|
|
"schedules": ["00-00-00_FULL"],
|
|||
|
|
"last": 1736690000,
|
|||
|
|
"count": 1
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### Check Plugin (`tsm_backups.py`)
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Speicherort:** `/omd/sites/<site>/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py`
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Aufgaben:**
|
|||
|
|
- JSON vom Agent parsen
|
|||
|
|
- Services mit Labels discovern
|
|||
|
|
- Backup-Status bewerten
|
|||
|
|
- Metriken generieren
|
|||
|
|
- Schwellwerte prüfen
|
|||
|
|
|
|||
|
|
|
|||
|
|
**CheckMK API Version:** v2 (cmk.agent_based.v2)
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 1.2 Datenfluss-Diagramm
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ TSM Server │
|
|||
|
|
│ │
|
|||
|
|
│ SELECT │
|
|||
|
|
│ DATE_TIME, ENTITY, NODE_NAME, SCHEDULE_NAME, RESULT │
|
|||
|
|
│ FROM ACTLOG │
|
|||
|
|
│ WHERE TIMESTAMP > CURRENT_TIMESTAMP - 24 HOURS │
|
|||
|
|
│ │
|
|||
|
|
│ ↓ Export als CSV │
|
|||
|
|
│ /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV │
|
|||
|
|
└──────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
│ NFS/SCP/Rsync
|
|||
|
|
▼
|
|||
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ Host: /mnt/CMK_TSM/ │
|
|||
|
|
│ ├── TSM_BACKUP_SCHED_24H.CSV │
|
|||
|
|
│ ├── TSM_DB_SCHED_24H.CSV │
|
|||
|
|
│ └── TSM_FILE_SCHED_24H.CSV │
|
|||
|
|
└──────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ Agent Plugin: /usr/lib/check_mk_agent/plugins/tsm_backups │
|
|||
|
|
│ │
|
|||
|
|
│ 1. Liste CSV-Dateien in /mnt/CMK_TSM │
|
|||
|
|
│ 2. Parse jede Zeile: │
|
|||
|
|
│ - Extrahiere: timestamp, node, schedule, status │
|
|||
|
|
│ - Validiere Node (Länge, MAINTENANCE) │
|
|||
|
|
│ - Normalisiere Node-Name: │
|
|||
|
|
│ RRZ01_SERVER_MSSQL → _SERVER_MSSQL → SERVER_MSSQL │
|
|||
|
|
│ 3. Aggregiere pro Node: │
|
|||
|
|
│ - Sammle alle Statuses │
|
|||
|
|
│ - Sammle alle Schedules │
|
|||
|
|
│ - Finde letzten Timestamp │
|
|||
|
|
│ - Zähle Jobs │
|
|||
|
|
│ 4. Generiere JSON-Output │
|
|||
|
|
│ │
|
|||
|
|
│ Output: <<<tsm_backups:sep(0)>>> │
|
|||
|
|
│ {"SERVER_MSSQL": {...}, ...} │
|
|||
|
|
└──────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
│ CheckMK Agent Protocol
|
|||
|
|
▼
|
|||
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ CheckMK Server: Agent Section Parser │
|
|||
|
|
│ │
|
|||
|
|
│ parse_tsm_backups(string_table): │
|
|||
|
|
│ - Extrahiere JSON-String aus string_table[0][0] │
|
|||
|
|
│ - Parse JSON → Python Dict │
|
|||
|
|
│ - Return: {node: data, ...} │
|
|||
|
|
└──────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ Service Discovery │
|
|||
|
|
│ │
|
|||
|
|
│ discover_tsm_backups(section): │
|
|||
|
|
│ FOR EACH node IN section: │
|
|||
|
|
│ 1. Extrahiere Metadata: │
|
|||
|
|
│ - backup_type = extract_backup_type(node) │
|
|||
|
|
│ - backup_level = extract_backup_level(schedules) │
|
|||
|
|
│ - frequency = extract_frequency(schedules) │
|
|||
|
|
│ - error_handling = get_error_handling(backup_type) │
|
|||
|
|
│ - category = get_backup_category(backup_type) │
|
|||
|
|
│ 2. Erstelle Service mit Labels: │
|
|||
|
|
│ Service( │
|
|||
|
|
│ item=node, │
|
|||
|
|
│ labels=[ │
|
|||
|
|
│ ServiceLabel("backup_type", backup_type), │
|
|||
|
|
│ ServiceLabel("backup_category", category), │
|
|||
|
|
│ ... │
|
|||
|
|
│ ] │
|
|||
|
|
│ ) │
|
|||
|
|
└──────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ Service Check Execution │
|
|||
|
|
│ │
|
|||
|
|
│ check_tsm_backups(item, section): │
|
|||
|
|
│ 1. Lade Node-Daten aus section[item] │
|
|||
|
|
│ 2. Extrahiere Metadata (wie bei Discovery) │
|
|||
|
|
│ 3. Berechne Status: │
|
|||
|
|
│ - calculate_state() → (State, status_text) │
|
|||
|
|
│ 4. Berechne Backup-Alter: │
|
|||
|
|
│ - age = now - last_timestamp │
|
|||
|
|
│ 5. Hole Schwellwerte: │
|
|||
|
|
│ - thresholds = get_thresholds(type, level) │
|
|||
|
|
│ 6. Prüfe Alter gegen Schwellwerte │
|
|||
|
|
│ 7. Generiere Output: │
|
|||
|
|
│ - Result(state, summary) │
|
|||
|
|
│ - Metric("backup_age", age, levels) │
|
|||
|
|
│ - Metric("backup_jobs", count) │
|
|||
|
|
└──────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ CheckMK Service │
|
|||
|
|
│ │
|
|||
|
|
│ Name: TSM Backup SERVER_MSSQL │
|
|||
|
|
│ State: OK │
|
|||
|
|
│ Summary: Type=MSSQL (database), Level=FULL, Freq=daily, │
|
|||
|
|
│ Status=Completed, Last=3h 15m, Jobs=3 │
|
|||
|
|
│ Metrics: │
|
|||
|
|
│ - backup_age: 11700s (warn: 93600s, crit: 172800s) │
|
|||
|
|
│ - backup_jobs: 3 │
|
|||
|
|
│ Labels: │
|
|||
|
|
│ - backup_type: mssql │
|
|||
|
|
│ - backup_category: database │
|
|||
|
|
│ - frequency: daily │
|
|||
|
|
│ - backup_level: full │
|
|||
|
|
│ - error_handling: strict │
|
|||
|
|
└──────────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 1.3 Node-Normalisierung im Detail
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Zweck:** TSM-Umgebungen mit redundanten Servern (z.B. RRZ01, RRZ02, NFRZ01) sollen als ein logischer Node überwacht werden.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Algorithmus:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def normalize_node_name(node):
|
|||
|
|
"""
|
|||
|
|
Input: "RRZ01_MYSERVER_MSSQL"
|
|||
|
|
|
|||
|
|
Schritt 1: Entferne RRZ*/NFRZ*/RZ* Präfix mit Unterstrich
|
|||
|
|
Pattern: r'(RRZ|NFRZ|RZ)\d+(_)'
|
|||
|
|
Ergebnis: "_MYSERVER_MSSQL"
|
|||
|
|
|
|||
|
|
Schritt 2: Entferne führenden Unterstrich
|
|||
|
|
Ergebnis: "MYSERVER_MSSQL"
|
|||
|
|
|
|||
|
|
Schritt 3: Entferne RRZ*/NFRZ*/RZ* Suffix ohne Unterstrich
|
|||
|
|
Pattern: r'(RRZ|NFRZ|RZ)\d+$'
|
|||
|
|
Ergebnis: "MYSERVER_MSSQL"
|
|||
|
|
|
|||
|
|
Output: "MYSERVER_MSSQL"
|
|||
|
|
"""
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiele:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Original Node | Normalisiert | Ergebnis |
|
|||
|
|
|---------------|--------------|----------|
|
|||
|
|
| `RRZ01_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ |
|
|||
|
|
| `RRZ02_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ Zusammengeführt |
|
|||
|
|
| `NFRZ01_DATABASE_HANA` | `DATABASE_HANA` | ✅ |
|
|||
|
|
| `SERVER_FILE_RRZ01` | `SERVER_FILE` | ✅ |
|
|||
|
|
| `MYSERVER_ORACLE` | `MYSERVER_ORACLE` | ✅ Unverändert |
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 2. API-Referenz
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 2.1 Agent Plugin Funktionen
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `TSMParser.normalize_node_name(node: str) -> str`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Normalisiert TSM-Node-Namen für Redundanz-Logik.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `node` (str): Original TSM-Node-Name
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `str`: Normalisierter Node-Name
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiel:**
|
|||
|
|
```python
|
|||
|
|
parser = TSMParser()
|
|||
|
|
normalized = parser.normalize_node_name("RRZ01_SERVER_MSSQL")
|
|||
|
|
# normalized == "SERVER_MSSQL"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `TSMParser.is_valid_node(node: str, status: str) -> bool`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Prüft, ob ein Node für Monitoring valide ist.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `node` (str): Node-Name
|
|||
|
|
- `status` (str): Backup-Status
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `bool`: True wenn valide, False sonst
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Validierungs-Regeln:**
|
|||
|
|
- Node muss existieren (not empty)
|
|||
|
|
- Node muss mindestens 3 Zeichen lang sein
|
|||
|
|
- Status muss existieren
|
|||
|
|
- Node darf nicht "MAINTENANCE" enthalten
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiel:**
|
|||
|
|
```python
|
|||
|
|
parser.is_valid_node("SERVER", "Completed") # False (zu kurz)
|
|||
|
|
parser.is_valid_node("SERVER_MSSQL", "Completed") # True
|
|||
|
|
parser.is_valid_node("SERVER_MAINTENANCE", "Completed") # False
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `TSMParser.parse_csv(csv_file: Path) -> None`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Parsed eine TSM-CSV-Datei und sammelt Backup-Informationen.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `csv_file` (Path): Pfad zur CSV-Datei
|
|||
|
|
|
|||
|
|
|
|||
|
|
**CSV-Format:**
|
|||
|
|
```
|
|||
|
|
TIMESTAMP,FIELD,NODE_NAME,SCHEDULE_NAME,STATUS
|
|||
|
|
2026-01-12 08:00:00,SOMETHING,SERVER_MSSQL,DAILY_FULL,Completed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Side Effects:**
|
|||
|
|
- Fügt geparste Backups zu `self.backups` hinzu
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `TSMParser.aggregate() -> dict`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Aggregiert Backup-Daten pro normalisierten Node.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
```python
|
|||
|
|
{
|
|||
|
|
"SERVER_MSSQL": {
|
|||
|
|
"statuses": ["Completed", "Completed", "Failed"],
|
|||
|
|
"schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
|
|||
|
|
"last": 1736693420, # Unix timestamp
|
|||
|
|
"count": 3
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 2.2 Check Plugin Funktionen
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `extract_backup_type(node: str) -> str`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Extrahiert Backup-Typ aus Node-Namen anhand bekannter Typen.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `node` (str): Normalisierter Node-Name
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `str`: Backup-Typ in Kleinbuchstaben, oder "unknown"
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Bekannte Typen:**
|
|||
|
|
- Datenbanken: MSSQL, HANA, Oracle, DB2, MySQL
|
|||
|
|
- Virtualisierung: Virtual
|
|||
|
|
- Dateisysteme: FILE, SCALE, DM, Datacenter
|
|||
|
|
- Applikationen: Mail
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Algorithmus:**
|
|||
|
|
1. Splitte Node-Namen bei Unterstrich
|
|||
|
|
2. Nehme letztes Segment
|
|||
|
|
3. Falls letztes Segment numerisch → nehme vorletztes Segment
|
|||
|
|
4. Prüfe gegen Liste bekannter Typen
|
|||
|
|
5. Return lowercase oder "unknown"
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiele:**
|
|||
|
|
```python
|
|||
|
|
extract_backup_type("SERVER_MSSQL") # → "mssql"
|
|||
|
|
extract_backup_type("DATABASE_HANA_01") # → "hana"
|
|||
|
|
extract_backup_type("FILESERVER_FILE") # → "file"
|
|||
|
|
extract_backup_type("VM_HYPERV_123") # → "hyperv"
|
|||
|
|
extract_backup_type("APP_UNKNOWN") # → "unknown"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `extract_backup_level(schedules: list[str]) -> str`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Extrahiert Backup-Level aus Schedule-Namen.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `schedules` (list[str]): Liste von Schedule-Namen
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `str`: `"log"`, `"full"`, `"incremental"`, `"differential"`
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Priorität:** log > full > differential > incremental
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Erkennungs-Pattern:**
|
|||
|
|
- `_LOG` oder `LOG` → log
|
|||
|
|
- `_FULL` oder `FULL` → full
|
|||
|
|
- `_INCR` oder `INCREMENTAL` → incremental
|
|||
|
|
- `_DIFF` oder `DIFFERENTIAL` → differential
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiele:**
|
|||
|
|
```python
|
|||
|
|
extract_backup_level(["DAILY_FULL"]) # → "full"
|
|||
|
|
extract_backup_level(["HOURLY_LOG", "DAILY_FULL"]) # → "log"
|
|||
|
|
extract_backup_level(["00-00-00_FULL"]) # → "full"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `extract_frequency(schedules: list[str]) -> str`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Extrahiert Backup-Frequenz aus Schedule-Namen.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `schedules` (list[str]): Liste von Schedule-Namen
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `str`: `"hourly"`, `"daily"`, `"weekly"`, `"monthly"`, `"unknown"`
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Priorität:** hourly > daily > weekly > monthly
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Erkennungs-Pattern:**
|
|||
|
|
- `HOURLY` → hourly
|
|||
|
|
- `DAILY` → daily
|
|||
|
|
- `WEEKLY` → weekly
|
|||
|
|
- `MONTHLY` → monthly
|
|||
|
|
- `HH-MM-SS_*LOG` → hourly (Zeit-basiert mit LOG)
|
|||
|
|
- `00-00-00_*` → daily (Mitternacht)
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiele:**
|
|||
|
|
```python
|
|||
|
|
extract_frequency(["DAILY_FULL"]) # → "daily"
|
|||
|
|
extract_frequency(["00-00-00_FULL"]) # → "daily"
|
|||
|
|
extract_frequency(["08-00-00_LOG"]) # → "hourly"
|
|||
|
|
extract_frequency(["WEEKLY_FULL", "DAILY_DIFF"]) # → "daily"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `get_error_handling(backup_type: str) -> str`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Bestimmt Error-Handling-Strategie basierend auf Backup-Typ.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `backup_type` (str): Backup-Typ
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `str`: `"tolerant"` oder `"strict"`
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Logik:**
|
|||
|
|
```python
|
|||
|
|
if backup_type in TOLERANT_TYPES:
|
|||
|
|
return "tolerant" # Failed → WARN
|
|||
|
|
else:
|
|||
|
|
return "strict" # Failed → CRIT
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Tolerante Typen:**
|
|||
|
|
- file, virtual, scale, dm, datacenter
|
|||
|
|
- vmware, hyperv, mail, exchange
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Strikt:**
|
|||
|
|
- Alle Datenbank-Typen (mssql, hana, oracle, db2, ...)
|
|||
|
|
- Alle anderen Typen
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `get_backup_category(backup_type: str) -> str`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Kategorisiert Backup-Typ in Oberkategorien.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `backup_type` (str): Backup-Typ
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `str`: `"database"`, `"virtualization"`, `"filesystem"`, `"application"`, `"other"`
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Kategorien:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Kategorie | Typen |
|
|||
|
|
|-----------|-------|
|
|||
|
|
| `database` | mssql, hana, oracle, db2, mysql, postgres, mariadb, sybase, mongodb |
|
|||
|
|
| `virtualization` | virtual, vmware, hyperv, kvm, xen |
|
|||
|
|
| `filesystem` | file, scale, dm, datacenter |
|
|||
|
|
| `application` | mail, exchange |
|
|||
|
|
| `other` | Alle anderen |
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `get_thresholds(backup_type: str, backup_level: str) -> dict`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Liefert typ- und level-spezifische Schwellwerte.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `backup_type` (str): Backup-Typ
|
|||
|
|
- `backup_level` (str): Backup-Level
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
```python
|
|||
|
|
{
|
|||
|
|
"warn": 93600, # Sekunden
|
|||
|
|
"crit": 172800 # Sekunden
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Priorität:**
|
|||
|
|
1. Falls `backup_level == "log"` → LOG-Schwellwerte (4h/8h)
|
|||
|
|
2. Falls `backup_type` in THRESHOLDS → Typ-spezifische Schwellwerte
|
|||
|
|
3. Sonst → Default-Schwellwerte (26h/48h)
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiele:**
|
|||
|
|
```python
|
|||
|
|
get_thresholds("mssql", "log") # → {"warn": 14400, "crit": 28800}
|
|||
|
|
get_thresholds("mssql", "full") # → {"warn": 93600, "crit": 172800}
|
|||
|
|
get_thresholds("newtype", "full") # → {"warn": 93600, "crit": 172800}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `calculate_state(statuses: list[str], last_time: int, backup_type: str, error_handling: str) -> tuple[State, str]`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Berechnet CheckMK-Status aus Backup-Zuständen.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Parameter:**
|
|||
|
|
- `statuses` (list[str]): Liste aller Backup-Statuses
|
|||
|
|
- `last_time` (int): Unix-Timestamp des letzten Backups
|
|||
|
|
- `backup_type` (str): Backup-Typ
|
|||
|
|
- `error_handling` (str): "tolerant" oder "strict"
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
- `tuple`: `(State, status_text)`
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Status-Logik-Tabelle:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Bedingung | Alter | Error Handling | State | Text |
|
|||
|
|
|-----------|-------|----------------|-------|------|
|
|||
|
|
| ≥1x "completed" | - | - | OK | "Completed" |
|
|||
|
|
| Nur "pending"/"started" | <2h | - | OK | "Pending/Started" |
|
|||
|
|
| Nur "pending"/"started" | >2h | - | WARN | "Pending (>2h)" |
|
|||
|
|
| Nur "pending"/"started" | unknown | - | WARN | "Pending" |
|
|||
|
|
| "failed"/"missed" | - | tolerant | WARN | "Failed (partial)" |
|
|||
|
|
| "failed"/"missed" | - | strict | CRIT | "Failed/Missed" |
|
|||
|
|
| Andere | - | - | CRIT | "Unknown State" |
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiele:**
|
|||
|
|
```python
|
|||
|
|
calculate_state(["Completed", "Completed"], 1736690000, "mssql", "strict")
|
|||
|
|
# → (State.OK, "Completed")
|
|||
|
|
|
|||
|
|
|
|||
|
|
calculate_state(["Failed"], 1736690000, "file", "tolerant")
|
|||
|
|
# → (State.WARN, "Failed (partial)")
|
|||
|
|
|
|||
|
|
|
|||
|
|
calculate_state(["Failed"], 1736690000, "mssql", "strict")
|
|||
|
|
# → (State.CRIT, "Failed/Missed")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 2.3 CheckMK API v2 Objekte
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `Service`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Definiert einen CheckMK-Service während der Discovery.
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from cmk.agent_based.v2 import Service, ServiceLabel
|
|||
|
|
|
|||
|
|
|
|||
|
|
Service(
|
|||
|
|
item="SERVER_MSSQL",
|
|||
|
|
labels=[
|
|||
|
|
ServiceLabel("backup_type", "mssql"),
|
|||
|
|
ServiceLabel("frequency", "daily"),
|
|||
|
|
]
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `Result`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Repräsentiert ein Check-Ergebnis.
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from cmk.agent_based.v2 import Result, State
|
|||
|
|
|
|||
|
|
|
|||
|
|
Result(
|
|||
|
|
state=State.OK,
|
|||
|
|
summary="Type=MSSQL, Status=Completed, Last=3h"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
|
|||
|
|
Result(
|
|||
|
|
state=State.OK,
|
|||
|
|
notice="Detailed information for details page"
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### `Metric`
|
|||
|
|
|
|||
|
|
|
|||
|
|
Definiert eine Performance-Metrik.
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from cmk.agent_based.v2 import Metric
|
|||
|
|
|
|||
|
|
|
|||
|
|
Metric(
|
|||
|
|
name="backup_age",
|
|||
|
|
value=11700, # Aktueller Wert
|
|||
|
|
levels=(93600, 172800), # (warn, crit)
|
|||
|
|
boundaries=(0, None), # (min, max)
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 3. Erweiterte Konfiguration
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 3.1 Benutzerdefinierte Backup-Typen
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Szenario:** Neuer Backup-Typ "SAPASE" (SAP ASE Datenbank) soll überwacht werden.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Schritt 1: Typ zur known_types Liste hinzufügen**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# In tsm_backups.py, extract_backup_type() Funktion
|
|||
|
|
known_types = [
|
|||
|
|
'MSSQL', 'HANA', 'FILE', 'ORACLE', 'DB2', 'SCALE', 'DM',
|
|||
|
|
'DATACENTER', 'VIRTUAL', 'MAIL', 'MYSQL',
|
|||
|
|
'SAPASE', # NEU
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Schritt 2: Schwellwerte definieren (optional)**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
THRESHOLDS = {
|
|||
|
|
# ... bestehende Einträge ...
|
|||
|
|
"sapase": {"warn": 26 * 3600, "crit": 48 * 3600},
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Schritt 3: Typ zur passenden Kategorie hinzufügen**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
DATABASE_TYPES = {
|
|||
|
|
'mssql', 'hana', 'db2', 'oracle', 'mysql',
|
|||
|
|
'sapase', # NEU
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Schritt 4: Error-Handling festlegen (optional)**
|
|||
|
|
|
|||
|
|
|
|||
|
|
Falls tolerant gewünscht:
|
|||
|
|
```python
|
|||
|
|
TOLERANT_TYPES = {
|
|||
|
|
'file', 'virtual', 'scale', 'dm', 'datacenter',
|
|||
|
|
'vmware', 'hyperv', 'mail', 'exchange',
|
|||
|
|
'sapase', # NEU (falls tolerant erwünscht)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Schritt 5: Plugin neu laden**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cmk -R
|
|||
|
|
cmk -II --all
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Ergebnis:**
|
|||
|
|
- Nodes wie `SERVER_SAPASE` werden automatisch erkannt
|
|||
|
|
- Typ-Label: `backup_type=sapase`
|
|||
|
|
- Kategorie-Label: `backup_category=database`
|
|||
|
|
- Schwellwerte: 26h/48h
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 3.2 Async Agent Plugin
|
|||
|
|
|
|||
|
|
|
|||
|
|
Bei großen TSM-Umgebungen kann das CSV-Parsing Zeit in Anspruch nehmen. Async-Plugins laufen unabhängig vom Agent-Intervall.
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Konfiguration:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Als root auf dem Host
|
|||
|
|
cat > /etc/check_mk/mrpe.cfg << 'EOF'
|
|||
|
|
# TSM Backups async (alle 5 Minuten)
|
|||
|
|
(interval=300) tsm_backups /usr/lib/check_mk_agent/plugins/tsm_backups
|
|||
|
|
EOF
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Oder mit CheckMK Bakery (Regel):**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Setup > Agents > Agent Rules > Asynchronous execution of plugins (Windows, Linux)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Einstellungen:**
|
|||
|
|
- Plugin: `tsm_backups`
|
|||
|
|
- Execution interval: `300` Sekunden (5 Minuten)
|
|||
|
|
- Cache age: `600` Sekunden (10 Minuten)
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 3.3 CSV-Export Automation
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### Option A: NFS-Mount (empfohlen)
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# /etc/fstab
|
|||
|
|
tsm-server.example.com:/exports/backup-stats /mnt/CMK_TSM nfs defaults,ro 0 0
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Mount testen
|
|||
|
|
mount -a
|
|||
|
|
ls /mnt/CMK_TSM/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### Option B: Rsync via Cron
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Crontab für root
|
|||
|
|
*/15 * * * * rsync -az --delete tsm-server:/path/to/csv/ /mnt/CMK_TSM/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### Option C: SCP mit SSH-Key
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# SSH-Key einrichten
|
|||
|
|
ssh-keygen -t ed25519 -f ~/.ssh/tsm_backup_key -N ""
|
|||
|
|
ssh-copy-id -i ~/.ssh/tsm_backup_key.pub tsm-server
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Crontab
|
|||
|
|
*/15 * * * * scp -i ~/.ssh/tsm_backup_key tsm-server:/path/*.CSV /mnt/CMK_TSM/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 3.4 Regel-basierte Service-Erstellung
|
|||
|
|
|
|||
|
|
|
|||
|
|
**CheckMK-Regeln für automatische Service-Labels:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Setup > Services > Discovery rules > Host labels
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiel-Regel:**
|
|||
|
|
```yaml
|
|||
|
|
conditions:
|
|||
|
|
service_labels:
|
|||
|
|
backup_category: database
|
|||
|
|
|
|||
|
|
actions:
|
|||
|
|
add_labels:
|
|||
|
|
criticality: high
|
|||
|
|
team: dba
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 3.5 Custom Views
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### View: Alle kritischen Datenbank-Backups
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Setup > General > Custom views > Create new view
|
|||
|
|
|
|||
|
|
|
|||
|
|
Name: Critical Database Backups
|
|||
|
|
Datasource: All services
|
|||
|
|
|
|||
|
|
|
|||
|
|
Filters:
|
|||
|
|
- Service state: CRIT
|
|||
|
|
- Service labels: backup_category = database
|
|||
|
|
|
|||
|
|
|
|||
|
|
Columns:
|
|||
|
|
- Host
|
|||
|
|
- Service description
|
|||
|
|
- Service state
|
|||
|
|
- Service output
|
|||
|
|
- Service labels: backup_type
|
|||
|
|
- Service labels: frequency
|
|||
|
|
- Perf-O-Meter
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 3.6 Custom Notifications
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Notification Rule: Nur strikte Failed-Backups eskalieren**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Setup > Notifications > Add rule
|
|||
|
|
|
|||
|
|
|
|||
|
|
Conditions:
|
|||
|
|
- Service labels: error_handling = strict
|
|||
|
|
- Service state: CRIT
|
|||
|
|
- Service state type: HARD
|
|||
|
|
|
|||
|
|
|
|||
|
|
Contact selection:
|
|||
|
|
- Specify users: dba-team
|
|||
|
|
|
|||
|
|
|
|||
|
|
Notification method:
|
|||
|
|
- Email
|
|||
|
|
- PagerDuty
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 4. Entwicklungsleitfaden
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 4.1 Entwicklungsumgebung einrichten
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# CheckMK-Site für Entwicklung
|
|||
|
|
omd create dev
|
|||
|
|
omd start dev
|
|||
|
|
su - dev
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Git-Repository
|
|||
|
|
cd ~/local/lib/python3/cmk_addons/plugins/
|
|||
|
|
git init
|
|||
|
|
git add .
|
|||
|
|
git commit -m "Initial commit"
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Entwicklungs-Workflow
|
|||
|
|
vim tsm/agent_based/tsm_backups.py
|
|||
|
|
cmk -R
|
|||
|
|
cmk -vv --debug test-host | grep "TSM Backup"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 4.2 Unit Tests schreiben
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Test-Datei:** `test_tsm_backups.py`
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
#!/usr/bin/env python3
|
|||
|
|
import pytest
|
|||
|
|
from tsm_backups import (
|
|||
|
|
extract_backup_type,
|
|||
|
|
extract_backup_level,
|
|||
|
|
calculate_state,
|
|||
|
|
)
|
|||
|
|
from cmk.agent_based.v2 import State
|
|||
|
|
|
|||
|
|
|
|||
|
|
def test_extract_backup_type():
|
|||
|
|
assert extract_backup_type("SERVER_MSSQL") == "mssql"
|
|||
|
|
assert extract_backup_type("DATABASE_HANA_01") == "hana"
|
|||
|
|
assert extract_backup_type("NEWTYPE_CUSTOM") == "custom"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def test_extract_backup_level():
|
|||
|
|
assert extract_backup_level(["DAILY_FULL"]) == "full"
|
|||
|
|
assert extract_backup_level(["HOURLY_LOG", "DAILY_FULL"]) == "log"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def test_calculate_state_completed():
|
|||
|
|
state, text = calculate_state(
|
|||
|
|
["Completed", "Completed"],
|
|||
|
|
1736690000,
|
|||
|
|
"mssql",
|
|||
|
|
"strict"
|
|||
|
|
)
|
|||
|
|
assert state == State.OK
|
|||
|
|
assert text == "Completed"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def test_calculate_state_failed_strict():
|
|||
|
|
state, text = calculate_state(
|
|||
|
|
["Failed"],
|
|||
|
|
1736690000,
|
|||
|
|
"mssql",
|
|||
|
|
"strict"
|
|||
|
|
)
|
|||
|
|
assert state == State.CRIT
|
|||
|
|
assert text == "Failed/Missed"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def test_calculate_state_failed_tolerant():
|
|||
|
|
state, text = calculate_state(
|
|||
|
|
["Failed"],
|
|||
|
|
1736690000,
|
|||
|
|
"file",
|
|||
|
|
"tolerant"
|
|||
|
|
)
|
|||
|
|
assert state == State.WARN
|
|||
|
|
assert text == "Failed (partial)"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Tests ausführen:**
|
|||
|
|
```bash
|
|||
|
|
pytest test_tsm_backups.py -v
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 4.3 Code-Style
|
|||
|
|
|
|||
|
|
|
|||
|
|
**PEP 8 Compliance:**
|
|||
|
|
```bash
|
|||
|
|
pip install black flake8 mypy
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Auto-Formatierung
|
|||
|
|
black tsm_backups.py
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Linting
|
|||
|
|
flake8 tsm_backups.py
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Type Checking
|
|||
|
|
mypy tsm_backups.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 4.4 Debugging
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### Agent-Plugin debuggen
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Direkter Aufruf mit Traceback
|
|||
|
|
python3 /usr/lib/check_mk_agent/plugins/tsm_backups
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Mit Debugger
|
|||
|
|
python3 -m pdb /usr/lib/check_mk_agent/plugins/tsm_backups
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
#### Check-Plugin debuggen
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Verbose Check mit Debug-Output
|
|||
|
|
cmk -vv --debug hostname | less
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Nur TSM-Services
|
|||
|
|
cmk -vv --debug hostname | grep -A 20 "TSM Backup"
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Python-Debugger im Plugin
|
|||
|
|
import pdb; pdb.set_trace()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 4.5 Performance-Profiling
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# In tsm_backups.py
|
|||
|
|
import cProfile
|
|||
|
|
import pstats
|
|||
|
|
|
|||
|
|
|
|||
|
|
def main():
|
|||
|
|
profiler = cProfile.Profile()
|
|||
|
|
profiler.enable()
|
|||
|
|
|
|||
|
|
# ... bestehender Code ...
|
|||
|
|
|
|||
|
|
profiler.disable()
|
|||
|
|
stats = pstats.Stats(profiler)
|
|||
|
|
stats.sort_stats('cumulative')
|
|||
|
|
stats.print_stats(20)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 5. Performance-Optimierung
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 5.1 CSV-Parsing beschleunigen
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Problem:** Große CSV-Dateien (>100 MB) verlangsamen Agent
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Lösung 1: Nur relevante Zeilen parsen**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def parse_csv_optimized(self, csv_file):
|
|||
|
|
# Nur die letzten 24h sind relevant
|
|||
|
|
cutoff_time = datetime.now() - timedelta(hours=24)
|
|||
|
|
|
|||
|
|
with open(csv_file, 'r') as f:
|
|||
|
|
reader = csv.reader(f)
|
|||
|
|
for row in reader:
|
|||
|
|
try:
|
|||
|
|
time_str = row[0].strip()
|
|||
|
|
timestamp = datetime.strptime(time_str, "%Y-%m-%d %H:%M:%S")
|
|||
|
|
|
|||
|
|
# Skip alte Einträge
|
|||
|
|
if timestamp < cutoff_time:
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
# ... restliche Verarbeitung ...
|
|||
|
|
except:
|
|||
|
|
continue
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Lösung 2: Pandas für große Dateien**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import pandas as pd
|
|||
|
|
|
|||
|
|
|
|||
|
|
def parse_csv_pandas(csv_file):
|
|||
|
|
df = pd.read_csv(
|
|||
|
|
csv_file,
|
|||
|
|
names=['timestamp', 'field', 'node', 'schedule', 'status'],
|
|||
|
|
parse_dates=['timestamp'],
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Filter letzten 24h
|
|||
|
|
cutoff = pd.Timestamp.now() - pd.Timedelta(hours=24)
|
|||
|
|
df = df[df['timestamp'] > cutoff]
|
|||
|
|
|
|||
|
|
# Aggregation
|
|||
|
|
grouped = df.groupby('node').agg({
|
|||
|
|
'status': list,
|
|||
|
|
'schedule': list,
|
|||
|
|
'timestamp': 'max',
|
|||
|
|
'node': 'count'
|
|||
|
|
})
|
|||
|
|
|
|||
|
|
return grouped.to_dict()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 5.2 Caching
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Problem:** CSV-Dateien ändern sich nur alle 15-30 Minuten
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Lösung: Cache mit Timestamp-Check**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import json
|
|||
|
|
from pathlib import Path
|
|||
|
|
import time
|
|||
|
|
|
|||
|
|
|
|||
|
|
CACHE_FILE = Path("/tmp/tsm_backups_cache.json")
|
|||
|
|
CACHE_TTL = 300 # 5 Minuten
|
|||
|
|
|
|||
|
|
|
|||
|
|
def get_cached_or_parse():
|
|||
|
|
if CACHE_FILE.exists():
|
|||
|
|
cache_age = time.time() - CACHE_FILE.stat().st_mtime
|
|||
|
|
if cache_age < CACHE_TTL:
|
|||
|
|
with open(CACHE_FILE, 'r') as f:
|
|||
|
|
return json.load(f)
|
|||
|
|
|
|||
|
|
# Parse fresh
|
|||
|
|
parser = TSMParser()
|
|||
|
|
# ... parse logic ...
|
|||
|
|
result = parser.aggregate()
|
|||
|
|
|
|||
|
|
# Cache schreiben
|
|||
|
|
with open(CACHE_FILE, 'w') as f:
|
|||
|
|
json.dump(result, f)
|
|||
|
|
|
|||
|
|
return result
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 5.3 Speicher-Optimierung
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Problem:** Große Listen von Status/Schedule strings
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Lösung: Nur unique values speichern**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def aggregate_optimized(self):
|
|||
|
|
nodes = defaultdict(lambda: {
|
|||
|
|
"statuses": set(), # Set statt Liste
|
|||
|
|
"schedules": set(),
|
|||
|
|
"last": None,
|
|||
|
|
"count": 0,
|
|||
|
|
})
|
|||
|
|
|
|||
|
|
for b in self.backups:
|
|||
|
|
node = b["node"]
|
|||
|
|
nodes[node]["count"] += 1
|
|||
|
|
nodes[node]["statuses"].add(b["status"]) # Automatisch unique
|
|||
|
|
nodes[node]["schedules"].add(b["schedule"])
|
|||
|
|
# ... rest ...
|
|||
|
|
|
|||
|
|
# Konvertiere Sets zu Listen für JSON
|
|||
|
|
for node in nodes:
|
|||
|
|
nodes[node]["statuses"] = list(nodes[node]["statuses"])
|
|||
|
|
nodes[node]["schedules"] = list(nodes[node]["schedules"])
|
|||
|
|
|
|||
|
|
return nodes
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 6. Sicherheit
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 6.1 Dateiberechtigungen
|
|||
|
|
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Agent-Plugin
|
|||
|
|
chown root:root /usr/lib/check_mk_agent/plugins/tsm_backups
|
|||
|
|
chmod 755 /usr/lib/check_mk_agent/plugins/tsm_backups
|
|||
|
|
|
|||
|
|
|
|||
|
|
# CSV-Verzeichnis
|
|||
|
|
chown root:root /mnt/CMK_TSM
|
|||
|
|
chmod 755 /mnt/CMK_TSM
|
|||
|
|
chmod 644 /mnt/CMK_TSM/*.CSV
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Check-Plugin
|
|||
|
|
chown <site>:<site> $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
|
|||
|
|
chmod 644 $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 6.2 Input-Validierung
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Agent-Plugin:**
|
|||
|
|
```python
|
|||
|
|
def is_valid_node(self, node, status):
|
|||
|
|
# Länge prüfen
|
|||
|
|
if not node or len(node) < 3 or len(node) > 200:
|
|||
|
|
return False
|
|||
|
|
|
|||
|
|
# Unerlaubte Zeichen
|
|||
|
|
if not re.match(r'^[A-Za-z0-9_-]+$', node):
|
|||
|
|
return False
|
|||
|
|
|
|||
|
|
# Status whitelist
|
|||
|
|
valid_statuses = ['Completed', 'Failed', 'Missed', 'Pending', 'Started']
|
|||
|
|
if status not in valid_statuses:
|
|||
|
|
return False
|
|||
|
|
|
|||
|
|
return True
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 6.3 Sichere CSV-Verarbeitung
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def parse_csv_safe(self, csv_file):
|
|||
|
|
try:
|
|||
|
|
# Dateigröße prüfen (max 500 MB)
|
|||
|
|
if csv_file.stat().st_size > 500 * 1024 * 1024:
|
|||
|
|
return
|
|||
|
|
|
|||
|
|
with open(csv_file, 'r', encoding='utf-8') as f:
|
|||
|
|
reader = csv.reader(f)
|
|||
|
|
|
|||
|
|
line_count = 0
|
|||
|
|
for row in reader:
|
|||
|
|
line_count += 1
|
|||
|
|
|
|||
|
|
# Max. 1 Million Zeilen
|
|||
|
|
if line_count > 1000000:
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
# ... Verarbeitung ...
|
|||
|
|
except Exception as e:
|
|||
|
|
# Logging statt Crash
|
|||
|
|
pass
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 7. Integration
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 7.1 Grafana-Dashboard
|
|||
|
|
|
|||
|
|
|
|||
|
|
**InfluxDB Query:**
|
|||
|
|
```sql
|
|||
|
|
SELECT
|
|||
|
|
mean("backup_age") AS "avg_age",
|
|||
|
|
max("backup_age") AS "max_age"
|
|||
|
|
FROM "tsm_backups"
|
|||
|
|
WHERE
|
|||
|
|
"backup_category" = 'database'
|
|||
|
|
AND time > now() - 7d
|
|||
|
|
GROUP BY
|
|||
|
|
time(1h),
|
|||
|
|
"node_name"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Panels:**
|
|||
|
|
- Backup Age Heatmap (pro Node)
|
|||
|
|
- Status Distribution (Pie Chart)
|
|||
|
|
- Backup Jobs Timeline
|
|||
|
|
- Alert History
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 7.2 Prometheus Exporter
|
|||
|
|
|
|||
|
|
|
|||
|
|
**CheckMK Prometheus Exporter konfigurieren:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Setup > Exporter > Prometheus
|
|||
|
|
|
|||
|
|
|
|||
|
|
Metrics:
|
|||
|
|
- cmk_tsm_backups_backup_age_seconds
|
|||
|
|
- cmk_tsm_backups_backup_jobs_total
|
|||
|
|
|
|||
|
|
|
|||
|
|
Labels:
|
|||
|
|
- backup_type
|
|||
|
|
- backup_category
|
|||
|
|
- frequency
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 7.3 REST API Zugriff
|
|||
|
|
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import requests
|
|||
|
|
|
|||
|
|
|
|||
|
|
# CheckMK REST API
|
|||
|
|
url = "https://checkmk.example.com/site/check_mk/api/1.0"
|
|||
|
|
headers = {
|
|||
|
|
"Authorization": "Bearer YOUR_API_KEY",
|
|||
|
|
"Accept": "application/json"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
|
|||
|
|
# Alle TSM-Services abfragen
|
|||
|
|
response = requests.get(
|
|||
|
|
f"{url}/domain-types/service/collections/all",
|
|||
|
|
headers=headers,
|
|||
|
|
params={
|
|||
|
|
"query": '{"op": "and", "expr": [{"op": "~", "left": "description", "right": "TSM Backup"}]}'
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
|
|||
|
|
services = response.json()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 8. Best Practices
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 8.1 Naming Conventions
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Node-Namen:**
|
|||
|
|
```
|
|||
|
|
✅ EMPFOHLEN:
|
|||
|
|
- SERVER_MSSQL
|
|||
|
|
- APP_ORACLE_01
|
|||
|
|
- FILESERVER_BACKUP
|
|||
|
|
|
|||
|
|
|
|||
|
|
❌ VERMEIDEN:
|
|||
|
|
- MSSQL (zu generisch)
|
|||
|
|
- SERVER-PROD (Bindestrich kann Probleme machen)
|
|||
|
|
- very_long_name_that_is_too_descriptive_mssql_backup_node (>50 Zeichen)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Schedule-Namen:**
|
|||
|
|
```
|
|||
|
|
✅ EMPFOHLEN:
|
|||
|
|
- DAILY_FULL
|
|||
|
|
- HOURLY_LOG
|
|||
|
|
- WEEKLY_FULL
|
|||
|
|
|
|||
|
|
|
|||
|
|
❌ VERMEIDEN:
|
|||
|
|
- PROD_BACKUP (keine Frequency erkennbar)
|
|||
|
|
- BACKUP01 (keine Informationen)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 8.2 Monitoring-Strategie
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Alarm-Eskalation:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
1. **Stufe 1 (INFO):** Backup Started/Pending
|
|||
|
|
2. **Stufe 2 (WARN):**
|
|||
|
|
- Backup-Alter > WARN-Schwellwert
|
|||
|
|
- Failed (tolerante Typen)
|
|||
|
|
- Pending > 2h
|
|||
|
|
3. **Stufe 3 (CRIT):**
|
|||
|
|
- Backup-Alter > CRIT-Schwellwert
|
|||
|
|
- Failed/Missed (strikte Typen)
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Notification Delays:**
|
|||
|
|
```
|
|||
|
|
Setup > Notifications > Rules
|
|||
|
|
|
|||
|
|
|
|||
|
|
WARN: Notify after 15 minutes (allow recovery)
|
|||
|
|
CRIT: Notify immediately
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 8.3 Maintenance Windows
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Backup-Services während Maintenance pausieren:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Setup > Services > Service monitoring rules > Disabled checks
|
|||
|
|
|
|||
|
|
|
|||
|
|
Conditions:
|
|||
|
|
- Service labels: backup_system = tsm
|
|||
|
|
- Timeperiod: maintenance_window
|
|||
|
|
|
|||
|
|
|
|||
|
|
Action: Disable active checks
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 8.4 Dokumentation
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Pro Installation dokumentieren:**
|
|||
|
|
|
|||
|
|
|
|||
|
|
1. **CSV-Export-Quelle:** Welcher TSM-Server, welche Queries
|
|||
|
|
2. **CSV-Transfer-Methode:** NFS/SCP/Rsync + Schedule
|
|||
|
|
3. **Benutzerdefinierte Typen:** Liste aller hinzugefügten Backup-Typen
|
|||
|
|
4. **Angepasste Schwellwerte:** Begründung für Abweichungen
|
|||
|
|
5. **Kontakte:** Wer ist für TSM-Backups verantwortlich
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### 8.5 Regelmäßige Wartung
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Monatlich:**
|
|||
|
|
- CSV-Verzeichnis aufräumen (alte Dateien löschen)
|
|||
|
|
- Überprüfen: Werden alle erwarteten Nodes gefunden?
|
|||
|
|
- Alert-History analysieren: False Positives?
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Quartalsweise:**
|
|||
|
|
- Schwellwerte überprüfen und ggf. anpassen
|
|||
|
|
- Neue Backup-Typen dokumentieren
|
|||
|
|
- Check-Plugin auf Updates prüfen
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Jährlich:**
|
|||
|
|
- CheckMK-Upgrade-Kompatibilität testen
|
|||
|
|
- Performance-Review (Agent-Laufzeit, Check-Dauer)
|
|||
|
|
- Architektur-Review (Ist die Lösung noch passend?)
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
## Anhang
|
|||
|
|
|
|||
|
|
|
|||
|
|
### A. Glossar
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Begriff | Beschreibung |
|
|||
|
|
|---------|--------------|
|
|||
|
|
| **Agent Plugin** | Script auf dem überwachten Host, liefert Daten an CheckMK |
|
|||
|
|
| **Check Plugin** | Code auf CheckMK-Server, erstellt Services und bewertet Status |
|
|||
|
|
| **Service Label** | Key-Value-Paar, das einem Service zugeordnet ist (Filterung/Reporting) |
|
|||
|
|
| **Discovery** | Prozess, bei dem CheckMK automatisch Services erstellt |
|
|||
|
|
| **Threshold** | Schwellwert (WARN/CRIT) für eine Metrik |
|
|||
|
|
| **Node** | TSM-Begriff für einen Backup-Client |
|
|||
|
|
| **Schedule** | TSM-Begriff für einen geplanten Backup-Job |
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### B. Fehlercode-Referenz
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Fehler | Ursache | Lösung |
|
|||
|
|
|--------|---------|--------|
|
|||
|
|
| `Backup not found in data` | Node existiert in Discovery, aber nicht im aktuellen Agent-Output | CSV-Dateien prüfen, ggf. Re-Discovery |
|
|||
|
|
| `Empty agent section` | Agent liefert keine Daten | Agent-Plugin-Ausführung prüfen, CSV-Verzeichnis prüfen |
|
|||
|
|
| `JSON decode error` | Agent-Output ist kein valides JSON | Agent-Plugin manuell testen, Fehler im Output suchen |
|
|||
|
|
| `Unknown State` | Unerwarteter Status vom TSM | Agent-Output prüfen, ggf. `calculate_state()` erweitern |
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
### C. TSM-Query für CSV-Export
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Beispiel-Query für TSM-Server (dsmadmc):**
|
|||
|
|
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
SELECT
|
|||
|
|
DATE(END_TIME) || ' ' || TIME(END_TIME) AS DATETIME,
|
|||
|
|
ENTITY,
|
|||
|
|
NODE_NAME,
|
|||
|
|
SCHEDULE_NAME,
|
|||
|
|
RESULT
|
|||
|
|
FROM ACTLOG
|
|||
|
|
WHERE
|
|||
|
|
SCHEDULE_NAME IS NOT NULL
|
|||
|
|
AND SCHEDULE_NAME != ''
|
|||
|
|
AND TIMESTAMPDIFF(4, CHAR(CURRENT_TIMESTAMP - END_TIME)) <= 24
|
|||
|
|
ORDER BY END_TIME DESC
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Export als CSV:**
|
|||
|
|
```bash
|
|||
|
|
dsmadmc -id=admin -pa=password -comma \
|
|||
|
|
"SELECT ... FROM ACTLOG ..." \
|
|||
|
|
> /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Ende der technischen Dokumentation**
|
|||
|
|
|
|||
|
|
|
|||
|
|
**Letzte Aktualisierung:** 2026-01-12
|
|||
|
|
**Version:** 4.1
|
|||
|
|
**Autor:** Marius Gielnik
|