1648 lines
39 KiB
Markdown
1648 lines
39 KiB
Markdown
# TSM Backup Monitoring - Technische Dokumentation
|
||
|
||
|
||
## Inhaltsverzeichnis
|
||
|
||
|
||
1. [Architektur-Details](#architektur-details)
|
||
2. [API-Referenz](#api-referenz)
|
||
3. [Erweiterte Konfiguration](#erweiterte-konfiguration)
|
||
4. [Entwicklungsleitfaden](#entwicklungsleitfaden)
|
||
5. [Performance-Optimierung](#performance-optimierung)
|
||
6. [Sicherheit](#sicherheit)
|
||
7. [Integration](#integration)
|
||
8. [Best Practices](#best-practices)
|
||
|
||
|
||
---
|
||
|
||
|
||
## 1. Architektur-Details
|
||
|
||
|
||
### 1.1 Komponenten-Übersicht
|
||
|
||
|
||
#### Agent Plugin (`tsm_backups`)
|
||
|
||
|
||
**Speicherort:** `/usr/lib/check_mk_agent/plugins/tsm_backups`
|
||
|
||
|
||
**Aufgaben:**
|
||
- CSV-Dateien aus `/mnt/CMK_TSM` einlesen
|
||
- Node-Namen normalisieren (RRZ*/NFRZ*-Präfixe entfernen)
|
||
- Backup-Daten pro Node aggregieren
|
||
- JSON-Output für CheckMK Agent generieren
|
||
|
||
|
||
**Ausführung:**
|
||
- Wird bei jedem Agent-Aufruf ausgeführt
|
||
- Standard-Intervall: 60 Sekunden (CheckMK Standard)
|
||
- Kann asynchron konfiguriert werden (siehe [Async Plugins](#async-plugins))
|
||
|
||
|
||
**Output-Format:**
|
||
```json
|
||
{
|
||
"SERVER_MSSQL": {
|
||
"statuses": ["Completed", "Completed", "Failed"],
|
||
"schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
|
||
"last": 1736693420,
|
||
"count": 3
|
||
},
|
||
"DATABASE_HANA": {
|
||
"statuses": ["Completed"],
|
||
"schedules": ["00-00-00_FULL"],
|
||
"last": 1736690000,
|
||
"count": 1
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
#### Check Plugin (`tsm_backups.py`)
|
||
|
||
|
||
**Speicherort:** `/omd/sites/<site>/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py`
|
||
|
||
|
||
**Aufgaben:**
|
||
- JSON vom Agent parsen
|
||
- Services mit Labels discovern
|
||
- Backup-Status bewerten
|
||
- Metriken generieren
|
||
- Schwellwerte prüfen
|
||
|
||
|
||
**CheckMK API Version:** v2 (cmk.agent_based.v2)
|
||
|
||
|
||
### 1.2 Datenfluss-Diagramm
|
||
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ TSM Server │
|
||
│ │
|
||
│ SELECT │
|
||
│ DATE_TIME, ENTITY, NODE_NAME, SCHEDULE_NAME, RESULT │
|
||
│ FROM ACTLOG │
|
||
│ WHERE TIMESTAMP > CURRENT_TIMESTAMP - 24 HOURS │
|
||
│ │
|
||
│ ↓ Export als CSV │
|
||
│ /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
│
|
||
│ NFS/SCP/Rsync
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ Host: /mnt/CMK_TSM/ │
|
||
│ ├── TSM_BACKUP_SCHED_24H.CSV │
|
||
│ ├── TSM_DB_SCHED_24H.CSV │
|
||
│ └── TSM_FILE_SCHED_24H.CSV │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ Agent Plugin: /usr/lib/check_mk_agent/plugins/tsm_backups │
|
||
│ │
|
||
│ 1. Liste CSV-Dateien in /mnt/CMK_TSM │
|
||
│ 2. Parse jede Zeile: │
|
||
│ - Extrahiere: timestamp, node, schedule, status │
|
||
│ - Validiere Node (Länge, MAINTENANCE) │
|
||
│ - Normalisiere Node-Name: │
|
||
│ RRZ01_SERVER_MSSQL → _SERVER_MSSQL → SERVER_MSSQL │
|
||
│ 3. Aggregiere pro Node: │
|
||
│ - Sammle alle Statuses │
|
||
│ - Sammle alle Schedules │
|
||
│ - Finde letzten Timestamp │
|
||
│ - Zähle Jobs │
|
||
│ 4. Generiere JSON-Output │
|
||
│ │
|
||
│ Output: <<<tsm_backups:sep(0)>>> │
|
||
│ {"SERVER_MSSQL": {...}, ...} │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
│
|
||
│ CheckMK Agent Protocol
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ CheckMK Server: Agent Section Parser │
|
||
│ │
|
||
│ parse_tsm_backups(string_table): │
|
||
│ - Extrahiere JSON-String aus string_table[0][0] │
|
||
│ - Parse JSON → Python Dict │
|
||
│ - Return: {node: data, ...} │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ Service Discovery │
|
||
│ │
|
||
│ discover_tsm_backups(section): │
|
||
│ FOR EACH node IN section: │
|
||
│ 1. Extrahiere Metadata: │
|
||
│ - backup_type = extract_backup_type(node) │
|
||
│ - backup_level = extract_backup_level(schedules) │
|
||
│ - frequency = extract_frequency(schedules) │
|
||
│ - error_handling = get_error_handling(backup_type) │
|
||
│ - category = get_backup_category(backup_type) │
|
||
│ 2. Erstelle Service mit Labels: │
|
||
│ Service( │
|
||
│ item=node, │
|
||
│ labels=[ │
|
||
│ ServiceLabel("backup_type", backup_type), │
|
||
│ ServiceLabel("backup_category", category), │
|
||
│ ... │
|
||
│ ] │
|
||
│ ) │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ Service Check Execution │
|
||
│ │
|
||
│ check_tsm_backups(item, section): │
|
||
│ 1. Lade Node-Daten aus section[item] │
|
||
│ 2. Extrahiere Metadata (wie bei Discovery) │
|
||
│ 3. Berechne Status: │
|
||
│ - calculate_state() → (State, status_text) │
|
||
│ 4. Berechne Backup-Alter: │
|
||
│ - age = now - last_timestamp │
|
||
│ 5. Hole Schwellwerte: │
|
||
│ - thresholds = get_thresholds(type, level) │
|
||
│ 6. Prüfe Alter gegen Schwellwerte │
|
||
│ 7. Generiere Output: │
|
||
│ - Result(state, summary) │
|
||
│ - Metric("backup_age", age, levels) │
|
||
│ - Metric("backup_jobs", count) │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ CheckMK Service │
|
||
│ │
|
||
│ Name: TSM Backup SERVER_MSSQL │
|
||
│ State: OK │
|
||
│ Summary: Type=MSSQL (database), Level=FULL, Freq=daily, │
|
||
│ Status=Completed, Last=3h 15m, Jobs=3 │
|
||
│ Metrics: │
|
||
│ - backup_age: 11700s (warn: 93600s, crit: 172800s) │
|
||
│ - backup_jobs: 3 │
|
||
│ Labels: │
|
||
│ - backup_type: mssql │
|
||
│ - backup_category: database │
|
||
│ - frequency: daily │
|
||
│ - backup_level: full │
|
||
│ - error_handling: strict │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
|
||
### 1.3 Node-Normalisierung im Detail
|
||
|
||
|
||
**Zweck:** TSM-Umgebungen mit redundanten Servern (z.B. RRZ01, RRZ02, NFRZ01) sollen als ein logischer Node überwacht werden.
|
||
|
||
|
||
**Algorithmus:**
|
||
|
||
|
||
```python
|
||
def normalize_node_name(node):
|
||
"""
|
||
Input: "RRZ01_MYSERVER_MSSQL"
|
||
|
||
Schritt 1: Entferne RRZ*/NFRZ*/RZ* Präfix mit Unterstrich
|
||
Pattern: r'(RRZ|NFRZ|RZ)\d+(_)'
|
||
Ergebnis: "_MYSERVER_MSSQL"
|
||
|
||
Schritt 2: Entferne führenden Unterstrich
|
||
Ergebnis: "MYSERVER_MSSQL"
|
||
|
||
Schritt 3: Entferne RRZ*/NFRZ*/RZ* Suffix ohne Unterstrich
|
||
Pattern: r'(RRZ|NFRZ|RZ)\d+$'
|
||
Ergebnis: "MYSERVER_MSSQL"
|
||
|
||
Output: "MYSERVER_MSSQL"
|
||
"""
|
||
```
|
||
|
||
|
||
**Beispiele:**
|
||
|
||
|
||
| Original Node | Normalisiert | Ergebnis |
|
||
|---------------|--------------|----------|
|
||
| `RRZ01_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ |
|
||
| `RRZ02_SERVER_MSSQL` | `SERVER_MSSQL` | ✅ Zusammengeführt |
|
||
| `NFRZ01_DATABASE_HANA` | `DATABASE_HANA` | ✅ |
|
||
| `SERVER_FILE_RRZ01` | `SERVER_FILE` | ✅ |
|
||
| `MYSERVER_ORACLE` | `MYSERVER_ORACLE` | ✅ Unverändert |
|
||
|
||
|
||
---
|
||
|
||
|
||
## 2. API-Referenz
|
||
|
||
|
||
### 2.1 Agent Plugin Funktionen
|
||
|
||
|
||
#### `TSMParser.normalize_node_name(node: str) -> str`
|
||
|
||
|
||
Normalisiert TSM-Node-Namen für Redundanz-Logik.
|
||
|
||
|
||
**Parameter:**
|
||
- `node` (str): Original TSM-Node-Name
|
||
|
||
|
||
**Returns:**
|
||
- `str`: Normalisierter Node-Name
|
||
|
||
|
||
**Beispiel:**
|
||
```python
|
||
parser = TSMParser()
|
||
normalized = parser.normalize_node_name("RRZ01_SERVER_MSSQL")
|
||
# normalized == "SERVER_MSSQL"
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `TSMParser.is_valid_node(node: str, status: str) -> bool`
|
||
|
||
|
||
Prüft, ob ein Node für Monitoring valide ist.
|
||
|
||
|
||
**Parameter:**
|
||
- `node` (str): Node-Name
|
||
- `status` (str): Backup-Status
|
||
|
||
|
||
**Returns:**
|
||
- `bool`: True wenn valide, False sonst
|
||
|
||
|
||
**Validierungs-Regeln:**
|
||
- Node muss existieren (not empty)
|
||
- Node muss mindestens 3 Zeichen lang sein
|
||
- Status muss existieren
|
||
- Node darf nicht "MAINTENANCE" enthalten
|
||
|
||
|
||
**Beispiel:**
|
||
```python
|
||
parser.is_valid_node("SERVER", "Completed") # False (zu kurz)
|
||
parser.is_valid_node("SERVER_MSSQL", "Completed") # True
|
||
parser.is_valid_node("SERVER_MAINTENANCE", "Completed") # False
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `TSMParser.parse_csv(csv_file: Path) -> None`
|
||
|
||
|
||
Parsed eine TSM-CSV-Datei und sammelt Backup-Informationen.
|
||
|
||
|
||
**Parameter:**
|
||
- `csv_file` (Path): Pfad zur CSV-Datei
|
||
|
||
|
||
**CSV-Format:**
|
||
```
|
||
TIMESTAMP,FIELD,NODE_NAME,SCHEDULE_NAME,STATUS
|
||
2026-01-12 08:00:00,SOMETHING,SERVER_MSSQL,DAILY_FULL,Completed
|
||
```
|
||
|
||
|
||
**Side Effects:**
|
||
- Fügt geparste Backups zu `self.backups` hinzu
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `TSMParser.aggregate() -> dict`
|
||
|
||
|
||
Aggregiert Backup-Daten pro normalisierten Node.
|
||
|
||
|
||
**Returns:**
|
||
```python
|
||
{
|
||
"SERVER_MSSQL": {
|
||
"statuses": ["Completed", "Completed", "Failed"],
|
||
"schedules": ["DAILY_FULL", "DAILY_DIFF", "HOURLY_LOG"],
|
||
"last": 1736693420, # Unix timestamp
|
||
"count": 3
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 2.2 Check Plugin Funktionen
|
||
|
||
|
||
#### `extract_backup_type(node: str) -> str`
|
||
|
||
|
||
Extrahiert Backup-Typ aus Node-Namen anhand bekannter Typen.
|
||
|
||
|
||
**Parameter:**
|
||
- `node` (str): Normalisierter Node-Name
|
||
|
||
|
||
**Returns:**
|
||
- `str`: Backup-Typ in Kleinbuchstaben, oder "unknown"
|
||
|
||
|
||
**Bekannte Typen:**
|
||
- Datenbanken: MSSQL, HANA, Oracle, DB2, MySQL
|
||
- Virtualisierung: Virtual
|
||
- Dateisysteme: FILE, SCALE, DM, Datacenter
|
||
- Applikationen: Mail
|
||
|
||
|
||
**Algorithmus:**
|
||
1. Splitte Node-Namen bei Unterstrich
|
||
2. Nehme letztes Segment
|
||
3. Falls letztes Segment numerisch → nehme vorletztes Segment
|
||
4. Prüfe gegen Liste bekannter Typen
|
||
5. Return lowercase oder "unknown"
|
||
|
||
|
||
**Beispiele:**
|
||
```python
|
||
extract_backup_type("SERVER_MSSQL") # → "mssql"
|
||
extract_backup_type("DATABASE_HANA_01") # → "hana"
|
||
extract_backup_type("FILESERVER_FILE") # → "file"
|
||
extract_backup_type("VM_HYPERV_123") # → "hyperv"
|
||
extract_backup_type("APP_UNKNOWN") # → "unknown"
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `extract_backup_level(schedules: list[str]) -> str`
|
||
|
||
|
||
Extrahiert Backup-Level aus Schedule-Namen.
|
||
|
||
|
||
**Parameter:**
|
||
- `schedules` (list[str]): Liste von Schedule-Namen
|
||
|
||
|
||
**Returns:**
|
||
- `str`: `"log"`, `"full"`, `"incremental"`, `"differential"`
|
||
|
||
|
||
**Priorität:** log > full > differential > incremental
|
||
|
||
|
||
**Erkennungs-Pattern:**
|
||
- `_LOG` oder `LOG` → log
|
||
- `_FULL` oder `FULL` → full
|
||
- `_INCR` oder `INCREMENTAL` → incremental
|
||
- `_DIFF` oder `DIFFERENTIAL` → differential
|
||
|
||
|
||
**Beispiele:**
|
||
```python
|
||
extract_backup_level(["DAILY_FULL"]) # → "full"
|
||
extract_backup_level(["HOURLY_LOG", "DAILY_FULL"]) # → "log"
|
||
extract_backup_level(["00-00-00_FULL"]) # → "full"
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `extract_frequency(schedules: list[str]) -> str`
|
||
|
||
|
||
Extrahiert Backup-Frequenz aus Schedule-Namen.
|
||
|
||
|
||
**Parameter:**
|
||
- `schedules` (list[str]): Liste von Schedule-Namen
|
||
|
||
|
||
**Returns:**
|
||
- `str`: `"hourly"`, `"daily"`, `"weekly"`, `"monthly"`, `"unknown"`
|
||
|
||
|
||
**Priorität:** hourly > daily > weekly > monthly
|
||
|
||
|
||
**Erkennungs-Pattern:**
|
||
- `HOURLY` → hourly
|
||
- `DAILY` → daily
|
||
- `WEEKLY` → weekly
|
||
- `MONTHLY` → monthly
|
||
- `HH-MM-SS_*LOG` → hourly (Zeit-basiert mit LOG)
|
||
- `00-00-00_*` → daily (Mitternacht)
|
||
|
||
|
||
**Beispiele:**
|
||
```python
|
||
extract_frequency(["DAILY_FULL"]) # → "daily"
|
||
extract_frequency(["00-00-00_FULL"]) # → "daily"
|
||
extract_frequency(["08-00-00_LOG"]) # → "hourly"
|
||
extract_frequency(["WEEKLY_FULL", "DAILY_DIFF"]) # → "daily"
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `get_error_handling(backup_type: str) -> str`
|
||
|
||
|
||
Bestimmt Error-Handling-Strategie basierend auf Backup-Typ.
|
||
|
||
|
||
**Parameter:**
|
||
- `backup_type` (str): Backup-Typ
|
||
|
||
|
||
**Returns:**
|
||
- `str`: `"tolerant"` oder `"strict"`
|
||
|
||
|
||
**Logik:**
|
||
```python
|
||
if backup_type in TOLERANT_TYPES:
|
||
return "tolerant" # Failed → WARN
|
||
else:
|
||
return "strict" # Failed → CRIT
|
||
```
|
||
|
||
|
||
**Tolerante Typen:**
|
||
- file, virtual, scale, dm, datacenter
|
||
- vmware, hyperv, mail, exchange
|
||
|
||
|
||
**Strikt:**
|
||
- Alle Datenbank-Typen (mssql, hana, oracle, db2, ...)
|
||
- Alle anderen Typen
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `get_backup_category(backup_type: str) -> str`
|
||
|
||
|
||
Kategorisiert Backup-Typ in Oberkategorien.
|
||
|
||
|
||
**Parameter:**
|
||
- `backup_type` (str): Backup-Typ
|
||
|
||
|
||
**Returns:**
|
||
- `str`: `"database"`, `"virtualization"`, `"filesystem"`, `"application"`, `"other"`
|
||
|
||
|
||
**Kategorien:**
|
||
|
||
|
||
| Kategorie | Typen |
|
||
|-----------|-------|
|
||
| `database` | mssql, hana, oracle, db2, mysql, postgres, mariadb, sybase, mongodb |
|
||
| `virtualization` | virtual, vmware, hyperv, kvm, xen |
|
||
| `filesystem` | file, scale, dm, datacenter |
|
||
| `application` | mail, exchange |
|
||
| `other` | Alle anderen |
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `get_thresholds(backup_type: str, backup_level: str) -> dict`
|
||
|
||
|
||
Liefert typ- und level-spezifische Schwellwerte.
|
||
|
||
|
||
**Parameter:**
|
||
- `backup_type` (str): Backup-Typ
|
||
- `backup_level` (str): Backup-Level
|
||
|
||
|
||
**Returns:**
|
||
```python
|
||
{
|
||
"warn": 93600, # Sekunden
|
||
"crit": 172800 # Sekunden
|
||
}
|
||
```
|
||
|
||
|
||
**Priorität:**
|
||
1. Falls `backup_level == "log"` → LOG-Schwellwerte (4h/8h)
|
||
2. Falls `backup_type` in THRESHOLDS → Typ-spezifische Schwellwerte
|
||
3. Sonst → Default-Schwellwerte (26h/48h)
|
||
|
||
|
||
**Beispiele:**
|
||
```python
|
||
get_thresholds("mssql", "log") # → {"warn": 14400, "crit": 28800}
|
||
get_thresholds("mssql", "full") # → {"warn": 93600, "crit": 172800}
|
||
get_thresholds("newtype", "full") # → {"warn": 93600, "crit": 172800}
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `calculate_state(statuses: list[str], last_time: int, backup_type: str, error_handling: str) -> tuple[State, str]`
|
||
|
||
|
||
Berechnet CheckMK-Status aus Backup-Zuständen.
|
||
|
||
|
||
**Parameter:**
|
||
- `statuses` (list[str]): Liste aller Backup-Statuses
|
||
- `last_time` (int): Unix-Timestamp des letzten Backups
|
||
- `backup_type` (str): Backup-Typ
|
||
- `error_handling` (str): "tolerant" oder "strict"
|
||
|
||
|
||
**Returns:**
|
||
- `tuple`: `(State, status_text)`
|
||
|
||
|
||
**Status-Logik-Tabelle:**
|
||
|
||
|
||
| Bedingung | Alter | Error Handling | State | Text |
|
||
|-----------|-------|----------------|-------|------|
|
||
| ≥1x "completed" | - | - | OK | "Completed" |
|
||
| Nur "pending"/"started" | <2h | - | OK | "Pending/Started" |
|
||
| Nur "pending"/"started" | >2h | - | WARN | "Pending (>2h)" |
|
||
| Nur "pending"/"started" | unknown | - | WARN | "Pending" |
|
||
| "failed"/"missed" | - | tolerant | WARN | "Failed (partial)" |
|
||
| "failed"/"missed" | - | strict | CRIT | "Failed/Missed" |
|
||
| Andere | - | - | CRIT | "Unknown State" |
|
||
|
||
|
||
**Beispiele:**
|
||
```python
|
||
calculate_state(["Completed", "Completed"], 1736690000, "mssql", "strict")
|
||
# → (State.OK, "Completed")
|
||
|
||
|
||
calculate_state(["Failed"], 1736690000, "file", "tolerant")
|
||
# → (State.WARN, "Failed (partial)")
|
||
|
||
|
||
calculate_state(["Failed"], 1736690000, "mssql", "strict")
|
||
# → (State.CRIT, "Failed/Missed")
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 2.3 CheckMK API v2 Objekte
|
||
|
||
|
||
#### `Service`
|
||
|
||
|
||
Definiert einen CheckMK-Service während der Discovery.
|
||
|
||
|
||
```python
|
||
from cmk.agent_based.v2 import Service, ServiceLabel
|
||
|
||
|
||
Service(
|
||
item="SERVER_MSSQL",
|
||
labels=[
|
||
ServiceLabel("backup_type", "mssql"),
|
||
ServiceLabel("frequency", "daily"),
|
||
]
|
||
)
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `Result`
|
||
|
||
|
||
Repräsentiert ein Check-Ergebnis.
|
||
|
||
|
||
```python
|
||
from cmk.agent_based.v2 import Result, State
|
||
|
||
|
||
Result(
|
||
state=State.OK,
|
||
summary="Type=MSSQL, Status=Completed, Last=3h"
|
||
)
|
||
|
||
|
||
Result(
|
||
state=State.OK,
|
||
notice="Detailed information for details page"
|
||
)
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
#### `Metric`
|
||
|
||
|
||
Definiert eine Performance-Metrik.
|
||
|
||
|
||
```python
|
||
from cmk.agent_based.v2 import Metric
|
||
|
||
|
||
Metric(
|
||
name="backup_age",
|
||
value=11700, # Aktueller Wert
|
||
levels=(93600, 172800), # (warn, crit)
|
||
boundaries=(0, None), # (min, max)
|
||
)
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
## 3. Erweiterte Konfiguration
|
||
|
||
|
||
### 3.1 Benutzerdefinierte Backup-Typen
|
||
|
||
|
||
**Szenario:** Neuer Backup-Typ "SAPASE" (SAP ASE Datenbank) soll überwacht werden.
|
||
|
||
|
||
**Schritt 1: Typ zur known_types Liste hinzufügen**
|
||
|
||
|
||
```python
|
||
# In tsm_backups.py, extract_backup_type() Funktion
|
||
known_types = [
|
||
'MSSQL', 'HANA', 'FILE', 'ORACLE', 'DB2', 'SCALE', 'DM',
|
||
'DATACENTER', 'VIRTUAL', 'MAIL', 'MYSQL',
|
||
'SAPASE', # NEU
|
||
]
|
||
```
|
||
|
||
|
||
**Schritt 2: Schwellwerte definieren (optional)**
|
||
|
||
|
||
```python
|
||
THRESHOLDS = {
|
||
# ... bestehende Einträge ...
|
||
"sapase": {"warn": 26 * 3600, "crit": 48 * 3600},
|
||
}
|
||
```
|
||
|
||
|
||
**Schritt 3: Typ zur passenden Kategorie hinzufügen**
|
||
|
||
|
||
```python
|
||
DATABASE_TYPES = {
|
||
'mssql', 'hana', 'db2', 'oracle', 'mysql',
|
||
'sapase', # NEU
|
||
}
|
||
```
|
||
|
||
|
||
**Schritt 4: Error-Handling festlegen (optional)**
|
||
|
||
|
||
Falls tolerant gewünscht:
|
||
```python
|
||
TOLERANT_TYPES = {
|
||
'file', 'virtual', 'scale', 'dm', 'datacenter',
|
||
'vmware', 'hyperv', 'mail', 'exchange',
|
||
'sapase', # NEU (falls tolerant erwünscht)
|
||
}
|
||
```
|
||
|
||
|
||
**Schritt 5: Plugin neu laden**
|
||
|
||
|
||
```bash
|
||
cmk -R
|
||
cmk -II --all
|
||
```
|
||
|
||
|
||
**Ergebnis:**
|
||
- Nodes wie `SERVER_SAPASE` werden automatisch erkannt
|
||
- Typ-Label: `backup_type=sapase`
|
||
- Kategorie-Label: `backup_category=database`
|
||
- Schwellwerte: 26h/48h
|
||
|
||
|
||
---
|
||
|
||
|
||
### 3.2 Async Agent Plugin
|
||
|
||
|
||
Bei großen TSM-Umgebungen kann das CSV-Parsing Zeit in Anspruch nehmen. Async-Plugins laufen unabhängig vom Agent-Intervall.
|
||
|
||
|
||
**Konfiguration:**
|
||
|
||
|
||
```bash
|
||
# Als root auf dem Host
|
||
cat > /etc/check_mk/mrpe.cfg << 'EOF'
|
||
# TSM Backups async (alle 5 Minuten)
|
||
(interval=300) tsm_backups /usr/lib/check_mk_agent/plugins/tsm_backups
|
||
EOF
|
||
```
|
||
|
||
|
||
**Oder mit CheckMK Bakery (Regel):**
|
||
|
||
|
||
```
|
||
Setup > Agents > Agent Rules > Asynchronous execution of plugins (Windows, Linux)
|
||
```
|
||
|
||
|
||
**Einstellungen:**
|
||
- Plugin: `tsm_backups`
|
||
- Execution interval: `300` Sekunden (5 Minuten)
|
||
- Cache age: `600` Sekunden (10 Minuten)
|
||
|
||
|
||
---
|
||
|
||
|
||
### 3.3 CSV-Export Automation
|
||
|
||
|
||
#### Option A: NFS-Mount (empfohlen)
|
||
|
||
|
||
```bash
|
||
# /etc/fstab
|
||
tsm-server.example.com:/exports/backup-stats /mnt/CMK_TSM nfs defaults,ro 0 0
|
||
|
||
|
||
# Mount testen
|
||
mount -a
|
||
ls /mnt/CMK_TSM/
|
||
```
|
||
|
||
|
||
#### Option B: Rsync via Cron
|
||
|
||
|
||
```bash
|
||
# Crontab für root
|
||
*/15 * * * * rsync -az --delete tsm-server:/path/to/csv/ /mnt/CMK_TSM/
|
||
```
|
||
|
||
|
||
#### Option C: SCP mit SSH-Key
|
||
|
||
|
||
```bash
|
||
# SSH-Key einrichten
|
||
ssh-keygen -t ed25519 -f ~/.ssh/tsm_backup_key -N ""
|
||
ssh-copy-id -i ~/.ssh/tsm_backup_key.pub tsm-server
|
||
|
||
|
||
# Crontab
|
||
*/15 * * * * scp -i ~/.ssh/tsm_backup_key tsm-server:/path/*.CSV /mnt/CMK_TSM/
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 3.4 Regel-basierte Service-Erstellung
|
||
|
||
|
||
**CheckMK-Regeln für automatische Service-Labels:**
|
||
|
||
|
||
```
|
||
Setup > Services > Discovery rules > Host labels
|
||
```
|
||
|
||
|
||
**Beispiel-Regel:**
|
||
```yaml
|
||
conditions:
|
||
service_labels:
|
||
backup_category: database
|
||
|
||
actions:
|
||
add_labels:
|
||
criticality: high
|
||
team: dba
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 3.5 Custom Views
|
||
|
||
|
||
#### View: Alle kritischen Datenbank-Backups
|
||
|
||
|
||
```
|
||
Setup > General > Custom views > Create new view
|
||
|
||
|
||
Name: Critical Database Backups
|
||
Datasource: All services
|
||
|
||
|
||
Filters:
|
||
- Service state: CRIT
|
||
- Service labels: backup_category = database
|
||
|
||
|
||
Columns:
|
||
- Host
|
||
- Service description
|
||
- Service state
|
||
- Service output
|
||
- Service labels: backup_type
|
||
- Service labels: frequency
|
||
- Perf-O-Meter
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 3.6 Custom Notifications
|
||
|
||
|
||
**Notification Rule: Nur strikte Failed-Backups eskalieren**
|
||
|
||
|
||
```
|
||
Setup > Notifications > Add rule
|
||
|
||
|
||
Conditions:
|
||
- Service labels: error_handling = strict
|
||
- Service state: CRIT
|
||
- Service state type: HARD
|
||
|
||
|
||
Contact selection:
|
||
- Specify users: dba-team
|
||
|
||
|
||
Notification method:
|
||
- Email
|
||
- PagerDuty
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
## 4. Entwicklungsleitfaden
|
||
|
||
|
||
### 4.1 Entwicklungsumgebung einrichten
|
||
|
||
|
||
```bash
|
||
# CheckMK-Site für Entwicklung
|
||
omd create dev
|
||
omd start dev
|
||
su - dev
|
||
|
||
|
||
# Git-Repository
|
||
cd ~/local/lib/python3/cmk_addons/plugins/
|
||
git init
|
||
git add .
|
||
git commit -m "Initial commit"
|
||
|
||
|
||
# Entwicklungs-Workflow
|
||
vim tsm/agent_based/tsm_backups.py
|
||
cmk -R
|
||
cmk -vv --debug test-host | grep "TSM Backup"
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 4.2 Unit Tests schreiben
|
||
|
||
|
||
**Test-Datei:** `test_tsm_backups.py`
|
||
|
||
|
||
```python
|
||
#!/usr/bin/env python3
|
||
import pytest
|
||
from tsm_backups import (
|
||
extract_backup_type,
|
||
extract_backup_level,
|
||
calculate_state,
|
||
)
|
||
from cmk.agent_based.v2 import State
|
||
|
||
|
||
def test_extract_backup_type():
|
||
assert extract_backup_type("SERVER_MSSQL") == "mssql"
|
||
assert extract_backup_type("DATABASE_HANA_01") == "hana"
|
||
assert extract_backup_type("NEWTYPE_CUSTOM") == "custom"
|
||
|
||
|
||
def test_extract_backup_level():
|
||
assert extract_backup_level(["DAILY_FULL"]) == "full"
|
||
assert extract_backup_level(["HOURLY_LOG", "DAILY_FULL"]) == "log"
|
||
|
||
|
||
def test_calculate_state_completed():
|
||
state, text = calculate_state(
|
||
["Completed", "Completed"],
|
||
1736690000,
|
||
"mssql",
|
||
"strict"
|
||
)
|
||
assert state == State.OK
|
||
assert text == "Completed"
|
||
|
||
|
||
def test_calculate_state_failed_strict():
|
||
state, text = calculate_state(
|
||
["Failed"],
|
||
1736690000,
|
||
"mssql",
|
||
"strict"
|
||
)
|
||
assert state == State.CRIT
|
||
assert text == "Failed/Missed"
|
||
|
||
|
||
def test_calculate_state_failed_tolerant():
|
||
state, text = calculate_state(
|
||
["Failed"],
|
||
1736690000,
|
||
"file",
|
||
"tolerant"
|
||
)
|
||
assert state == State.WARN
|
||
assert text == "Failed (partial)"
|
||
```
|
||
|
||
|
||
**Tests ausführen:**
|
||
```bash
|
||
pytest test_tsm_backups.py -v
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 4.3 Code-Style
|
||
|
||
|
||
**PEP 8 Compliance:**
|
||
```bash
|
||
pip install black flake8 mypy
|
||
|
||
|
||
# Auto-Formatierung
|
||
black tsm_backups.py
|
||
|
||
|
||
# Linting
|
||
flake8 tsm_backups.py
|
||
|
||
|
||
# Type Checking
|
||
mypy tsm_backups.py
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 4.4 Debugging
|
||
|
||
|
||
#### Agent-Plugin debuggen
|
||
|
||
|
||
```bash
|
||
# Direkter Aufruf mit Traceback
|
||
python3 /usr/lib/check_mk_agent/plugins/tsm_backups
|
||
|
||
|
||
# Mit Debugger
|
||
python3 -m pdb /usr/lib/check_mk_agent/plugins/tsm_backups
|
||
```
|
||
|
||
|
||
#### Check-Plugin debuggen
|
||
|
||
|
||
```bash
|
||
# Verbose Check mit Debug-Output
|
||
cmk -vv --debug hostname | less
|
||
|
||
|
||
# Nur TSM-Services
|
||
cmk -vv --debug hostname | grep -A 20 "TSM Backup"
|
||
|
||
|
||
# Python-Debugger im Plugin
|
||
import pdb; pdb.set_trace()
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 4.5 Performance-Profiling
|
||
|
||
|
||
```python
|
||
# In tsm_backups.py
|
||
import cProfile
|
||
import pstats
|
||
|
||
|
||
def main():
|
||
profiler = cProfile.Profile()
|
||
profiler.enable()
|
||
|
||
# ... bestehender Code ...
|
||
|
||
profiler.disable()
|
||
stats = pstats.Stats(profiler)
|
||
stats.sort_stats('cumulative')
|
||
stats.print_stats(20)
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
## 5. Performance-Optimierung
|
||
|
||
|
||
### 5.1 CSV-Parsing beschleunigen
|
||
|
||
|
||
**Problem:** Große CSV-Dateien (>100 MB) verlangsamen Agent
|
||
|
||
|
||
**Lösung 1: Nur relevante Zeilen parsen**
|
||
|
||
|
||
```python
|
||
def parse_csv_optimized(self, csv_file):
|
||
# Nur die letzten 24h sind relevant
|
||
cutoff_time = datetime.now() - timedelta(hours=24)
|
||
|
||
with open(csv_file, 'r') as f:
|
||
reader = csv.reader(f)
|
||
for row in reader:
|
||
try:
|
||
time_str = row[0].strip()
|
||
timestamp = datetime.strptime(time_str, "%Y-%m-%d %H:%M:%S")
|
||
|
||
# Skip alte Einträge
|
||
if timestamp < cutoff_time:
|
||
continue
|
||
|
||
# ... restliche Verarbeitung ...
|
||
except:
|
||
continue
|
||
```
|
||
|
||
|
||
**Lösung 2: Pandas für große Dateien**
|
||
|
||
|
||
```python
|
||
import pandas as pd
|
||
|
||
|
||
def parse_csv_pandas(csv_file):
|
||
df = pd.read_csv(
|
||
csv_file,
|
||
names=['timestamp', 'field', 'node', 'schedule', 'status'],
|
||
parse_dates=['timestamp'],
|
||
)
|
||
|
||
# Filter letzten 24h
|
||
cutoff = pd.Timestamp.now() - pd.Timedelta(hours=24)
|
||
df = df[df['timestamp'] > cutoff]
|
||
|
||
# Aggregation
|
||
grouped = df.groupby('node').agg({
|
||
'status': list,
|
||
'schedule': list,
|
||
'timestamp': 'max',
|
||
'node': 'count'
|
||
})
|
||
|
||
return grouped.to_dict()
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 5.2 Caching
|
||
|
||
|
||
**Problem:** CSV-Dateien ändern sich nur alle 15-30 Minuten
|
||
|
||
|
||
**Lösung: Cache mit Timestamp-Check**
|
||
|
||
|
||
```python
|
||
import json
|
||
from pathlib import Path
|
||
import time
|
||
|
||
|
||
CACHE_FILE = Path("/tmp/tsm_backups_cache.json")
|
||
CACHE_TTL = 300 # 5 Minuten
|
||
|
||
|
||
def get_cached_or_parse():
|
||
if CACHE_FILE.exists():
|
||
cache_age = time.time() - CACHE_FILE.stat().st_mtime
|
||
if cache_age < CACHE_TTL:
|
||
with open(CACHE_FILE, 'r') as f:
|
||
return json.load(f)
|
||
|
||
# Parse fresh
|
||
parser = TSMParser()
|
||
# ... parse logic ...
|
||
result = parser.aggregate()
|
||
|
||
# Cache schreiben
|
||
with open(CACHE_FILE, 'w') as f:
|
||
json.dump(result, f)
|
||
|
||
return result
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 5.3 Speicher-Optimierung
|
||
|
||
|
||
**Problem:** Große Listen von Status/Schedule strings
|
||
|
||
|
||
**Lösung: Nur unique values speichern**
|
||
|
||
|
||
```python
|
||
def aggregate_optimized(self):
|
||
nodes = defaultdict(lambda: {
|
||
"statuses": set(), # Set statt Liste
|
||
"schedules": set(),
|
||
"last": None,
|
||
"count": 0,
|
||
})
|
||
|
||
for b in self.backups:
|
||
node = b["node"]
|
||
nodes[node]["count"] += 1
|
||
nodes[node]["statuses"].add(b["status"]) # Automatisch unique
|
||
nodes[node]["schedules"].add(b["schedule"])
|
||
# ... rest ...
|
||
|
||
# Konvertiere Sets zu Listen für JSON
|
||
for node in nodes:
|
||
nodes[node]["statuses"] = list(nodes[node]["statuses"])
|
||
nodes[node]["schedules"] = list(nodes[node]["schedules"])
|
||
|
||
return nodes
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
## 6. Sicherheit
|
||
|
||
|
||
### 6.1 Dateiberechtigungen
|
||
|
||
|
||
```bash
|
||
# Agent-Plugin
|
||
chown root:root /usr/lib/check_mk_agent/plugins/tsm_backups
|
||
chmod 755 /usr/lib/check_mk_agent/plugins/tsm_backups
|
||
|
||
|
||
# CSV-Verzeichnis
|
||
chown root:root /mnt/CMK_TSM
|
||
chmod 755 /mnt/CMK_TSM
|
||
chmod 644 /mnt/CMK_TSM/*.CSV
|
||
|
||
|
||
# Check-Plugin
|
||
chown <site>:<site> $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
|
||
chmod 644 $OM/local/lib/python3/cmk_addons/plugins/tsm/agent_based/tsm_backups.py
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 6.2 Input-Validierung
|
||
|
||
|
||
**Agent-Plugin:**
|
||
```python
|
||
def is_valid_node(self, node, status):
|
||
# Länge prüfen
|
||
if not node or len(node) < 3 or len(node) > 200:
|
||
return False
|
||
|
||
# Unerlaubte Zeichen
|
||
if not re.match(r'^[A-Za-z0-9_-]+$', node):
|
||
return False
|
||
|
||
# Status whitelist
|
||
valid_statuses = ['Completed', 'Failed', 'Missed', 'Pending', 'Started']
|
||
if status not in valid_statuses:
|
||
return False
|
||
|
||
return True
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 6.3 Sichere CSV-Verarbeitung
|
||
|
||
|
||
```python
|
||
def parse_csv_safe(self, csv_file):
|
||
try:
|
||
# Dateigröße prüfen (max 500 MB)
|
||
if csv_file.stat().st_size > 500 * 1024 * 1024:
|
||
return
|
||
|
||
with open(csv_file, 'r', encoding='utf-8') as f:
|
||
reader = csv.reader(f)
|
||
|
||
line_count = 0
|
||
for row in reader:
|
||
line_count += 1
|
||
|
||
# Max. 1 Million Zeilen
|
||
if line_count > 1000000:
|
||
break
|
||
|
||
# ... Verarbeitung ...
|
||
except Exception as e:
|
||
# Logging statt Crash
|
||
pass
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
## 7. Integration
|
||
|
||
|
||
### 7.1 Grafana-Dashboard
|
||
|
||
|
||
**InfluxDB Query:**
|
||
```sql
|
||
SELECT
|
||
mean("backup_age") AS "avg_age",
|
||
max("backup_age") AS "max_age"
|
||
FROM "tsm_backups"
|
||
WHERE
|
||
"backup_category" = 'database'
|
||
AND time > now() - 7d
|
||
GROUP BY
|
||
time(1h),
|
||
"node_name"
|
||
```
|
||
|
||
|
||
**Panels:**
|
||
- Backup Age Heatmap (pro Node)
|
||
- Status Distribution (Pie Chart)
|
||
- Backup Jobs Timeline
|
||
- Alert History
|
||
|
||
|
||
---
|
||
|
||
|
||
### 7.2 Prometheus Exporter
|
||
|
||
|
||
**CheckMK Prometheus Exporter konfigurieren:**
|
||
|
||
|
||
```
|
||
Setup > Exporter > Prometheus
|
||
|
||
|
||
Metrics:
|
||
- cmk_tsm_backups_backup_age_seconds
|
||
- cmk_tsm_backups_backup_jobs_total
|
||
|
||
|
||
Labels:
|
||
- backup_type
|
||
- backup_category
|
||
- frequency
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 7.3 REST API Zugriff
|
||
|
||
|
||
```python
|
||
import requests
|
||
|
||
|
||
# CheckMK REST API
|
||
url = "https://checkmk.example.com/site/check_mk/api/1.0"
|
||
headers = {
|
||
"Authorization": "Bearer YOUR_API_KEY",
|
||
"Accept": "application/json"
|
||
}
|
||
|
||
|
||
# Alle TSM-Services abfragen
|
||
response = requests.get(
|
||
f"{url}/domain-types/service/collections/all",
|
||
headers=headers,
|
||
params={
|
||
"query": '{"op": "and", "expr": [{"op": "~", "left": "description", "right": "TSM Backup"}]}'
|
||
}
|
||
)
|
||
|
||
|
||
services = response.json()
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
## 8. Best Practices
|
||
|
||
|
||
### 8.1 Naming Conventions
|
||
|
||
|
||
**Node-Namen:**
|
||
```
|
||
✅ EMPFOHLEN:
|
||
- SERVER_MSSQL
|
||
- APP_ORACLE_01
|
||
- FILESERVER_BACKUP
|
||
|
||
|
||
❌ VERMEIDEN:
|
||
- MSSQL (zu generisch)
|
||
- SERVER-PROD (Bindestrich kann Probleme machen)
|
||
- very_long_name_that_is_too_descriptive_mssql_backup_node (>50 Zeichen)
|
||
```
|
||
|
||
|
||
**Schedule-Namen:**
|
||
```
|
||
✅ EMPFOHLEN:
|
||
- DAILY_FULL
|
||
- HOURLY_LOG
|
||
- WEEKLY_FULL
|
||
|
||
|
||
❌ VERMEIDEN:
|
||
- PROD_BACKUP (keine Frequency erkennbar)
|
||
- BACKUP01 (keine Informationen)
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 8.2 Monitoring-Strategie
|
||
|
||
|
||
**Alarm-Eskalation:**
|
||
|
||
|
||
1. **Stufe 1 (INFO):** Backup Started/Pending
|
||
2. **Stufe 2 (WARN):**
|
||
- Backup-Alter > WARN-Schwellwert
|
||
- Failed (tolerante Typen)
|
||
- Pending > 2h
|
||
3. **Stufe 3 (CRIT):**
|
||
- Backup-Alter > CRIT-Schwellwert
|
||
- Failed/Missed (strikte Typen)
|
||
|
||
|
||
**Notification Delays:**
|
||
```
|
||
Setup > Notifications > Rules
|
||
|
||
|
||
WARN: Notify after 15 minutes (allow recovery)
|
||
CRIT: Notify immediately
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 8.3 Maintenance Windows
|
||
|
||
|
||
**Backup-Services während Maintenance pausieren:**
|
||
|
||
|
||
```
|
||
Setup > Services > Service monitoring rules > Disabled checks
|
||
|
||
|
||
Conditions:
|
||
- Service labels: backup_system = tsm
|
||
- Timeperiod: maintenance_window
|
||
|
||
|
||
Action: Disable active checks
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
### 8.4 Dokumentation
|
||
|
||
|
||
**Pro Installation dokumentieren:**
|
||
|
||
|
||
1. **CSV-Export-Quelle:** Welcher TSM-Server, welche Queries
|
||
2. **CSV-Transfer-Methode:** NFS/SCP/Rsync + Schedule
|
||
3. **Benutzerdefinierte Typen:** Liste aller hinzugefügten Backup-Typen
|
||
4. **Angepasste Schwellwerte:** Begründung für Abweichungen
|
||
5. **Kontakte:** Wer ist für TSM-Backups verantwortlich
|
||
|
||
|
||
---
|
||
|
||
|
||
### 8.5 Regelmäßige Wartung
|
||
|
||
|
||
**Monatlich:**
|
||
- CSV-Verzeichnis aufräumen (alte Dateien löschen)
|
||
- Überprüfen: Werden alle erwarteten Nodes gefunden?
|
||
- Alert-History analysieren: False Positives?
|
||
|
||
|
||
**Quartalsweise:**
|
||
- Schwellwerte überprüfen und ggf. anpassen
|
||
- Neue Backup-Typen dokumentieren
|
||
- Check-Plugin auf Updates prüfen
|
||
|
||
|
||
**Jährlich:**
|
||
- CheckMK-Upgrade-Kompatibilität testen
|
||
- Performance-Review (Agent-Laufzeit, Check-Dauer)
|
||
- Architektur-Review (Ist die Lösung noch passend?)
|
||
|
||
|
||
---
|
||
|
||
|
||
## Anhang
|
||
|
||
|
||
### A. Glossar
|
||
|
||
|
||
| Begriff | Beschreibung |
|
||
|---------|--------------|
|
||
| **Agent Plugin** | Script auf dem überwachten Host, liefert Daten an CheckMK |
|
||
| **Check Plugin** | Code auf CheckMK-Server, erstellt Services und bewertet Status |
|
||
| **Service Label** | Key-Value-Paar, das einem Service zugeordnet ist (Filterung/Reporting) |
|
||
| **Discovery** | Prozess, bei dem CheckMK automatisch Services erstellt |
|
||
| **Threshold** | Schwellwert (WARN/CRIT) für eine Metrik |
|
||
| **Node** | TSM-Begriff für einen Backup-Client |
|
||
| **Schedule** | TSM-Begriff für einen geplanten Backup-Job |
|
||
|
||
|
||
---
|
||
|
||
|
||
### B. Fehlercode-Referenz
|
||
|
||
|
||
| Fehler | Ursache | Lösung |
|
||
|--------|---------|--------|
|
||
| `Backup not found in data` | Node existiert in Discovery, aber nicht im aktuellen Agent-Output | CSV-Dateien prüfen, ggf. Re-Discovery |
|
||
| `Empty agent section` | Agent liefert keine Daten | Agent-Plugin-Ausführung prüfen, CSV-Verzeichnis prüfen |
|
||
| `JSON decode error` | Agent-Output ist kein valides JSON | Agent-Plugin manuell testen, Fehler im Output suchen |
|
||
| `Unknown State` | Unerwarteter Status vom TSM | Agent-Output prüfen, ggf. `calculate_state()` erweitern |
|
||
|
||
|
||
---
|
||
|
||
|
||
### C. TSM-Query für CSV-Export
|
||
|
||
|
||
**Beispiel-Query für TSM-Server (dsmadmc):**
|
||
|
||
|
||
```sql
|
||
SELECT
|
||
DATE(END_TIME) || ' ' || TIME(END_TIME) AS DATETIME,
|
||
ENTITY,
|
||
NODE_NAME,
|
||
SCHEDULE_NAME,
|
||
RESULT
|
||
FROM ACTLOG
|
||
WHERE
|
||
SCHEDULE_NAME IS NOT NULL
|
||
AND SCHEDULE_NAME != ''
|
||
AND TIMESTAMPDIFF(4, CHAR(CURRENT_TIMESTAMP - END_TIME)) <= 24
|
||
ORDER BY END_TIME DESC
|
||
```
|
||
|
||
|
||
**Export als CSV:**
|
||
```bash
|
||
dsmadmc -id=admin -pa=password -comma \
|
||
"SELECT ... FROM ACTLOG ..." \
|
||
> /exports/backup-stats/TSM_BACKUP_SCHED_24H.CSV
|
||
```
|
||
|
||
|
||
---
|
||
|
||
|
||
**Ende der technischen Dokumentation**
|
||
|
||
|
||
**Letzte Aktualisierung:** 2026-01-12
|
||
**Version:** 4.1
|
||
**Autor:** Marius Gielnik |