Metadata-Version: 2.4
Name: prometheus-monitoring-scripts
Version: 0.0.0
Summary: prometheus-monitoring-scripts
Author-email: Jordan Tardif <jordan@dreamhost.com>
License: Apache Software License 2.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: prometheus-client
Requires-Dist: storable
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: coverage; extra == "test"

# Prometheus Monitoring Scripts

A collection of custom exporters for Prometheus monitoring, designed to collect metrics from various systems and services.

## Installation

Clone the repository and set up the development environment using the Makefile:

```bash
git clone https://git.dreamhost.com/dreamhost/infra/prometheus-monitoring-scripts.git
cd prometheus-monitoring-scripts
make setup
source env/bin/activate
```

This will create a virtual environment, install the package in development mode, and install all required dependencies.

> Note: The `make setup` command requires the `uv` tool, a modern Python package manager. If you don't have `uv` installed, you can install it following the instructions at [https://docs.astral.sh/uv/getting-started/installation/](https://docs.astral.sh/uv/getting-started/installation/).

## Development

The project includes several Makefile targets to help with development:

- `make setup` - Set up the development environment
- `make style` - Check code style using Ruff
- `make autopep` - Automatically fix code style issues using Ruff
- `make test` - Run functional tests
- `make test_smoke` - Run smoke tests

To run specific tests, use:

```bash
make test test=tests/path/to/test
```

## Usage

The monitoring scripts are organized as modules that can be called using the `custom_exporter` command:

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter <module_name> [arguments]
```

In a dev environment, the paths are configured such that you can run `custom_exporter` without a full path.

### Available Modules

- `disk.xfs` - XFS quota metrics for user disk usage
- `disk.podman` - Disk usage metrics for Podman containers
- `mailq.generic` - Postfix mail queue metrics
- `mailq.podman` - Postfix mail queue metrics for Podman containers
- `mailq.mailman` - Mailman queue monitor
- `backups.users` - User backup metrics
- `backups.vms` - VM backup metrics
- `service.podman` - Podman service metrics
- `dphactl.core` - Core dp-ha-ctl metrics for monitoring Redis connection and service status
- `dphactl.systemctl` - Systemctl-based checks for Podman socket and HAManager service 
- `dphactl.logging_checks` - Log-related checks for Redis auth failures and timeouts
- `dphactl.containers` - Container difference checks between hosts
- `dphactl.verify` - Missing file verification checks from dp-ha-ctl

## Module Documentation

### disk.xfs

Collects XFS quota metrics for disk usage by user and filesystem.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter disk.xfs
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `xfs_disk_avail_bytes` | Gauge | Available disk space in bytes | `filesystem`, `user` |
| `xfs_disk_used_bytes` | Gauge | Used disk space in bytes | `filesystem`, `user` |

The exporter runs `xfs_quota` on the specified mount point (default: `/home`) to gather usage data.

### mailq.generic

Collects Postfix mail queue metrics by counting files in queue directories.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter mailq.generic
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `postfix_queue_size` | Gauge | Number of emails in queue | `queue` |

Monitors all standard Postfix queues: active, bounce, deferred, incoming, and maildrop.

### mailq.podman

Extends mail queue monitoring to Postfix instances running in Podman containers.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter mailq.podman [container_filter]
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `postfix_queue_size` | Gauge | Number of emails in queue | `machine`, `queue` |
| `postfix_queue_errors` | Gauge | Increment for each error accessing queues | `machine` |

The optional container filter parameter allows monitoring specific containers.

### mailq.mailman

Monitors Mailman queue sizes by counting files in queue directories across multiple mailman instances.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter mailq.mailman [base_path]
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `mailman_queue_size` | Gauge | Number of files in queue directories | `service`, `queue` |

This exporter scans mailman instances in the specified base path (default: `/dh/mailman`) and counts files in both "in" and "out" queue directories. Each service (mailman instance directory) is tracked separately with the directory name as the service label.

### backups.users

Collects metrics about user backup status and history.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter backups.users
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `backup_last_successful` | Gauge | Timestamp of last successful backup | `user`, `machine`, `vmhost` |
| `backup_last_rsync_exit_code` | Gauge | Exit code of last rsync operation | `user`, `machine`, `vmhost` |
| `backup_last_attempted_backup` | Gauge | Timestamp of last backup attempt | `user`, `machine`, `vmhost` |
| `backup_last_user_state` | Gauge | State of last backup (1=active state) | `user`, `machine`, `vmhost`, `state` |
| `backup_state_retrive_failed` | Gauge | Indicates backup state retrieval failed | `machine`, `vmhost` |
| `backup_status` | Gauge | Overall backup status | `machine`, `vmhost`, `status` |

Reads backup state from `/usr/local/dh/var/localdata/backup.state` and user information from `/usr/local/dh/etc/localdata/users.json`.

### backups.vms

Extends backup monitoring to virtual machines using the same metrics as user backups.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter backups.vms
```

Uses the same metrics as `backups.users` but applied to VM guests defined in `/usr/local/dh/etc/localdata/guests.json`.

### service.podman

Monitors systemd services running within Podman containers.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter service.podman <services> [container_filter]
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `node_systemd_unit_state` | Gauge | State of systemd units (1=in this state) | `machine`, `name`, `state`, `type` |
| `podman_exec_errors` | Gauge | Increments for each failed exec into container | `machine` |

The first argument is required and should be a comma-separated list of service names to monitor. The optional second argument filters which containers to check.

### dphactl.core

Provides core metrics for the DP-HA-CTL system, focusing on basic functionality like Redis connection and service status.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter dphactl.core
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `dp_hactl_redis_connected` | Gauge | Status of Redis backend connection (1=connected) | |
| `dp_hactl_service_up` | Gauge | Status of dp-ha-ctl service (1=up) | |

The module implements timeouts for all commands and preserves subprocess exit codes for detailed error reporting.

### dphactl.systemctl

Monitors system services related to DP-HA-CTL functionality using systemctl.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter dphactl.systemctl
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `podman_socket_active` | Gauge | Status of podman.socket systemd unit (1=active) | |
| `hamanager_service_active` | Gauge | Status of hamanager.service systemd unit (1=active) | |

This module can be easily removed or replaced when node-exporter checks are enabled on the hosts.

### dphactl.logging_checks

Examines log files to track Redis-related issues.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter dphactl.logging_checks
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `hamanager_redis_auth_failures` | Gauge | Count of Redis authentication failures in logs | |
| `hamanager_redis_timeouts` | Gauge | Count of Redis timeouts in logs | |

Scans the log files in `/var/log/hamanager/` for specific patterns related to Redis errors.

### dphactl.containers

Monitors container differences between hosts.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter dphactl.containers
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `dp_hactl_missing_containers_total` | Gauge | Count of missing containers | `type`, `problem` |

Reports missing standby and primary containers, with specific error labels for timeout or connection issues.

### dphactl.verify

Runs verification checks on the DP-HA-CTL system to ensure proper configuration.

```bash
/opt/prometheus-monitoring-scripts/bin/custom_exporter dphactl.verify
```

#### Metrics

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `dp_hactl_verify_problems_total` | Gauge | Total number of problems per container | `machine` |
| `dp_hactl_verify_problem_details` | Gauge | Detailed breakdown of problem types | `machine`, `problem` |

Runs the `dp-ha-ctl verify` command and categorizes detected problems for detailed monitoring.

All dphactl modules implement proper error handling with specific subprocess exit code preservation for detailed alerting and debugging.