1、Prometheus 介绍安装

本文参考我的好朋友的文章 Mr.pu 监控模板文档

本次监控部署应用到的相关软件如下

1
2
3
4
5
6
7
8
9
prometheus			数据采集和存储 提供PromQL语法查询
alertmanager 警告管理 进行报警
node_exporter 收集主机的基本性能监控指标
blackbox_exporter 收集http,https,tcp等监控指标
redis_exporter 收集redis相关的监控指标
mysqld_exporter 收集mysql相关的监控指标
pushgateway 向prometheus推送监控指标
PrometheusAlert 运维告警转发系统 结合alertmanager
grafana 监控数据大盘展示

服务端口说明

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
prometheus 			启动 9090 端口
alertmanager 启动 9093 端口
PrometheusAlert 启动 8080 端口
grafana 启动 3000 端口
blackbox_exporter 启动 9115 端口
node_exporter 启动 9100 端口
redis_exporter 启动 9121 端口
mysqld_exporter 启动 9104 端口
pushgateway 启动 9091 端口

以上服务其启动端口都可以通过启动命令 -h 查看帮助信息 找到其指定启动端口的参数
PrometheusAler 的启动端口在 配置文件中设置


主机prometheus 部署 prometheus alertmanager PrometheusAlert grafana blackbox_exporter 这五个服务部署在了一台上 也可以分开单独部署
主机node 部署 node_exporter redis_exporter pushgateway mysqld_exporter

Prometheus主机部署的服务

安装prometheus

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
$ wget https://github.com/prometheus/prometheus/releases/download/v2.37.1/prometheus-2.37.1.linux-amd64.tar.gz
tar -xvf prometheus-2.37.1.linux-amd64.tar.gz -C /usr/local/
$ cd /usr/local && ln -s prometheus-2.37.1.linux-amd64/ prometheus
$ cat > /usr/lib/systemd/system/prometheus.service << EOF
[Unit]
Description=prometheus
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/usr/local/prometheus
ExecStart=/usr/local/prometheus/prometheus --web.enable-lifecycle --storage.tsdb.retention.time=90d --web.enable-admin-api --storage.tsdb.path=/usr/local/prometheus/data --web.external-url=http://prometheus-dd.aaaa.com
[Install]
WantedBy=multi-user.target
EOF


# 启动参数相关说明:
--web.enable-lifecycle 在修改了prometheus.yml之后 可以通过下面的方式 进行热加载不需要通过重启
--storage.tsdb.retention.time=90d 设置数据保留时间为90天
--web.enable-admin-api 启用api 可以进行数据清理功能
--storage.tsdb.path=/usr/local/prometheus/data 指定数据落地的目录
--web.external-url=http://prometheus-dd.aaaa.com 指定域名 此域名用于报警消息中超链接跳转所用


$ systemctl daemon-reload
$ systemctl start prometheus.service
$ systemctl enable prometheus.service
- 当修改了prometheus.yml文件后 热加载配置命令
$ curl -X POST http://localhost:9090/-/reload

alertmanager安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
软件包地址 https://github.com/prometheus/alertmanager/releases

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar xvf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64 /usr/local/alertmanager

cat > /usr/lib/systemd/system/alertmanager.service << EOF
[Unit]
Description=alertmanager
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/usr/local/alertmanager
ExecStart=/usr/local/alertmanager/alertmanager --web.external-url=http://alertmanager-dd.aaaa.com
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable alertmanager.service

启动参数相关说明:
--web.external-url=http://alertmanager-dd.aaaa.com 指定域名 此域名用于报警消息中超链接跳转所用

PrometheusAlert安装

运维告警转发系统 结合alertmanager

PrometheusAlert的主要功能就是,alertmanager把告警消息转发给他,他来通知告警到多个渠道

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
软件包地址 https://github.com/feiyu563/PrometheusAlert
此软件安装和使用文档 https://feiyu563.gitbook.io/prometheusalert/

cd /usr/local && wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.7/linux.zip && unzip linux.zip
mv linux PrometheusAlert
chmod 755 /usr/local/PrometheusAlert/PrometheusAlert

cat > /usr/lib/systemd/system/PrometheusAlert.service << EOF
[Unit]
Description=PrometheusAlert
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/usr/local/PrometheusAlert
ExecStart=/usr/local/PrometheusAlert/PrometheusAlert
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable PrometheusAlert.service

blackbox_exporter安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.21.1/blackbox_exporter-0.21.1.linux-amd64.tar.gz
tar -zxvf blackbox_exporter-0.21.1.linux-amd64.tar.gz
mv blackbox_exporter-0.21.1.linux-amd64 /usr/local/blackbox_exporte
cat > /usr/lib/systemd/system/blackbox_exporter.service << EOF
[Unit]
Description=blackbox_exporter
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/usr/local/blackbox_exporter
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable blackbox_exporter.service
systemctl start blackbox_exporter.service

安装grafana

1
2
3
4
5
- 软件包地址 https://grafana.com/grafana/download
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.1.6_amd64.deb
sudo dpkg -i grafana-enterprise_9.1.6_amd64.deb
systemctl start grafana-server && systemctl enable grafana-server.service
- 默认账密都是admin

nginx代理对应服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
由于 prometheus alertmanager PrometheusAlert 服务没有用户系统 所以我们使用nginx来做认证

nginx的配置文件内容如下

grafana.aaaa.com 跳转到grafana的3000端口 无需nginx认证 grafana自带认证 初始账户和密码 Admin Admin
prometheus.aaaa.com 跳转到prometheus的9090端口 使用nginx认证
prometheus-dd.aaaa.com 跳转到prometheus的9090端口 无需nginx认证 此域名与启动服务参数里的域名一致 但设置了转发规则 通过钉钉APP访问无需认证直接 否则跳转到 prometheus.aaaa.com 域名下
alert.aaaa.com 跳转到PrometheusAlert的8080端口 使用nginx认证
alertmanager.aaaa.com 跳转到alertmanager的9093端口 使用nginx认证
alertmanager-dd.aaaa.com 跳转到alertmanager的9093端口 无需nginx认证 此域名与启动服务参数里的域名一致 但设置了转发规则 通过钉钉APP访问无需认证直接 否则跳转到 alertmanager.aaaa.com 域名下
s
可以看到prometheus-dd.aaaa.com alertmanager-dd.aaaa.com为服务启动参数 --web.external-url指定的url
这2个域名在nginx配置文件中设置了转发限制 在使用钉钉APP访问的时候没有nginx认证 其他方式的访问都会跳转到需要认证的域名下



yum -y install httpd-tools
htpasswd -bc /apps/nginx/auth/.htpasswd 用户名 密码
############################ prometheus.yanghongtao.cn ############################
server {
listen 80;
server_name prometheus.yanghongtao.cn;
rewrite ^(.*)$ https://$host$1 permanent;
}

server {
listen 443 ssl;
server_name prometheus.yanghongtao.cn;
ssl_certificate /apps/nginx/cert/yanghongtao_cn.pem;
ssl_certificate_key /apps/nginx/cert/yanghongtao_cn.key;
ssl_session_timeout 5m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP;
ssl_prefer_server_ciphers on;

auth_basic "Prometheus Auth";
auth_basic_user_file /apps/nginx/auth/.htpasswd;
location / {
proxy_pass http://127.0.0.1:9090;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
}

######################### grafana.yanghongtao.cn #########################
server {
listen 80;
server_name grafana.yanghongtao.cn;
rewrite ^(.*)$ https://$host$1 permanent;
}

server {
listen 443 ssl;
server_name grafana.yanghongtao.cn;

ssl_certificate /apps/nginx/cert/yanghongtao_cn.pem;
ssl_certificate_key /apps/nginx/cert/yanghongtao_cn.key;
ssl_session_timeout 5m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP;
ssl_prefer_server_ciphers on;

location / {
proxy_pass http://127.0.0.1:3000;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
access_log /apps/nginx/logs/grafana.log;
}
}

####################################################################################


server {
listen 80;
server_name alert.aaaa.com;
location / {
auth_basic "Alert Auth";
auth_basic_user_file /usr/local/nginx/prometheus.passwd;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_pass http://172.19.120.164:8080;
}
access_log /usr/local/nginx/logs/alert.log main;
}

server {
listen 80;
server_name alertmanager.aaaa.com;
location / {
auth_basic "Alert Auth";
auth_basic_user_file /usr/local/nginx/prometheus.passwd;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_pass http://172.19.120.164:9093;
}
access_log /usr/local/nginx/logs/alertmanager.log main;
}

server {
listen 80;
server_name alertmanager-dd.aaaa.com;
location / {
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $remote_addr;
if ($http_user_agent ~ "com.laiwang.DingTalk")
#if ($remote_addr ~ "139.196.8.74")
{
proxy_pass http://172.19.120.164:9093;
break;
}
rewrite ^/(.*) http://alertmanager.aaaa.com/$1 permanent;
}
access_log /usr/local/nginx/logs/alertmanager-dd.log main;
}

node节点部署的服务

安装node_export

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0-rc.0/node_exporter-1.4.0-rc.0.linux-amd64.tar.gz
$ tar -xf node_exporter-1.4.0-rc.0.linux-amd64.tar.gz -C /usr/local
$ cd /usr/local && ln -s node_exporter-1.4.0-rc.0.linux-amd64/ node_exporter
$ cat > /usr/lib/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
[Service]
Restart=on-failure
WorkingDirectory=/usr/local/node_exporter
ExecStart=/usr/local/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
EOF
$ systemctl start node_exporter.service && systemctl enable --now node_exporter.service

PrometheusAlert配置告警

prometheus alertmanager PrometheusAlert 安装的时候只是设置了开机自启 并未启动 其他服务都已经启动,修改app.conf文件,然后启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
编辑配置文件 /usr/local/PrometheusAlert/conf/app.conf 
prometheus_cst_time=1
open-dingding=1 #开启钉钉告警
ddurl=https://oapi.dingtalk.com/robot/send?access_token=****** #钉钉机器人地址

获取钉钉机器人地址的方法文档
https://feiyu563.gitbook.io/prometheusalert/gao-jing-jie-shou-mu-biao-pei-zhi/ding-ding-gao-jing-pei-zhi
可以先拉取一个钉钉群 群设置里面创建钉钉机器人 在将其他人移除 只保留自己 这样在调试的时候可以不打扰其他人

AlertTemplate 菜单栏下 修改 prometheus-dd 模板并保存
{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
### [Prometheus恢复信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
##### 告警项目:{{$v.annotations.project}}
##### 开始时间:{{GetCSTtime $v.startsAt}}
##### 结束时间:{{GetCSTtime $v.endsAt}}
##### 恢复主机:{{$v.labels.instance}}
{{else}}
### [Prometheus告警信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
##### 告警项目:{{$v.annotations.project}}
##### 开始时间:{{GetCSTtime $v.startsAt}}
##### 故障主机:{{$v.labels.instance}}
##### 告警描述:{{$v.annotations.description}}
{{end}}
{{ end }}

修改alertmanager配置文件启动服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
配置文件 /usr/local/alertmanager/alertmanager.yml

其内容如下
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://172.19.120.164:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=******'

systemctl start alertmanager.service

alertmanager有更高级的用法可以根据告警的严重级别和告警的项目类型 发送给对应的receiver
可以参考 https://feiyu563.gitbook.io/prometheusalert/prometheusalert-gao-jing-yuan-pei-zhi/prometheus-pei-zhi

配置邮件告警 忽略

如果使用PrometheusAlert就可以忽略以下告警配置

prometheus 支持多种告警定义,我这里测试的是使用阿里邮箱告警到qq邮箱里面。反之则也一样,首先编辑alert配置,添加邮件告警

参考

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
- alert配置邮件告警
[root@instance-1-web1 /usr/local/alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: 'yanghongtao@*****.com'
smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_auth_username: '邮件账号'
smtp_auth_password: '邮件密码'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '1419946323@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

- prometheus配置告警规则
[root@instance-1-web1 /usr/local/prometheus]# cat prometheus.yml |egrep -v "^$|#"
global:
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
rule_files:
- "/usr/local/prometheus/rules/*.rules"
scrape_configs:
- job_name: 'linux-server'
scrape_interval: 10s
file_sd_configs:
- files: ['/usr/local/prometheus/conf/linux.yml']

- 在 /usr/local/prometheus/rules/ 增加一条告警规则,根据自定义配置,然后重启prometheus和alertmanager
[root@instance-1-web1 /usr/local/prometheus]# cat rules/node-up.rules
groups:
- name: node-up
rules:
- alert: node-up
expr: up{job="linux-server"} == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"

最终告警内容,但是格式不是太好看,下面进行一下优化

AlertManager 配置自定义邮件模板

看到上边默认发送的邮件模板,虽然所有核心的信息已经包含了,但是邮件格式内容可以更优雅直观一些,那么,AlertManager 也是支持自定义邮件模板配置的,首先新建一个模板文件 email.tmpl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ mkdir -p /usr/local/alertmanager/alertmanager-tmpl/ && cd /usr/local/alertmanager/alertmanager-tmpl/
$ vim email.tmpl
{{ define "email.from" }}yanghongtao@***.com{{ end }}
{{ define "email.to" }}1419946323@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
<b>=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br>
<b>=========end==========<br>
{{ end }}
{{ end }}

简单说明一下,上边模板文件配置了 email.from、email.to、email.to.html 三种模板变量,可以在 alertmanager.yml 文件中直接配置引用。这里 email.to.html 就是要发送的邮件内容,支持 Html 和 Text 格式,这里为了显示好看,采用 Html 格式简单显示信息。下边 是个循环语法,用于循环获取匹配的 Alerts 的信息,下边的告警信息跟上边默认邮件显示信息一样,只是提取了部分核心值来展示。然后,需要增加 alertmanager.yml 文件 templates 配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
global:
resolve_timeout: 5m
smtp_from: 'yanghongtao@*****.com'
smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_auth_username: 'yanghongtao@*****.com'
smtp_auth_password: '*******'
smtp_require_tls: false
templates:
- '/usr/local/alertmanager/alertmanager-tmpl/email.tmpl'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '{{ template "email.to" . }}'
html: '{{ template "email.to.html" . }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

上边模板中由于配置了 {{ .Annotations.description }} 变量,而之前 node-up.rules 中并没有配置该变量,会导致获取不到值,所以这里我们修改一下 node-up.rules 并重启 Promethues 服务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@instance-1-web1 /usr/local/prometheus/rules]# cat node-up.rules
groups:
- name: node-up
rules:
- alert: node-up
expr: up{job="linux-server"} == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"
[root@instance-1-web1 /usr/local/prometheus/rules]# pwd
/usr/local/prometheus/rules

重启完毕后,同样模拟触发报警条件(停止 node-exporter 服务),也是可以正常发送模板邮件出来的,这次就是我们想要的风格啦!

配置钉钉告警 忽略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1、下载插件并且解压
$ wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
$ tat -zxf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
$ mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 /usr/local/alertmanager


2、编辑启动脚本(请替换为自己的webhook URL 及 ding.profile)
cat > /usr/lib/systemd/system/dingtalk.service << EOF
[Unit]
Description=dingtalk
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/usr/local/alertmanager/prometheus-webhook-dingtalk-1.4.0.linux-amd64/
ExecStart=/usr/local/alertmanager/prometheus-webhook-dingtalk-1.4.0.linux-amd64/prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=钉钉API"
[Install]
WantedBy=multi-user.target
EOF


systemctl daemon-reload && systemctl start dingtalk.service && systemctl enable alertmanager.service

prometheus 常用监控

基础日常监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
groups:
- name: CPU 使用率告警
rules:
- alert: CPU 使用率告警
expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle",job=~"(IDC-GPU|hw-nodes-prod-ES)"}[30m])) by (instance) > 0.9
for: 1m
labels:
level: disaster
annotations:
summary: "{{ $labels.instance }} CPU负载告警"
description: "{{ $labels.instance }} CPU使用率超过90%(当前值: {{ $value }}) "

- alert: 主机磁盘使用率80告警
expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
for: 15s
labels:
level: warning
annotations:
summary: "{{ $labels.instance }} 磁盘使用率大于80%,当前值:{{ $value }}%"
description: "{{ $labels.instance }} 磁盘使用率大于80%,当前值:{{ $value }}%"

- alert: 主机磁盘使用率90告警
expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 90
for: 15s
labels:
level: disaster
annotations:
summary: "{{ $labels.instance }} 磁盘使用率大于90%,当前值:{{ $value }}%"
descripiton: "{{ $labels.instance }} 磁盘使用率大于90%,当前值:{{ $value }}%"

- alert: 内存使用率90告警
expr: (1-node_memory_MemAvailable_bytes{job!="IDC-GPU"} / node_memory_MemTotal_bytes{job!="IDC-GPU"}) * 100 > 90
for: 1m
labels:
level: disaster
annotations:
summary: "{{ $labels.instance }} 可用内存不足"
description: "{{ $labels.instance }} 内存使用率超过90%(当前值: {{ $value }}) "

- alert: node主机存活告警
expr: up{job="linux-server"} == 0
for: 15s
labels:
level: disaster
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
description: "{{ $labels.instance }} 检测到异常停止已超过15s!请重点关注!!!"

- alert: 服务器时间同步
#expr: abs(node_timex_offset_seconds{job="linux-server"})
expr: abs(node_timex_tai_offset_seconds) > 3
for: 15s
labels:
level: warning
annotations:
description: "{{$labels.instance}} 节点的时间与Prometheus时间偏差大于3秒"

监控web站点以及证书过期

本文所指的web监控是指对某些访问地址或者说是接口进行监控。我们将通过一些实例,来介绍如何配置Prometheus 、black_exporter、grafana来监控站点的以下几个方面:

  1. 状态码

  2. 响应时间

  3. 证书过期时间

    Prometheus的web监控需要借助 blackbox_exporter

    当然black_exporter 的功能远不止于监控web站点,它还能监控端口(TCP),DNS、UDP等

配置大概分为以下几步:

  1. 安装black_exporter (上面有安装好的步鄹)
  2. 配置监控目标地址
  3. 配置告警规则
  4. 配置grafana面板

在prometheus.yml配置文件中加入以下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
######blackbox 监控http状态码
- job_name: 'blackbox_http_status'
metrics_path: /probe
params:
module: [http_2xx] # 这里就是我们在black_exporter中定义的模块名
file_sd_configs: # 因需要监控的地址很多,我们这里将所有地址独立出来
# - refresh_interval: 15s
- files: ['/usr/local/prometheus/blackbox/job_web.yaml']
refresh_interval: 15s
relabel_configs:
- source_labels: [__address__] # 当前target的访问地址,比如监控百度则为 https://baidu.com
target_label: __param_target # __param是默认参数前缀,target为参数,这里可以理解为把__address__ 的值赋给__param_target,若监控百度,则target=https://baidu.com
- source_labels: [__param_target]
target_label: instance # 可以理解为把__param_target的值赋给instance标签
- target_label: __address__
replacement: 127.0.0.1:9115 # web监控原本的target为站点的地址,但Prometheus不是直接去请求该地址,而是去请求black_exporter,故需要把目标地址替换为black_exporter的地址


######blackbox ping icmp的写法
- job_name: 'blackbox_icmp_status'
metrics_path: /probe
params:
module: [icmp] # 这里就是我们在black_exporter中定义的模块名
file_sd_configs:
# - refresh_interval: 15s
- files: ['/usr/local/prometheus/blackbox/icmp.yaml']
refresh_interval: 15s
relabel_configs:
- source_labels: [__address__] # 当前target的访问地址,比如监控百度则为 https://baidu.com
target_label: __param_target # __param是默认参数前缀,target为参数,这里可以理解为把__address__ 的值赋给__param_target,若监控百度,则target=https://baidu.com
- source_labels: [__param_target]
target_label: instance # 可以理解为把__param_target的值赋给instance标签
- target_label: __address__
replacement: 127.0.0.1:9115 # web监控原本的target为站点的地址,但Prometheus不是直接去请求该地址,而是去请求black_exporter,故需要把目标地址替换为black_exporter的地址

创建监控文件,根据需求添加,当然leables只是为了更好的去展示告警,配置好以后重启Prometheus然后在Targets 能看到我们的web站点说明配置完成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cat /usr/local/prometheus/blackbox/job_web.yaml
---
- targets:
- https://www.baidu.com/
labels:
env: pro
app: web
project: 百度
desc: 百度生产
- targets:
- https://blog.csdn.net/
labels:
env: test
app: web
project: CSDN
desc: 测试一下啦
not_200: yes # 这个自定义标签是为了标识某些地址在正常情况下不是返回200状态码

配置告警规则 根据自己的配置定义,最后出现告警即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
cat /usr/local/prometheus/rules/web.rules
groups:
- name: web 状态告警
rules:
- alert: Web访问异常
expr: probe_http_status_code != 200
for: 3m
labels:
level: disaster
annotations:
summary: "{{ $labels.instance }} Web访问异常"
description: "{{ $labels.instance }} Web访问异常"

# - alert: Web访问响应响应时间 >3s
# expr: probe_duration_seconds >= 3
# for: 15s
# labels:
# level: Warning
# annotations:
# summary: Web 响应异常{{ $labels.instance }}
# description: "{{ $labels.instance }} Web访问异常"

- alert: 证书过期时间<30天
expr: probe_ssl_earliest_cert_expiry-time()< 3600*24*30
labels:
level: Warning
annotations:
summary: Web 证书将在30天后过期 {{ $labels.instance }}
description: "{{ $labels.instance }} Web 证书将在30天后过期"

最后在grafana导入模板 点击import 输入 14603 即可使用,最终实现效果如下图