1、Prometheus 介绍安装
本文参考我的好朋友的文章 Mr.pu 监控模板文档
本次监控部署应用到的相关软件如下
1 2 3 4 5 6 7 8 9
| prometheus 数据采集和存储 提供PromQL语法查询 alertmanager 警告管理 进行报警 node_exporter 收集主机的基本性能监控指标 blackbox_exporter 收集http,https,tcp等监控指标 redis_exporter 收集redis相关的监控指标 mysqld_exporter 收集mysql相关的监控指标 pushgateway 向prometheus推送监控指标 PrometheusAlert 运维告警转发系统 结合alertmanager grafana 监控数据大盘展示
|
服务端口说明
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| prometheus 启动 9090 端口 alertmanager 启动 9093 端口 PrometheusAlert 启动 8080 端口 grafana 启动 3000 端口 blackbox_exporter 启动 9115 端口 node_exporter 启动 9100 端口 redis_exporter 启动 9121 端口 mysqld_exporter 启动 9104 端口 pushgateway 启动 9091 端口
以上服务其启动端口都可以通过启动命令 -h 查看帮助信息 找到其指定启动端口的参数 PrometheusAler 的启动端口在 配置文件中设置
主机prometheus 部署 prometheus alertmanager PrometheusAlert grafana blackbox_exporter 这五个服务部署在了一台上 也可以分开单独部署 主机node 部署 node_exporter redis_exporter pushgateway mysqld_exporter
|
Prometheus主机部署的服务
安装prometheus
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| $ wget https://github.com/prometheus/prometheus/releases/download/v2.37.1/prometheus-2.37.1.linux-amd64.tar.gz tar -xvf prometheus-2.37.1.linux-amd64.tar.gz -C /usr/local/ $ cd /usr/local && ln -s prometheus-2.37.1.linux-amd64/ prometheus $ cat > /usr/lib/systemd/system/prometheus.service << EOF [Unit] Description=prometheus After=network.target [Service] Restart=on-failure WorkingDirectory=/usr/local/prometheus ExecStart=/usr/local/prometheus/prometheus --web.enable-lifecycle --storage.tsdb.retention.time=90d --web.enable-admin-api --storage.tsdb.path=/usr/local/prometheus/data --web.external-url=http://prometheus-dd.aaaa.com [Install] WantedBy=multi-user.target EOF
--web.enable-lifecycle 在修改了prometheus.yml之后 可以通过下面的方式 进行热加载不需要通过重启 --storage.tsdb.retention.time=90d 设置数据保留时间为90天 --web.enable-admin-api 启用api 可以进行数据清理功能 --storage.tsdb.path=/usr/local/prometheus/data 指定数据落地的目录 --web.external-url=http://prometheus-dd.aaaa.com 指定域名 此域名用于报警消息中超链接跳转所用
$ systemctl daemon-reload $ systemctl start prometheus.service $ systemctl enable prometheus.service - 当修改了prometheus.yml文件后 热加载配置命令 $ curl -X POST http://localhost:9090/-/reload
|
alertmanager安装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| 软件包地址 https://github.com/prometheus/alertmanager/releases
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz tar xvf alertmanager-0.24.0.linux-amd64.tar.gz mv alertmanager-0.24.0.linux-amd64 /usr/local/alertmanager
cat > /usr/lib/systemd/system/alertmanager.service << EOF [Unit] Description=alertmanager After=network.target [Service] Restart=on-failure WorkingDirectory=/usr/local/alertmanager ExecStart=/usr/local/alertmanager/alertmanager --web.external-url=http://alertmanager-dd.aaaa.com [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl enable alertmanager.service
启动参数相关说明: --web.external-url=http://alertmanager-dd.aaaa.com 指定域名 此域名用于报警消息中超链接跳转所用
|
PrometheusAlert安装
运维告警转发系统 结合alertmanager
PrometheusAlert的主要功能就是,alertmanager把告警消息转发给他,他来通知告警到多个渠道
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| 软件包地址 https://github.com/feiyu563/PrometheusAlert 此软件安装和使用文档 https://feiyu563.gitbook.io/prometheusalert/
cd /usr/local && wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.7/linux.zip && unzip linux.zip mv linux PrometheusAlert chmod 755 /usr/local/PrometheusAlert/PrometheusAlert
cat > /usr/lib/systemd/system/PrometheusAlert.service << EOF [Unit] Description=PrometheusAlert After=network.target [Service] Restart=on-failure WorkingDirectory=/usr/local/PrometheusAlert ExecStart=/usr/local/PrometheusAlert/PrometheusAlert [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl enable PrometheusAlert.service
|
blackbox_exporter安装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.21.1/blackbox_exporter-0.21.1.linux-amd64.tar.gz tar -zxvf blackbox_exporter-0.21.1.linux-amd64.tar.gz mv blackbox_exporter-0.21.1.linux-amd64 /usr/local/blackbox_exporte cat > /usr/lib/systemd/system/blackbox_exporter.service << EOF [Unit] Description=blackbox_exporter After=network.target [Service] Restart=on-failure WorkingDirectory=/usr/local/blackbox_exporter ExecStart=/usr/local/blackbox_exporter/blackbox_exporter [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl enable blackbox_exporter.service systemctl start blackbox_exporter.service
|
安装grafana
1 2 3 4 5
| - 软件包地址 https://grafana.com/grafana/download wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.1.6_amd64.deb sudo dpkg -i grafana-enterprise_9.1.6_amd64.deb systemctl start grafana-server && systemctl enable grafana-server.service - 默认账密都是admin
|
nginx代理对应服务
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
| 由于 prometheus alertmanager PrometheusAlert 服务没有用户系统 所以我们使用nginx来做认证
nginx的配置文件内容如下
grafana.aaaa.com 跳转到grafana的3000端口 无需nginx认证 grafana自带认证 初始账户和密码 Admin Admin prometheus.aaaa.com 跳转到prometheus的9090端口 使用nginx认证 prometheus-dd.aaaa.com 跳转到prometheus的9090端口 无需nginx认证 此域名与启动服务参数里的域名一致 但设置了转发规则 通过钉钉APP访问无需认证直接 否则跳转到 prometheus.aaaa.com 域名下 alert.aaaa.com 跳转到PrometheusAlert的8080端口 使用nginx认证 alertmanager.aaaa.com 跳转到alertmanager的9093端口 使用nginx认证 alertmanager-dd.aaaa.com 跳转到alertmanager的9093端口 无需nginx认证 此域名与启动服务参数里的域名一致 但设置了转发规则 通过钉钉APP访问无需认证直接 否则跳转到 alertmanager.aaaa.com 域名下 s 可以看到prometheus-dd.aaaa.com alertmanager-dd.aaaa.com为服务启动参数 --web.external-url指定的url 这2个域名在nginx配置文件中设置了转发限制 在使用钉钉APP访问的时候没有nginx认证 其他方式的访问都会跳转到需要认证的域名下
yum -y install httpd-tools htpasswd -bc /apps/nginx/auth/.htpasswd 用户名 密码
server { listen 80; server_name prometheus.yanghongtao.cn; rewrite ^(.*)$ https://$host$1 permanent; }
server { listen 443 ssl; server_name prometheus.yanghongtao.cn; ssl_certificate /apps/nginx/cert/yanghongtao_cn.pem; ssl_certificate_key /apps/nginx/cert/yanghongtao_cn.key; ssl_session_timeout 5m; ssl_protocols TLSv1 TLSv1.1 TLSv1.2; ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP; ssl_prefer_server_ciphers on;
auth_basic "Prometheus Auth"; auth_basic_user_file /apps/nginx/auth/.htpasswd; location / { proxy_pass http://127.0.0.1:9090; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection 'upgrade'; proxy_set_header Host $host; proxy_cache_bypass $http_upgrade; } }
server { listen 80; server_name grafana.yanghongtao.cn; rewrite ^(.*)$ https://$host$1 permanent; }
server { listen 443 ssl; server_name grafana.yanghongtao.cn;
ssl_certificate /apps/nginx/cert/yanghongtao_cn.pem; ssl_certificate_key /apps/nginx/cert/yanghongtao_cn.key; ssl_session_timeout 5m; ssl_protocols TLSv1 TLSv1.1 TLSv1.2; ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP; ssl_prefer_server_ciphers on;
location / { proxy_pass http://127.0.0.1:3000; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; access_log /apps/nginx/logs/grafana.log; } }
server { listen 80; server_name alert.aaaa.com; location / { auth_basic "Alert Auth"; auth_basic_user_file /usr/local/nginx/prometheus.passwd; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $remote_addr; proxy_pass http://172.19.120.164:8080; } access_log /usr/local/nginx/logs/alert.log main; }
server { listen 80; server_name alertmanager.aaaa.com; location / { auth_basic "Alert Auth"; auth_basic_user_file /usr/local/nginx/prometheus.passwd; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $remote_addr; proxy_pass http://172.19.120.164:9093; } access_log /usr/local/nginx/logs/alertmanager.log main; }
server { listen 80; server_name alertmanager-dd.aaaa.com; location / { proxy_set_header Host $host; proxy_set_header X-Forwarded-For $remote_addr; if ($http_user_agent ~ "com.laiwang.DingTalk") { proxy_pass http://172.19.120.164:9093; break; } rewrite ^/(.*) http://alertmanager.aaaa.com/$1 permanent; } access_log /usr/local/nginx/logs/alertmanager-dd.log main; }
|
node节点部署的服务
安装node_export
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| $ wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0-rc.0/node_exporter-1.4.0-rc.0.linux-amd64.tar.gz $ tar -xf node_exporter-1.4.0-rc.0.linux-amd64.tar.gz -C /usr/local $ cd /usr/local && ln -s node_exporter-1.4.0-rc.0.linux-amd64/ node_exporter $ cat > /usr/lib/systemd/system/node_exporter.service << EOF [Unit] Description=node_exporter [Service] Restart=on-failure WorkingDirectory=/usr/local/node_exporter ExecStart=/usr/local/node_exporter/node_exporter [Install] WantedBy=multi-user.target EOF $ systemctl start node_exporter.service && systemctl enable --now node_exporter.service
|
PrometheusAlert配置告警
prometheus alertmanager PrometheusAlert 安装的时候只是设置了开机自启 并未启动 其他服务都已经启动,修改app.conf文件,然后启动
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| 编辑配置文件 /usr/local/PrometheusAlert/conf/app.conf prometheus_cst_time=1 open-dingding=1 ddurl=https://oapi.dingtalk.com/robot/send?access_token=******
获取钉钉机器人地址的方法文档 https://feiyu563.gitbook.io/prometheusalert/gao-jing-jie-shou-mu-biao-pei-zhi/ding-ding-gao-jing-pei-zhi 可以先拉取一个钉钉群 群设置里面创建钉钉机器人 在将其他人移除 只保留自己 这样在调试的时候可以不打扰其他人
AlertTemplate 菜单栏下 修改 prometheus-dd 模板并保存 {{ $var := .externalURL}}{{ range $k,$v:=.alerts }} {{if eq $v.status "resolved"}}
{{else}}
{{end}} {{ end }}
|
修改alertmanager配置文件启动服务
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| 配置文件 /usr/local/alertmanager/alertmanager.yml
其内容如下 global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://172.19.120.164:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=******'
systemctl start alertmanager.service
alertmanager有更高级的用法可以根据告警的严重级别和告警的项目类型 发送给对应的receiver 可以参考 https://feiyu563.gitbook.io/prometheusalert/prometheusalert-gao-jing-yuan-pei-zhi/prometheus-pei-zhi
|
配置邮件告警 忽略
如果使用PrometheusAlert就可以忽略以下告警配置
prometheus 支持多种告警定义,我这里测试的是使用阿里邮箱告警到qq邮箱里面。反之则也一样,首先编辑alert配置,添加邮件告警
参考
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
| - alert配置邮件告警 [root@instance-1-web1 /usr/local/alertmanager] global: resolve_timeout: 5m smtp_from: 'yanghongtao@*****.com' smtp_smarthost: 'smtp.qiye.aliyun.com:465' smtp_auth_username: '邮件账号' smtp_auth_password: '邮件密码' smtp_require_tls: false route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'email' receivers: - name: 'email' email_configs: - to: '1419946323@qq.com' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] - prometheus配置告警规则 [root@instance-1-web1 /usr/local/prometheus] global: alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 rule_files: - "/usr/local/prometheus/rules/*.rules" scrape_configs: - job_name: 'linux-server' scrape_interval: 10s file_sd_configs: - files: ['/usr/local/prometheus/conf/linux.yml'] - 在 /usr/local/prometheus/rules/ 增加一条告警规则,根据自定义配置,然后重启prometheus和alertmanager [root@instance-1-web1 /usr/local/prometheus] groups: - name: node-up rules: - alert: node-up expr: up{job="linux-server"} == 0 for: 15s labels: severity: 1 team: node annotations: summary: "{{ $labels.instance }} 已停止运行超过 15s!"
|
最终告警内容,但是格式不是太好看,下面进行一下优化
AlertManager 配置自定义邮件模板
看到上边默认发送的邮件模板,虽然所有核心的信息已经包含了,但是邮件格式内容可以更优雅直观一些,那么,AlertManager 也是支持自定义邮件模板配置的,首先新建一个模板文件 email.tmpl
。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| $ mkdir -p /usr/local/alertmanager/alertmanager-tmpl/ && cd /usr/local/alertmanager/alertmanager-tmpl/ $ vim email.tmpl {{ define "email.from" }}yanghongtao@***.com{{ end }} {{ define "email.to" }}1419946323@qq.com{{ end }} {{ define "email.to.html" }} {{ range .Alerts }} <b>=========start==========<br> 告警程序: prometheus_alert <br> 告警级别: {{ .Labels.severity }} 级 <br> 告警类型: {{ .Labels.alertname }} <br> 故障主机: {{ .Labels.instance }} <br> 告警主题: {{ .Annotations.summary }} <br> 告警详情: {{ .Annotations.description }} <br> 触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br> <b>=========end==========<br> {{ end }} {{ end }}
|
简单说明一下,上边模板文件配置了 email.from、email.to、email.to.html 三种模板变量,可以在 alertmanager.yml 文件中直接配置引用。这里 email.to.html 就是要发送的邮件内容,支持 Html 和 Text 格式,这里为了显示好看,采用 Html 格式简单显示信息。下边 是个循环语法,用于循环获取匹配的 Alerts 的信息,下边的告警信息跟上边默认邮件显示信息一样,只是提取了部分核心值来展示。然后,需要增加 alertmanager.yml 文件 templates 配置如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| global: resolve_timeout: 5m smtp_from: 'yanghongtao@*****.com' smtp_smarthost: 'smtp.qiye.aliyun.com:465' smtp_auth_username: 'yanghongtao@*****.com' smtp_auth_password: '*******' smtp_require_tls: false templates: - '/usr/local/alertmanager/alertmanager-tmpl/email.tmpl' route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'email' receivers: - name: 'email' email_configs: - to: '{{ template "email.to" . }}' html: '{{ template "email.to.html" . }}' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
|
上边模板中由于配置了 {{ .Annotations.description }}
变量,而之前 node-up.rules
中并没有配置该变量,会导致获取不到值,所以这里我们修改一下 node-up.rules
并重启 Promethues 服务。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| [root@instance-1-web1 /usr/local/prometheus/rules] groups: - name: node-up rules: - alert: node-up expr: up{job="linux-server"} == 0 for: 15s labels: severity: 1 team: node annotations: summary: "{{ $labels.instance }} 已停止运行超过 15s!" description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!" [root@instance-1-web1 /usr/local/prometheus/rules] /usr/local/prometheus/rules
|
重启完毕后,同样模拟触发报警条件(停止 node-exporter
服务),也是可以正常发送模板邮件出来的,这次就是我们想要的风格啦!
配置钉钉告警 忽略
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| 1、下载插件并且解压 $ wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz $ tat -zxf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz $ mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 /usr/local/alertmanager
2、编辑启动脚本(请替换为自己的webhook URL 及 ding.profile) cat > /usr/lib/systemd/system/dingtalk.service << EOF [Unit] Description=dingtalk After=network.target [Service] Restart=on-failure WorkingDirectory=/usr/local/alertmanager/prometheus-webhook-dingtalk-1.4.0.linux-amd64/ ExecStart=/usr/local/alertmanager/prometheus-webhook-dingtalk-1.4.0.linux-amd64/prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=钉钉API" [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload && systemctl start dingtalk.service && systemctl enable alertmanager.service
|
prometheus 常用监控
基础日常监控
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
| groups: - name: CPU 使用率告警 rules: - alert: CPU 使用率告警 expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle",job=~"(IDC-GPU|hw-nodes-prod-ES)"}[30m])) by (instance) > 0.9 for: 1m labels: level: disaster annotations: summary: "{{ $labels.instance }} CPU负载告警" description: "{{ $labels.instance }} CPU使用率超过90%(当前值: {{ $value }}) "
- alert: 主机磁盘使用率80告警 expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80 for: 15s labels: level: warning annotations: summary: "{{ $labels.instance }} 磁盘使用率大于80%,当前值:{{ $value }}%" description: "{{ $labels.instance }} 磁盘使用率大于80%,当前值:{{ $value }}%"
- alert: 主机磁盘使用率90告警 expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 90 for: 15s labels: level: disaster annotations: summary: "{{ $labels.instance }} 磁盘使用率大于90%,当前值:{{ $value }}%" descripiton: "{{ $labels.instance }} 磁盘使用率大于90%,当前值:{{ $value }}%"
- alert: 内存使用率90告警 expr: (1-node_memory_MemAvailable_bytes{job!="IDC-GPU"} / node_memory_MemTotal_bytes{job!="IDC-GPU"}) * 100 > 90 for: 1m labels: level: disaster annotations: summary: "{{ $labels.instance }} 可用内存不足" description: "{{ $labels.instance }} 内存使用率超过90%(当前值: {{ $value }}) "
- alert: node主机存活告警 expr: up{job="linux-server"} == 0 for: 15s labels: level: disaster annotations: summary: "{{ $labels.instance }} 已停止运行超过 15s!" description: "{{ $labels.instance }} 检测到异常停止已超过15s!请重点关注!!!"
- alert: 服务器时间同步 expr: abs(node_timex_tai_offset_seconds) > 3 for: 15s labels: level: warning annotations: description: "{{$labels.instance}} 节点的时间与Prometheus时间偏差大于3秒"
|
监控web站点以及证书过期
本文所指的web监控是指对某些访问地址或者说是接口进行监控。我们将通过一些实例,来介绍如何配置Prometheus 、black_exporter、grafana来监控站点的以下几个方面:
状态码
响应时间
证书过期时间
Prometheus的web监控需要借助 blackbox_exporter
当然black_exporter 的功能远不止于监控web站点,它还能监控端口(TCP),DNS、UDP等
配置大概分为以下几步:
- 安装black_exporter (上面有安装好的步鄹)
- 配置监控目标地址
- 配置告警规则
- 配置grafana面板
在prometheus.yml配置文件中加入以下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
| - job_name: 'blackbox_http_status' metrics_path: /probe params: module: [http_2xx] file_sd_configs: - files: ['/usr/local/prometheus/blackbox/job_web.yaml'] refresh_interval: 15s relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9115
- job_name: 'blackbox_icmp_status' metrics_path: /probe params: module: [icmp] file_sd_configs: - files: ['/usr/local/prometheus/blackbox/icmp.yaml'] refresh_interval: 15s relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9115
|
创建监控文件,根据需求添加,当然leables只是为了更好的去展示告警,配置好以后重启Prometheus
然后在Targets
能看到我们的web站点说明配置完成
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| cat /usr/local/prometheus/blackbox/job_web.yaml --- - targets: - https://www.baidu.com/ labels: env: pro app: web project: 百度 desc: 百度生产 - targets: - https://blog.csdn.net/ labels: env: test app: web project: CSDN desc: 测试一下啦 not_200: yes
|
配置告警规则 根据自己的配置定义,最后出现告警即可
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| cat /usr/local/prometheus/rules/web.rules groups: - name: web 状态告警 rules: - alert: Web访问异常 expr: probe_http_status_code != 200 for: 3m labels: level: disaster annotations: summary: "{{ $labels.instance }} Web访问异常" description: "{{ $labels.instance }} Web访问异常"
- alert: 证书过期时间<30天 expr: probe_ssl_earliest_cert_expiry-time()< 3600*24*30 labels: level: Warning annotations: summary: Web 证书将在30天后过期 {{ $labels.instance }} description: "{{ $labels.instance }} Web 证书将在30天后过期"
|
最后在grafana导入模板 点击import
输入 14603
即可使用,最终实现效果如下图