Browse Source

Add monit monitoring of sd card hiccups and time leaps

pull/134/head
KemoNine 1 week ago
parent
commit
cd55cd7511
Signed by: KemoNine <kemonine@lollipopcloud.solutions> GPG Key ID: 3BC2928798AE11AB
1 changed files with 92 additions and 0 deletions
  1. 92
    0
      hardware/pine64.md

+ 92
- 0
hardware/pine64.md View File

@@ -43,3 +43,95 @@ dkms install zfs/0.7.13
43 43
 systemctl enable zfs-import-cache zfs-import.target zfs-mount zfs-share zfs.target
44 44
 
45 45
 ```
46
+
47
+## Monitor For Common Problems
48
+
49
+For some reason the Pine64 and SOPine can have problems with "clock jumps" (ie. jumping forward 95 years) due to kernel bugs. They can also have major IO stalls when writing heavily to micro-sd cards, so much so the board becomes basically non-responsive for many minutes (upwards of 10).
50
+
51
+The below Monit configuration and setup will monitor for both events and reboot the board in the event either happens. Currently this seems to be the least-worst option for recovery.
52
+
53
+### Monit Install / Initial Config
54
+
55
+``` bash
56
+
57
+apt install monit
58
+nano -w /etc/monit/monitrc
59
+    set mail-format { from: user@domain.tld }
60
+    set alert admin@domain.tld
61
+    set mailserver mail.domain.tld port 587
62
+               username "user@domain.tld" password "apassword"
63
+               using tls
64
+    set httpd port 2812 and
65
+        allow admin:apassword
66
+        allow guest:guest readonly
67
+        #with ssl {            # enable SSL/TLS and set path to server certificate
68
+        #    pemfile: /etc/ssl/certs/monit.pem
69
+        #}
70
+
71
+```
72
+
73
+
74
+### Monit Monitor for large clock jumps forward
75
+
76
+```/usr/local/bin/check_clock_jump.py```
77
+
78
+``` python
79
+
80
+#!/usr/bin/env python3
81
+
82
+import datetime
83
+import sys
84
+
85
+FORMAT_STRING = '%Y-%m-%d %H:%M:%S'
86
+MAX_TIME_JUMP = datetime.timedelta(days=90)
87
+CACHE_FILE = '/var/cache/last_time.check'
88
+
89
+current_time = datetime.datetime.now()
90
+last_time = current_time
91
+
92
+try:
93
+    with open(CACHE_FILE, 'r') as f:
94
+        last_time = datetime.datetime.strptime(f.read().strip(), FORMAT_STRING)
95
+except FileNotFoundError:
96
+    pass
97
+
98
+timedelta = current_time - last_time
99
+if timedelta > MAX_TIME_JUMP:
100
+    sys.exit(1)
101
+
102
+with open(CACHE_FILE, 'w') as f:
103
+    f.write(current_time.strftime(FORMAT_STRING))
104
+
105
+sys.exit(0)
106
+
107
+```
108
+
109
+``` bash
110
+
111
+chmod a+x /usr/local/bin/check_clock_jump.py
112
+cat > /etc/monit/conf.d/check_clock_jump.conf <<EOF
113
+ check program check_clock_jump with path /usr/local/bin/check_clock_jump.py
114
+       if status != 0 
115
+       then exec "/bin/systemctl reboot"
116
+        as uid "root" and gid "root"
117
+EOF
118
+
119
+systemctl restart monit
120
+
121
+```
122
+
123
+### Monit monitor for ```card_busy_detect status: 0xe00``` kernel errors
124
+
125
+``` bash
126
+
127
+cat > /etc/monit/conf.d/card_busy_detect.conf <<EOF
128
+# From docs: On startup the read position is set to the end of the file and Monit continues to scan to the end of the file on each cycle.
129
+check file kernel path /var/log/kern.log
130
+if content = ".*card_busy_detect status: 0xe00.*" 
131
+    then exec "/bin/systemctl reboot"
132
+        as uid "root" and gid "root"
133
+EOF
134
+
135
+systemctl restart monit
136
+
137
+```

Loading…
Cancel
Save