Discussion:
on a out of the blue reboot
(too old to reply)
Boris Dorestand
2021-04-23 11:32:01 UTC
Permalink
I'm going to describe a problem I faced yesterday with a system. This
is a reboot out of the blue that looks more like a turn-off and turn-on
of the power. I'll appreciate anything you can say about it. Thanks!

(*) System

%lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic

%uname -a
Linux ip-172-26-11-188 5.4.0-1045-aws #47~18.04.1-Ubuntu SMP Tue Apr 13 15:58:14 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
%

(*) What happened?

When I logged in to see what happened, I found a reboot:

--8<---------------cut here---------------start------------->8---
reboot system boot 5.4.0-1045-aws Thu Apr 22 07:37 - still running
--8<---------------cut here---------------end--------------->8---

I then adjusted everything I wanted to adjust and went to see who or
what happened for the reboot. I didn't find anything of relevance
except for syslog.log and kern.log.

In kern.log, I found booting kernel messages at 7:37am showing indeed
the system rebooted. But I wanted to see what happeened before that and
kern.log didn't say anything (older logs didn't help either). So I went
to syslog, where I found interesting information:

%ls -l /var/log/syslog*
-rw-r----- 1 syslog adm 450751 Apr 22 16:43 /var/log/syslog
-rw-r----- 1 syslog adm 21824 Apr 22 06:25 /var/log/syslog.1
-rw-r----- 1 syslog adm 2062 Apr 21 06:25 /var/log/syslog.2.gz
-rw-r----- 1 syslog adm 2037 Apr 20 06:25 /var/log/syslog.3.gz
-rw-r----- 1 syslog adm 2027 Apr 19 06:25 /var/log/syslog.4.gz
-rw-r----- 1 syslog adm 2189 Apr 18 06:25 /var/log/syslog.5.gz
-rw-r----- 1 syslog adm 1988 Apr 17 06:25 /var/log/syslog.6.gz
-rw-r----- 1 syslog adm 2115 Apr 16 06:25 /var/log/syslog.7.gz
%

The file syslog.1,

-rw-r----- 1 syslog adm 21824 Apr 22 06:25 /var/log/syslog.1

despite having date of 06:25, begins with:

--8<---------------cut here---------------start------------->8---
Apr 22 10:30:01 ip-172-26-11-188 CRON[25135]: (boris) CMD (/home/boris/system1/update.sh)
Apr 22 10:36:15 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 10:36:15 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 10:48:32 ip-172-26-11-188 systemd[1]: Starting Daily apt upgrade and clean activities...
Apr 22 10:48:34 ip-172-26-11-188 systemd[1]: Started Daily apt upgrade and clean activities.
Apr 22 11:00:01 ip-172-26-11-188 CRON[25705]: (boris) CMD (/home/boris/system1/update.sh)
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Apr 22 07:37:25 ip-172-26-11-188 systemd-modules-load[440]: Inserted module 'iscsi_tcp'
Apr 22 07:37:25 ip-172-26-11-188 kernel: [ 0.000000] Linux version 5.4.0-1045-aws (***@lgw01-amd64-003) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #47~18.04.1-Ubuntu SMP Tue Apr 13 15:58:14 UTC 2021 (Ubuntu 5.4.0-1045.47~18.04.1-aws 5.4.101)
Apr 22 07:37:25 ip-172-26-11-188 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1045-aws root=UUID=bbf64c6d-bc15-4ae0-aa4c-608fd9820d95 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
--8<---------------cut here---------------end--------------->8---

That looks odd. It begins with Apr 22 10:30 and suddenly gets back to
Apr 22 07:37 where the reboot takes place. There is no sign of a
shutdown or a reboot command. I opened this file with emacs where we
can see there's binary data in the log. The binary data is a sequence
of zeroes --- I looked at it with hexlify-buffer.

--8<---------------cut here---------------start------------->8---
000003d0: 2e68 6f75 726c 7929 0a00 0000 0000 0000 .hourly)........
000003e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000003f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000400: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000410: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000420: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000430: 0000 0000 0000 0000 0000 0000 0000 0041 ...............A
00000440: 7072 2032 3220 3037 3a33 373a 3235 2069 pr 22 07:37:25 i
00000450: 702d 3137 322d 3236 2d31 312d 3138 3820 p-172-26-11-188
00000460: 7379 7374 656d 642d 6d6f 6475 6c65 732d systemd-modules-
00000470: 6c6f 6164 5b34 3430 5d3a 2049 6e73 6572 load[440]: Inser
00000480: 7465 6420 6d6f 6475 6c65 2027 6973 6373 ted module 'iscs
00000490: 695f 7463 7027 0a41 7072 2032 3220 3037 i_tcp'.Apr 22 07
--8<---------------cut here---------------end--------------->8---

Thsi is a Lightsail AWS machine. Isn't this a virtual machine? Is
losing power in such machines something that could actually happen with
non-negligible probability? I'm assuming the server rebooted out of the
blue due to either a kernel problem or a hardware problem. Wouldn't a
kernel problem most likely log some kernel panic in syslog?

I'm totally in the dark.
Paul
2021-04-23 13:19:52 UTC
Permalink
Post by Boris Dorestand
I'm going to describe a problem I faced yesterday with a system. This
is a reboot out of the blue that looks more like a turn-off and turn-on
of the power. I'll appreciate anything you can say about it. Thanks!
(*) System
%lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
%uname -a
Linux ip-172-26-11-188 5.4.0-1045-aws #47~18.04.1-Ubuntu SMP Tue Apr 13 15:58:14 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
%
(*) What happened?
--8<---------------cut here---------------start------------->8---
reboot system boot 5.4.0-1045-aws Thu Apr 22 07:37 - still running
--8<---------------cut here---------------end--------------->8---
I then adjusted everything I wanted to adjust and went to see who or
what happened for the reboot. I didn't find anything of relevance
except for syslog.log and kern.log.
In kern.log, I found booting kernel messages at 7:37am showing indeed
the system rebooted. But I wanted to see what happeened before that and
kern.log didn't say anything (older logs didn't help either). So I went
%ls -l /var/log/syslog*
-rw-r----- 1 syslog adm 450751 Apr 22 16:43 /var/log/syslog
-rw-r----- 1 syslog adm 21824 Apr 22 06:25 /var/log/syslog.1
-rw-r----- 1 syslog adm 2062 Apr 21 06:25 /var/log/syslog.2.gz
-rw-r----- 1 syslog adm 2037 Apr 20 06:25 /var/log/syslog.3.gz
-rw-r----- 1 syslog adm 2027 Apr 19 06:25 /var/log/syslog.4.gz
-rw-r----- 1 syslog adm 2189 Apr 18 06:25 /var/log/syslog.5.gz
-rw-r----- 1 syslog adm 1988 Apr 17 06:25 /var/log/syslog.6.gz
-rw-r----- 1 syslog adm 2115 Apr 16 06:25 /var/log/syslog.7.gz
%
The file syslog.1,
-rw-r----- 1 syslog adm 21824 Apr 22 06:25 /var/log/syslog.1
--8<---------------cut here---------------start------------->8---
Apr 22 10:30:01 ip-172-26-11-188 CRON[25135]: (boris) CMD (/home/boris/system1/update.sh)
Apr 22 10:36:15 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 10:36:15 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 10:48:32 ip-172-26-11-188 systemd[1]: Starting Daily apt upgrade and clean activities...
Apr 22 10:48:34 ip-172-26-11-188 systemd[1]: Started Daily apt upgrade and clean activities.
Apr 22 11:00:01 ip-172-26-11-188 CRON[25705]: (boris) CMD (/home/boris/system1/update.sh)
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 22 07:37:25 ip-172-26-11-188 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1045-aws root=UUID=bbf64c6d-bc15-4ae0-aa4c-608fd9820d95 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
--8<---------------cut here---------------end--------------->8---
That looks odd. It begins with Apr 22 10:30 and suddenly gets back to
Apr 22 07:37 where the reboot takes place. There is no sign of a
shutdown or a reboot command. I opened this file with emacs where we
can see there's binary data in the log. The binary data is a sequence
of zeroes --- I looked at it with hexlify-buffer.
--8<---------------cut here---------------start------------->8---
000003d0: 2e68 6f75 726c 7929 0a00 0000 0000 0000 .hourly)........
000003e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000003f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000400: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000410: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000420: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000430: 0000 0000 0000 0000 0000 0000 0000 0041 ...............A
00000440: 7072 2032 3220 3037 3a33 373a 3235 2069 pr 22 07:37:25 i
00000450: 702d 3137 322d 3236 2d31 312d 3138 3820 p-172-26-11-188
00000460: 7379 7374 656d 642d 6d6f 6475 6c65 732d systemd-modules-
00000470: 6c6f 6164 5b34 3430 5d3a 2049 6e73 6572 load[440]: Inser
00000480: 7465 6420 6d6f 6475 6c65 2027 6973 6373 ted module 'iscs
00000490: 695f 7463 7027 0a41 7072 2032 3220 3037 i_tcp'.Apr 22 07
--8<---------------cut here---------------end--------------->8---
Thsi is a Lightsail AWS machine. Isn't this a virtual machine? Is
losing power in such machines something that could actually happen with
non-negligible probability? I'm assuming the server rebooted out of the
blue due to either a kernel problem or a hardware problem. Wouldn't a
kernel problem most likely log some kernel panic in syslog?
I'm totally in the dark.
You know there's an iSCSI exploit, right ?

https://www.bleepingcomputer.com/news/security/15-year-old-linux-kernel-bugs-let-attackers-gain-root-privileges/

Is that piece of software being patched, receiving updates ?
How do you verify that ?

All that set me off here, was the word "iscsi", I don't know
anything more about it.

Paul
Richard Kettlewell
2021-04-23 13:33:33 UTC
Permalink
Post by Paul
You know there's an iSCSI exploit, right ?
https://www.bleepingcomputer.com/news/security/15-year-old-linux-kernel-bugs-let-attackers-gain-root-privileges/
Is that piece of software being patched, receiving updates ?
How do you verify that ?
Based on https://ubuntu.com/security/CVE-2021-27365 I think the OP
already has the fix for that one.
--
https://www.greenend.org.uk/rjk/
Rainer Weikusat
2021-04-23 16:31:48 UTC
Permalink
Boris Dorestand <***@example.com> writes:

[Spontaneous reboot]
Post by Boris Dorestand
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
The ^@ are a bunch of zero bytes. This is usually a sign of a system
rebooting because of a hardware problem (log file corrupted due to
incomplete write).
Boris Dorestand
2021-04-23 19:26:13 UTC
Permalink
Post by Rainer Weikusat
[Spontaneous reboot]
Post by Boris Dorestand
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network
configuration changed, trying to establish connection.
Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
22 07:37:25 ip-172-26-11-188 systemd-modules-load[440]: Inserted
module 'iscsi_tcp'
rebooting because of a hardware problem (log file corrupted due to
incomplete write).
Thank you for your thoughts on this. I appreciate it.
Casper H.S. Dik
2021-04-28 15:35:46 UTC
Permalink
Post by Rainer Weikusat
[Spontaneous reboot]
Post by Boris Dorestand
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
rebooting because of a hardware problem (log file corrupted due to
incomplete write).
Or a file system which has changed the size of a file because of a write but
has not yet written that particular block.

Casper
Rainer Weikusat
2021-04-28 15:54:31 UTC
Permalink
Post by Casper H.S. Dik
Post by Rainer Weikusat
[Spontaneous reboot]
Post by Boris Dorestand
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
22 07:37:25 ip-172-26-11-188 systemd-modules-load[440]: Inserted
module 'iscsi_tcp'
rebooting because of a hardware problem (log file corrupted due to
incomplete write).
Or a file system which has changed the size of a file because of a write but
has not yet written that particular block.
Yes, that's what I meant :-).
Kaz Kylheku
2021-04-28 17:39:02 UTC
Permalink
Post by Casper H.S. Dik
Post by Rainer Weikusat
[Spontaneous reboot]
Post by Boris Dorestand
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
rebooting because of a hardware problem (log file corrupted due to
incomplete write).
Or a file system which has changed the size of a file because of a write but
has not yet written that particular block.
That was my reaction when I read that report first. The size meta-data
got committed, but the most up-to-date write of the affected data block
didn't have a chance to complete.

E.g, original block at the end of the file: caret indicates length:

"how now brown0000000000000000" (commited to storage)
^

New write: comes in, adding " cow".

"how now brown cow000000000000" (this block is uncommited)
^

The length update is committed:

"how now brown0000000000000000" (commited to storage: length only)
^

At this point, the machine completely bites the dust, and so the block
doesn't get updated. The system boots up up again and starts logging
"Apr 22" at the end of the file indicated by the length, leaving
a block of zeros:

"how now brown0000Apr 22000000"
^
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Casper H.S. Dik
2021-04-29 13:17:03 UTC
Permalink
Post by Kaz Kylheku
Post by Casper H.S. Dik
Post by Rainer Weikusat
[Spontaneous reboot]
Post by Boris Dorestand
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Network configuration changed, trying to establish connection.
Apr 22 11:06:14 ip-172-26-11-188 systemd-timesyncd[9972]: Synchronized to time server 91.189.89.199:123 (ntp.ubuntu.com).
Apr 22 11:17:01 ip-172-26-11-188 CRON[26092]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
rebooting because of a hardware problem (log file corrupted due to
incomplete write).
Or a file system which has changed the size of a file because of a write but
has not yet written that particular block.
That was my reaction when I read that report first. The size meta-data
got committed, but the most up-to-date write of the affected data block
didn't have a chance to complete.
A proper file system, of course, would not have that issue :-)

Originally, we added "logging" to the "ufs" (which came from BSD as "fastfs")

This ade sure that the metadata was consistent (that is, fsck sees no errors)
but that does not mean that there might be data lost; particular we'd see
files having the data from another file that had been removed and the block
had been allocated to a different file but its data had not been written yet.

In ZFS, the meta data and the data are part of the same transaction. You'd
see the file with the new size + new data or you find it with old data and
the old size.


Casper

Loading...