This evening my 3B running PiAware 3.8.0 has suddenly started displaying unusual Overall CPU Utilization in graphs1090, and it’s a bit laggy when SSH’d in. It appears to be a real effect that’s being measured, not a graph artifact.
me dropping the gain a couple of notches and rebooting
me rebooting again after noticing this behaviour
me setting the gain back to what it was and rebooting again in case that was relevant (it wasn’t)
I use full-size graphs which makes it easier to align events vertically across all the graphs. This behaviour started at 19:10 UTC. In this case no other graph shows anything unusual. There’s nothing of note in syslog or piaware.log or dmesg at that or any other time. ps and top reveal nothing unexpected. Mono and tmux weren’t running today so they’re not related.
I did an apt update and upgrade, which grabbed and installed a load of kernel updates (last done a couple of week ago) and rebooted. Oddly the behaviour looks like it’s returned to normal. I haven’t physically touched the unit at all.
I saw these replies after I’d updated and discovered that the behaviour appears to be back to normal. The two Disk I/O graphs showed nothing of note at the point of change or during it, not that they necessarily would. I’ve grabbed iotop (thanks, I’d not heard of it) for any future investigations. It’s a SanDisk card, a couple of years old, gets low levels of action from PiAware.
Perhaps a power fluctuation corrupted something and that update happens to have refreshed it. Seems unlikely though.
Thanks @caius, that’s a really handy test to have around, I’ve saved it to use in future. Much appreciated.
It could be that some PiAware component crashed and left the platform in an unexpected state, the side-effect of which was this behaviour. But there’s nothing in piaware.log suggesting that and my stats and aircraft have been unaffected by this.
Looks like it’s going to be one of those annoying occasional computer-related mysteries which resolves itself and never reveals what went wrong!
EDIT – I found this article which has that test @caius posted.
Hi @chrislfa - I’ve been experiencing the exact same thing! SSH login is slow (sometimes up to 10-15 seconds to log in). I see the high CPU Utilizations for ~2 days and then it disappears for ~2 days.
Rebooting has no effect. And for me the gain isn’t changing (that I know of!). I’ve been staring at the top command for a few days trying to determine the “why”. (like watching paint dry!) No luck!
Note that io-wait (“wa” in top) is not “real” CPU - it is the time spent by a process that wants to run, but which is waiting for I/O on an external device (e.g. sdcard read or write) to complete. Look for processes in state D (“S” column in top)
It’s either sudden heavy I/O or your sdcard is looking unhappy.
note that a pristine piaware image does very little I/O, but adding things like graphs / stat collection will increase it.
I’m starting to suspect some sort of caching issue hitting a limit and turning on a brick wall of idle CPU waiting for I/O. Possibly caused by the graphs or by any of the tweaks done to 3.8.0 over time. Other than that I’m not losing any sleep over the card I/O from graphs. I’m looking forward to being able to start over with a nice clean 3.8.1, and ditch the dhcpcd patching and static IP workaround along the way.
The card looks happy with a read test.
pi@piaware:~ $ mount | grep " / "
/dev/mmcblk0p2 on / type ext4 (rw,noatime)
pi@piaware:~ $ sudo badblocks -sv /dev/mmcblk0p2
Checking blocks 0 to 15382015
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)
How odd, thanks for the extra data points @JonMu! Yes the gain change and reboot here was just a coincidence just under 2 hours before it started. I’ve checked the 30 day graph and in fact I have had this before, it just didn’t catch my attention until this evening. There was one last month too.
Are you also running the PiAware 3.8.0 image? Are you on a Pi 3B?
Next time it happens try out @caius command and check that link for other ways to identify the cause. Here’s one; if you install iotop mentioned by @wiedehopf you can use the -a and -o switches to toggle accumulated reads/writes and only show active processes. You can also press ‘a’ and ‘o’ when it’s running to toggle them interactively.
Interestingly, the effect isn’t showing on the graphs as disk I/O, while the badblocks read test did. On the graphs below you can see:
The problem on the left side up to around 01:30, with nothing showing for disk I/O
The apt upgrade which shows both CPU activity and disk writes, as expected, then a reboot (two in fact, I was testing shutdown vs reboot)
The problem has then disappeared which is a bit annoying…
Another reboot around 04:50 (more testing) and then the badblocks read test at 05:00 which shows a similar CPU effect but also clearly registers disk read I/O
So what could permanently tie up CPU resource with I/O wait but apparently not do any disk read or write I/O?
I’ll keep an eye on it, but the card’s well within its lifespan given its usage so I’d really like a way to test that idea rather than assume it.
Do you mean “dying” as in it’s reached a stage where it’s physically becoming unable to store data despite wear-levelling? If so then that impacts the OS and I’d expect to be able to detect that using OS tools. The graph could be evidence of that but since we’re trying to understand the graph it doesn’t help here. I’ll try a non-destructive badblocks write test shortly and see what that says.
“Tie up CPU resource with I/O wait” was my clumsy way of saying that if CPU resource (ie part of all available CPU) is idle waiting on a process that is waiting for I/O, then it’s tied up, unavailable to other processes.
Well, not exactly. Nothing about the I/O wait time makes the CPU unavailable to other processes. It is more of a subset of idle time; it’s the times when the scheduler was idle and there was at least one process blocked waiting for I/O to complete.
While searching the interwebs for ideas a few days ago I found anecdotal reports that bluetooth was implicated. Since it’s not needed on PiAware I’ve removed the entire stack and rebooted.
I’ll see if it happens again. If it does then bluetooth was likely not relevant. The next step will be a destructive SD card write test and a fresh PiAware install. Is there an ETA for PiAware 3.8.1? There are a couple of awkward hacks with 3.8.0 (the static IP bug and the dhcpcd seg fault patch) which I might as well wait and avoid if it’s close to release.
Where is the data coming from which is being used to produce the dark red “io” lines on the graph “Overall CPU Utilization”? I’d like to be able to monitor this but with graphs1090 not installed, so obviously I cannot use the graph to detect it and instead need to use the underlying mechanism that the graph would be using.