Odd Overall CPU Utilization for no obvious reason

This evening my 3B running PiAware 3.8.0 has suddenly started displaying unusual Overall CPU Utilization in graphs1090, and it’s a bit laggy when SSH’d in. It appears to be a real effect that’s being measured, not a graph artifact.

The three gaps, in order, are:

  • me dropping the gain a couple of notches and rebooting
  • me rebooting again after noticing this behaviour
  • me setting the gain back to what it was and rebooting again in case that was relevant (it wasn’t)

I use full-size graphs which makes it easier to align events vertically across all the graphs. This behaviour started at 19:10 UTC. In this case no other graph shows anything unusual. There’s nothing of note in syslog or piaware.log or dmesg at that or any other time. ps and top reveal nothing unexpected. Mono and tmux weren’t running today so they’re not related.

Anyone else seen similar behaviour?

1 Like

Haven’t seen sth like this.

Try checking htop or iotop?
Seems it’s waiting on IO = Input Output (sd-card yes?).

Might be a card going bad i suppose.

Try running this:

for x in `seq 1 1 10`; do ps -eo state,pid,cmd | grep "^D"; echo "----"; sleep 5; done

Which will show processes being held up with state D, which is usually waiting for I/O. Might help to narrow it down a bit.

1 Like

I had similar graphs when planefinder crapped out. I had to manually stop pfclient. Otherwise not seen anything like this.

I did an apt update and upgrade, which grabbed and installed a load of kernel updates (last done a couple of week ago) and rebooted. Oddly the behaviour looks like it’s returned to normal. I haven’t physically touched the unit at all.

back_to_normal

I saw these replies after I’d updated and discovered that the behaviour appears to be back to normal. The two Disk I/O graphs showed nothing of note at the point of change or during it, not that they necessarily would. I’ve grabbed iotop (thanks, I’d not heard of it) for any future investigations. It’s a SanDisk card, a couple of years old, gets low levels of action from PiAware.

Perhaps a power fluctuation corrupted something and that update happens to have refreshed it. Seems unlikely though.

Thanks @caius, that’s a really handy test to have around, I’ve saved it to use in future. Much appreciated.

It could be that some PiAware component crashed and left the platform in an unexpected state, the side-effect of which was this behaviour. But there’s nothing in piaware.log suggesting that and my stats and aircraft have been unaffected by this.

Looks like it’s going to be one of those annoying occasional computer-related mysteries which resolves itself and never reveals what went wrong!

EDIT – I found this article which has that test @caius posted.

Troubleshooting High I/O Wait in Linux

Some more good diags in there. If this happens again I’m confident I can track it down now, and report in here what’s going on.

Hi @chrislfa - I’ve been experiencing the exact same thing! SSH login is slow (sometimes up to 10-15 seconds to log in). I see the high CPU Utilizations for ~2 days and then it disappears for ~2 days.


(above is Chicago time or CST)

Rebooting has no effect. And for me the gain isn’t changing (that I know of!). I’ve been staring at the top command for a few days trying to determine the “why”. (like watching paint dry!) No luck!

 

I agree. It seems to be I/O related…
 

EDIT: the issue ended near 17:30 CST. So the tests suggested above may not help (yet!)

Load averages are way down:

1 Like

Note that io-wait (“wa” in top) is not “real” CPU - it is the time spent by a process that wants to run, but which is waiting for I/O on an external device (e.g. sdcard read or write) to complete. Look for processes in state D (“S” column in top)

It’s either sudden heavy I/O or your sdcard is looking unhappy.

note that a pristine piaware image does very little I/O, but adding things like graphs / stat collection will increase it.

This post is a good explanation.

I’m starting to suspect some sort of caching issue hitting a limit and turning on a brick wall of idle CPU waiting for I/O. Possibly caused by the graphs or by any of the tweaks done to 3.8.0 over time. Other than that I’m not losing any sleep over the card I/O from graphs. I’m looking forward to being able to start over with a nice clean 3.8.1, and ditch the dhcpcd patching and static IP workaround along the way.

The card looks happy with a read test.

pi@piaware:~ $ mount | grep " / "
/dev/mmcblk0p2 on / type ext4 (rw,noatime)
pi@piaware:~ $ sudo badblocks -sv /dev/mmcblk0p2
Checking blocks 0 to 15382015
Checking for bad blocks (read-only test): done                                                 
Pass completed, 0 bad blocks found. (0/0/0 errors)

How odd, thanks for the extra data points @JonMu! Yes the gain change and reboot here was just a coincidence just under 2 hours before it started. I’ve checked the 30 day graph and in fact I have had this before, it just didn’t catch my attention until this evening. There was one last month too.

Are you also running the PiAware 3.8.0 image? Are you on a Pi 3B?

Next time it happens try out @caius command and check that link for other ways to identify the cause. Here’s one; if you install iotop mentioned by @wiedehopf you can use the -a and -o switches to toggle accumulated reads/writes and only show active processes. You can also press ‘a’ and ‘o’ when it’s running to toggle them interactively.

$ sudo iotop -ao

iostat_ao

Interestingly, the effect isn’t showing on the graphs as disk I/O, while the badblocks read test did. On the graphs below you can see:

  • The problem on the left side up to around 01:30, with nothing showing for disk I/O
  • The apt upgrade which shows both CPU activity and disk writes, as expected, then a reboot (two in fact, I was testing shutdown vs reboot)
  • The problem has then disappeared which is a bit annoying…
  • Another reboot around 04:50 (more testing) and then the badblocks read test at 05:00 which shows a similar CPU effect but also clearly registers disk read I/O

So what could permanently tie up CPU resource with I/O wait but apparently not do any disk read or write I/O?

Backoff-and-retry on I/O error? That’s why I suspected a dying sdcard.

(also to clarify again, it’s not tying up CPU resources; it’s saying the CPU is idle waiting on some other resource)

I’ll keep an eye on it, but the card’s well within its lifespan given its usage so I’d really like a way to test that idea rather than assume it.

Do you mean “dying” as in it’s reached a stage where it’s physically becoming unable to store data despite wear-levelling? If so then that impacts the OS and I’d expect to be able to detect that using OS tools. The graph could be evidence of that but since we’re trying to understand the graph it doesn’t help here. I’ll try a non-destructive badblocks write test shortly and see what that says.

“Tie up CPU resource with I/O wait” was my clumsy way of saying that if CPU resource (ie part of all available CPU) is idle waiting on a process that is waiting for I/O, then it’s tied up, unavailable to other processes.

Well, not exactly. Nothing about the I/O wait time makes the CPU unavailable to other processes. It is more of a subset of idle time; it’s the times when the scheduler was idle and there was at least one process blocked waiting for I/O to complete.

The CPU issues came back fast this time…

No. The current Buster with PiAware added on top. And running a RPi2B.

Here are the results though they don’t mean much too me…

Just ran for about 30 seconds…

I noticed this happened again earlier today and then resolved itself a few hours later.

While searching the interwebs for ideas a few days ago I found anecdotal reports that bluetooth was implicated. Since it’s not needed on PiAware I’ve removed the entire stack and rebooted.

$ sudo apt-get purge bluez -y
$ sudo apt-get autoremove -y

I’ll see if it happens again. If it does then bluetooth was likely not relevant. The next step will be a destructive SD card write test and a fresh PiAware install. Is there an ETA for PiAware 3.8.1? There are a couple of awkward hacks with 3.8.0 (the static IP bug and the dhcpcd seg fault patch) which I might as well wait and avoid if it’s close to release.

Where is the data coming from which is being used to produce the dark red “io” lines on the graph “Overall CPU Utilization”? I’d like to be able to monitor this but with graphs1090 not installed, so obviously I cannot use the graph to detect it and instead need to use the underlying mechanism that the graph would be using.

Looks like my SD card was unhappy.

Steps I took to move to a new SD card:

  1. shutdown FA
  2. removed the FA SD card from RasPi
  3. plugged the old FA SD card into a card reader on a spare RasPi
  4. ran dd and copied the SD card to an image file (I wanted a backup!)
    • /dev/sdb was my old SC card
      • Note: this may be different on your device!
    • /media/Images was a spare external hard drive - mounted
sudo dd bs=1M if=/dev/sdb of=/media/Images/sdCard.img status=progress && sync
  1. removed the old SD card from the card reader
  2. inserted a brand new SD card into a card reader on the spare RasPi
  3. ran dd to copy the image to the new SD card
    • /dev/sdb was the new SC card
      • Note: this may be different on your device!
    • /media/Images was a spare external hard drive - mounted
sudo dd bs=1M of=/dev/sdb if=/media/Images/sdCard.img status=progress && sync
  1. Unplug new FA SD card from spare RasPi
  2. Insert new FA SD card into FA RasPi & power up!

Everything has been working A-OK for 9 days. :grinning: