FlightAware Discussions

CPU hikes crash dump1090-fa

I have been using a Debian build by zenonp (Source packages?) for months. But this entire week, dump1090-fa crashes nearly every day at various times. Before the crash, there was always a CPU hike. Once I got into the server before the crash, and noticed that dump1090-fa was the cause of CPU hike. What could this be?


The following excerpt from piaware.log includes the previous start to latest crash. It provides little clue to the cause of crash.

-- A start job for unit dump1090-fa.service has finished successfully.
-- 
-- The job identifier is 15071.
Mar 13 11:32:21 fabian64 dump1090-fa[8899]: Sat Mar 13 11:32:21 2021 PST  dump10
90-fa 4.0 starting up.
Mar 13 11:32:22 fabian64 dump1090-fa[8899]: rtlsdr: using device #0: Generic RTL
2832U (Realtek, RTL2832U, SN 00001000)
Mar 13 11:32:22 fabian64 dump1090-fa[8899]: Found Rafael Micro R820T tuner
Mar 13 11:32:22 fabian64 dump1090-fa[8899]: rtlsdr: tuner gain set to 43.4 dB
Mar 13 11:32:22 fabian64 dump1090-fa[8899]: Allocating 4 zero-copy buffers
Mar 14 13:45:57 fabian64 dump1090-fa[8899]: Sun Mar 14 13:45:57 2021 PDT  Caught
 SIGTERM, shutting down..
Mar 14 13:45:57 fabian64 systemd[1]: Stopping dump1090 ADS-B receiver (FlightAwa
re customization)...
-- Subject: A stop job for unit dump1090-fa.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A stop job for unit dump1090-fa.service has begun execution.
-- 
-- The job identifier is 17191.
Mar 14 13:47:28 fabian64 systemd[1]: dump1090-fa.service: State 'stop-sigterm' t
imed out. Killing.
Mar 14 13:47:28 fabian64 systemd[1]: dump1090-fa.service: Killing process 8899 (
dump1090-fa) with signal SIGKILL.
Mar 14 13:47:28 fabian64 systemd[1]: dump1090-fa.service: Main process exited, c
ode=killed, status=9/KILL
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- An ExecStart= process belonging to unit dump1090-fa.service has exited.
-- 
-- The process' exit code is 'killed' and its exit status is 9.
Mar 14 13:47:28 fabian64 systemd[1]: dump1090-fa.service: Failed with result 'ti
meout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit dump1090-fa.service has entered the 'failed' state with result 'time
out'.
Mar 14 13:47:28 fabian64 systemd[1]: Stopped dump1090 ADS-B receiver (FlightAwar
e customization).
-- Subject: A stop job for unit dump1090-fa.service has finished
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A stop job for unit dump1090-fa.service has finished.
-- 
-- The job identifier is 17191 and the job result is done.
Mar 14 13:47:28 fabian64 systemd[1]: Started dump1090 ADS-B receiver (FlightAwar
e customization).
-- Subject: A start job for unit dump1090-fa.service has finished successfully

Please try 5.0 and see if the problem continues.

(it’s possibly the bug fixed by this commit)

Good point. I was reading about 5.0. There is not a package in Debian yet. I remember the last time zenonp built that one. Guess this work has to be repeated, or I’ll need to compile from source.

Update: Just in time to see ABCD567’s

This is awkward. So I updated to 5.0 following ABCD567’s instructions. The result? I get two message gaps in one day; but instead of CPU surges, they correspond to two CPU dips this time.


I suspect that something in th OS - or even hardware is wrong. Right before this whole thing started to happen, this machine crashed on itself and I don’t know why,

What do the dump1090-fa logs look like at the time?
(But yes, that does smell like a hardware issue)

I’d guess low SDR / USB voltage or the SDR is dying.
But checking the log like obj suggests is the way to go.

This time logs are definitely different. The first onset shows a sudden disappearance of messages, then piaware forced a dump1090-fa restart after one hour. (This explains the wider gap.) Then everything looked normal until the next dip.

  1. From 152msg/s to 0

    Mar 16 15:35:56 fabian64 piaware[20989]: mlat-client(21036): Receiver status: connected
    Mar 16 15:35:56 fabian64 piaware[20989]: mlat-client(21036): Server status: synchronized with 439 nearby receivers
    Mar 16 15:35:56 fabian64 piaware[20989]: mlat-client(21036): Receiver: 152.2 msg/s received 77.5 msg/s processed (51%)
    Mar 16 15:35:56 fabian64 piaware[20989]: mlat-client(21036): Server: 0.0 kB/s from server 0.0kB/s TCP to server 1.0kB/s UDP to server
    Mar 16 15:35:56 fabian64 piaware[20989]: mlat-client(21036): Aircraft: 2 of 7 Mode S, 31 of 32 ADS-B used
    Mar 16 15:36:05 fabian64 piaware[20989]: 227954 msgs recv’d from dump1090-fa (1969 in last 5m); 227954 msgs sent to FlightAware
    Mar 16 15:41:05 fabian64 piaware[20989]: 230095 msgs recv’d from dump1090-fa (2141 in last 5m); 230095 msgs sent to FlightAware
    Mar 16 15:46:05 fabian64 piaware[20989]: 230870 msgs recv’d from dump1090-fa (775 in last 5m); 230870 msgs sent to FlightAware
    Mar 16 15:50:56 fabian64 piaware[20989]: mlat-client(21036): Receiver status: connected
    Mar 16 15:50:56 fabian64 piaware[20989]: mlat-client(21036): Server status: not synchronized with any nearby receivers
    Mar 16 15:50:56 fabian64 piaware[20989]: mlat-client(21036): Receiver: 87.2 msg/s received 44.1 msg/s processed (51%)
    Mar 16 15:50:56 fabian64 piaware[20989]: mlat-client(21036): Server: 0.0 kB/s from server 0.0kB/s TCP to server 0.6kB/s UDP to server
    Mar 16 15:50:56 fabian64 piaware[20989]: mlat-client(21036): Aircraft: 0 of 0 Mode S, 0 of 0 ADS-B used
    Mar 16 15:51:05 fabian64 piaware[20989]: 230870 msgs recv’d from dump1090-fa (0 in last 5m); 230870 msgs sent to FlightAware
    Mar 16 15:56:05 fabian64 piaware[20989]: 230870 msgs recv’d from dump1090-fa (0 in last 5m); 230870 msgs sent to FlightAware
    Mar 16 16:01:05 fabian64 piaware[20989]: 230870 msgs recv’d from dump1090-fa (0 in last 5m); 230870 msgs sent to FlightAware
    Mar 16 16:05:57 fabian64 piaware[20989]: mlat-client(21036): Receiver status: connected
    Mar 16 16:05:57 fabian64 piaware[20989]: mlat-client(21036): Server status: not synchronized with any nearby receivers
    Mar 16 16:05:57 fabian64 piaware[20989]: mlat-client(21036): Receiver: 0.0 msg/s received 0.0 msg/s processed (0%)
    Mar 16 16:05:57 fabian64 piaware[20989]: mlat-client(21036): Server: 0.0 kB/s from server 0.0kB/s TCP to server 0.0kB/s UDP to server
    Mar 16 16:05:57 fabian64 piaware[20989]: mlat-client(21036): Aircraft: 0 of 0 Mode S, 0 of 0 ADS-B used
    Mar 16 16:06:05 fabian64 piaware[20989]: 230870 msgs recv’d from dump1090-fa (0 in last 5m); 230870 msgs sent to FlightAware

  2. Force restart

    Mar 16 16:45:35 fabian64 piaware[20989]: no new messages received in 3736 seconds, it might just be that there haven’t been any aircraft nearby but I’m going to try to restart everything, just in case…
    Mar 16 16:45:36 fabian64 piaware[20989]: faup1090 exited with SIG SIGHUP
    Mar 16 16:45:37 fabian64 piaware[20989]: attempting to restart dump1090…
    Mar 16 16:45:37 fabian64 piaware[20989]: attempting to restart dump1090-fa using ‘systemctl --no-block try-restart dump1090-fa.service < /dev/null’…
    Mar 16 16:45:37 fabian64 piaware[20989]: dump1090 restart appears to have been successful
    Mar 16 16:45:47 fabian64 piaware[20989]: ADS-B data program ‘dump1090-fa’ is listening on port 30005, so far so good

  3. 232msg/s immediately after
    Mar 16 16:50:57 fabian64 piaware[20989]: mlat-client(21036): Aircraft: 7 of 15 Mode S, 31 of 37 ADS-B used
    Mar 16 16:51:06 fabian64 piaware[20989]: 232617 msgs recv’d from dump1090-fa (1747 in last 5m); 232617 msgs sent to FlightAware
    Mar 16 16:56:06 fabian64 piaware[20989]: 234341 msgs recv’d from dump1090-fa (1724 in last 5m); 234341 msgs sent to FlightAware
    Mar 16 17:01:06 fabian64 piaware[20989]: 236094 msgs recv’d from dump1090-fa (1753 in last 5m); 236094 msgs sent to FlightAware
    Mar 16 17:05:57 fabian64 piaware[20989]: mlat-client(21036): Receiver status: connected
    Mar 16 17:05:57 fabian64 piaware[20989]: mlat-client(21036): Server status: clock unstable
    Mar 16 17:05:57 fabian64 piaware[20989]: mlat-client(21036): Receiver: 232.4 msg/s received 96.2 msg/s processed (41%)

The fact that messages immediately started pouring in makes it less like a hardware condition. I’m even more puzzled. These dips haven’t recurred so far.

Ok, but what do the dump1090 logs say?

Where do I find dump1090 logs? (Always thought they were in piaware.log.) I searched by .log and by dump1090. Nothing comes obvious. (BTW, dump1090-fa is not getting messages right now.)

Debug commands · wiedehopf/adsb-wiki Wiki · GitHub

Look there for dump1090-fa, put the log on pastebin.com and link it here.

1 Like

Thanks for the pointer, @wiedehopf! I am so used to services not sending any usable info to STDOUT that I forgot to check journald. The actual logs, though, show similar events as piaware.log does.

  1. Server start
    Mar 15 20:04:30 fabian64 systemd[1]: Started dump1090 ADS-B receiver (FlightAware customization).
    Mar 15 20:04:30 fabian64 dump1090-fa[14865]: Mon Mar 15 20:04:30 2021 PDT dump1090-fa 5.0 starting up.
    Mar 15 20:04:31 fabian64 dump1090-fa[14865]: rtlsdr: using device #0: Generic RTL2832U (Realtek, RTL2832U, SN 00001000)
    Mar 15 20:04:31 fabian64 dump1090-fa[14865]: Found Rafael Micro R820T tuner
    Mar 15 20:04:31 fabian64 dump1090-fa[14865]: rtlsdr: tuner gain set to 43.4 dB
    Mar 15 20:04:31 fabian64 dump1090-fa[14865]: Allocating 4 zero-copy buffers
  2. Data from SDR stopped
    Mar 16 15:42:50 fabian64 dump1090-fa[14865]: Tue Mar 16 15:42:50 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 15:43:51 fabian64 dump1090-fa[14865]: Tue Mar 16 15:43:51 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 15:44:52 fabian64 dump1090-fa[14865]: Tue Mar 16 15:44:52 2021 PDT No data received from the SDR for a long time, it may have wedged
  3. piaware forced restart

    Mar 16 16:44:32 fabian64 dump1090-fa[14865]: Tue Mar 16 16:44:32 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 16:45:32 fabian64 dump1090-fa[14865]: Tue Mar 16 16:45:32 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 16:45:37 fabian64 dump1090-fa[14865]: Tue Mar 16 16:45:37 2021 PDT Caught SIGTERM, shutting down…
    Mar 16 16:45:37 fabian64 systemd[1]: Stopping dump1090 ADS-B receiver (FlightAware customization)…
    Mar 16 16:45:37 fabian64 dump1090-fa[14865]: Tue Mar 16 16:45:37 2021 PDT Waiting for receive thread termination
    Mar 16 16:47:07 fabian64 systemd[1]: dump1090-fa.service: State ‘stop-sigterm’ timed out. Killing.
    Mar 16 16:47:07 fabian64 systemd[1]: dump1090-fa.service: Killing process 14865 (dump1090-fa) with signal SIGKILL.
    Mar 16 16:47:07 fabian64 systemd[1]: dump1090-fa.service: Main process exited, code=killed, status=9/KILL
    Mar 16 16:47:07 fabian64 systemd[1]: dump1090-fa.service: Failed with result ‘timeout’.
    Mar 16 16:47:07 fabian64 systemd[1]: Stopped dump1090 ADS-B receiver (FlightAware customization).
    Mar 16 16:47:07 fabian64 systemd[1]: Started dump1090 ADS-B receiver (FlightAware customization).
    Mar 16 16:47:07 fabian64 dump1090-fa[9972]: Tue Mar 16 16:47:07 2021 PDT dump1090-fa 5.0 starting up.
    Mar 16 16:47:07 fabian64 dump1090-fa[9972]: rtlsdr: using device #0: Generic RTL2832U (Realtek, RTL2832U, SN 00001000)
    Mar 16 16:47:07 fabian64 dump1090-fa[9972]: Found Rafael Micro R820T tuner
    Mar 16 16:47:08 fabian64 dump1090-fa[9972]: rtlsdr: tuner gain set to 43.4 dB
    Mar 16 16:47:08 fabian64 dump1090-fa[9972]: Allocating 4 zero-copy buffers
  4. Data stream died again
    Mar 16 19:24:59 fabian64 dump1090-fa[9972]: Tue Mar 16 19:24:59 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 19:26:00 fabian64 dump1090-fa[9972]: Tue Mar 16 19:26:00 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 19:27:00 fabian64 dump1090-fa[9972]: Tue Mar 16 19:27:00 2021 PDT No data received from the SDR for a long time, it may have wedged
  5. piaware forced restart again

    Mar 16 20:23:38 fabian64 dump1090-fa[9972]: Tue Mar 16 20:23:38 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 20:24:39 fabian64 dump1090-fa[9972]: Tue Mar 16 20:24:39 2021 PDT No data received from the SDR for a long time, it may have wedged
    Mar 16 20:25:38 fabian64 dump1090-fa[9972]: Tue Mar 16 20:25:38 2021 PDT Caught SIGTERM, shutting down…
    Mar 16 20:25:38 fabian64 systemd[1]: Stopping dump1090 ADS-B receiver (FlightAware customization)…
    Mar 16 20:25:38 fabian64 dump1090-fa[9972]: Tue Mar 16 20:25:38 2021 PDT Waiting for receive thread termination
    Mar 16 20:27:08 fabian64 systemd[1]: dump1090-fa.service: State ‘stop-sigterm’ timed out. Killing.
    Mar 16 20:27:08 fabian64 systemd[1]: dump1090-fa.service: Killing process 9972 (dump1090-fa) with signal SIGKILL.
    Mar 16 20:27:08 fabian64 systemd[1]: dump1090-fa.service: Main process exited, code=killed, status=9/KILL
    Mar 16 20:27:08 fabian64 systemd[1]: dump1090-fa.service: Failed with result ‘timeout’.
    Mar 16 20:27:08 fabian64 systemd[1]: Stopped dump1090 ADS-B receiver (FlightAware customization).
    Mar 16 20:27:08 fabian64 systemd[1]: Started dump1090 ADS-B receiver (FlightAware customization).
    Mar 16 20:27:08 fabian64 dump1090-fa[11167]: Tue Mar 16 20:27:08 2021 PDT dump1090-fa 5.0 starting up.
    Mar 16 20:27:09 fabian64 dump1090-fa[11167]: rtlsdr: using device #0: Generic RTL2832U (Realtek, RTL2832U, SN 00001000)
    Mar 16 20:27:09 fabian64 dump1090-fa[11167]: Found Rafael Micro R820T tuner
    Mar 16 20:27:09 fabian64 dump1090-fa[11167]: rtlsdr: tuner gain set to 43.4 dB
    Mar 16 20:27:09 fabian64 dump1090-fa[11167]: Allocating 4 zero-copy buffers

Not sure what would have “wedged” SDR. Is it SDR itself?

This means that the receive thread is stuck, which “shouldn’t happen” even if the SDR hardware is misbehaving.

If you can catch it in the act and are comfortable with either capturing strace output (use -f, it’s threaded) or attaching gdb to the running dump1090-fa and getting stack traces of all threads (or both), then that would help in debugging where exactly it’s hanging.

A little clarification would help here. Do you suspect the receive thread is stuck at the beginning of message drops, not just when piaware sends SIGTERM? To catch the few minutes between SIGTERM and SIGKILL would be hard. I can try to attach strace when I see the message drop till piaware kills it. (One hour window is easier.)

Yes, basically at the point where the main thread starts complaining it hasn’t received any data for a while.

I disagree, there are some code paths that can lead to a hang pretty easily i think:

Assume the SDR is misbehaving, we have to assume there are no more callbacks coming in.
The only place cancel_async is called is in the callback.
A good place to call it from might be sdrClose().

Also if you haven’t gotten data for 10 minutes, why not exit?

                if (--watchdogCounter <= 0) {
                    log_with_timestamp("No data received from the SDR for a long time, it may have wedged");
                    watchdogCounter = 600;
                }  

That is a natural point to call rtlsdr_cancelAsync and also break the while loop isn’t it?

Now this might be sufficient but if you don’t want to take any chances this is not a good command to issue:

        pthread_join(Modes.reader_thread,NULL);     // Wait on reader thread exit

As if you assume that the library is misbehaving, this might never return:

void rtlsdrRun()
{
    if (!RTLSDR.dev) {
        return;
    }

    rtlsdr_read_async(RTLSDR.dev, rtlsdrCallback, NULL,                                                                                 
                      /* MODES_RTL_BUFFERS */ 4,
                      MODES_RTL_BUF_SIZE);

Even calling rtlsdr_cancel_async(RTLSDR.dev) is dangerous as if you don’t trust librtlsdr you don’t know if the call ever returns.
The same goes for rtlsdr_close(RTLSDR.dev);
Thus one could start it in a pthread and then do a similar timeout for its termination.

So … possible code:

void *sdrClose(void *arg) {
    MODES_NOTUSED(arg);
    current_handler()->cancel();
    current_handler()->close();
    pthread_exit(NULL);
}
            // avoid calling synchronous ... we don't trust those rtl-sdr lib calls to not hang
            pthread_create(&Modes.sdrCloseThread, NULL, sdrClose, NULL);
            // Wait on sdrClose
            int err = 0;
            int count = 100;
            while (count-- > 0 && (err = pthread_tryjoin_np(Modes.sdrCloseThread, NULL))) {
                msleep(100);
            }
            if (err) {
                log_with_timestamp("sdrClose timed out, will raise SIGKILL on exit!");
                Modes.exit = SIGKILL;
            }                    

If you don’t trust the library the only solution i see: Only use rtlsdr_* calls in a separate thread that you then wait on using pthread_tryjoin_np.
There is a timed tryjoin but that had compile issues for me … seems to be less well supported than tryjoin_np which i’ve so far had work.
That theoretically even includes the sdrOpen code.
If that is done rigorously you can detect if the sdr code hangs and in that case raise a SIGKILL instead of calling exit in your main thread.
Edit: I’ve cleaned up my code just a little bit so it’s more readable:
readsb/sdr.c at sid · wiedehopf/readsb · GitHub
readsb/sdr.c at sid · wiedehopf/readsb · GitHub

Assuming all SDR library calls are suspect per se and might not return is a bit of a paranoid approach but really i don’t see an alternative :confused:

But that’s just my 2 cents as always, good luck fixing the issue.

I’m not aware of any if we assume that librtlsdr/libusb are behaving themselves.
This is why I want to see where it’s actually getting stuck - if there’s a bug in dump1090 itself, that’s a (relatively) easy fix; if the bug is in librtlsdr/libusb then I have to think more about how to fix it, because starting to build/distribute our own versions of that is a larger effort. (I’m debating that anyway for librtlsdr, since the upstream library is mostly unmaintained, but…)

It’s 60 seconds, not 600 (loops are 100ms) but yeah, that’s probably a reasonable thing to do.

If we’re assuming that the problem lies in librtlsdr et al, then the only sensible thing to do at that point is abort(). No point in trying to do cleanup if you have a thread that is not at a safe cancellation point, all bets are off.

edit: refreshing my memory of the librtlsdr code, I think the problem is that if the hardware is stalled, the reader thread is looping within rtlsdr_read_async() waiting for USB activity or cancellation, and no callbacks to the dump1090 code will ever happen, so the signal handler’s change to Modes.exit is not noticed and rtlsdr_cancel_async never gets called. I think we need a sdrStop() handler that for librtlsdr just calls rtlsdr_cancel_async - we can call that before waiting for the reader thread. (Can’t safely call sdrClose before the reader thread has terminated, in general).

Arguably librtlsdr should handle no data arriving for an extended period of time as an error and bail out … but it doesn’t currently do that so we have to build our own mechanisms.

Yeah i need to adjust my code a bit you’re correct, thanks for the hint!

I’m not saying pthread_tryjoin is pretty … it’s just that i write code that’s not pretty anyhow so i don’t see the downside :wink:
Calling rtlsdr_cancel_async() and seeing if it terminates properly is surely the right thing to do.
But as you see with this issue we have high CPU usage which usually means we have some code in librtlsdr spinning as dump1090-fa doesn’t just free spin it has sleeps.
Thus i’m assuming the code that calls the callback is somehow in an endless loop and if that’s the case it’s not a leap that the rtlsdr_cancel_async() will get stuck trying to cancel that loop.
And once i had the try_join in the code and thought about it a bit … i don’t see a reason to trust any library when i can do it like this.

I’m not sure - there are two sets of graphs above and they behave differently; the more recent set shows a large drop in CPU when it’s stalled, which is what I’d expect if librtlsdr is waiting on USB activity that never arrives.

1 Like

From anecdotal experience: Almost always the SDR doesn’t have enough voltage.
In rare cases the SDR is dying but i would expect more errors in that case so i’d lean towards voltage.

Low SDR voltage is usually caused by USB extension leads that drop too much / are too thin copper or just a dying / insufficient RPi power supply.

Might still be interesting to get that strace though :slight_smile: