Clock unstable

My stations have been working fine (other than the recent kernel/MLAT issue on one of them, which I fixed a few days ago) up until about two days ago…

Suddenly, MLAT on both are not working. In the log, I see the “Server status: clock unstable”, does that indicate there is an issue on the FA side?

My clock synchronization seems fine (and I’m only .2 msec off of a highly accurate reference), per ntpq -p:

 remote           refid      st t when poll reach   delay   offset  jitter

==============================================================================
192.168.101.1 50.116.38.157 3 u 38 64 377 1.967 ** 0.219 0.209*
+clock.team-cymr 204.9.54.119 2 u 60 64 377 42.236 8.194 0.496
+209-133-217-165 71.40.128.146 2 u 54 64 377 55.758 4.938 0.951
+time-c.nist.gov .NIST. 1 u 61 64 377 28.240 9.420 0.460

My CPU is nowhere near saturated:
11:16:31 up 19 min, 1 user, load average: 0.38, 0.37, 0.28

and I’m sending every message coming out of dump1090-fa:
May 20 11:07:32 flightradar2 piaware[573]: 4938 msgs recv’d from dump1090-fa (2354 in last 5m); 4938 msgs sent to FlightAware

Any ideas?

-Matt

I am having the same problem since 18th of May about 19:00 CET at some of my receivers. A flatline, no mlat at all.

Today I changed the location to be more exact (it was just 300m away from the real location)… and it is running for the last two hours,

So I’ll see tomorrow if this is the solution. :slight_smile:

But it also failed on different other receivers which had the exact location and nothing have been changed the last weeks.

Sometimes the stats-page is reporting mlat as down, but in skyview the receiver gets mlat… Strange.

So maybe it’s an FA issue.

Mlat doesn’t care what you do with your system clock so ntp or not is irrelevant. It cares about dropped samples on the USB bus.

The stats page reports a snapshot of what the state was, updated every 5 minutes. The current state might have changed since the snapshot.

There were some large internal changes deployed to the mlat system over the last few days which seem to have made the system more sensitive to unstable clocks. From the cases I’ve looked at, the clocks are unstable, it’s mostly a case of the trigger levels being set a little differently to throw out more noise.

FWIW, “clock unstable” means more than 25% of receivers you synchronized with in the last 30 seconds had unstable synchronization - the main cause being multiple outliers more than about 4us out causing a step in the pairing.

(edit: I found a bug that would make the mlat servers use more stringent timing requirements than they should be using; also looking into another bug which may be producing bad estimates of average error; so you may see some improvements over the next couple of days)

Thanks for this info. Is there anything we can do on our side in regards to the trigger levels, etc.?

Also, I’m not sure if I’m more susceptible to this issue, but these two stations are high-profile enough that they typically sync to ~123 or so other stations (mountain top)…

-Matt

1 Like

Things seem happier with those bugfixes in place, let me know if you see further problems.

I´m facing the same issue (in Buenos Aires - Argentina), my stats-page is reporting mlat as down I see no clock issue from my side.

Your receiver just seems to be unstable. Check USB connections and cables.



Synchronization stats for receiver fAT-b827...#327:
 PeerID Distance State    Frequency Error    LastReset Expires
 13      290.3km Stepped   -73.8ppm    226ns     16.7s -      
 48      298.7km Stepped   -76.2ppm      0ns      0.0s -      
 59       47.5km Stepped   -75.4ppm    126ns     18.8s -      
 130       8.4km Stepped   -74.3ppm     87ns     17.5s -      
 146     319.9km New       -74.8ppm      0ns      0.0s -      
 175      24.1km Stepped   -73.7ppm     64ns     18.8s -      
 177     228.9km New       -73.2ppm     97ns      0.5s -      
 182      10.1km Stepped   -87.9ppm     69ns     20.3s -      
 206     225.1km Outliers  -74.0ppm   3672ns     32.6s    1.0s
 210     259.9km Stepped   -74.0ppm    208ns     17.4s -      
 249      19.5km Stepped   -74.5ppm    127ns      8.0s -      
 286     218.1km Stepped   -73.7ppm      0ns      0.0s -      
 292     221.4km Stepped   -73.7ppm      0ns      0.0s -      
Pairings: 13  Usable: 6  Unstable: 5
Inhibit: FULL  Instability: 43


(all peers moving to Stepped at the same time is a strong indication that your receiver just dropped some data)

1 Like

Things seem much happier now at my two stations. I will continue to monitor. Thanks very much for the bug fixes.

-Matt

It’s working again without any outages.

Thanks a lot!

This is still an issue here.
A restart of my station gets it in sync, but MLAT drops after an hour or less.

RPi2 with FA Pro stick, external filter,
official Raspberry PS, max_usb_current=1 set, power LED steady on,
setup stable for a year, MLAT was in sync with 125± neigbours.
Since noon 19th may it drops out of sync with “clock unstable”.
It appears that my station sometimes regains sync when the traffic
decreases during night times.

Swapped the FA Pro stick with a Fa Pro+ stick
(that was planned anyway to minimize connector loss),
refreshed the dump1090-mutability (not -fa) to current version,
reduced gain so that dump1090 keeps below 1000 msgs/s,
overclocked the Pi2 (44°C now, dump at 25% CPU, FA tasks at 6% CPU),
rtl_test sampling at 2.048k doesn’t report any drops over 30 mins of testing.

Last thing to do I see on my side is upgrading the Pi2 to Pi3, this will
happen tomorrow - I’m running out of options then.

I’m wondering why my station during the short periods of MLAT up
from one day to the next reports synchonization with 300+ stations now,
way more than before and imho. way more chances to fail the sync with.

Are you running kernel 4.9 on the pi2? There have been reports that 4.9 causes usb problems leading to bad mlat sync on the pi 2

Try rtl_test at 2.4mhz as that is what dump1090-fa runs at

In deed, kernel 4.9.24 here. rtl_test running at 2.4 right now - no losses by now, but just started…
How’s the situation with the Pi3, is it also affected by this?

I have only heard of problems with Pi 2s. Doing some testing myself is on my todo list…

See post206538.html#p206538 for some more details.

Thanks, that doesn’t read too bad.
rtl_test at 2.4k reported one loss with 104 bytes within the last 20 mins. Not that much but definitely a loss.
I can’t estimate if this is enough to drop the sync, but ok, the Pi3 is on it’s way. I’ll see how it behaves here.

With the caveat that rtl_test doesn’t behave identically to dump1090-fa …

104 bytes is about 22us; this is large enough you will make you lose synchronization with all peers when it happens.
At worst, that will inhibit mlat for 2 minutes.

Following the hints given in the thread you pointed out above, I downgraded the Pi2 to kernel 4.4
for the time being until I have the Pi3 here. That appears to be stable, even with the higher RF-gain again in place,
I used to have configured before. So you may add +1 to the list of reported issues with Pi2, USB and kernel 4.9
Thanks very much, best regards!

The internal changes mean we can run much larger mlat regions without running into load problems.

and imho. way more chances to fail the sync with.

The process that decides who is unstable already takes that into account.

Thank you very much, my Pi2 was running kernel 4.9 as well. It’s already 12 hours I have downgraded the version to 4.5, it seems the clock unstable message has gone from logs.

I did the swap from Pi2 to Pi3 (and learnt how to spoof the eth0 MAC-address :smiley: )
Pi2 MLAT stable with kernel 4.4.50 but not 4.9.xx
Pi3 MLAT stable with kernel 4.9.29, too.

I’m not sure if this is on topic, but about the same time posters report they started getting loss of sync, I saw an apparent drop of MLAT position reports to less than 50% of before the change. I checked the stats of other users in my area and saw much the same change. Since I was running on a Raspberry Pi 2, I tried replacing it with a Raspberry Pi 3, but the percentage of MLAT positions processed remained about the same. Here are some recent log lines:

[2017-05-26 10:34 PDT] mlat-client(2157): Receiver status: connected
[2017-05-26 10:34 PDT] mlat-client(2157): Server status: synchronized with 126 nearby receivers
[2017-05-26 10:34 PDT] mlat-client(2157): Receiver: 205.9 msg/s received 85.9 msg/s processed (42%)
[2017-05-26 10:34 PDT] mlat-client(2157): Server: 0.2 kB/s from server 0.0kB/s TCP to server 1.1kB/s UDP to server
[2017-05-26 10:34 PDT] mlat-client(2157): Results: 81.1 positions/minute
[2017-05-26 10:34 PDT] mlat-client(2157): Aircraft: 18 of 34 Mode S, 23 of 29 ADS-B used
[2017-05-26 10:34 PDT] 55322 msgs recv’d from dump1090-fa (1164 in last 5m); 55322 msgs sent to FlightAware
[2017-05-26 10:39 PDT] 56724 msgs recv’d from dump1090-fa (1402 in last 5m); 56724 msgs sent to FlightAware
[2017-05-26 10:44 PDT] 58078 msgs recv’d from dump1090-fa (1354 in last 5m); 58078 msgs sent to FlightAware
[2017-05-26 10:49 PDT] mlat-client(2157): Receiver status: connected
[2017-05-26 10:49 PDT] mlat-client(2157): Server status: synchronized with 109 nearby receivers
[2017-05-26 10:49 PDT] mlat-client(2157): Receiver: 207.1 msg/s received 93.8 msg/s processed (45%)
[2017-05-26 10:49 PDT] mlat-client(2157): Server: 0.2 kB/s from server 0.0kB/s TCP to server 1.1kB/s UDP to server
[2017-05-26 10:49 PDT] mlat-client(2157): Results: 81.0 positions/minute
[2017-05-26 10:49 PDT] mlat-client(2157): Aircraft: 18 of 31 Mode S, 24 of 31 ADS-B used
[2017-05-26 10:49 PDT] 59383 msgs recv’d from dump1090-fa (1305 in last 5m); 59383 msgs sent to FlightAware
[2017-05-26 10:54 PDT] 60843 msgs recv’d from dump1090-fa (1460 in last 5m); 60843 msgs sent to FlightAware

Considering that one remark I read was that MLAT could increase with the recent changes I have to wonder why I am seeing the opposite effect. Does anyone have any ideas? Should I start a new thread? Thanks!