by Lanigiro » 11 Feb 2014 19:00
A few more thoughts on the "bistable" hypothesis.
First, the graph looks an awful lot like the behavior of a bistable system nudged out of an attractor state. It wanders chaotically for a short time and then falls into an attractor again, in this instance the other attractor.
Second, one of the likelier ways for such a bug to exist is a race condition with a chance of wedging or crashing Shareaza if an inbound hub-to-hub connection enters a particular phase of the handshake while another inbound connection is still in that phase. Adding the hub to the global data structure representing the totality of established hub connections is the likely place where two threads handling separate network connections might step on each others' toes in this case.
I'd go over the hub-to-hub connection establishment code with a fine-toothed comb looking for a place where there's unsynchronized access to a shared data structure -- the established-connections list is a likely culprit, as is the host cache (newly connected hubs propagating info about hubs they know of prompting update). If a bug in that area is found, pushing out a 2.7.2.0 with the bug fixed could (eventually) fix this.
The other thing I'd look at is the system for promoting leaves to hubs -- or demoting hubs to leaves. I noticed that the onset of the problem happened while the leaf count as at its daily low, or very close to that time, and that the previous daily low had been the lowest daily low in a while. I'm wondering if a big enough daily swing in the number of leaves can trigger a destructive oscillation where a very low number of leaves per hub causes a lot of hubs to demote themselves to leaf, and then all the leaves losing these hub connections trying to make new connections trigger the remaining hubs to do whatever they do to try to elect some leaves to hubs, and the network gets stuck oscillating between "too many hubs so demote some" and "too few hubs so promote some leaves" and won't stabilize with a reasonable number of hubs. This would explain why there are surges and dips now in the number of hubs, much more than the nearly level number of hubs there used to be. But the smoking gun for this scenario, an oscillation of hub and leaf counts on a timescale of minutes, isn't visible at the crawler site. That may just be because the crawler's temporal resolution is too coarse to see it, though.
The fix in this case has an obvious short term as well as an easy long term solution. The short term fix is just to get enough people to nail their Shareazas into either hub-only or leaf-only behavior rather than let it be promoted/demoted. Preferably without ending up with too few hubs. The long term fix is to push out an update designed to damp out oscillations caused by hub promotion/demotion. Making promotion require a chronically high number of "rejected because leaf slots full" leaf rejections per minute over a longer span of time before a hub does something to try to get one or more of its leaves to become hubs (or however exactly promotion works) could do it. Hub count would climb more slowly and might level off rather than overshoot and start swinging wildly, even though the oscillations in leaf count are apparently wider than in the past for some reason, or at least reach lower low points. Another option is for hub to leaf demotion to have a refractory period: unless the node has been a hub more than X hours, it won't demote itself to leaf. Randomizing X at the time of promotion would be a good idea too. Otherwise, it will just broaden the peaks of hub count before the hub count crashes and rebounds again, slowing but not halting the oscillations. Randomizing X will make the hub count drop more slowly from its peak, and hopefully level off instead of undershoot. If there's already a refractory period programmed in, randomizing it or broadening the distribution would help. Each time a leaf becomes a hub, it should roll a random number between, say, 1 and 6 and be incapable of being demoted to leaf (other than by explicit user action) until that many hours after when it became a hub, even if it has a chronically low leaf count the whole while. The effect is that if there really are excessive hubs, the ones that rolled a 1 will drop, and then if there are now adequate hubs but not too many, the ones that rolled a 2 will have enough leaves by the 2-hour mark not to demote, and the oscillation is halted. (Making it minutes, random between 60 and 360, would be even better, as the hubs would just start dropping off slowly after the 1 hour mark until an optimum number was reached.)
There is a way to test each of the above bistability hypotheses.
Race condition: actually examine the code to see if the stuff involved in establishing hub connections touches a shared data structure without holding the mutex for that structure. Fix any instance where it does and deploy 2.7.2.0. Then wait a while to see if the problem goes away once 2.7.2.0 has broad enough uptake. If so, problem diagnosed and solved.
Deadlock and livelock: related to the above, the problem might instead be that the threading in Shareaza is prone to deadlock (wedges the hub) or livelock (temporarily jams at least some of its operations) trying to acquire mutexes. The cause would be if handling new inbound hub to hub connections acquires some particular two mutexes and holds them simultaneously and some other operation (maybe outbound hub to hub connection establishment) acquires the same mutexes in the reverse order and holds them simultaneously. Livelock instead of deadlock results if there's some slow timeout or something to eventually give up trying to get a mutex and abort the associated operation. The fix is essentially the same: fix the bug (this time by making both processes try to acquire the relevant mutexes in the same order) and push out 2.7.2.0, then see if the problem goes away.
Of course, it's possible that threading bugs like the above exist but aren't actually the cause of our present difficulties. In that case, the bugs will be found but fixing them won't fix G2. But it won't be a total waste, as bugs ought to be found and fixed regardless, and Shareaza and G2 will be more robust going forward.
Oscillatory hub election: the test for this is easy enough -- run some Shareazas (at least two) in "either hub or leaf" mode and see if some or all keep switching back and forth frequently and in something close to synchrony; say, most of the leaves become hubs within a few minutes, then after say half an hour of little change most of the hubs become leaves in only a few minutes, then after another longer period of little change most of the leaves become hubs, etc.
The fix in that case is to nail the test machines into hub mode to help stabilize the network, alert everyone here and wherever else you can to the situation and advise people with good long-lived broadband connections to go into hub-only mode until 2.7.2.0 is released and they've upgraded, and then make 2.7.2.0 with something like random-duration-set-on-promotion hub-to-leaf-demotion refractory period, test it for regressions, and release it.
If none of the above apply, then we're back to square 1 on this, though some sort of DOS attack or network poisoning then is the likeliest explanation.