2.7.7.0 outstanding bugs

Get answers to your Shareaza related problems.
Forum rules
Home | Wiki | Rules

2.7.7.0 outstanding bugs

Postby Lanigiro » 20 Sep 2014 22:48

The new version fixed some issues, and seems to use a bit less CPU and start up a bit faster, but it still has some showstopping bugs.

1. The hang that sometimes occurs hashing files, with zero CPU use and 128B I/O every tick, is still present. This is egregious. It seems to be limited to when there are active search tabs, and worse if some of these have large numbers of results (> 100). It also seems particularly prone to happen if a download completes at the same time a bunch of other files are hashing, though it's not exclusive to such occasions. I'd tend to suspect that two or more threads are trying to commit transactions to the library db3 or some similar thing resulting in a loop of retries, except that at least one pending transaction should succeed rather than roll back each time, and instead it seems to get stuck in a "livelock" with nearly every attempt rolling back for up to TEN MINUTES before one goes through. This suggests the database you're using is badly designed. This bug REALLY NEEDS TO BE FIXED, LIKE, IMMEDIATELY. If the DB has problems with concurrent transactions touching the same tables, then the short-term fix is to serialize all DB writes (through one thread or guard with one mutex) which will make it slower but non-hangy, and the long-term fix is to get a better LWDB library to transition to.

2. The incompatibilities/impedance mismatches with GTK-Gnutella seem much improved, but are not completely gone. I am still seeing the occasional GTKG query hit with the following constellation of incorrect behaviors:

- Does not add idempotently to downloads tab, instead can be added multiple times, resulting in multiple copies of the same file with the same hash from the same source

- Does not search correctly if below Downloads.MaxSources, staying "pending" when it should be "searching"

- Does not have a normal likelihood of downloading -- reduced likelihood compared to random file

- Have not verified if this also still happens, but if downloaded, file is regarded as in the library for the purposes of the "you already have this file" prompt when selecting to download from search results, but is regarded as NOT in the library by the "files you have already" filter.

GTKG query hits exhibit the above anomalous behavior far less commonly but still occasionally, and when a query hit is affected, it is consistent -- the same query hit will continue to behave that way if cleared and added back.



Bug 1 here is a severe showstopper. Anything that causes minutes-long hangs of a program is unacceptable, period. This should be considered a blocker for the next non-beta version. I don't know why it wasn't considered a blocker for this version. It's been reported for years now.

Bug 2 is not a showstopper per se, but still nasty, as I suspect it puts some subset of files out of reach, making them undownloadable. I'm not 100% sure that it's a fixed subset of the affected query hits that simply will not ever download successfully until the bug is fixed, nor 100% sure that it's a fixed subset of files that are affected from one search to another separate search, but if both are the case there's a subset of files on GTKG sources that cannot be downloaded from Shareaza until it's fixed, and that's bad.
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 21 Sep 2014 06:55

Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Since my previous post was ignored...

Postby Lanigiro » 21 Sep 2014 09:39

I need to point out that the bug causing the hangs when hashing files REALLY MUST BE FIXED IMMEDIATELY. Several design flaws in Shareaza converge to make it an especially enormous problem.

1. Files left in the download directory are vulnerable, both to seemingly-random, rare disappearance if "clear completed downloads" is performed and to being overwritten by newly downloaded files with the same name. The latter bug was introduced many years ago, complained about repeatedly, and still not fixed, by the way. Can the overwrite behavior not at least be made an option?

2. Moving files *out* of the download directory to elsewhere in the library is needed to protect them from item 1 above, but (when there are active searches) is frequently "punished" by these lengthy and completely unnecessary hangs.

3. The hangs in turn seem to be "punished" by it screwing up its G2 connections, which STILL(!!!) need MANUAL NURSEMAIDING to reestablish, for NO GOOD REASON WHATSOEVER. This, by the way, has ALSO been reported repeatedly for years here and nobody can apparently be arsed to do a f@$!ing thing about it.

4. Of course, one must periodically "clear completed downloads", clear out everything not remotely queued or downloading, and re-add everything to get files to download, because otherwise they get stuck counting down from sometimes very high numbers (an hour, sometimes much more) if a source is only intermittently available. One also must recover from overwrites by clearing completed and re-adding the affected files. Thus invoking point 1 again given the small chance "clear completed downloads" has to actually delete the download (for, I might add again, no good reason whatsoever). (I haven't seen this sporadic deletion occur recently; indeed not since 2.7.5.0 came out; but there was nothing about this bug being fixed in any commit descriptions or changelogs so it's presumably still ticking like a time bomb somewhere in the code. Of course I take such pains to avoid such deletions that it's hardly surprising I haven't seen any in a long time even if the bug remains present.)

5. Clearing clutter in both the download tab and the download directory is desirable anyways. Or would be if Shareaza weren't inexplicably programmed to subject you to harsh penalties for trying.

I grow weary of periodically posting (generally after each new version comes out) to remind you all of the bugs you keep forgetting to fix. Fix the damn things already, will you?

I want to see ALL of the following in the next version:

A) The hanging behavior is GONE GONE GONE GONE GONE. Never to be seen again. There is no excuse for it. None whatsoever.

B) G2 connects without manual nursemaiding. I've already told you down to which line of code EXACTLY how to f*#!ing fix this. That you still haven't is also inexcusable, given that all you have to do is add 1 line of code at the start of 1 function along the lines of if (ChineseLocationSet.contains(getCountryCode(peer.ip))) return; -- at the start of the function for adding one IP to the G2 host cache. Just add that one line (and define ChineseLocationSet to CN, TW, HK if not already done somewhere) and commit the change, and those Great Firewalled IPs will never actually make it into the G2 host cache to pollute it ever again and that problem is solved. WHY HAVEN'T YOU DONE THIS ALREADY? IT WOULD TAKE LITERALLY TWO MINUTES OF YOUR TIME.

C) GTK-Gnutella and Shareaza get along without the "wonky source" issue where some small fraction of GTKG-originating query hits act brokenly in Shareaza's interface. (Even some hits from one source can be affected while others from the same source aren't, oddly.) I don't know what's causing this but there shouldn't be any incompatibilities between G2 clients, while here there clearly is one. It's probably something related to the one that was causing all those hub crashes early this year.

D) Downloads either do not overwrite files of the same names in the download directory, period, or else there is an option in Tools that can be checked to toggle between overwrite and rename behavior. Defaulting to rename for clean installs by new users, of course.

E) Under no circumstances does "clear completed downloads" delete anything. (If this was already fixed, you neglected to actually note it in any changelog or commit description, and a formal acknowledgement of the swatting of this bug is still needed so users can breathe easier.)

F) It's still too quick to give up on busy push sources and forget them. Perhaps it should give up only after 3 two-minute-backed-off attempts to contact them, and only if there wasn't a successful contact with the same source to download a different file? Right now if you want 10 files on a push source, you'll often get three and the others will revert to "no sources" and need the source adding again to get another three, and again to get another three, and again to get the last one. This manual nursemaiding should not be necessary. I suggest that the successful downloads suppress forgetting that source for the others yet to download, so it keeps trying them all until it has been six minutes and three tries per remaining file since it successfully downloaded ANY of the files from that source. (It might also be smart for it to not even try to get more than three at a time from any single source, waiting until one of the first three has either downloaded or failed before trying a fourth on the same source. The latter seems like it would be fairly simple and fast to implement, since most of the relevant code must already be there to handle ED2K A4AF behavior.)
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: Since my previous post was ignored...

Postby raspopov » 21 Sep 2014 09:52

User avatar
raspopov
Project Admin
 
Posts: 945
Joined: 13 Jun 2009 12:30

Re: Since my previous post was ignored...

Postby Lanigiro » 21 Sep 2014 10:03

Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby raspopov » 21 Sep 2014 11:00

Does manual file rebuilding (via menu item Rebuild) causes the same behavior?
User avatar
raspopov
Project Admin
 
Posts: 945
Joined: 13 Jun 2009 12:30

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 21 Sep 2014 11:06

Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 23 Sep 2014 02:56

The hanging bug is continuing to bother me, and I don't get the feeling that anyone is urgently doing anything about it. Status update please...
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Hashing hang [new info]

Postby Lanigiro » 23 Sep 2014 06:58

I have a key piece of new information about the hashing hang. Not only does it depend on there being active searches, I suspect it requires that a new query hit arrive at the same time it's hashing a file -- or more exactly, right after it's hashed the file, and it's updating the library database and/or the files you have already filter.

The evidence is this: under circumstances where moving a large folder full of library files (over 100) would be almost certain to provoke the hang (when there's ten searches active with a couple of those accumulating hundreds of results), physically unplugging the network and then moving the files results in no hang.

It's a just-hashed file being integrated into the library/files you have already filter being updated at the same time that there's a query hit just received being added to the search tabs. I'd look for a data race involving the files you have already filter first. It's almost certainly that filter that's responsible. Unfortunately it's not apparently possible to verify that absolutely, by turning that filter off and moving the files and seeing if there's a hang, because any sufficiently busy search to make the hang inevitable otherwise will quickly reach 300 hits and halt if that filter is switched off. (The only thing keeping it going ordinarily is I already have most of the files that that search returns, and am trying to download the last few.) And a halted search won't be able to cause the hang.
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby raspopov » 23 Sep 2014 15:03

Can you close all existing search windows and re-produce this with a new one?
How much active downloads you have?
User avatar
raspopov
Project Admin
 
Posts: 945
Joined: 13 Jun 2009 12:30

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 24 Sep 2014 11:19

I can reproduce this any time I have an active-enough search window. Closing them and starting a new search which will produce a high rate of results will make it vulnerable again.

Please quit beating around the bush and examine the code I've indicated as likely harboring a race condition or related bug. I doubt it will take long to find now that it's pretty tightly localized to the interaction of the search tabs/Searches.tmp, files you have already filter, newly arriving hits being added to the former, and newly hashed files being added to the latter.
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby raspopov » 24 Sep 2014 15:28

What amount of search hits produces freezing?
What version of Shareaza you use? 32 or 64-bit?
User avatar
raspopov
Project Admin
 
Posts: 945
Joined: 13 Jun 2009 12:30

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 25 Sep 2014 22:27

Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 25 Sep 2014 22:42

A couple of technical questions that could guide my investigation:

1. What locks does the UI event pump thread ever grab? The livelock hangs the UI thread immediately whereas the deadlock doesn't (but almost any UI action results in the event thread joining the pileup, including trying to exit Shareaza gracefully). This suggests that both deadlocks involve the transfer lock and the other lock involved in the livelock is one the UI thread can grab without specific user intervention (because the UI thread obviously doesn't grab the transfer lock without specific user actions, or else in the deadlock case it would join the deadlock before you could see the "network core overloaded" error messages in Networks). On the other hand the deadlock case must not involve the same lock or again the UI would wedge earlier in that case.

2. CLibraryBuilder::HashFile makes two ten-iteration spin-lock attempts, one on the transfer lock and one on the library lock (and these two locks are my prime suspects in the livelock case). The livelock clearly involves very large numbers of attempts that fail followed by eventual success, with the failure likelihood rising sharply the more frequently query hits are arriving. I'm somewhat puzzled that there seems to be no way for the hashing thread to go into a longer loop than 1 second without additional hash computations happening (which would probably elevate CPU well above zero), and I'm unsure where the other site is that participates in the livelock. Logically, though, at most one of the sites snatches its second lock with no timeout (or it'd be a deadlock instead) and at least one goes for a spin-retry behavior with potentially unbounded iterations (or the livelock could never reach the ten-minute mark and beyond, as it sometimes does) but can end up relinquishing both locks (or we'd again have a deadlock instead of a livelock) and then retrying.

So the second question is: Where are all the places in the code that are similar to the HashFile spin-retry loops, or otherwise may try to acquire two locks, may hold onto the first for a while during each attempt to get the second, drops the first lock and returns a failure status eventually if it can't, and itself may be called in a retry loop by its own caller, but a tighter and less CPU-intensive retry loop than the hash skip/retry one?
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 25 Sep 2014 23:53

Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 26 Sep 2014 00:03

Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 26 Sep 2014 00:11

Actually, it looks like anytime it spends over 250ms in RunJobs it sleeps 100ms in a wider loop in CNetwork::OnRun. So that might as well have been if ( bKeep ) return;. The actual problem might be that 100ms is not only not long enough, but the same as a lot of the other lock acquisition timeouts, so that OnRun ends up resonating with another thread, both backing off for 100ms and then retrying and almost-always acquiring their first-culprit-locks before either gets to its second-culprit-lock. I'd try changing that to 150ms so that a) the other thread has longer to acquire its locks successfully before the query hit jobs make another attempt, and b) the relative phases of the two threads will keep shifting, likely breaking any logjam that still occurs much faster.
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 26 Sep 2014 03:04

Preliminary debug data indicates that my identification of the two culprit threads was correct. I managed to provoke a ten-second-long livelock by moving files around inside my library, while a very busy search was active, and noted the bottommost message in the network tab (set to DEBUG detail level) during the freeze. When it "thawed" I copied the whole log and grepped it, and found the location.

The very next two messages (likely generated in the instants before the livelock, but so shortly before that the window didn't get repainted between them and the livelock's onset) were:

1. Hashing: $file path$

2. Processing query acknowledge from $IP$ (time adjust + 2 seconds): $blather$

Messages were generated during the freeze, at an exponentially dropping rate as more threads joined the developing logjam. The end of the freeze can be estimated by my own sense of time and by when the message rate abruptly jumps from low and slowing to a normal rate again. The first messages at that point, following a three-second period with no messages:

1. Sending push request for download $unimportant$

2. Processing query acknowledge... (x2)

3. Got bogus hit packet... (x4)

4. Received a malformatted query hit packet...

5. Got bogus hit packet...

6. Received a malformatted query hit packet...

7. Sending query for ...

8. Hashing completed: $same file as before$

So, no hashing messages during the freeze. A rash of CNetwork:OnRun thread messages and one hashing message instantly after the freeze. And I checked, and there were no processing query anythings or hit packet anythings during the freeze either.

That's the smoking gun: the CNetwork jobs queue stalled and the hashing thread stalled, and they seem to be the threads that stalled first, before anything else stopped contributing to the log.

Now please for the love of Christ fix this bug already? :)
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 26 Sep 2014 04:59

More data seems to confirm. I managed to set off another livelock, this one almost two minutes long. This time there was a "hashing..." message a few messages before the interface froze, and a "processing query acknowledge" almost instantly after that point (so, probably in between the last repaint before the freeze and the freeze proper). No "processing" anything, no "query hit" anything, and no "hashing" anything from that "processing query acknowledge" until two minutes later, when there's lots of all three starting right as the message frequency (as judged by their timestamps) jumped back up to normal.

It seems the only thing I was wrong about was that it was processing query hits during the hang. It seems that the trigger is in processing query acknowledge events (inside the same loop that keeps grabbing the network lock). I'll have a closer look at the query acknowledge code next, to try to find the locks likely to be the culprits. The network lock has moved up higher on my suspect list, though I didn't see anything touching it in the hashing parts of the code...

That the hashing thread and the CNetwork::OnRun thread are the two threads that livelock seems beyond dispute now, however.
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 26 Sep 2014 06:54

I have circumstantial evidence that temporarily disconnecting G2 when moving a batch of library files suppresses the bug. This would be consistent with a query acknowledge being one of the two triggering events, as by inspection it seems the code for processing those is only run if G2 is connected. The odd thing is that in my cursory examination of the relevant code I didn't see query acknowledges simultaneously holding any pair of locks -- the network, G2 hostcache, and SearchManager locks are all held and released separately and sequentially though.

I also now know that if the hash that is the other triggering event is for a just-downloaded file rather than from moving library files in Explorer, the "Successfully verified file" notification is generated during the livelock and late in the game, with recovery (if no other files are queued for hashing) seeming to occur a few seconds later, accompanied immediately by the "Hashed file:" notification for the same file. This means that most of the lock contention happens before the various nested OnVerify calls complete, but some happens after. The original two triggering threads may have begun working again at that point, but the logjam is still slowly unraveling; it's probably the multiple grabs for the transfer lock at various times during the post-hashing part of HashFile and its callees, plus heavy contention for this lock by pretty much the entire rest of Shareaza, that causes it to remain in molasses mode for several seconds after the recovery has technically begun. Once backlogged tasks that want the transfer lock have completed the UI returns to responsiveness. Further evidence in favor of this hypothesis includes one of my sample freeze logs including a "network core overloaded" tucked in among the other trickle of messages in the later phases of one instance, caused by an ED2K connect before the logjam had fully cleared.
Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19

Re: 2.7.7.0 outstanding bugs

Postby raspopov » 26 Sep 2014 14:51

You wrote too many words and I lost thread of your reasoning... Can you draw a graph of cross-locked threads around suspected place?

P.S. At least deadlocks problems were solved in Shareaza :)
User avatar
raspopov
Project Admin
 
Posts: 945
Joined: 13 Jun 2009 12:30

Re: 2.7.7.0 outstanding bugs

Postby Lanigiro » 27 Sep 2014 05:55

Lanigiro
 
Posts: 202
Joined: 10 Feb 2014 14:19


Return to Help and Support

Who is online

Users browsing this forum: No registered users and 1 guest