Shareaza

by **cgreevey** » 14 Jan 2012 22:32

I've had Shareaza for years, and it's steadily gotten slower and slower to start up. I eventually decided to split my rather large library into several chunks and switch among them, by swapping out various versions of the AppData/Roaming/Shareaza directory and a directory with incomplete, download, and library share folders. I could then share (and search for more stuff for, if I wanted to avoid redownloading files I already had) only one chunk at a time, but hopefully the smaller sizes of the chunks (1/10 the whole) would make it start up faster and behave more reliably.

In the process, I examined the contents of the various directories and saw that the AppData subdirectory contained two extremely large individual files:

Shareaza.db3 500MB
TigerTree.dat 1.2GB

After backing up both the library and the AppData subdirectory I ran Shareaza and removed nearly every library folder from the library. It hung with the "add/remove library folders" dialog still displayed. It stayed that way for 72 hours and then woke up again with the library pared down to about 4000 files. I then shut Shareaza down and noted that the two very large files noted above had not shrunk at all, though the 50-meg library.dat and library2.dat files had shrunk to about 2 megs each.

The next step in my planned migration to a chunked library was to create a chunk-template: a copy with the bare-bones "always-there" library subset only. To that end I copied the AppData/Roaming/Shareaza directory to ShTemplate (named so it would stay near Shareaza in an alphabetized listing, and also to indicate exactly what it's a template of). After that, I decided to try an experiment, because those two huge files had not shrunk. Thinking that maybe they ratcheted only upward in size and that there was now much less to store in them, and that the size alone might continue to make Shareaza very slow to start up (~30 minutes), and having just backed them up, I decided to try nuking them. I expected Shareaza to start up quickly, but have to re-hash the 4000 or so files in the stripped-down library.

The surprise was that it didn't. Re-hash the files, that is. It hashed 2 new files I'd moved into that part of the library while Shareaza was down, and that was that. It also took only about ten seconds to start up -- a reduction by a factor of 180.

So ... if the tigertree.dat, in particular, wasn't hosting all the library hashes, what was it doing? Shareaza seems to be functioning OK, but I'd like to know what the consequences are of deleting those two files. If they're just non-expiring caches, it suggests that a simple cure for slow startup (though maybe not other forms of slowed performance) is to just plain delete TigerTree.dat and Shareaza.db3 from time to time. On the other hand, splitting the library may still be a good idea.

Does anyone with more knowledge of Shareaza's under-the-hood technical stuff have anything to add to this? In particular, exactly what the consequences are likely to be of nuking those two files periodically? If there would be any unpleasant surprises, I'd like to know now so I can just restore those files from the backup and see how well Shareaza behaves with the smaller libraries but the bloated files. If there won't be, I suppose I now need to test Shareaza with the original large library but the two big files nuked...maybe the split isn't even necessary.

by **cyko_01** » 16 Jan 2012 02:49

as a work-around you could set shareaza to minimize to the tray when closed (Tools > Shareaza settings > general > general ) and then when you click the icon on the desktop (or start menu) it should pop up instantly

by **cyko_01** » 16 Jan 2012 03:21

shareaza.db3 is a database containing a list of all your library files, there locations, file sizes, and the time it thinks they were last modified. I would imagine that tigertree.dat contains the tiger hashes (only one of several types of hashes that shareaza keeps) that match those files. On startup shareaza does a quick scan of your entire library to check for changes - new files, missing files, changes to existing files - so it knows what it needs to hash. Shareaza also watches your library folders for changes every few seconds while it is running. It may be beneficial for you to disable the "watch library folders" setting ( Tools > Shareaza Settings > General > Library ) and scan your library manually when necessary (not sure if it will still scan on startup). To scan manually, change to "Folders" view in your library (top of the left pane on the library tab) , right-click on the folder in the left pane that you think has changed, and select "Scan" from the context menu. This may allow you to add all your files back to your library (although the first time you scan/hash them will still be a b1+ch)

by **cgreevey** » 16 Jan 2012 04:05

by **cgreevey** » 16 Jan 2012 08:11

by **cgreevey** » 17 Jan 2012 05:11

by **cgreevey** » 19 Jan 2012 11:32

Since they were going to need re-hashing anyway, I took the opportunity to do some sorting and reorganizing of Chunk 1 of the split library. I've just now started Shareaza hashing it all. Chunk 1 has about 11,500 files, plus the (already hashed) common subset of 3500 or so files equals 15,000. That's about 1/10 the size of the original library. (There will be 18 chunks rather than 10, though, as there must be some overlap among the chunks for various reasons.)

I hope it will be done sometime tomorrow. I'll shut it down and restart it then and evaluate its startup speed. Then maybe do some searches for Chunk 1 subject matter and when something downloads see how fast it hashes.

(The common subset doesn't need rehashing because I have a template of the Roaming/Shareaza directory with only those files in the library that I can keep cloning to create instances for each chunk, then adding the files specific to that chunk so Shareaza hashes them and turns that instance of the Shareaza directory into a fuller version for that chunk, which can be renamed out of the way to switch chunks in the future along with correspondingly rearranging the library chunks in their directory. Complicated, but it should work to avoid having to hash everything again on every chunk change.)

by **cgreevey** » 19 Jan 2012 22:37

TL;DR: Chunk 1 library gives a reasonable startup time and a reasonably responsive Shareaza during hashing; the next version of Shareaza should replace the data structure used to track the library with a hash table by file path or else replace the sort being used with mergesort, and probably shouldn't bother generating tiger tree hashes or tracking query hit/upload statistics for library files that are not being shared as the sole use of such files to Shareaza is to exclude them in search results via the "files you have already" filter.

Long and detailed version:

After enumerating the Chunk 1 files, it started out last night hashing only one file every couple of minutes, but after about ten files, something else it must have been doing finished. There was a spike of CPU use, after which it settled down to hashing two or three files a second -- slower than the 8 a second with only 3500-odd files in the library but much faster than the one a minute or so with the original giant library. In fact, much faster than the 1 every 6 seconds that would have been 10x that speed (with a library size reduction of 1/10), and even a bit faster than 10x that speed.

I conclude from this that hashing speed slows down quadratically as a function of library size. This suggests the likelihood that Shareaza could use some algorithm improvements. Quadratic behavior points to the use of a naive sort algorithm somewhere, perhaps bubble sort. Hashing speed was independent of TigerTree.dat and Shareaza.db3 file sizes as well, so it's Library.dat manipulation that's probably behaving badly here. The remarkable thing is just how badly. I can think of three implementations of the library data structure none of which would be quadratic for an insert/update. The smartest structure would be a hash table (by hashes of the file paths, of course, since the files themselves aren't hashed at first when added) which should be ~O(1) to insert into or pull an entry out of to update. A balanced binary tree by file path would be O(log N) and an almost-as-good choice. If the tree wasn't forced to stay balanced and got badly unbalanced updates and inserts could become O(N). Even a simple unsorted array or a linked list that has to be linear searched for an entry, or a sorted array that for an insertion has to be copied either completely or from the insertion onward, is O(N). O(N^2) implies to me that it's doing some O(N) operation on not only the altered item but all the others too for some reason. The simplest cause would be if instead of a sorted insertion into a sorted list (log N) it does an append-to-the-end followed by a complete sort of the list (should be N log N), but making it worse, either a) it's bubblesort or b) it's quicksort and the worst case behavior is triggered by the already-almost-sorted input. If so, the solution is: change to a hash table, do a sorted insert, or if the data structure library you're using is too weak for it to be trivial to make those changes, use mergesort instead of either bubblesort or quicksort.

I estimate it took a total of about 2 and a half to 3 hours to hash the 11,617 files added to the library to build Chunk 1.

Anyway, on to the startup speed. When I restarted it just now, it took about three minutes, mostly spent at "Starting Database". Once the main window appeared, it maximized and became responsive almost immediately. Causing another file to be hashed resulted in a hashing time too brief to notice (a fraction of a second then), consistent with its performance last night.

The startup phase for "Starting Database" took about 30 minutes with the original giant TigerTree.dat and Shareaza.db3 files, was nearly instant with those deleted, and the current Chunk 1 ones post-hashing are about 1/10 the sizes of their pre-split versions (makes sense as there are 1/10 the files) and produced a 1/10 as long startup phase of 3 minutes. This bit, then, is linear in the size of those two files. That's the best that can be expected if the contents are being loaded into RAM, but one possible improvement exists. Since the "files you have already" filter presumably relies on SHA-1, or on any hash, matching a file rather than specifically on the tiger tree hash matching, every file you can search for has a non-TigerTree hash (SHA-1 usually, or ed2k for ed2k-only results), and unshared files don't get uploaded, the tiger tree hash needn't be generated for files that are in the library but not being shared.

The startup phase after that, when memory use climbs and the window is displayed but unmaximized and unresponsive, took 10 minutes with the original library whether or not Shareaza.db3 and TigerTree.dat were deleted, was nearly instant with the 3500 file base library, and took a few seconds with the 15,000 file combined base and Chunk 1 library. That's clearly nonlinear and depending only on Library1.dat size. I'd estimate that it's quadratic again, probably for the same reason hashing is: an in-memory library data structure is created empty and inserted into one file at a time, using the same structure and algorithm used when updating after a file is hashed. (Caveat: it's much faster to load 150,000 files, or for that matter 15,000 files, than to hash the same number, by what seems to be a constant factor of about 1500. When they're mostly small files of a few tens of K, the actual hashing itself is not capable of being the explanation, especially since it would be a linear term and the multiplier of 1500 applies to the quadratic term in the duration equation. This may mean that there's a separate cause of the quadratic behavior during library loading, or it may mean that the comparison function used by the sort is different when loading than when hashing and the one used when hashing takes 1500 times as long to do one comparison.)

(One more data point in favor of hashing being slow due to the algorithm used for inserting/updating the data structure: the hashing popup has a little progress meter that presumably measures the actual computation of the hash. This part does not change speed for a given file size depending on library size; its speed, as I've eyeballed it, depends only on the size of the particular file being hashed. It's the pause between one file hashing and the next, during which presumably the previous file's hashes are being inserted into the in-memory library data structure, that grows quadratically with library size.)

by **raspopov** » 29 Jan 2012 08:04

You can not share hundred thousands of files without penalty on any modern computer hardware. Please pack similar files to several archives (zip, 7z, etc.) to reduce total amount of shared files in your library.

Shareaza

Simple way to fix slow startup?

Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Re: Simple way to fix slow startup?

Who is online