Shareaza

by **old_death** » 25 Jul 2011 20:16

You're right. Such factors could be used to calculate a "spam probability rating" with a maximum of 100 points out of all available heuristic factors. There could be a setting somewhere to allow users to turn the filter threshold...

by **cyko_01** » 28 Jul 2011 17:40

by **ailurophobe** » 29 Jul 2011 01:05

One method would be to separate security filtering and spam filtering. Security filtering would block addresses outright and essentially be an IP filter on connections. Spam filter would work on search results only and be adaptive. So that if a keyword filter blocks a result both the address and the hashes would be marked as probable spam. A hash block would count against the address. Address block against the hash. This way starting from known spam the filter could learn to filter new spam from its association with known spam.

For example if an address is a source of spam matching a reg exp filter, a learning filter would remember the address and filter results from it even after the spam no longer matches the reg exp filter. Similarly it would remember the hashes for the reg exp hits and filter those files even if they came from clean addresses with new names. With a smart enough threshold system it would even be possible to use these new addresses and hashes to learn even more addresses and hashes to filter.

by **cyko_01** » 29 Jul 2011 13:10

by **ailurophobe** » 29 Jul 2011 17:09

I am pretty sure that a learning spam filter would count as being "more intelligent"...

Seriously I am not a big fan of the "one-click black box" spam filter. I've recently had some cases where I've needed to uncheck spam filtering to see the perfectly valid results to my query. Not a big or common issue, but if the amount of filtering is increased the number of false positives will go up. Basically there is a limit on how many things you can filter automatically before it becomes inconvenient for users not to have finer control over the process. I doubt we are there yet or even after your proposals, but it is something to keep in mind. At some point making structural changes will become a better option.

by **cyko_01** » 29 Jul 2011 18:43

by **ailurophobe** » 30 Jul 2011 22:53

Yes, obviously my description was nowhere near to being sufficient to get a filter that is actually usable...

Scenario A is simple enough to fix by simply not filtering files in the library. Your library would essentially work as a white list for files.

In general I was thinking along the lines of the Bayesian filters used for email filtering. So the filter would also need to count instances of "not associated with spam" for hashes and addresses, which would solve most of the problems. The reason I didn't mention this is because I honestly have no idea how to do this efficiently. I doubt it is impossible or even particularly difficult, I just happen to not know a good method right now.

by **borsti67** » 31 Jul 2011 12:16

Regarding not-yet-deleted spam files: you could count the number of "known bad files" you've got in response to search requests for each session? This would give a penalty for that host, increasing the spam-likelyhood of search results from there and ending up in blocking it after a certain count.

by **cyko_01** » 01 Aug 2011 02:56

by **smokex** » 24 Aug 2011 22:42

if (result is type that should have metadata)
{
if (result does not have metadata)
{
drop search hit;
}
}

not complicated for this one and would be damn simple to implement and take care of 90% of the spam out there.

by **cyko_01** » 25 Aug 2011 01:25

by **old_death** » 03 Sep 2011 12:10

Are G2 clients supposed to provide metadata? I mean, does the specs say this is necessary or only optional?

by **brov** » 04 Sep 2011 12:39

Optional.

by **cyko_01** » 05 Sep 2011 15:26

by **biggestnoob** » 05 Sep 2011 16:16

I actually find very little spam with G2, DC++ and mule files. I mean 1 spam file out of 100.

The most spam I find is with G1, but even that doesn't bother me as its usually .mov, .qt or .wmv files and very little are actually .mp3 and .mp4

So I don't see a big reason for additional filters or stuff. I'd just like DC++ to function better and return more results.

by **old_death** » 09 Sep 2011 07:18

Well, then you are a lucky exception... Spam is a much bigger problem for most other G2 network users...

Shareaza

more intelligent suspicious files filter

more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Re: more intelligent suspicious files filter

Who is online