Post Reply 
Base: Speeding up ad-list
Feb. 19, 2009, 11:57 PM
Post: #1
Base: Speeding up ad-list
A short explaining how lists works:

A list file is very similar to writing (word1|word2|word3|...), but use to be more and more large. This post will speak about how i did to create an adlist for proxomitron starting from the famous easy list from adblock plus.

Taking a look to its keywords, most of them start by "http://","/" or ".". Let's try to use this list (of course after some adaptations to proxomitron):

Having the URL to parse in the variable \1, and being http://www.host.com/sub1/sub2.adbureau.example if we use $TST(\1=$LST(adlist)*), then it would parse only one time and would match if we find a keyword wich match with the beginning of our url

One possible code to use would be $TST(\1=*$LST(adlist)*) but it would be really really slow. It would look all the words in the list for:
Code:
http://www.host.com/sub1/sub2.adbureau.example
ttp://www.host.com/sub1/sub2.adbureau.example
tp://www.host.com/sub1/sub2.adbureau.example
p://www.host.com/sub1/sub2.adbureau.example
://www.host.com/sub1/sub2.adbureau.example
//www.host.com/sub1/sub2.adbureau.example
...
ub1/sub2.adbureau.example
...
ample
mple
ple
le
e


So after some days of research, a pencil, a paper and using the log window, i got to something useable.

Copy this filter to the clipboard and import it, go to the test window and test with this code in it:
href=http://www.host.com/sub1/sub2.adbureau.ext
Code:
[Patterns]
Name = "<example> Parsing Adlist Release Candidate {ln}090220"
Active = FALSE
Limit = 256
Match = "href=$AV(\1)"
        ""
        "$LOG(!C\1)$TST(\1=("
        "(\w)\3|"
        "*((^(http|ftp)://|//)(http|.|/)\w)\3"
        ")$LOG(W\3)$TST(\3="
        "(.|/|)(\w)\9$LOG(w\9)prxfail"
        "))"

In gray you will see which parts of the url will be parsed by the adlist.
Feel free to post suggestions or comment anything.
Add Thank You Quote this message in a reply
Feb. 20, 2009, 06:34 AM
Post: #2
RE: Base: Speeding up ad-list
I played a while and tried to use + operator to do this:

Code:
[Patterns]
Name = "New HTML filter"
Active = FALSE
Limit = 256
Match = "href=([^/.]+)\1((([/.]+[^/.]+[/.])\2$SET(a=$GET(a)\2)$LOG(!W\1$GET(A)))+|(([/.]+[^/.]+)\2$SET(b=$GET(b)\2)$LOG(!W\1$GET(b)))+)prxfail"

The log window output:
Code:
http://www.
http://www.host.
http://www.host.com/
http://www.host.com/sub1/
http://www.host.com/sub1/sub2.
http://www.host.com/sub1/sub2.adbureau.
http://www
http://www.host
http://www.host.com
http://www.host.com/sub1
http://www.host.com/sub1/sub2
http://www.host.com/sub1/sub2.adbureau
http://www.host.com/sub1/sub2.adbureau.ext

The output is the consumed characters while the left characters tried to match prxfail or the $LST(adlist).

It's just for demonstration and I knew it still need a lot improvement. Smile!
Add Thank You Quote this message in a reply
Feb. 20, 2009, 06:53 AM
Post: #3
RE: Base: Speeding up ad-list
Regarding speed -- I don't know if this is of any relevance to the Base Config approach, or even applicable, but i thought i'd share:

When switching to Paul Rupe's 3-list approach (AdHosts, AdDomains, AdPaths), i got a significant speed boost, mainly because the list invoking expression could be much better tailored to the actual list content.

There is a fourth list, "AdList", which acts as a hub for the other lists. I was messing around a lot with it to get it right. Now an (off-domain host testing) entry looks like:
Code:
http(s|):\\+/\\+/
  (^([^/]++.|)$TST(uDom)(^.))$TST(flag=(^*.adurl_l:[12].)(.)\9*|*)
  (($LST(AdHosts))\8(^[a-z0-9])$SET(9=AdH \8)
  |([^/]++.|)($LST(AdDomains))\8(^[a-z0-9])$SET(9=AdD \8)
  |[^/?]+*[/._?&;=-]($LST(AdPaths))\8(^[a-z0-9])$SET(9=AdP \8))

Same test with "ftp(s|):\\+/\\+/" and "//". I've never seen an FTP - let alone secure FTP - ad server. That test is just still in to keep that list hashable (and - to a lesser extend - for completeness).

Most of above code isn't exactly interesting, but the tailored list invocation expressions work pretty well for me:
AdHosts: No wildcarding needed.
AdDomains: ([^/]++.|)
AdPaths: [^/?]+*[/._?&;=-]
Add Thank You Quote this message in a reply
Feb. 20, 2009, 05:16 PM
Post: #4
RE: Base: Speeding up ad-list
Thanks guys Wink

Whenever, it doesn't work as supposed, take a look to the gray words in the log window
Code:
[Patterns]
Name = "<example> Parsing Adlist {whenever}090220"
Active = FALSE
Limit = 256
Match = "href=([^/.]+)\1"
        "("
        "(([/.]+[^/.]+[/.])\2$SET(a=$GET(a)\2)$LOG(!W\1$GET(A)))+|"
        "(([/.]+[^/.]+)\2$SET(b=$GET(b)\2)$LOG(!W\1$GET(b)))+"
        ")"
        "(\w)\9$LOG(!w\9)prxfail"

Sidki, it seems very optimized, but the list should need our maintenance, While the easy adlist is frequently updated, and i think we wouldn't notice a big difference of speed. I would like to take a look, did some search for the Paul Rupe config set or info about the 3-list approach but i didn't find anything.

If someone have the Paul Rupe config set would be nice to share it in the download section of our forum.
Add Thank You Quote this message in a reply
Feb. 20, 2009, 05:41 PM
Post: #5
RE: Base: Speeding up ad-list
He never published a complete config set.

There is a copy of the - now gone - original pages already on-site:
http://prxbx.com/other/paulrupe/
Relevant section: Blocklists
Add Thank You Quote this message in a reply
Feb. 20, 2009, 06:17 PM
Post: #6
RE: Base: Speeding up ad-list
Just a warning: the "WillemList.txt" link in the Blocklists section is NSFW (not safe for work). Seems like some squatters got the domain after it expired Sad
EDIT: It is now SFW (safe for work) Wink
Visit this user's website
Add Thank You Quote this message in a reply
Feb. 20, 2009, 06:30 PM
Post: #7
RE: Base: Speeding up ad-list
Here it is! http://accs-net.com/smallfish/WillemList.txt
Add Thank You Quote this message in a reply
Feb. 20, 2009, 06:32 PM
Post: #8
RE: Base: Speeding up ad-list
Ah, thanks! I'll update that one link on that page. (going to save a local copy)
Visit this user's website
Add Thank You Quote this message in a reply
Feb. 20, 2009, 07:30 PM
Post: #9
RE: Base: Speeding up ad-list
(Feb. 20, 2009 06:17 PM)Kye-U Wrote:  Seems like some squatters got the domain after it expired Sad

OT: The "<frameset>: Jump out of invisible Frames" filter in *the other* config set was missing that hijacked page. I've posted an update here.
Add Thank You Quote this message in a reply
Feb. 20, 2009, 07:37 PM
Post: #10
RE: Base: Speeding up ad-list
Please, update these too:
AdDomainList.txt
AdPathList.txt
AdKeywordList.txt
CommentList.txt

Hopefully i have found them here: http://homepage.usask.ca/cgi-bin/cgiwrap...klist.html


Attached File(s)
.rar  lists.rar (Size: 9.86 KB / Downloads: 813)
Add Thank You Quote this message in a reply
Feb. 21, 2009, 04:54 AM
Post: #11
RE: Base: Speeding up ad-list
Updated, thanks Wink
Visit this user's website
Add Thank You Quote this message in a reply
Feb. 21, 2009, 06:31 AM
Post: #12
RE: Base: Speeding up ad-list
This is probably irrelevant to what is being discussed here, but I improved the ad filtering speed of the Banner Blaster filter from Scott's default.cfg file by putting the keywords back into the filter and removing the reference to any external list. Since anchors are still the most common type of ads, and there can be many of them on a page, I found that the filter jumping first to the keyword list and from there into the ad host name list took too long, and slowed down page loading.
Add Thank You Quote this message in a reply
Feb. 21, 2009, 03:39 PM
Post: #13
RE: Base: Speeding up ad-list
Siamesecat, i did some test and it works like the first code posted in this thread. The optimization we are looking for is precisely to parse the code only in certain parts of the full URL. If you go to http://local.ptron/.pinfo/lists/AdKeys you will see that list is not hashed. His code starts by \w so it will be similar to the first example posted here. Thanks for posting Wink
Add Thank You Quote this message in a reply
Mar. 15, 2009, 07:13 PM
Post: #14
RE: Base: Speeding up ad-list
As when we go to a page most of their ads use to come from only 4 or 5 sites, i had the idea of creating a list in memory wich would have the last matched keywords from the adlist. So in theory proxomitron would be faster because it would search before in the last matched keywords instead of the full list. I know proxomitron does a very good hashing of the list, but I'm testing this "caching" concept... comments are welcome as always Wink
Code:
$TST(
\1=((\w)\0|*((^(http|ftp)://|//)(http|.|/)\w)\0)
$TST(\0=(.|/|)($LST(Mem-Adlist)|$LST(Adlist))\9*)
)
($TST(\9=$LST(Mem-Adlist))|$ADDLST(Mem-Adlist,$WESC(\9)))
$SET(4=Ad-Href)

Edit: The filter is the same than i posted at first place, just the working version, not the forum version. I added a test for Mem-Adlist before than the full adlist. When some of the lists matched, do a test to see if the keyword matched is in the Mem-Adlist, if it isn't them add it to the Mem-Adlist.

edit: added the $WESC Thanks Sidki
Add Thank You Quote this message in a reply
Mar. 15, 2009, 08:35 PM
Post: #15
RE: Base: Speeding up ad-list
Very inventive idea! I'm curious about your results. Smile!

I don't know if that applies to your case, but for the situations where i'm using this $TST / $ADDLST routine, i had to do a $WESC when adding, as well as an end-of-string test, like:
Code:
(^$TST((\0\5)=$LST(Mem-ScriptSrc)))$ADDLST(Mem-ScriptSrc,$WESC(\0\5)(^?))
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: