# Proxomitron4 URL killfile # by Paul Rupe # # NoAddURL # NoHash # # $Header: E:/RCS/AdList.txt 1.11 2001/12/04 18:46:29 prupe Exp prupe $ # # Get the latest version of this file from # http://www.geocities.com/u82011729/prox/blocklist.html # Based on # Willem Irrwarr's blockfile at # http://www.eccentrix.com/computer/wirrwarr/ # Stephen Martin's Hosts file at # http://www.smartin-designs.com/ # Ryan Farmer's Hosts file at # http://www.geocities.com/ryanf86_2000/ # # Input: # Some bounded string representing an absolute URL minus the protocol and # :// separator. Suitable for the URL-Killer header filter. # Output: # On successful match, sets \9 to the keyword (domain, hostname, or path) # that caused the match. # # This is an attempt to make the ad site filter list more structured and # easier to maintain. Looking over Willem's original list you will notice # that most of the entries fall into one of three categories: # # (1) Domains - Entire domains like doubleclick that are blocked with an # expression like "([^/]++.|)doubleclick.". # (2) Hostnames - Common hosts names like ads, adserver, banners that are # blocked regardless of the domain. These are represented by entries # such as "ads.*". # (3) Path components - Paths like /ads/ that are blocked even if the # domain and hostnames were ok. These show up as "*/ads/", for example. # # Each of these three types is pushed into a separate list. This means you # must configure four lists instead of one: # AdList - AdList.txt # AdDomains - AdDomainList.txt # AdHosts - AdHostList.txt # AdPaths - AdPathList.txt # Of these, only AdList is useful directly in filters. # # The advantage to this is that it simplifies adding new entries. Do you # want to block an entire domain? Add its name to the AdDomain list. No # need to worry about the proper wildcards to put around it, that part is # taken care of by AdList. If suddenly, the boundaries for what constitutes # a path component change, we merely update the top-level entry in AdList, # not every single line in AdPaths. # # Another big advantage is efficiency. Since the sublists do not have # wildcards at the beginning, Proxomitron can use its hashing algorithms on # them. Also because of this hierarchy, Proxomitron does not have to keep # evaluating ([^/]++.|) over and over. The result is that this version of # AdList is several times faster than the original. # # The disadvantage, of course, is that you now have four lists to install # rather than one. # Blocked domains - Entries in this list are single words like doubleclick # or possibly full domain names like dynamic.dol.ru. A given entry is # optionally preceded by a hostname and followed by dot, slash, or # end-of-string. The last is so that "http://dynamic.dol.ru" will match even # without a trailing slash. Also add colon to take care of URLs with port # numbers, and quote chars for URLs embedded in scripts. ([^/]++.|)($LST(AdDomains))\9([./:"']|(^?)) # Blocked hostnames - Entries in this list are single words like adserver or # complete hostnames like ns.netsol.com. Each entry is checked at the # beginning of the input string only. To match it must be followed by a # boundary: a dot, slash, dash, or end-of-string. The purpose of this is so # that an AdHosts entry "ad" won't accidentally block "admin.something.com". ($LST(AdHosts))\9([./\-:"']|(^?)) # Blocked paths - Entries in this list are single words like ads. Each # entry is matched against components in the path portion of the URL, # separated by dot, slash, dash, underscore, question mark, or ampersand. *[./\-_?&:=]($LST(AdPaths))\9[./\-_?&:="']