The Un-Official Proxomitron Forum
Cut: Chained Ad Path URLs - Printable Version

+- The Un-Official Proxomitron Forum (http://www.prxbx.com/forums)
+-- Forum: Proxomitron Config Sets (/forumdisplay.php?fid=43)
+--- Forum: Sidki (/forumdisplay.php?fid=44)
+--- Thread: Cut: Chained Ad Path URLs (/showthread.php?tid=1323)



Cut: Chained Ad Path URLs - sidki3003 - Mar. 21, 2009 12:13 PM

One of the new filters in the 2009 configs is "<script>: Cut: Chained Ad Path URLs", required to deal with concatenated scripts, which get more and more popular:
Code:
<script src="http://myserver.com/load?adscript.js,requiredscript.js,trackingscript.js"></script>

Example: http://www.spike.com/

So far so good. However, lately i also see concatenated offsite scripts:
Code:
<script src="http://myserver.com/load?http%3A//adserver.com/x.js,requiredscript.js,http%3A//trackingserver.com/y.js"></script>

Example: http://kirstenrokz.buzznet.com/user/


Below filter tests each chained component against the complete ad-list combo (AdHosts-J, AdDomains, etc.). I'm not sure whether the recursive expressions are correct and sufficiently robust, hence "WIP".
Code:
[Patterns]
Name = "<script>: Cut: Chained Ad Path URLs     9.03.20 (multi) [sd] (d.1)"
Active = TRUE
Multi = TRUE
URL = "$TYPE(htm)"
Bounds = "$NEST(<script\s,*src=$AV(*\?*,*)*,>)"
Limit = 1024
Match = "(*src=)\1$AVQ("
        "(*\?*\=)\2("
        "[^,="']+[,=]+"
        "&&"
        ",+((https+%3a)+//($LST(AdHosts-J))\8$SET(a=$GET(a) AdHj \8)|((^(^http))|(^http))("
        "$LST(AdList)$SET(a=$GET(a) \9)|(^$TST(keyword=*.a_track_s.*))"
        "((^http|[/.])|((https+%3a)+//[^/?]+)+*[/=_-])($LST(AdPaths-J)(^[a-z0-9]))\8$SET(a=$GET(a) AdPj \8)"
        "))(*(\&*)\#|*)"
        "|\#"
        ")+\#"
        ")\3"
        ""
        "&$TST(\8=*)"
        "$SET(eAdJS=$GET(eAdJS)"
        "%3Cspan class=%22ProxFly-Span%22>$GET(mHead) Chain URL:%3C/span>"
        "$ESC($GET(a))%3Cbr class=%22ProxFly-Br%22 />"
        ")"
        "($TST(volat=*.log:2*)$ADDLST(Log-Main,[$DTM(d T)]\tWEB JS_Chain_URL\t$GET(a) \t\u)|)"
        "($TST(volat=*.log:[12]*)$ADDLST(Log-Rare,WEB JS_Chain_URL\t$GET(a) \t\u)|)"
Replace = "\1\2\@\3$SET(a=)"



The benefit of extending the filter as described becomes especially obvious if you look at the second filter hit (as well as the resulting script) on latter example page, after adding below entry (found via Ghostery) to AdHosts-J:
Code:
# Ads - Lotame
[^/]++.crwdcntrl.net/$SET(7=var LOTCC={add:function(){},addAction:function()
  {},addBehavior:function(){},addInterest:function(){},addMedia:function(){},
  bcp:function(){}};)
  &&(($TST(volat=*.log:[12]*)\8&$ADDLST(Log-Rare,ALST AdHj \8 \t\u))|*)


edit: "WIP" flag removed.


RE: Cut: Chained Ad Path URLs - ProxRocks - Mar. 21, 2009 12:39 PM

the above post is showing this line:
Code:
{},addBehavior:function(){},addInterest:function(){},addMedia:function(){},?
but if i view that line's source code, i see a &# 8203; (without the space) at the very end of that line...

is that 8203 supposed to be there?
i can't seem to find it in any HTML Code Table...


RE: Cut: Chained Ad Path URLs - sidki3003 - Mar. 21, 2009 12:50 PM

As long as the forum's code tag handles things correctly, all is fine. The real source code doesn't matter.
"&#8203;" usually triggers a word break ( http://www.quirksmode.org/oddsandends/wbr.html , 2nd para).


RE: Cut: Chained Ad Path URLs - ProxRocks - Mar. 21, 2009 01:11 PM

(Mar. 21, 2009 12:50 PM)sidki3003 Wrote:  As long as the forum's code tag handles things correctly, all is fine. The real source code doesn't matter.

that seems to depend upon your OS, or more specifically, your text editor...

i cut-and-paste the above via Notepad and can not save the file because the pasting pastes a "square character" in place of that 8203 and i get a "This file contains characters in Unicode format which will be LOST if you save this file as an ANSI encoded text file" upon attempted save...

so i cancel the save and track down "why" - it's that 8203...


RE: Cut: Chained Ad Path URLs - sidki3003 - Mar. 21, 2009 01:14 PM

Ahh okay, i didn't know that, thanks.

You should end up with a list entry that looks exactly as posted.
The line indents are especially important.


RE: Cut: Chained Ad Path URLs - lnminente - Mar. 21, 2009 02:36 PM

Hi Sidki, i found this in my logf (log file) of large urls:
http://mail.yimg.com/d/combo?/mg/5_1_20/intl/es/strings.js&/mg/5_1_20/js/msgr.js&ult/ylc_1.9.js&/mg/5_1_20/intl/es/fcue_strings.js&/mg/5_1_20/js/fcues.js


RE: Cut: Chained Ad Path URLs - sidki3003 - Mar. 21, 2009 03:01 PM

Oh - thanks. Smile!

I haven't seen this script concatenation with separate query params thus far, only with a comma, once or twice with a semicolon, always within the same query param.

If this method is also used for adscript/required-script mixtures, it would be interesting how it's embedded in the page ("&" or "&amp;", etc.). It's important that chained ad paths are intercepted in the page code (vs. headers), because otherwise other anti-adscript filters could be triggered by a chained ad path, which would also remove required components.


RE: Cut: Chained Ad Path URLs - lnminente - Mar. 21, 2009 09:33 PM

Hi Sidki i have an more general idea, i'm thinking we could create a filter wich could log the name of the functions inside a script coming from an ad source. Later we process that log file and create a list of blocking functions.

In that way it wouldn't matter how the script is served to us, also scripts programmed to broke pages if they are not loaded could be fixed by us instead of blocking the full script file.

Let me know if you trust in this idea...


RE: Cut: Chained Ad Path URLs - sidki3003 - Mar. 21, 2009 10:06 PM

Well, as far as sidki-configs are concerned, the approach is generally multi-layered where possible.
Regarding concatenated scripts:
1 - First try to cut ad/tracking paths.
2 - Then see if the individual modules have introductory comments which match a list of known ad/tracking comments.
3 - Then see if the contained function (or argument) names match an AdKeys-J entry.
4 - Then see if the function body contains ad strings.

I don't see a way around point 1. The original reason why i wrote this filter was to prevent subsequent "block scripts by URL" filters from matching and blocking the whole enchilada, required modules included. Besides, i like that filter. Smile!

I assume that you have something like point 3 in mind. That's fine, but, personally, i doubt that it's sufficient.
Which reminds me... there's an updated version of this filter, too:
Code:
[Patterns]
Name = "Remove: Ad Functions I  - Names/Params     9.03.04 [jd sd] (d.2)"
Active = TRUE
URL = "($TYPE(htm)|$TYPE(js)|$TYPE(vbs))(^$TST(keyword=*.(a_ads|a_js|a_adjs|a_adfn1).*)|$TST(flag=*.adkey_j:[#*:0].*)|$TST(volat=*.clength:([#3:970]e|[#3:2400]).*))"
Limit = 32766
Match = "function$TST(script=([1s])\3*)"
        "("
        "\s(([^( ]++_|)$LST(AdKeys-J)([0-9_.:-][a-z0-9_.:-]+|))\8( $NEST(\(,\)))\4 {(^ })"
        "$SET(1=function \8\4 { return prxVoidV; /* PROX: Ad Function Blocked (Name) */ )"
        "$SET(2=Func Name)"
        "|"
        "((\s[^( ]+ |)\( )\5"
        "(([^(),]++_|)$LST(AdKeys-J)([0-9_.:-][a-z0-9_.:-]+|(^[a-z])))\8($INEST(\(,\)))\4\) $NEST({, ?*,})"
        "$SET(1=function\5\8\4\) { return prxVoidV; /* PROX: Ad Function Removed (Argument) */ })"
        "$SET(2=Func Arg )"
        ")"
        ""
        "|if \($TST(script=([1s])\3*)"
        " (([^()"']++[._]|)!+$LST(AdKeys-J)([0-9_.:-][a-z0-9_.:-]+|"|(^[a-z])))\8($INEST(\(,\)))\4\)( {+)\5"
        "$SET(1=if (0 /* PROX: Ad Routine Blocked (\8) */)\5)$SET(2=If Block )"
        ""
        "&($TST(volat=*.log:2*)$ADDLST(Log-Main,[$DTM(d T)]\tWEB JS_AdFunction I \2 \t\8\4 \t\u)|)"
        "($TYPE(htm)$SET(eAdJS=$GET(eAdJS)"
        "%3Cspan class=%22ProxFly-Span%22>$GET(mHead) \2:%3C/span>"
        " $ESC(\8\4)%3Cbr class=%22ProxFly-Br%22 />"
        ")|)"
Replace = "\1"



RE: Cut: Chained Ad Path URLs - lnminente - Mar. 21, 2009 10:24 PM

Nice!! That was exactly the idea, a list of forbidden functions. Veeeery well [Image: happy0034.gif]


RE: Cut: Chained Ad Path URLs - sidki3003 - Apr. 02, 2009 08:48 PM

Removing "WIP" flag from discussed filter...