Post Reply 
$NEST command behavior
Jul. 13, 2004, 06:34 PM
Post: #1
 
Hi all,

I've noticed some odd behavior with the $NEST command.

When using the $NEST command in the bounds match, what is the difference in entering:
Code:
$NEST(<table,</table>)
    VS
$NEST(<table*>,</table>)

When I "test" a web filter, I don't see any apparent difference. Yet when loading pages, the filter matches differently depending on which bounds I use. Here is the filter is question:

Code:
[Patterns]
Name = "Small Table Killer1"
Active = TRUE
Bounds = "$NEST(<table,</table>)"
Limit = 1500
Match = "<(\w)\0*=$AV($LST(AdList)*)*"
Replace = "<proxo killed \0 with \9 />"

Name = "Small Table Killer2"
Active = TRUE
Bounds = "$NEST(<table*>,</table>)"
Limit = 1500
Match = "<(\w)\0*=$AV($LST(AdList)*)*"
Replace = "<proxo killed \0 with \9 />"

Heres a link to a page that does this: http://dir.yahoo.com/Computers_and_Internet/Internet/

If anybody can shed some light on this, I'd appreaciate it.

Thanks
Mike
Add Thank You Quote this message in a reply
Jul. 13, 2004, 09:04 PM
Post: #2
 
I've once heard that a rule of thumb is to NOT have an * in the bounds parameter...

But I'm with you - waiting until someone has a better answer...
All I can say to this point is that the method without the * is the most common scheme...
Add Thank You Quote this message in a reply
Jul. 13, 2004, 11:53 PM
Post: #3
 
z12:

I must be doing something wrong - I can't get either filter to make a difference when I surf to your test site.

As for using an asterisk in the bounds field, it only makes sense that if you allow the boundry checker to continue, then it will match anything and everything until it finds the ending boundry. That will take up CPU cycles, if nothing else. Plus, what does it do to the position of the matching cursor when it comes time to start the match process? I have a feeling that the cursor won't be where you expect it to be, and essentially, your match will always fail.

At least, that's the results I got. When I disabled both of these filters, the page displayed in the exact same way, leading me to believe that one might not need these filters after all. But my mileage may be varying from that of other forum members. Any one else wanna chime in here?


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Jul. 14, 2004, 02:42 AM
Post: #4
 
Hi all,

Ok, AdList is a custom list of mine, that may be why you saw no difference.

To eliminate that difference for this test, replace the matching expression like so:
Code:
Match = "<(\w)\0*=$AV(*.overture.*)*"

For me, this is very repeatable. The Table I'm having issues with is the sponsored link for juno. Filter_1 always fails, Filter_2 will always match.

The thing is, I prefer Filter_1. I seem to have layout issues on other sites when I use the Filter_2 format for boundry matching.

It's a mystery to me.

Mike
Add Thank You Quote this message in a reply
Jul. 14, 2004, 07:43 AM
Post: #5
 
Mike;

OK, now I feel dumb. After adding your modification of *.overture.*, both filters work for me - they both remove the BS table on the far right. Not coincidentally, they also remove another much smaller table at the bottom.

To gain an understanding of why this was so, I checked the source code for that page. I think what happens is this: Your "asterisk" version of the filter effectively doesn't need a limit on the boundry - it's gonna match until the cows come home. However, the "asterisk-less" version is set to only 1500 bytes. The code for that table says.... 6970 bytes! I increased the limit to 8000, and everything started working correctly.

Now, what happens when Yahoo decides to no longer use overture?


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Jul. 14, 2004, 01:03 PM
Post: #6
 
Hi Oddysey,

hmmm, thats interesting. For some reason, the bounds limit of 1500 doesn't seem to be working for you. When I check that page, I get 7 matches for my "small table killer" filter, and all the tables are less than 1500.

As for overture, since yahoo bought them, I think we'll be (not) seeing them for a while.

Attached is a pic of the filtered page:

Mike

Edit: I attached a jpg but that didn't seem to work.
Add Thank You Quote this message in a reply
Jul. 14, 2004, 01:42 PM
Post: #7
 
$NEST() is a specialized command. It's very fast, but it won't always match when standard bounds do.
And sometimes it fails. No Expression

You can boil down the problem that filter 1 doesn't catch all ad tables on that Yahoo page to this:
Why doesn't $NEST(<table,</table >) see the first closing tag in this string?

<table><tr><td>Juno's</td></tr></table><table><tr><td>Juno's</td></tr></table>

$NEST() does some sort of quote checking for other reasons (i think it had something to do with document.write strings, not sure) and ignores the </table> within the single quotes.

That filter 2 matches is a false hit so to speek, the ">" that it matches is not the one that belongs to that table tag but to the next one. $NEST(<table[^>]+>,</table >) doesn't match.

$NEST() is not really intended to be used like filter 2, there were a couple of discussions with Scott about that a while back. If you need to match some content within the opening tag you can take $INEST(), which is a powerful command and way too less used imo.

<table[^>]++this=that$INEST()(<table,</table)</table >

A very nice thing about it is if you omit the closing tag, it will still match everything except the latter, thereby allowing other filters to match it, without setting the first filter to "multi".


sidki
Add Thank You Quote this message in a reply
Jul. 15, 2004, 02:00 AM
Post: #8
 
Hi sidki3003,

Sorry about the slow reply, but I had to think about what you said.Smile!

What you said about the $NEST command not always matching the closing tag seems to be the case here. It's somewhat of a relief to know that there are known issues with it...I thought that maybe I was losing my mind.

Since I'm just trying to remove un-nested tables with this filter, I've just modified my filter to insure no other table tag is included in the bounds.

Normally I don't have a problem with $NEST, but with the table tag, I know I've had problems before. Oddly enough, I've never noticed the issue with td or div tags, which are other commonly nested tags. Have you ever heard of a problem with using $NEST with tags that shouldn't be nested, such as STYLE or A tags? Normally for tags that have opening & closing tags I use $NEST in my bounds check as shown in filter 1.

Mike
Add Thank You Quote this message in a reply
Jul. 15, 2004, 04:06 AM
Post: #9
 
Hi Mike,

Hmm... at some time at Arne's board everyone started to use $NEST(<a\s,[/url]) instead of <a\s*[/url] because of the speed gain. Scott said several times that that isn't a good idea. We didn't really understand why and didn't know about that quote problem either, but silently went back to the old bounds. Big Teeth

So i try avoiding to use $NEST() for non-nested tags and hence don't know how problematic it is, but i could imagine that things like <a href="foo">Juno's[/url]<a href="foo">Juno's[/url] aren't that rare.

sidki
Add Thank You Quote this message in a reply
Jul. 15, 2004, 11:03 AM
Post: #10
 
Hi sidki,

Ok, it looks like I have some filters to modify.

Thanks
Mike
Add Thank You Quote this message in a reply
Aug. 01, 2004, 08:33 PM
Post: #11
 
Just came accross this, which may clarify things (the Yahoo ad tables are all in one huge line):
Quote:--- In prox-list@y..., Mona <...> wrote:

> Since this is OUTSIDE of the <a*[/url] NEST bounds -- and that is confirmed
> through the resulting \2 variable and the placement of the new add-on
> material -- this seems like it might be a bug...?
>

Thanks for explaining. No it's not a bug, just a problem with trying to interpret quotes - one for which there's probably no good answer. Normally NEST ignores anything within quotes - this prevents tags in quoted strings from affecting it. Very useful for parsing through or around JavaScripts especially. Unfortunately, random unmatched quotes in the body text of a web page can cause trouble. As it stands, it'll ignore any quote if it can't find a match to it on the same line - this helps, but isn't perfect.

I could also only consider quotes within tags <...>, but this would fail for JavaScript strings, and they often have imbedded tags. The question is how to tell a meaningful quote from random body text considering you may be starting at any point (even inside a script) and don't know the enitre context of everything that came before in the page. Currently it does the best it can, but it's probably never going to be 100%.

-Scott-
Add Thank You Quote this message in a reply
Aug. 02, 2004, 02:28 PM
Post: #12
 
Hi sidki,

Thanks for that info... that explains much.

Usually, when I had problems with $NEST some js was involved in the match.

I had been fooling around with $INEST a bit, as I wasn't sure if that had the same problem, to see if I could get that to do what I wanted.

Here is the last version of a sponsored link table killer I was fooling with:
Code:
[Patterns]
Name = "Sponsored Link Table Killer2"
Active = FALSE
Bounds = "<table*>$INEST(<table,</table>)</table>"
Limit = 8191
Match = "<(\w)\0(*>[^<a-z]+(Sponsored|Advertisement)[^<]+<"
"&&(^*<table)*)*"
Replace = "<proxo name="PDomTarget" title="\0:\1" />"

I think it was working ok, but I think the last issue I had was that the byte limit was too high.

Besides fooling with $INEST, the idea behind the filter was to "sneak up" on the table tag that was closest to ">sponsored link<", as I noticed that sometimes the $NEST command would sometimes match several table tags ahead of where I wanted it to.

I haven't been trying to improve this filter lately as I have been playing with using the "dom container killer" javascript to see if I can get it to do the same thing, and then some. So far, that is looking very promising.

Mike
Add Thank You Quote this message in a reply
Aug. 02, 2004, 02:57 PM
Post: #13
 
Hi Mike,

That's funny - i was playing with the same idea! Big Teeth
It works pretty well here, too. I make sure that the ad'ish string isn't more than six tags away from the opening table tag.
Code:
[Patterns]
Name = "<table> Remove: Ad Tables II     4.06.22 [s] (d.3)"
Active = TRUE
URL = "$TYPE(htm)(^$TST(keyword=*.(a_ads|a_adtab).*))"
Bounds = "<table(([^>]+(>)\#)++{3,6})\3[^a-z]+(advertisem|sponsor(ed|))\1 ([^<>]+)\2<$INEST(<table,</table)</table >"
Limit = 6000
Match = "*&"
"(^$TST(\3=*<(script|table)*))"
"(^$TST(script=*)|$TST(textarea=1)|$TST(comment=1)|$TST(noscript=1))$SET(table=)"
Replace = "<span id="prxTable" class="Prox" style="display:none">"
"<div id="prxTable" class="Prox" style="text-align:center">"
"? Ad Table II: \@\1\2</div></span>"

$INEST does the same quote checking as $NEST tho, but i run into problems with it very rarely.

I saw your DOM container post, pretty interesting stuff! Unfortunately i can't use it because i rely on making the kills visible if needed (toggling display none/inline). And on my slow machine the empty space is hidden with a considerable delay (apparently time to upgrade).
But it's good to know that a JS/DOM expert is around. I get lost there often enough. *lol*


sidki
Add Thank You Quote this message in a reply
Aug. 02, 2004, 04:57 PM
Post: #14
 
Hi sidki

I like the way that filter makes sure it close to the "adish" tag. I've seen before where my filter matched to the closest table tag, but it wasn't close enough. Very nice.

Hmmm, thats an interesting idea about toggling visibility. I think I'll look into modifying the dom container killer to see if I can do that. It sure would make it easier to tweak it, as right now it's hard to see what is being removed. I'll have to think about that for a bit.

Oh, and I wouldn't consider myself a js/dom expert. Smile! I mostly just struggle though it.

Mike
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: