Threaded Mode | Linear Mode

susa · Jan. 10, 2006, 01:46 PM

one of my favorite sites has an nagvertisement (asking for money) which is posted in cyrillic letters (russian) and english

I've unsuccessfully attempted to filter this section out, any tips appreciated

below where the * appears starts a half page long advertisement in cyrillic letters

$NEST(<table width="100%">
<tr>
<td class="block_title"><center>
<b><font color="blue">,*,
</tr>
</table>
</center>
<center>
<table width="100%">
<tr>
<td class="block_title"><center>)

Jan. 10, 2006, 06:41 PM

my experience with proxo is that you can't enter any cyrillic text into the filters as they become question marks like ???????

but I recognize you did not do this and instead tried to use a $NEST and exclude the block which apparently didn't work

sorry, I don't know either and have wondered the same thing

***JJoe*** · Jan. 10, 2006, 07:06 PM

The $NEST doesn't look correct.

Why not post a URL?
If it is a question of content, maybe you could upload complete HTML of the ad somewhere?

Kinda hard for us to help from what has been posted.

Jan. 10, 2006, 08:14 PM

the html sections looks like this but of course cyrillic posts as question mark

the goal (I think), would be to allow anything to show but just kill the pieces that start the unwanted section

there is no way to filter based on a cyrillic word or word combination so the $NEST should work - but how?

take the section below as a "generic sample"

how could it be made to filter anything that comes after the
"<td class="content">" where the cyrillic text starts and then to stop at the "</td>" where the cyrillic text stops?

<table width="100%">
<tr>
<td class="block_title"><center>
<b><font color="blue">????? ??? ??????? ???????? !</font></b></center></td>

</tr>
</table>
<table width="100%">
<tr>
<td class="content">
??????????? ??? ????, ????????? ???????.

..lots of ??? deleted for brevity
there are 50 lines of ???? deleted for this post to show

</td>
</tr>
</table>
</center>
<center>

***Kye-U*** · Jan. 10, 2006, 10:05 PM

Is it all Cyrillic after <td class="content">? It would be much easier if there is some code or A-Z letter sequence.

The matching expression could then be something like:

$NEST(<td class="content">,*TEXT*,</td>)

***JJoe*** · Jan. 11, 2006, 05:17 AM

Guest Wrote:there is no way to filter based on a cyrillic word or word combination

I wouldn't say that. Wink

(I hope the forum code doesn't screw this up)

Consider the word название, which I found on
http://www.google.com/search?q=proxomitron.ru

Quote:Vote and review this url название:

I then wrote:

Code:

[Patterns]

Name = "The Escape Filter"

Active = FALSE

URL = "$TYPE(htm)"

Limit = 400

Match = "Vote and review this url \1:"

Replace = "$STOP()$ESC(\1)"

The escape filter 'escaped' название to

Code:

%D0%BD%D0%B0%D0%B7%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5

I then wrote:

Code:

[Patterns]

Name = "Removes The Word"

Active = FALSE

URL = "$TYPE(htm)"

Limit = 400

Match = "[%D0][%BD][%D0][%B0][%D0][%B7][%D0][%B2][%D0][%B0][%D0][%BD][%D0][%B8][%D0][%B5]"

Replace = "$STOP()********"

This filter removes the first occurance of the word from the google page.

Guest Wrote:how could it be made to filter anything that comes after the
"<td class="content">" where the cyrillic text starts and then to stop at the "</td>" where the cyrillic text stops?

That could be as simple as:

Code:

[Patterns]

Name = "Test"

Active = TRUE

URL = "$TYPE(htm)"

Limit = 2560

Match = "<td class="content">*</td>"

Just increase the Limit until the cell is removed.

Now somethings to consider are:
Each letter of название is more than one byte.
It could be coded as
& #1085;& #1072;& #1079;& #1074;& #1072;& #1085;& #1080;& #1077;
minus the spaces or maybe ...
So you'll need a Limit of at least 16 to match it.

I try to use $NEST only when there is or may be a nest.
More at
http://www.proxomitron.info/45/help/Matc....html#NEST
and
http://www.proxomitron.info/45/help/Matc...html#INEST

HTH