Post Reply 
help in filtering cyrillic letters
Jan. 10, 2006, 01:46 PM
Post: #1
help in filtering cyrillic letters
one of my favorite sites has an nagvertisement (asking for money) which is posted in cyrillic letters (russian) and english

I've unsuccessfully attempted to filter this section out, any tips appreciated

below where the * appears starts a half page long advertisement in cyrillic letters

$NEST(<table width="100%">
<tr>
<td class="block_title"><center>
<b><font color="blue">,*,
</tr>
</table>
</center>
<center>
<table width="100%">
<tr>
<td class="block_title"><center>)
Add Thank You Quote this message in a reply
Jan. 10, 2006, 06:41 PM
Post: #2
 
my experience with proxo is that you can't enter any cyrillic text into the filters as they become question marks like ???????

but I recognize you did not do this and instead tried to use a $NEST and exclude the block which apparently didn't work

sorry, I don't know either and have wondered the same thing
Quote this message in a reply
Jan. 10, 2006, 07:06 PM
Post: #3
 
The $NEST doesn't look correct.

Why not post a URL?
If it is a question of content, maybe you could upload complete HTML of the ad somewhere?

Kinda hard for us to help from what has been posted.
Add Thank You Quote this message in a reply
Jan. 10, 2006, 08:14 PM
Post: #4
 
the html sections looks like this but of course cyrillic posts as question mark

the goal (I think), would be to allow anything to show but just kill the pieces that start the unwanted section

there is no way to filter based on a cyrillic word or word combination so the $NEST should work - but how?

take the section below as a "generic sample"

how could it be made to filter anything that comes after the
"<td class="content">" where the cyrillic text starts and then to stop at the "</td>" where the cyrillic text stops?




<table width="100%">
<tr>
<td class="block_title"><center>
<b><font color="blue">????? ??? ??????? ???????? !</font></b></center></td>

</tr>
</table>
<table width="100%">
<tr>
<td class="content">
??????????? ??? ????, ????????? ???????.

..lots of ??? deleted for brevity
there are 50 lines of ???? deleted for this post to show

</td>
</tr>
</table>
</center>
<center>
Quote this message in a reply
Jan. 10, 2006, 10:05 PM
Post: #5
 
Is it all Cyrillic after <td class="content">? It would be much easier if there is some code or A-Z letter sequence.

The matching expression could then be something like:

$NEST(<td class="content">,*TEXT*,</td>)
Visit this user's website
Add Thank You Quote this message in a reply
Jan. 11, 2006, 05:17 AM
Post: #6
 
Guest Wrote:there is no way to filter based on a cyrillic word or word combination
I wouldn't say that. Wink
(I hope the forum code doesn't screw this up)

Consider the word &#1085;&#1072;&#1079;&#1074;&#1072;&#1085;&#1080;&#1077;, which I found on
http://www.google.com/search?q=proxomitron.ru
Quote:Vote and review this url &#1085;&#1072;&#1079;&#1074;&#1072;&#1085;&#1080;&#1077;:

I then wrote:
Code:
[Patterns]
Name = "The Escape Filter"
Active = FALSE
URL = "$TYPE(htm)"
Limit = 400
Match = "Vote and review this url \1:"
Replace = "$STOP()$ESC(\1)"

The escape filter 'escaped' &#1085;&#1072;&#1079;&#1074;&#1072;&#1085;&#1080;&#1077; to
Code:
%D0%BD%D0%B0%D0%B7%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5

I then wrote:
Code:
[Patterns]
Name = "Removes The Word"
Active = FALSE
URL = "$TYPE(htm)"
Limit = 400
Match = "[%D0][%BD][%D0][%B0][%D0][%B7][%D0][%B2][%D0][%B0][%D0][%BD][%D0][%B8][%D0][%B5]"
Replace = "$STOP()********"

This filter removes the first occurance of the word from the google page.

Guest Wrote:how could it be made to filter anything that comes after the
"<td class="content">" where the cyrillic text starts and then to stop at the "</td>" where the cyrillic text stops?

That could be as simple as:
Code:
[Patterns]
Name = "Test"
Active = TRUE
URL = "$TYPE(htm)"
Limit = 2560
Match = "<td class="content">*</td>"

Just increase the Limit until the cell is removed.

Now somethings to consider are:
Each letter of &#1085;&#1072;&#1079;&#1074;&#1072;&#1085;&#1080;&#1077; is more than one byte.
It could be coded as
& #1085;& #1072;& #1079;& #1074;& #1072;& #1085;& #1080;& #1077;
minus the spaces or maybe ...
So you'll need a Limit of at least 16 to match it.

I try to use $NEST only when there is or may be a nest.
More at
http://www.proxomitron.info/45/help/Matc....html#NEST
and
http://www.proxomitron.info/45/help/Matc...html#INEST

HTH
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: