The Un-Official Proxomitron Forum
Some text matching questions - Printable Version

+- The Un-Official Proxomitron Forum (https://www.prxbx.com/forums)
+-- Forum: Forum Related (/forumdisplay.php?fid=37)
+--- Forum: Proxomitron Program (/forumdisplay.php?fid=4)
+--- Thread: Some text matching questions (/showthread.php?tid=1093)

Pages: 1 2 3


Some text matching questions - Vendettta - Sep. 10, 2008 05:05 AM

What if you wanted to replace the word "pink" with the word "red" (and "pinks" with "reds") on all web pages, but didn't want to change words that contain "pink" such as pinkie? And at the same time, avoid breaking tags that contain things like "pink.gif" or "mypink.css"

Another text oddity that I've been unable to solve is the exclamation point. What would be the best way to change all exclamation points in the page text to periods? A simple replace with "\!" is obvious, but it also destroys comments and javascript.

In both instances the problem seems to be separating page text from code. I know these may seem like strange examples, but I'm trying to learn how the matching language works.


RE: Some text matching questions - ProxRocks - Sep. 10, 2008 04:21 PM

how 'bout:

Match = "pink(s|)\1 "
Replace = "red\1 "

INCLUDE the trailing SPACE in the search...


RE: Some text matching questions - Vendettta - Sep. 10, 2008 07:52 PM

It seems to handle the singular/plural correctly, but it replaces "pinkie" with "red ie", and most important, breaks links. For example, the link
http://www.codepink.com
becomes
http://www.codered .com

Both the space and the replaced text cause the link to be invalid. I tested it without the space (in the replace category) and it solved the space issue, but the problem still seems to be getting the filter to target page text but not tags.

Also, in the matching category, what is the purpose of the "or" expression within the parentheses since nothing follows it?


RE: Some text matching questions - lnminente - Sep. 10, 2008 08:09 PM

if you want to isolate pink, then put \s before and after
Match = "\spink(s|)\1\s"

Answering your question, (s|) is to match it with or without s.


RE: Some text matching questions - Vendettta - Sep. 11, 2008 03:29 AM

Matching "\spink(s|)\1\s" and replacing with "red\1" produced some strange results. The good part is that it seems to ignore most tags, so that's improved. But for page text, it ignores many instances of the word and replaces others. This page is a good example:
http://en.wikipedia.org/wiki/Pink_(singer)

The behavior I'm looking for is to replace "pink" with "red" and "pinks" with "reds" on all page text but to ignore anything within a tag. And also ignore other words with that root, such as "pinky"


RE: Some text matching questions - ProxRocks - Sep. 11, 2008 12:18 PM

"[a-z0-9]" will match for any alphanumeric character...

set your match to "(^[a-z0-9.:_\\-])pink(^[a-z0-9.:_\\-])"

this will match "pink" but will NOT match "1pink", "\pink" (you need TWO \'s the first is a "proxo escape string"), "pink.", ":pink", "pink:", "pink1", "pinkie", et cetera...

the .:_\\- portion will need to include every possible character that you want to NOT match when before or after "pink"...


edit: really the only way to prevent from matching within a "tag" is to set up a "preserve" filter at the "top" of your config that is NOT a "multi=TRUE" so that filters below it will NOT search for matches that the first filter replaced content with...


RE: Some text matching questions - Kye-U - Sep. 11, 2008 01:10 PM

This is a pretty interesting exercise Smile!

Code:
[Patterns]
Name = "Pink to Red"
Active = TRUE
URL = "$TYPE(htm)"
Limit = 5
Match = "(<$SET(open=1)|>$SET(open=))PrxNeverMatch"
        "|pink(?|)\1(^$TST(open=1))$TST(\1=(s|))"
Replace = "red\1"

What this does is it sets a variable ("open") to 1 every time a "<" character is matched. It empties it out when a ">" character is matched. PrxNeverMatch is just there to make sure it doesn't replace those tags.

You'll then note that there is a pipe separating the two lines so then it can match every occurrence of "pink" on the page, testing to see if it's in a tag or not, and then it replaces it with red if it isn't. If there's a a character after "pink", it'll capture it in the \1 variable and only replace it with red if the character is an "s" or if it doesn't exist at all. The bytes limit is 5 because the text that the filter is matching can be a maximum of 5 characters, "pinks".

The ? in (?|) matches any single character. It being wrapped in ( |) means: "match any character OR match nothing; matching nothing gives us the flexibility of matching a string even if a character (or string or expression) doesn't exist. If the matching expression is test(s|) and the test string is test tests, then it would match both.


RE: Some text matching questions - Vendettta - Sep. 12, 2008 03:40 AM

Wow. This is a lot more complex than I expected and a bit over my head. I'm going to have to study these examples and review the matching language page in depth. I'm glad the example is interesting to someone because I had worried that it might come across as kind of pointless. But these kind of examples that produce obvious results really help to wrap my mind around what proxomitron is doing with the various matching rules. I can tell this program has amazing potential, but it's kind of frustrating not being able to make my own filters.

I've tested the above two from ProxRocks & Kye-U and seem to be getting similar results. On the Wikipedia page it's still leaving some examples of "pink". I think this is because "pink" often follows a "<p>" tag even though it's not within the actual tag. And for some reason, it's eating the spaces before and after the word. I've tried
replace = " red\1 "
and
replace = "red\1"
but the space doesn't make a difference

It's also not preserving the capitalization, but from reading the prox rules, that may not be possible.


RE: Some text matching questions - ProxRocks - Sep. 12, 2008 08:29 AM

for the match, where ever pink occurs, you could replace that with (P$SET(0=Red)|p$SET(0=red))ink, then the replace would be \0\1 instead of red\1...

i "think" that will work for the capitalization...


RE: Some text matching questions - Guest - Sep. 13, 2008 04:03 AM

I'm not sure I understand. Is that meant to be used in addition to one of the above filters? If implemented like this:
Match = "(P$SET(0=Red)|p$SET(0=red))ink"
Replace = "\0\1"
on the Wiki page, it preserves case in some instances, but not in others, which is an improvement on that aspect of it. But it's still eating spaces and breaking links.


RE: Some text matching questions - ProxRocks - Sep. 13, 2008 02:30 PM

i was thinking more along the lines of:
Code:
[Patterns]
Name = "Pink to Red"
Active = TRUE
URL = "$TYPE(htm)"
Limit = 5
Match = "(<$SET(open=1)|>$SET(open=))PrxNeverMatch"
        "|pink(?|)\1(^$TST(open=1))$TST(\1=(s|))"
Replace = "red\1"

changing to:
Code:
[Patterns]
Name = "Pink to Red"
Active = TRUE
URL = "$TYPE(htm)"
Limit = 5
Match = "(<$SET(open=1)|>$SET(open=))PrxNeverMatch"
        "|(P$SET(0=Red)|p$SET(0=red))ink(?|)\1(^$TST(open=1))$TST(\1=(s|))"
Replace = "\0\1"



RE: Some text matching questions - lnminente - Sep. 13, 2008 04:29 PM

Testing with: < pinks> 0pink apink Pinkie pinks pink pinks pinksie

Code:
[Patterns]
Name = "<example> Variables: Pink to Red"
Active = FALSE
Multi = TRUE
URL = "$TYPE(htm)"
Limit = 6
Match = "(<$SET(open=1)|>$SET(open=))ThisHereIsToNotMatchAndPreventLoseOfTags|"
        "(\spink(s|)\1\s(^$TST(open=1)))"
Replace = " red\1 "

Note: it doesn't work well because of the byte limit, it would need to be included inside the filter...
Note2: nothing to apport about capital letters


RE: Some text matching questions - Vendettta - Sep. 16, 2008 06:25 AM

Still testing them with the Wikipedia page linked above: On that page, 51 instances of "pink" remained in the main (non-link) text after ProxRocks' last two filters, and 9 remained after Inminente's.

I'm slowly coming to the conclusion that this just can't be done. Similar problems seem to exist with every tweak of the filter. Not that I'm complaining. I really appreciate the effort to find a solution from all who've tried. I've certainly had no better success with anything I've done.

One other thought occurred to me though: Would blocklists help with this? For example, I know that URLs can be redirected or "jumped" in IncludeExclude, but how about replacing text.


RE: Some text matching questions - Guest - Sep. 16, 2008 08:53 PM

Code:
[Patterns]
Name = "Pink to Red"
Active = TRUE
Multi = TRUE
URL = "$TYPE(htm)"
Limit = 7
Match = "(<$SET(open=1)|>$SET(open=))PrxNeverMatch"
        "|"
        "([^a-z0-9])\2"
        "pink"
        "("
        "(s|)"
        "(\s|<|\'s|[^a-z0-9])"
        ")\1"
        "(^$TST(open=1))"
Replace = "\2RED\1"

Proxo Help Wrote:all matching is case insensitive



RE: Some text matching questions - Vendettta - Sep. 17, 2008 07:01 AM

That one works! At least it does with my initial testing. True, it doesn't account for capitalization, but all instances of "pink" seem to have been replaced, and with the correct spacing. And it changes the text in links but doesn't break them. Exactly what I was looking for. Smile! Thank you.

Would it be possible for someone to explain, line by line, how the matching works?
My dim understanding of it is like this....

Line 1 checks for the beginning of a tag and sets a variable to "1" when the start of a tag is matched and empties the variable when an end of tag is matched, and instructs it not to match anything while the variable is set to "1" Although I don't understand why the "PrxNeverMatch" is after the parenthesis.
Line 2 or
Line 3 checks for anything that's not a letter or digit (before the word "pink") and places it into variable 2
Line 4 matches "pink"
Line 5 hmmmm.
Line 6 includes the possibility of "pinks"
Line 7 ???
Line 8 places things following "pink" into variable "1"
Line 9 ???
Line 10 places items before and after "pink" back into position and replaces "pink" with "red"

Are the lines separated for visual clarity or does it serve a matching purpose?