Old Proxomitron Forums
April 23, 2014, 07:09:36 AM *
Welcome, Guest. Please login or register.

Login with username, password and session length
News: No activity here! This forum is read-only. Go to http://prxbx.com/forums/index.php.
 
   Home   Help Search Login Register  
Pages: [1]
  Print  
Author Topic: PDF to HTML on the fly  (Read 1505 times)
Arne

Administrator
Hero Member
*****
Posts: 778

1448105 aflaaten arneflaa
View Profile WWW Email
« on: May 25, 2002, 08:20:47 AM »

This is a filter posted by JarC which makes use of the Google option to do this. The filter will only work if Google already has it in it's cache:

[HTTP headers]
In = FALSE
Out = TRUE
Key = "URL: Convert PDF to HTML thru Google (JarC)"
URL = "(^*209.85.129.132)*.pdf"
Match = "http://1"
Replace = "$JUMP(http://209.85.129.132/search?q=cache:1&hl=en)"

IP corrected in 2009
Best wishes
Arne <img src=icon_smile.gif border=0 align=middle>
Imici username: Arne
« Last Edit: June 23, 2009, 03:50:30 AM by Admin » Logged

Best wishes
Arne
Imici username= Arne
cj.

Full Member
***
Posts: 135

 smithchasmel8 chasmel8@yahoo.com
View Profile WWW Email
« Reply #1 on: May 25, 2002, 01:43:51 PM »

This is a rather interesting filter Arne. Thank you for posting it here. With this filter you could effectively create .doc files from .pdf formatted file types.

Thank you Jarc I shall play with this filter some more during the "3" day Memorial Day weekend here in the States.



-cj.-
______
Logged

-cj.-
______
xartica
Newbie
*
Posts: 38


View Profile WWW Email
« Reply #2 on: May 25, 2002, 03:24:54 PM »

With the header filter, if Google doesn't have a cached version of a particular PDF, you're stuck with a "broken link" & would have to bypass and reload the referring page to reach the linked document.

...so, I've drafted the following WEBfilter. It lets you see, in advance, that a page anchor links to a PDF... and allows you to choose which [PDF/HTML] version to retrieve.

Name = "optionally view linked PDF as cached Google HTML page"
Active = TRUE
MULTI = TRUE
Bounds = "<a hrefs*>*</a>"
URL = "^([^/]++.google.com|*216.239.3.100)"
Match = "1 (href=$AV(*.pdf)9)2 (*>*)3"
Replace = "1 2 3
"
        "_OR_"
        "
try PDF cached as HTML at "
        "<a href='http://216.239.39.100/search?q=cache:9&hl=en'>Google</a>"

Although I haven't field-tested the filter, it works in the Prox filter test window

================ filter test window input:
stuff <a href="http:www.mysite.com/stuff.pdf" target=_new>get the PDF</a>

================ filter test window output:
stuff <a href="http:www.mysite.com/stuff.pdf" target=_new>get the PDF</a>
_OR_
try PDF cached as HTML at <a href='http://216.239.39.100/search?q=cache:&hl=en'>Google</a>



note:
An icon image (like the lozenge that the SuperOpener filter inserts)
would probably be preferable to using the "anchor text" I used in my test.


 
Logged

 
TEggHead
Jr. Member
**
Posts: 93

21893433  eljarec
View Profile WWW Email
« Reply #3 on: May 30, 2002, 03:47:31 PM »

Hi Xartica,

Yes, that is a disadvantage if the pdf isn't on google, and I like your alternative better...

BTW. I just noticed the same Google trick also works with Word DOC files, haven't found Excel sheets yet but I would not be surprised if these could be viewed as HTML in the same way...





Edited by - TEggHead on 31 May 2002  03:09:25
Logged

 
sidki3003

Sr. Member
****
Posts: 476


View Profile WWW Email
« Reply #4 on: May 30, 2002, 04:02:23 PM »

Hey cool . Excel sheets work.
http://216.239.37.100/search?q=cache:http://www.census.gov/population/cen2000/tab05.xls

 
Logged

 
TEggHead
Jr. Member
**
Posts: 93

21893433  eljarec
View Profile WWW Email
« Reply #5 on: May 30, 2002, 04:09:22 PM »

so, what else do have in formats that could be supported...man this is an unexpected surprise...and it has been staring in anyone's face when your on google, but never thought about using it as online conversion tool before

 
Logged

 
sidki3003

Sr. Member
****
Posts: 476


View Profile WWW Email
« Reply #6 on: May 30, 2002, 04:11:10 PM »

Powerpoint too:
http://216.239.37.100/search?q=cache:http://www.internet2.edu/presentations/vimm/20011004-QoSWorkingGroup-Campanella.ppt

 
Logged

 
sidki3003

Sr. Member
****
Posts: 476


View Profile WWW Email
« Reply #7 on: May 30, 2002, 04:20:58 PM »

I think i'm running out of MSOffice format knowledge now
That was a great idea JarC


 
Logged

 
Jor

Sr. Member
****
Posts: 421

10401286 jor otf jor_otf
View Profile WWW Email
« Reply #8 on: May 31, 2002, 01:33:49 PM »

Xartica's filter does not seem to work (?)

 
Logged

 
TEggHead
Jr. Member
**
Posts: 93

21893433  eljarec
View Profile WWW Email
« Reply #9 on: June 05, 2002, 10:49:54 PM »

<BLOCKQUOTE id=quote><font size=1 face="Verdana, Arial, Helvetica" id=quote>quote:<hr height=1 noshade id=quote>Xartica's filter does not seem to work (?)<hr height=1 noshade id=quote></BLOCKQUOTE id=quote></font id=quote><font face="Verdana, Arial, Helvetica" size=2 id=quote>

Try this one, after Xartica's comment I've been playing with the inital version. It isn't perfect yet... but maybe it works for you

[Patterns]
Name = "URL: Use Google Service to Convert ... files To HTML (on right)"
Active = TRUE
Multi = TRUE
URL = "(^*(google.com|216.239.3.100)*)"
Bounds = "<a\s*(<(\\|)/a>|(<a\s)\9)"
Limit = 256
Match = "(^*CLASS=PRX*)\0"
"(HREF=$AV(([a-z]+://|$SET(\#=\h))\8"
"\4.(DOC|PDF|PPT|PUB|RTF|XLS)\5))\1"
"(^*<IMG*)\2"
Replace = "\0\1\2\r\n<A CLASS=PRXdirect HREF="http://216.239.39.100"
"/search?q=cache:$ESC(\@\4.\5)&hl=en">\5& #187;HTML</A>\9"

corrected 2009. Remember to change IP


About how 8 is filled, the match needs to swallow the http:// part, as it must be removed from the URL that is fed to Google.

Then there's the cases where the links are relative and a hostname is absent, in this case there also won't be a http, so the right side of the OR makes sure the host part is present (luckily it does not return a http:// in front so we won't need to remove it again)

8 is not used thereafter and the only reason for putting it there was to prevent 4 from taking up the http:// (if any), as <u>it's</u> (4) purpose is to match the entire url without protocol prefix upto the extension dot.
Since the extension is also needed in the link text, it ends up in 5.


Now, the first part of the original link is stored in \0, the full href in \1 and the remainder in \2, these are then used to reconstruct the original link.
@ will have the value of h if it wasn't present in the link, and 4 and 5 recreate the path to the document without protocol prefix. The result is $ESCaped so spaces in document names don't throw off Google.

Finally 5 is then reused to create a text link, <b>the space between & #</b> is so it doesn't get converted during post (it had changed when I came back for the edits) and <b>needs to be removed before use</b>.

One last bit, the 9, I got the bounds from one of the Super Opener versions, it used 9 to catch the remainder of the buffer in case the <a tags where overlapping, it returns in the replace as last. Sadly, constructs like

<a href..> text/image link <br><a href..> image/text link </a> </a>

seem to occur more than I thought. so I thought better to go with something that already has proven it's worth...

HTH
JarC


Edited by - TEggHead on 06 Jun 2002  00:12:33
« Last Edit: June 23, 2009, 03:58:11 AM by Admin » Logged

 
Jor

Sr. Member
****
Posts: 421

10401286 jor otf jor_otf
View Profile WWW Email
« Reply #10 on: June 06, 2002, 12:25:12 AM »

Wonderful, it now works. Thanks!

By the way, if you want to write &#187; so it will show up here, escape the ampersand as well: use &amp;#187;

Edited by - Jor on 06 Jun 2002  01:27:13
Logged

 
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines Valid XHTML 1.0! Valid CSS!