Author Topic: PDF to HTML on the fly  (Read 4907 times)

Arne

  • Administrator
  • Hero Member
  • *****
  • Posts: 778
    • ICQ Messenger - 1448105
    • AOL Instant Messenger - aflaaten
    • Yahoo Instant Messenger - arneflaa
    • View Profile
    • http://
    • Email
PDF to HTML on the fly
« on: May 25, 2002, 08:20:47 AM »
This is a filter posted by JarC which makes use of the Google option to do this. The filter will only work if Google already has it in it's cache:

[HTTP headers]
In = FALSE
Out = TRUE
Key = "URL: Convert PDF to HTML thru Google (JarC)"
URL = "(^*209.85.129.132)*.pdf"
Match = "http://1"
Replace = "$JUMP(http://209.85.129.132/search?q=cache:1&hl=en)"

IP corrected in 2009
Best wishes
Arne <img src=icon_smile.gif border=0 align=middle>
Imici username: Arne
« Last Edit: June 23, 2009, 03:50:30 AM by Admin »
Best wishes
Arne
Imici username= Arne

cj.

  • Full Member
  • ***
  • Posts: 135
    • ICQ Messenger -
    • AOL Instant Messenger - smithchasmel8
    • Yahoo Instant Messenger - chasmel8@yahoo.com
    • View Profile
    • Email
PDF to HTML on the fly
« Reply #1 on: May 25, 2002, 01:43:51 PM »
This is a rather interesting filter Arne. Thank you for posting it here. With this filter you could effectively create .doc files from .pdf formatted file types.

Thank you Jarc I shall play with this filter some more during the "3" day Memorial Day weekend here in the States.



-cj.-
______
-cj.-
______

xartica

  • Newbie
  • *
  • Posts: 38
    • ICQ Messenger -
    • AOL Instant Messenger -
    • Yahoo Instant Messenger -
    • View Profile
    • Email
PDF to HTML on the fly
« Reply #2 on: May 25, 2002, 03:24:54 PM »
With the header filter, if Google doesn't have a cached version of a particular PDF, you're stuck with a "broken link" & would have to bypass and reload the referring page to reach the linked document.

...so, I've drafted the following WEBfilter. It lets you see, in advance, that a page anchor links to a PDF... and allows you to choose which [PDF/HTML] version to retrieve.

Name = "optionally view linked PDF as cached Google HTML page"
Active = TRUE
MULTI = TRUE
Bounds = "<a hrefs*>*</a>"
URL = "^([^/]++.google.com|*216.239.3.100)"
Match = "1 (href=$AV(*.pdf)9)2 (*>*)3"
Replace = "1 2 3
"
        "_OR_"
        "
try PDF cached as HTML at "
        "<a href='http://216.239.39.100/search?q=cache:9&hl=en'>Google</a>"

Although I haven't field-tested the filter, it works in the Prox filter test window

================ filter test window input:
stuff <a href="http:www.mysite.com/stuff.pdf" target=_new>get the PDF</a>

================ filter test window output:
stuff <a href="http:www.mysite.com/stuff.pdf" target=_new>get the PDF</a>
_OR_
try PDF cached as HTML at <a href='http://216.239.39.100/search?q=cache:&hl=en'>Google</a>



note:
An icon image (like the lozenge that the SuperOpener filter inserts)
would probably be preferable to using the "anchor text" I used in my test.


 
 

TEggHead

  • Jr. Member
  • **
  • Posts: 93
    • ICQ Messenger - 21893433
    • AOL Instant Messenger -
    • Yahoo Instant Messenger - eljarec
    • View Profile
    • Email
PDF to HTML on the fly
« Reply #3 on: May 30, 2002, 03:47:31 PM »
Hi Xartica,

Yes, that is a disadvantage if the pdf isn't on google, and I like your alternative better...

BTW. I just noticed the same Google trick also works with Word DOC files, haven't found Excel sheets yet but I would not be surprised if these could be viewed as HTML in the same way...





Edited by - TEggHead on 31 May 2002  03:09:25
 

sidki3003

  • Sr. Member
  • ****
  • Posts: 476
    • ICQ Messenger -
    • AOL Instant Messenger -
    • Yahoo Instant Messenger -
    • View Profile
    • http://
    • Email
PDF to HTML on the fly
« Reply #4 on: May 30, 2002, 04:02:23 PM »
 

TEggHead

  • Jr. Member
  • **
  • Posts: 93
    • ICQ Messenger - 21893433
    • AOL Instant Messenger -
    • Yahoo Instant Messenger - eljarec
    • View Profile
    • Email
PDF to HTML on the fly
« Reply #5 on: May 30, 2002, 04:09:22 PM »
so, what else do have in formats that could be supported...man this is an unexpected surprise...and it has been staring in anyone's face when your on google, but never thought about using it as online conversion tool before

 
 


sidki3003

  • Sr. Member
  • ****
  • Posts: 476
    • ICQ Messenger -
    • AOL Instant Messenger -
    • Yahoo Instant Messenger -
    • View Profile
    • http://
    • Email
PDF to HTML on the fly
« Reply #7 on: May 30, 2002, 04:20:58 PM »
I think i'm running out of MSOffice format knowledge now
That was a great idea JarC


 
 

Jor

  • Sr. Member
  • ****
  • Posts: 421
    • ICQ Messenger - 10401286
    • AOL Instant Messenger - jor otf
    • Yahoo Instant Messenger - jor_otf
    • View Profile
    • http://members.outpost10f.com/~jor/
    • Email
PDF to HTML on the fly
« Reply #8 on: May 31, 2002, 01:33:49 PM »
Xartica's filter does not seem to work (?)

 
 

TEggHead

  • Jr. Member
  • **
  • Posts: 93
    • ICQ Messenger - 21893433
    • AOL Instant Messenger -
    • Yahoo Instant Messenger - eljarec
    • View Profile
    • Email
PDF to HTML on the fly
« Reply #9 on: June 05, 2002, 10:49:54 PM »
<BLOCKQUOTE id=quote><font size=1 face="Verdana, Arial, Helvetica" id=quote>quote:<hr height=1 noshade id=quote>Xartica's filter does not seem to work (?)<hr height=1 noshade id=quote></BLOCKQUOTE id=quote></font id=quote><font face="Verdana, Arial, Helvetica" size=2 id=quote>

Try this one, after Xartica's comment I've been playing with the inital version. It isn't perfect yet... but maybe it works for you

[Patterns]
Name = "URL: Use Google Service to Convert ... files To HTML (on right)"
Active = TRUE
Multi = TRUE
URL = "(^*(google.com|216.239.3.100)*)"
Bounds = "<a\s*(<(\\|)/a>|(<a\s)\9)"
Limit = 256
Match = "(^*CLASS=PRX*)\0"
"(HREF=$AV(([a-z]+://|$SET(\#=\h))\8"
"\4.(DOC|PDF|PPT|PUB|RTF|XLS)\5))\1"
"(^*<IMG*)\2"
Replace = "\0\1\2\r\n<A CLASS=PRXdirect HREF="http://216.239.39.100"
"/search?q=cache:$ESC(\@\4.\5)&hl=en">\5& #187;HTML</A>\9"

corrected 2009. Remember to change IP


About how 8 is filled, the match needs to swallow the http:// part, as it must be removed from the URL that is fed to Google.

Then there's the cases where the links are relative and a hostname is absent, in this case there also won't be a http, so the right side of the OR makes sure the host part is present (luckily it does not return a http:// in front so we won't need to remove it again)

8 is not used thereafter and the only reason for putting it there was to prevent 4 from taking up the http:// (if any), as <u>it's</u> (4) purpose is to match the entire url without protocol prefix upto the extension dot.
Since the extension is also needed in the link text, it ends up in 5.


Now, the first part of the original link is stored in \0, the full href in \1 and the remainder in \2, these are then used to reconstruct the original link.
@ will have the value of h if it wasn't present in the link, and 4 and 5 recreate the path to the document without protocol prefix. The result is $ESCaped so spaces in document names don't throw off Google.

Finally 5 is then reused to create a text link, <b>the space between & #</b> is so it doesn't get converted during post (it had changed when I came back for the edits) and <b>needs to be removed before use</b>.

One last bit, the 9, I got the bounds from one of the Super Opener versions, it used 9 to catch the remainder of the buffer in case the <a tags where overlapping, it returns in the replace as last. Sadly, constructs like

<a href..> text/image link <br><a href..> image/text link </a> </a>

seem to occur more than I thought. so I thought better to go with something that already has proven it's worth...

HTH
JarC


Edited by - TEggHead on 06 Jun 2002  00:12:33
« Last Edit: June 23, 2009, 03:58:11 AM by Admin »
 

Jor

  • Sr. Member
  • ****
  • Posts: 421
    • ICQ Messenger - 10401286
    • AOL Instant Messenger - jor otf
    • Yahoo Instant Messenger - jor_otf
    • View Profile
    • http://members.outpost10f.com/~jor/
    • Email
PDF to HTML on the fly
« Reply #10 on: June 06, 2002, 12:25:12 AM »
Wonderful, it now works. Thanks!

By the way, if you want to write &#187; so it will show up here, escape the ampersand as well: use &amp;#187;

Edited by - Jor on 06 Jun 2002  01:27:13