Post Reply 
Adapting “hosts” file block lists to Privoxy's way of blocking…
Jul. 24, 2015, 11:09 PM (This post was last modified: Aug. 01, 2015 09:10 PM by Faxopita.)
Post: #1
Adapting “hosts” file block lists to Privoxy's way of blocking…
Dear contributors,

Privoxy-adapted block lists are “about” half the size of traditional hosts file. If one wants to block “www.abc.com” and “abc.com”, one only need to add “.abc.com” in the action file's block section, for example.

There are three websites from which I collect hosts-adapted block lists:
I have created a shell script that do the conversion to Privoxy language. As I'm not a pro in programming, you might feel the urge to yell at me… Anyway, this script could be a starting point for you to create a nice and sharp Python—or anything else—code in order to do it the proper way.

Also, this post is also meant to thank ProxHTTPSProxy contributors. I now use it and, given the Privoxy log, I can no longer “live” without it. There are, indeed, some nasty trackers tucked inside HTTPS connexions!!!

How to… (download archive)
First, run yourself “MVPS2PRIV”, which in turn will run “hpHosts Maintainer”. Unfortunately, you will need to revise the paths in those scripts and relocate the two “action” files.

MVPS2PRIV job is to clean the hosts file while “hpHosts Maintainer” is to “optimise” the size of the resulting “hpHostsList.action” file.

MVPS2PRIV only apply to the third URL above. I haven't had the time to apply the script to the other two first URLs, although it should be fast to do it!!!

Tell me what you think or whether or not it would be interesting to convert my script into a more “professional” code…

———
Would anyone tell me how I can send my 3 MB files? Because, when trying to attach a file in this forum, I have either the “not allowed format” or the “too big size issue”. Very irritating indeed. Thank you. I don't use DropBox or any cloud stuff.
———

Thanks for contributing, as always.
Add Thank You Quote this message in a reply
Jul. 25, 2015, 02:05 AM (This post was last modified: Jul. 25, 2015 02:09 AM by JJoe.)
Post: #2
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
(Jul. 24, 2015 11:09 PM)Faxopita Wrote:  Would anyone tell me how I can send my 3 MB files? Because, when trying to attach a file in this forum, I have either the “not allowed format” or the “too big size issue”. Very irritating indeed. Thank you. I don't use DropBox or any cloud stuff.

I don't have any experience with it but www.filedropper.com or something like it.
Registration is not required.

filedropper.com/aboutus.php Wrote:How long are the files kept?
The files are kept forever as long as they are being downloaded. If the files are not downloaded even once within 30 days consecutively they are removed.

I don't use Privoxy but I will save the file. I can upload it when requested.

HTH
Add Thank You Quote this message in a reply
Jul. 25, 2015, 06:40 AM
Post: #3
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
Thanks JJoe. Here's the link for downloading the archive: http://www.filedropper.com/scripts_3

Enjoy!
Add Thank You Quote this message in a reply
Jul. 25, 2015, 03:45 PM (This post was last modified: Jul. 25, 2015 04:14 PM by cattleyavns.)
Post: #4
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
Hi!
I wrote a tool like that aready using Autohotkey, its name is convert2privoxy, here it is: https://www.dropbox.com/s/hdkhz45w5jpjfk...y.exe?dl=0

Source code (you can modify it if you want, Autohotkey is the easiest Programming Language, it still left a very small bug with Hosts2Privoxy feature, but we can simply delete function() { and )}; line, I will fix this problem later (no, it is not, that bug was caused by the APKHostsFileEngine...)): https://www.dropbox.com/s/gqe58zta1iecys...y.ahk?dl=0

Okay, let us try it! I tried with the biggest hosts file in the world using APKHostsFileEngine (this tool is crazy!, it generate a hosts file from ALL source)

https://www.dropbox.com/s/9x5k2bp81cbrpr...0.jpg?dl=0

- Copy your hosts file into the 1st textbox, click Hosts2Privoxy and copy the content from 2nd textbox into your Privoxy action file, that is all!
Plus: This tool can make Greasemonkey script, Stylish script run on Privoxy, it also can convert AdBlock list like Easylist to Privoxy (with Element Hiding Helper's filter (for example: ###ads)).
Plus2: This tool might able to run on Linux using IronAHK, I'm not sure, but if you want to convert something and you don't have Windows OS to run my exe file then I can help you.

The result (5MB): https://www.dropbox.com/s/jh572cbgmmm65i...ction?dl=0

You might want to check my lastest Privoxy bundle (if you know ImgLikeOpera, this bundle contains some filters like that, also IframeLikeOpera, VideoLikeOpera, ObjectLikeOpera.. Big Teeth): https://www.dropbox.com/s/2dtm8pvvc9z7bw...e.zip?dl=0

Good luck and have fun! Happy surfing!
Add Thank You Quote this message in a reply
Jul. 25, 2015, 04:40 PM
Post: #5
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
Hello Cattleyavns,

thanks very much for sharing your tools. I'll have a look at them and I'm sure I will have fun.

I'll get back to you.

Good weekend,

Faxo
Add Thank You Quote this message in a reply
Jul. 28, 2015, 01:16 PM (This post was last modified: Aug. 01, 2015 09:23 PM by Faxopita.)
Post: #6
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
Cattleyavns add-on to convert hosts entries into Privoxy language is a nice tool to do the conversion quickly.

For those who want to reduce the resulting file size, the script from post #3's download link could be a potential alternative. Sure, it can be improved by any brighter mind reading this post and I hope it will happen.

The purpose of this script (hpHosts Maintainer) is to consolidate the block list in order to obey to the Privoxy's blocking style. So, I thought it might be good to remind what this style is about…

Imagine those entries in a hosts file (we forget localhost for simplicity):
Code:
abc.com
www.abc.com
tuv.xyz.def.com
xyz.def.com
def.com
www23.ghi.com
jkl.mno.pqr.com
mno.pqr.com

The “hpHosts Maintainer” script will turn the above entries into:
Code:
.abc.com
.def.com
.ghi.com
.mno.pqr.com

Privoxy would accept the entries as such from the hosts file (Cattleyavns), but would provide the exact same blocking actions as the above consolidated version created by the “Maintainer” script.

Now, the script goes a little further into simplifying entries. Imagine those entries in a hosts file:
Code:
www.zulu
www.com
mars.jupiter.saturn.uranus.neptune.com
jupiter.saturn.uranus.neptune.com
saturn.uranus.neptune.com
jupiter.uranus.neptune.com
mars.uranus.neptune.com

The “hpHosts Maintainer” script will turn the above entries into:
Code:
.zulu
.www.com
.uranus.neptune.com

“zulu” is not a known TLD. No problem, then. However, “www.com” cannot be turned into “.com” (a well-known TLD)—otherwise, all “.com” domains will be blocked; so, the script just prefixes with a dot. The good thing of prefixing with a dot, is that not only the bad “www.com" website will be blocked, but also any of its subdomains. Visit http://hosts-file.net/default.asp?s=www.com

Finally, I have made the deliberate choice to limit blocking to the first subdomain of the website/server. However, brighter minds of this forum may not like this way of proceeding. That's why the script also moves out cloud-based addresses to another file so they're not touched by URL compression. For example: “.tracker.not_a_bad_side.wordpress.com” will not be compressed to “.not_a_bad_side.wordpress.com”. In the above example, if “neptune.com” is not a cloud-based server, then the resulting entry is “.uranus.neptune.com”. The script compressed five “neptune.com”-related hosts entries into just one. The rational being that if “mars”, “jupiter” and “saturn” are bad, it can mean that “uranus” is bad as well. Then, we could assume that “neptune” is bad too, but for generic purpose, the script leave the URL down to the first subdomain.

One last thing. hosts file entries such as:
Code:
pub.casino-making-money.com
chocolate.article.diet.com
article.diet.com
viagra.pill.com

will be turned by the script into:
Code:
.casino-making-money.com
.diet.com
.pill.com

No need to go as far as the subdomain for these “scam”-related websites.

One of the reasons why I wanted to give away my script is to make it available for all platforms, not only for unix-based. Anyone wanted to work on it, improve it and make it a binary version or Python-based is welcome. If you do so, just append my member name next yours for future celebrity… Just in case!
Add Thank You Quote this message in a reply
Jul. 31, 2015, 04:45 AM
Post: #7
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
(Jul. 28, 2015 01:16 PM)Faxopita Wrote:  The good thing of prefixing with a dot, is that not only the bad “www.com" website will be blocked, but also any of its subdomains.

If you convert ad.goodsite.com to .goodsite.com, I'm sure many good sites will be killed by mistake only for their ad sub domains.
Add Thank You Quote this message in a reply
Jul. 31, 2015, 01:27 PM (This post was last modified: Jul. 31, 2015 02:37 PM by Faxopita.)
Post: #8
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
Hello whenever,

the script won't touch ad.goodsite.com, because shortening non-cloud-based subdomain-included addresses always leave untouched the domain and the first portion of its subdomain—a convention of mine, but free to you to modify it and keep the second portion of the subdomain as well. However, the script would compress ad.goodsite.com to .goodsite.com if goodsite.com were mistakenly included in the original hosts file as well. If the hosts file contained server1.ad.goodsite.com and server2.ad.goodsite.com, then the script would return .ad.goodsite.com, thus being rid of subdomain's portion 2; if the hosts file contained ad.goodsite.com and goodsite.com, then .goodsite.com would be the result. If you wanted to modify the script to keep the second portion of any subdomain when available, then your amended script would simply return .server1.ad.goodsite.com and .server2.ad.goodsite.com The resulting dot-prefixed entries would prevent your browser from loading any potential content such as, for example, tracking.server1.ad.goodsite.com

Thanks for your input.
Add Thank You Quote this message in a reply
Jul. 31, 2015, 03:02 PM (This post was last modified: Jul. 31, 2015 03:12 PM by cattleyavns.)
Post: #9
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
(Jul. 31, 2015 01:27 PM)Faxopita Wrote:  Hello whenever,

the script won't touch ad.goodsite.com, because shortening non-cloud-based subdomain-included addresses always leave untouched the domain and the first portion of its subdomain—a convention of mine, but free to you to modify it and keep the second portion of the subdomain as well. However, the script would compress ad.goodsite.com to .goodsite.com if goodsite.com were mistakenly included in the original hosts file as well. If the hosts file contained server1.ad.goodsite.com and server2.ad.goodsite.com, then the script would return .ad.goodsite.com, thus being rid of subdomain's portion 2; if the hosts file contained ad.goodsite.com and goodsite.com, then .goodsite.com would be the result. If you wanted to modify the script to keep the second portion of any subdomain when available, then your amended script would simply return .server1.ad.goodsite.com and .server2.ad.goodsite.com The resulting dot-prefixed entries would prevent your browser from loading any potential content such as, for example, tracking.server1.ad.goodsite.com

Thanks for your input.

I don't think so, in my opinion you should not convert server1.ad.goodsite.com and server2.ad.goodsite.com to .ad.goodsite.com, because that will cause you some trouble in the future, for example:

server1.ad.goodsite.com and server2.ad.goodsite.com, but the site use cdn.ad.goodsite.com to host core content, for example Jquery, AngularJS then Privoxy will block them too, so use what the hosts file give us will be good enough.
adimg.tv.com and mads.tv.com, this site might be similar with Youtube, but shorten to .tv.com will block the whole site.

But this is a good idea to optimize something like:
Code:
0.0.0.0 c.cnzz.com
0.0.0.0 hos1.cnzz.com
0.0.0.0 hzs1.cnzz.com
0.0.0.0 hzs2.cnzz.com
0.0.0.0 hzs4.cnzz.com
0.0.0.0 hzs8.cnzz.com
0.0.0.0 hzs10.cnzz.com
0.0.0.0 hzs13.cnzz.com
0.0.0.0 hzs15.cnzz.com
0.0.0.0 hzs22.cnzz.com
0.0.0.0 icon.cnzz.com
0.0.0.0 pcookie.cnzz.com
0.0.0.0 pw.cnzz.com
0.0.0.0 s1.cnzz.com
0.0.0.0 s3.cnzz.com
0.0.0.0 s4.cnzz.com
0.0.0.0 s5.cnzz.com
0.0.0.0 s7.cnzz.com
0.0.0.0 s8.cnzz.com
0.0.0.0 s9.cnzz.com
....
0.0.0.0 s132.cnzz.com
0.0.0.0 s137.cnzz.com

Will be a huge performance boost.
Add Thank You Quote this message in a reply
Jul. 31, 2015, 05:38 PM (This post was last modified: Aug. 01, 2015 09:46 PM by Faxopita.)
Post: #10
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
(Jul. 31, 2015 03:02 PM)cattleyavns Wrote:  adimg.tv.com and mads.tv.com, this site might be similar with Youtube, but shorten to .tv.com will block the whole site.

The script won't touch adimg.tv.com nor mads.tv.com, because of these simultaneously matched conditions in this case:
  • if the hosts file contains adimg.tv.com and mads.tv.com, they will only be preceded by a dot and not shorten to .tv.com; the conversion to one-single line entry .tv.com would only happen if the hosts file contained by accident tv.com as well.
  • although tv.com doesn't look like a second-level domain (like .us.com, for example), the script deems as is—because of a script limitation to protect second-level domain-included addresses—and uses a regular expression that takes account of
    Code:
    .vw.xyz
    -like pattern domain extension so that adimg will not be considered as a “subdomain” of the site (although it is in reality), which means that hosts entries containing, for example, banner.adimg.tv.com and banner.mads.tv.com will not respectively be turned into .adimg.tv.com and .mads.tv.com, although it would be ideal to (again, it's a script limitation); instead, the script would only prefix with a dot banner.adimg.tv.com and banner.mads.tv.com

After verification, my converted hosts file still contains .mads.tv.com and .adimg.tv.com

Equally, addresses like ad.mirror.co.uk will not be changed into .mirror.co.uk, because of the recognised
Code:
wx.yz
-like pattern ending the address, a second-level domain. Finally, the script will also protect addresses such as .com.sg, .us.com, .ac.be, etc.

(Jul. 31, 2015 03:02 PM)cattleyavns Wrote:  you should not convert server1.ad.goodsite.com and server2.ad.goodsite.com to .ad.goodsite.com, because that will cause you some trouble in the future, for example: server1.ad.goodsite.com and server2.ad.goodsite.com, but the site use cdn.ad.goodsite.com to host core content, for example Jquery, AngularJS then Privoxy will block them too

This didn't come up to my mind. I see what you mean. In this case, you could create in the user.action file an exception involving…
Code:
{ -block }
.cdn.ad.goodsite.com

The user.action file is to be called after the converted_hosts_file.action in the Privoxy's config file.

Other than this, I've been using the script for 14 months now and haven't had any deteriorated browsing experience. The script has only been improved towards further file size reduction. I use it every week to update my blocking .action file. Having said that, it does not mean of course the script is perfect as it's been proven by cattleyavns on that matter.

Regarding your last input:
Quote:0.0.0.0 c.cnzz.com
0.0.0.0 hos1.cnzz.com
0.0.0.0 hzs1.cnzz.com
0.0.0.0 hzs2.cnzz.com
0.0.0.0 hzs4.cnzz.com
0.0.0.0 hzs8.cnzz.com
0.0.0.0 hzs10.cnzz.com
0.0.0.0 hzs13.cnzz.com
0.0.0.0 hzs15.cnzz.com
0.0.0.0 hzs22.cnzz.com
0.0.0.0 icon.cnzz.com
0.0.0.0 pcookie.cnzz.com
0.0.0.0 pw.cnzz.com
0.0.0.0 s1.cnzz.com
0.0.0.0 s3.cnzz.com
0.0.0.0 s4.cnzz.com
0.0.0.0 s5.cnzz.com
0.0.0.0 s7.cnzz.com
0.0.0.0 s8.cnzz.com
0.0.0.0 s9.cnzz.com
....
0.0.0.0 s132.cnzz.com
0.0.0.0 s137.cnzz.com

While doing regular updates of your hosts file, you could see a newly-added entry like:
Code:
0.0.0.0 cnzz.com
alongside other .cnzz.com addresses. It's under this scenario the script would compress all these entries into one single entry .cnzz.com; and, indeed, after verification, my converted_hosts_file.action only contains .cnzz.com
Add Thank You Quote this message in a reply
Aug. 01, 2015, 08:52 AM (This post was last modified: Aug. 01, 2015 09:52 PM by Faxopita.)
Post: #11
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
In order to differentiate a top or second-level domain from the rest of the address, I've made the script to rely on the most common generic extensions. It's a personal choice and found it fair enough for me. However, someone could use instead the provided in-archive SLD.txt file—which currently contains 7313 domain extensions—and make the script rely on it.

Script portion dealing with domain extensions:
Code:
\(\.[a-z0-9]\{2\}\(\.[a-z0-9]\{2,3\}\)\?\|\.[a-z0-9]\{3\}\(\.[a-z0-9]\{2\}\)\?\|\.[a-z0-9\-]\{4,\}\|\(\.xn--[^\.]\+\)\{2\}\)$

Sources of domain extensions:
The most exhaustive one though seems to be: https://publicsuffix.org/list/public_suffix_list.dat

File SLD.txt is based upon this latter list.
Add Thank You Quote this message in a reply
Aug. 02, 2015, 03:22 AM (This post was last modified: Aug. 02, 2015 04:44 AM by cattleyavns.)
Post: #12
Toungue RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
I think Javascript is a good programing language, but I will try to use or Javascript or Python to create this converter, to be honest if I create a Javascript version of my convert2privoxy from the start would be much better for me, Javascript is cross-OS, cross-browser and also cross-network (use it everywhere).

Create a GUI with Python for me is hard, so I think I will try Javascript. I will post my source code here if it is usable.

Okay, here we go, convert2privoxy HTML + JS version, this is a very first Alpha version, can you integrate your optimization into it ? I'm pretty busy at this time:
Only Hosts2Privoxy, other feature coming soon..

Code:
<html>
<body>
<textarea id='input'></textarea>
<br>
<textarea id='output'></textarea>
<br>
<button id='hosts2privoxy' onclick='h2p();'>Hosts2Privoxy</button>


<script>
function h2p() {
var input = document.getElementById('input')
var output = document.getElementById('input')
input = input.value.replace(new RegExp(/(?:127.0.0.1|0.0.0.0).*?(\w.*)/gi), '$1')
input = '{+block{hosts}}\n' + input
document.getElementById('output').value = input
//alert(input)

}
</script>
<style>#input, #output {
width: 100%!important;
height: 30%!important;
}</style>

</body>
</html>

[Image: cTa3h1A.png]

It is like a small Framework, we can extend it with our work. Javascript is brilliant! We can upload this tool to a random webserver, or simply open our network port and use this tool from everywhere, even Android or IOS.
Add Thank You Quote this message in a reply
Aug. 02, 2015, 07:48 AM
Post: #13
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
(Jul. 31, 2015 05:38 PM)Faxopita Wrote:  The script has only been improved towards further file size reduction.

Have you measured the performance gain of Privoxy after the block file size reduction? I'm not sure if that kind of optimization is really needed by Privoxy, or maybe Privoxy can do it internally?

On the other hand, unless you have to often run Privoxy off a usb stick, why not just use the Hosts file directly?
Add Thank You Quote this message in a reply
Aug. 02, 2015, 08:20 AM (This post was last modified: Aug. 02, 2015 08:28 AM by cattleyavns.)
Post: #14
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
@whenever: Privoxy is better than hosts file and hosts file will get slowed if it reach, if I remember correctly is about 200KB, in that case, hosts file will become slower. And like Proxomitron, Privoxy is Portable and easy to setup, it can enhance user privacy by blocking Canvas Fingerprinting, it is extendable because Privoxy can inject javascript and even run complex Userscript using GM_, hide ads like Firefox's Adblock by injecting style tags with id/class/index and display:none;.

And hosts file is system level (on Windows at least, so we need root permission to modify hosts file, that is a big cons).

All feature above can easily achieve by using convert2privoxy, my converter automate a lot so we can generate those rule really easy.

Use hosts file rule as Privoxy rule is a way to use previous people work to minimize our work, which is good.

PS: I always carry Privoxy with my USB everytime I leave my home, I cannot live without my Privoxy set because after a long time tweaking, it can replace almost all browser addon, so I just have to plug my USB in, and then just enjoy my Privoxy set.
Add Thank You Quote this message in a reply
Aug. 03, 2015, 03:51 AM
Post: #15
RE: Adapting “hosts” file block lists to Privoxy's way of blocking…
(Aug. 02, 2015 08:20 AM)cattleyavns Wrote:  hosts file will get slowed if it reach, if I remember correctly is about 200KB, in that case, hosts file will become slower.

There is a workaround: http://winhelp2002.mvps.org/hosts.htm#DNS

I personally use hosts file with dnsmasq on the gateway machine.

I'm just curious if it is necessary to compress the file for Privoxy. Wink
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: