Threaded Mode | Linear Mode

cattleyavns · (This post was last modified: May. 28, 2015 07:42 AM by cattleyavns.)

Great job! I think we should rewrite self.headers.update too, I'm trying to do that now. Urllib3 headers feature is not good at least at this time, I think we should depart the whole header feature from them I use built-in as much as possible.

Can you tell me how to get this line to URLFilter.py and modify it as I want:

Code:

        headers = urllib3._collections.HTTPHeaderDict()

        [headers.add(key, value) for (key, value) in self.headers.items()]

I'm adding proxy feature to AFProxy using proxy_from_url, but I want to patch above problem by set headers = self.headers (req.headers in URLFilter.py)

I'm learning Python but I'm having a really tough question about "threading", threading with Python is not easy at all.. I would like to ask you some question and hope you will help me:
- In threading, how can we download a big file in parts but join it one by one instead wait them all finish and then join.
Code, save as .py and then run it.

Code:

import os, requests

import threading

import urllib.request, urllib.error, urllib.parse

import time

URL = "https://peach.blender.org/wp-content/uploads/poster_bunny_big.jpg"

def buildRange(value, numsplits):

    lst = []

    for i in range(numsplits):

        if i == range(numsplits):

            lst.append('%s-%s' % (int(round(1 + i * value/(numsplits*1.0),0)), int(value - round(1 + i * value/(numsplits*1.0) + value/(numsplits*1.0)-1, 0))))

        if i == 0:

            lst.append('%s-%s' % (i, int(round(1 + i * value/(numsplits*1.0) + value/(numsplits*1.0)-1, 0))))

        else:

            lst.append('%s-%s' % (int(round(1 + i * value/(numsplits*1.0),0)), int(round(1 + i * value/(numsplits*1.0) + value/(numsplits*1.0)-1, 0))))

    return lst

def main(url=None, splitBy=3):

    start_time = time.time()

    if not url:

        print("Please Enter some url to begin download.")

        return

    fileName = "1.jpg"

    sizeInBytes = requests.head(url, headers={'Accept-Encoding': 'identity'}).headers.get('content-length', None)

    print("%s bytes to download." % sizeInBytes)

    if not sizeInBytes:

        print("Size cannot be determined.")

        return

    dataDict = {}

    # split total num bytes into ranges

    ranges = buildRange(int(sizeInBytes), splitBy)

    def downloadChunk(idx, irange):

        print(idx)

        req = urllib.request.Request(url)

        req.headers['Range'] = 'bytes={}'.format(irange)

        dataDict[idx] = urllib.request.urlopen(req).read()

        print("finish: " + str(irange))

    # create one downloading thread per chunk

    downloaders = [

        threading.Thread(

            target=downloadChunk, 

            args=(idx, irange),

        )

        for idx,irange in enumerate(ranges)

        ]

    # start threads, let run in parallel, wait for all to finish

    for th in downloaders:

        th.start()

    #for th in downloaders:

        th.join()

        #print(th.join)

    print('done: got {} chunks, total {} bytes'.format(

        len(dataDict), sum( (

            len(chunk) for chunk in list(dataDict.values())

        ) )

    ))

    print("--- %s seconds ---" % str(time.time() - start_time))

    if os.path.exists(fileName):

        os.remove(fileName)

     #reassemble file in correct order

    with open(fileName, 'wb') as fh:

        for _idx,chunk in sorted(dataDict.items()):

            fh.write(chunk)

    #stream_chunk = 16 * 1024

    #with open(fileName, 'wb') as fp:

    #  while True:

    #      for _idx,chunk in sorted(dataDict.items()):

            #fh.write(chunk)

     #       chunking = chunk.read(stream_chunk)

      #      if not chunk:

       #         break

        #    fp.write(chunking)

    print("Finished Writing file %s" % fileName)

    print('file size {} bytes'.format(os.path.getsize(fileName)))

if __name__ == '__main__':

    main(URL, splitBy=3)

What I want is:
- For example we have a big file with 100MB file size
- We will split that file with Content-Length
- We will use "threading" module to download that file in parts to ensure we have as fast as possible download speed instead download one by one without threading then join part.
- But problem is with threading "join()", we cannot stream file or write file to disk instantly like Free Download Manager/Flashget software because "join()" wait for all thread finish.
- But without join(), simply this script will not work, file size return 0 byte because the file write before the download task finish.
- So I want to make threading work like this:
+ Download a file with 4 threads
+ Thread 1 download finish, stream thread 1 data then wait till thread 2 finsh, join thread 2 with thread 1, but even thread 3, 4 finish earlier than thread 2, thread 3, 4 should not join with thread 1 because that action will break the file, it must wait till thread 2 finish then join 1 with 2, then join 3, 4 with.