![]() |
|
|
|
#1
|
|||
|
|||
|
Hi,
I have a hugh files 15gb each and would like to zip them individually. Now, that process takes a long time. Is it possible to take a zip job and break it up into several pieces and have other machines do it in chunks? Thanks |
|
#2
|
|||
|
|||
|
Yes, this could be sped up with the Digipede Network
Are you hoping to reduce the time it takes to compress a single file or a batch of files? Reducing the time it takes to compress a batch of files is more straightforward than doing this for a single file. The latter is certainly possible, but it may require unacceptable trade-offs or programming complexity. If the former, it will be trivial to distribute. Answers to these questions will help you determine if it makes sense:
If the latter, you may need to get more creative: For example, one possibility is that you break up the large input file on the front-end, distribute the chunks across the Digipede Network, compress them, and then tar (i.e., archive without compression) them into a single file. Of course, this produces a different result format. Another possibility is to paste together the ZIPped data programmatically into a valid ZIP file that contains only compressed file -- I imagine that some 3rd party ZIP libraries would help with this, but I have no personal experience with any. My best suggestion is for you to get the Digipede Network Developer Edition (free to qualified parties) to try it out and see. |
|
#3
|
|||
|
|||
|
I just downloaded the Adsen Software File Splitter (it's available at http://www.adsensoftware.com/filesplitter). I can't vouch for the product, but it's freeware and I just used it with no problem.
It was a piece of cake to split a large file, then use the Digipede Network to zip the pieces. FileSplitter created files called FSS files. I used the Digipede Workbench to define a very simple job: one executable (I used gzip), one input file and one output file for each task, and the command line was simply "gzip $(FSSFile)". I just browsed to the network file share with my FSS files and selected them all--the Digipede Network automatically handled making all of the right command lines for me. For kicks, I unzipped everything on the Digipede Network as well (just had to change the command line to "gzip -d $(GZFile)"), then stitched it all together using Adsen's File Splitter (note on FileSplitter--you need to make sure to store the FSM file, because it's necessary to stitch it all back together again). Now that I know this works, I'm running it on a large file (3GB now) to get an idea of performance improvement. I'll post again when I have some hard numbers.
__________________
Director of Products, Digipede Technologies |
|
#4
|
|||
|
|||
|
Ok, I've got some good numbers for this.
First, the important part: You need a good network for this! Due to the huge file sizes involved, it's important that you have a network that can move them quickly. I tried this first on our regular (100MBit network), and I didn't get significant improvement. We have several machines that have Gigabit Ethernet NICs; when I ran on that subnet, I got good improvement. I was playing with a just-over-one-gigabyte ASCII file. Using gzip on one machine, zipping this file took 1 minute and 23 seconds (83 seconds). I used FileSplitter to break the file into 100MB pieces (there were 11 of them, with one smaller than the others). Then, I ran a job to zip those files on a pool three machines, each with Gig-E network access. 26 seconds, including all of the file moving! I played a lot with using more or fewer machines. Three seemed to be the sweet spot (more machines just hammered the network and the hard drive too much). Finally, I modified my job. Rather than copying the files around my network, I changed it so each machine was working directly on a file share on one server (so my command line was "gzip \\MyServerName\MyShareName\file1.fss". That ran in 17 seconds--a huge improvement! Working in this way (directly on a file share rather than copying files), I could even gain some more performance by adding a fourth machine into the mix (got it down to 15 seconds). Unfortunately, I don't have any other machines with Gig-E NICs. Note for comparison: running this same job on my 100MBit machines took about a minute and a half--longer than on one machine. In other words, moving the bits took longer than calculating the bits. So having the Gigabit Ethernet was definitely important! One other note: splitting this file up using FileSplitter took 38 seconds. So with splitting and zipping, it's a total of 53 seconds using the Digipede Network (on 4 machines) versus 83 seconds not using the Digipede Network--and most of that was the splitter. It's hard to believe that it should take that long; I'd guess there are more efficient file splitting utilities out there.
__________________
Director of Products, Digipede Technologies |
|
#5
|
|||
|
|||
|
Delcom5, you said that you had "huge files" (plural files).
As Rob pointed out, if you are zipping *multiple* files, there's no need to break each one to make it zip faster: you can simply zip files individually on individual machines. Skip the file splitting part; just set up a job where each file becomes a task.
__________________
Director of Products, Digipede Technologies |
|
#6
|
|||
|
|||
|
Hi lads,
Excellent replies I must say. This is my scenario. I am backing up to a NAS server. All my backups are about 14gb to 20gb each. So I was zipping these files before writing to tape. That way I can obviously put more on a tape, however, it takes a long time to zip about 15 20gb files. My plan was to do this by jobs, taking one file 20gb and then split that file up, then send it off to several machines on the network. After they send back their jobs I then merge that to one file. You guys should condider integrating this sdk with a backup software of some sort since everyone is moving to nas storage. Thanks |
|
#7
|
|||
|
|||
|
Hi,
Thanks for the info. I didn't get a chance to test this as yet as I was on another project, however, I downloaded your SDK and manuals last night. I am excited to give this a shot. Another recommendation - Is to add a page to your site listing possible scenarios that developers can use your package. One of the question I hear popping up is "Why or how can I use grid design to scale my app?" You can even break it down by Industry type "AeroSpace", "Distribution", "Media" etc. Thanks |
![]() |
| Thread Tools | |
| Display Modes | |
|
|