[Cialug] Efficiently removing the beginning of a file

Jeffrey C. Ollie jeff at ocjtech.us
Mon May 21 12:10:58 CDT 2007


On Mon, 2007-05-21 at 11:04 -0500, Daniel A. Ramaley wrote:
> I have a ~70 MB file. The first 3635 bytes need to be removed. What is 
> the most efficient way to do that? I did this, knowing it would work 
> but would be slow:
>     $ dd if=inputfile of=outputfile ibs=1 obs=1M skip=3635
> It did indeed work. But it took 274 seconds (and pegged the CPU the 
> entire time), whereas simply copying the file with cp only takes 2 
> seconds. Since what i want to do is not *that* different an action from 
> just copying the file (at least in terms of the minimum disk operations 
> that would be required), it seems to me that there should be a way to 
> do it that only takes ~2 seconds. What are some other command line ways 
> to do this that would be more efficient?
> 
> Actually, before hitting send i tried another test, just flipping the 
> "ibs" and "skip" values:
>     $ dd if=inputfile of=outputfile2 ibs=3635 obs=1M skip=1
> That only took 2.5 seconds, which is much closer to the theoretical 2 
> second time that should be possible. But i guess what i'm curious about 
> is the general problem; if there is a large file and you need to remove 
> some small number of bytes from the beginning of it, how is that best 
> accomplished? If i had needed to remove only 1 byte for example, i 
> would have had to have used "ibs=1 skip=1" which would have taken 
> around 274 seconds again.

This has everything to do with your input block size.  With an input
block size of 1, dd has to do a lot of work (my CPU was spiked during
this test):

        $ time dd if=test-in of=test-out ibs=1 obs=10M skip=3635
        73396685+0 records in
        6+1 records out
        73396685 bytes (73 MB) copied, 75.1107 seconds, 977 kB/s
        
        real    1m15.158s
        user    0m16.941s
        sys     0m55.442s

Switching dd to use and input block size of 3635 makes things go a LOT
faster:

        $ time dd if=test-in of=test-out ibs=3635 obs=10M skip=1
        20191+1 records in
        6+1 records out
        73396685 bytes (73 MB) copied, 0.316958 seconds, 232 MB/s
        
        real    0m0.365s
        user    0m0.050s
        sys     0m0.296s
        
With a little scripting in Python this can be done a bit more generally:

        import time
        
        s = time.time()
        
        i = file('test-in', 'r')
        o = file('test-out', 'w')
        
        i.seek(3635)
        
        while 1:
            data = i.read(10 * 1024 * 1024)
            if data == '':
                break
            o.write(data)
        
        i.close()
        o.close()
        
        e = time.time()
        
        print e - s

        $ python test.py 
        0.329372882843

Another factor to consider is how much RAM you have, the tests above
reflect the end result after a number of runs so that much of data would
have been cached in RAM.

Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://cialug.org/pipermail/cialug/attachments/20070521/07fcebf5/attachment.pgp


More information about the Cialug mailing list