[Cialug] Algorithm; Cutting Up A File

Daniel A. Ramaley daniel.ramaley at DRAKE.EDU
Wed Dec 13 09:47:00 CST 2006


This sounds like a homework-type problem, but it is a little interesting 
nevertheless. Below is the best i can do in Perl. It expects the file 
on standard input, though it would be trivial to change it to read a 
file from disk. The script doesn't care what tags you use in the input 
or what order they are in; datafiles will be created all the same. It 
should be easy to translate this to any language that has decent regex 
support.

#!/usr/bin/perl
use strict;
use warnings;
# Slurp the file
$/ = undef;
my $input = <>;
# Split it into multiple files
while ($input =~ m'^<([^/>][^>]*)>\n(.*?)^</\1>$'gms) {
    open OUTPUT, '>', "datafile-$1.txt";
    print OUTPUT $2;
    close OUTPUT;
}


On Tuesday 12 December 2006 16:52, Todd Walton wrote:
>Hey scripters,
>
>I'm having trouble concocting an algorithm to cut up a text file into
>blocks.  I'm going to have text files that have three distinct blocks
>of information in them, and each block will be marked in some way.  By
>HTML style tags, I suppose.  For example:
>
>~/filez> cat datafile.txt
><description>
>This is a data file.  It holds data.
></description>
>
><procedure>
>1. Read the file.
>2. Ponder meaning of existence.
>3. Write new file.
></procedure>
>
><reference>
>/usr/dict/datafile
></reference>
>
>~/filez> _
>
>What I can assume about these files is that each will have three
>pre-defined blocks of text, enclosed by HTML style tags.  The tags are
>on their own line.  There may or may not be text outside of these
>three blocks.  There may or may not be blank lines between the blocks.
> The blocks may or may not be in a given order.  Etc.
>
>How can I read in the file's contents, take out the text between the
>tags (but not the tags!), and write that text to a file?  I begin with
>datafile.txt, I run the script, and I end up with
>datafile-description.txt, datafile-procedure.txt, and
>datafile-reference.txt.  Here's what I have so far:
>
>while datafile.position != end
>       # The block for description.
>       strLine = datafile.readNextLine
>       if strLine contains "<description>" then
>               until strLine = "</description>"
>                       strLine = datafile.readNextLine
>                       write strLine to datafile-description.txt
>               end until
>       end if
>
>       # The block for procedure. (same as for description)
>       # The block for reference. (same as for description)
>end while
>
>So, the script runs through the text file line by line, until it finds
>the opening description tag and then, starting with the next line,
>writes it all out to a new file until it comes to the end-description
>tag.  Same for the other two.  Will this work?  If the blocks are out
>of order in the datafile will this still work?  Should I change
>something?
>
>-todd
>_______________________________________________
>Cialug mailing list
>Cialug at cialug.org
>http://cialug.org/mailman/listinfo/cialug

-- 
------------------------------------------------------------------------
Dan Ramaley                            Dial Center 118, Drake University
Network Programmer/Analyst             2407 Carpenter Ave
+1 515 271-4540                        Des Moines IA 50311 USA


More information about the Cialug mailing list