[Cialug] Perl regex

Daniel A. Ramaley daniel.ramaley at DRAKE.EDU
Tue Jun 19 15:31:21 CDT 2007


On Tuesday 19 June 2007 14:01, Chris Freeman wrote:
>I'm not sure how UTF-8 conversion works better in a 'slurp' than a line-
>by-line load, but the online docs suggest that you can convert 
>pretty easily (possibly replace 'while()' with a 'for()' across the 
>array of lines in the file).

Conversion isn't really any more difficult with multiple lines, 
assuming the encoding is known (though i'd be surprised if it isn't 
faster to call Encode::decode once with a long string than to call it 
separately for each line). However, detecting the encoding of a text 
file is not an easy problem. Depending on which encodings you have to 
choose from, it may not even be possible. With most detection methods,
the probability for success increases with the size of the examined 
data. The detection method i'm currently using (regexes, with an 
implicit assumption that the encoding is UTF-8 until proven otherwise)
doesn't suffer from that limitation, but keeping the detection at the
file level will allow me to change detection methods easily if i learn
of a more accurate method, or if i have to add additional encodings
that aren't so easy to choose amongst.

Here's how i read and decode the data (though ASCII is one of the 
possible encodings, i don't test for it since it is valid UTF-8).
If you haven't done much with encodings before the the 'utf-8' decode 
may appear to be pointless (since the data has tested good as UTF-8). 
But the decode let's Perl know to treat the data as UTF-8 and not an 
arbitrary binary blob.

    # Slurp the file
    my $data = do { local (@ARGV, $/) = $file_path; <> };
    # Figure out the proper encoding
    if (&is_utf8($data)) {
        $data = Encode::decode('utf-8', $data, Encode::FB_CROAK);
    } elsif (&is_iso8859_1($data)) {
        print "Using ISO-8859-1 encoding for: ${file_path}\n";
        $data = Encode::decode('iso-8859-1', $data, Encode::FB_CROAK);
    } else {
        print "Unable to determine encoding: ${file_path}\n";
        $fatal++;
    }

Here are my functions to do the encoding detection. I use regexes rather
than relying on the Encode module because the regexes are more reliable.

################################################################################
# Returns true if the given string is valid ISO-8859-1, false otherwise.
# Regex swiped from http://lachy.id.au/log/2005/11/handling-character-encodings
# It would be possible to use this:
#    my ($junk) = @_; # Needed because if we call Encode::decode with $_[0],
#                     # the variable in our calling function will be destroyed!
#    eval { Encode::decode('iso-8859-1', $junk, Encode::FB_CROAK) };
#    return not $@;
# However, testing shows Encode::decode may return erroneous results if given
# UTF-8 data containing Japanese characters. So the regex is actually better.
sub is_iso8859_1 {
    $_[0] =~ m/^([\x09\x0A\x0D\x20-\x7E\xA0-\xFF])*$/x;
}

################################################################################
# Returns true if the given string is valid UTF-8, false otherwise.
# Regex swiped from http://www.w3.org/International/questions/qa-forms-utf-8
sub is_utf8 {
    $_[0] =~ m/^([\x09\x0A\x0D\x20-\x7E]            # ASCII
               | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
               |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
               | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
               |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
               |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
               | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
               |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
              )*$/x;
}

>Obviously, it is longer. But no one (including yourself) will come back to
>visit you in six months/one year/three years with a Percussive Teaching
>Instrument.

I might go with something like your idea. It is very simple to follow.
Thanks!

------------------------------------------------------------------------
Dan Ramaley                            Dial Center 118, Drake University
Network Programmer/Analyst             2407 Carpenter Ave
+1 515 271-4540                        Des Moines IA 50311 USA


More information about the Cialug mailing list