Recipe 8.11. Processing Binary Files (Perl Cookbook)

Perl Cookbook

Perl CookbookSearch this book
Previous: 8.10. Removing the Last Line of a FileChapter 8
File Contents
Next: 8.12. Using Random-Access I/O
 

8.11. Processing Binary Files

Problem

Your system distinguishes between text and binary files. How do you?

Solution

Use the binmode function on the filehandle:

binmode(HANDLE);

Discussion

Not everyone agrees what constitutes a line in a text file, because one person's textual character set is another's binary gibberish. Even when everyone is using ASCII instead of EBCDIC, Rad50, or Unicode, discrepancies arise.

As mentioned in the Introduction, there is no such thing as a newline character. It is purely virtual, a figment of the operating system, standard libraries, device drivers, and Perl.

Under Unix or Plan9, a "\n" represents the physical sequence "\cJ" (the Perl double-quote escape for Ctrl-J), a linefeed. However, on a terminal that's not in raw mode, an Enter key generates an incoming "\cM" (a carriage return) which turns into "\cJ", whereas an outgoing "\cJ" turns into "\cM\cJ". This strangeness doesn't happen with normal files, just terminal devices, and it is handled strictly by the device driver.

On a Mac, a "\n" is usually represented by "\cM"; just to make life interesting (and because the standard requires that "\n" and "\r" be different), a "\r" represents a "\cJ". This is exactly the opposite of the way that Unix, Plan9, VMS, CP/M, or nearly anyone else does it. So, Mac programmers writing files for other systems or talking over a network have to be careful. If you send out "\n", you'll deliver a "\cM", and no "\cJ" will be seen. Most network services prefer to receive and send "\cM\cJ" as a line terminator, but most accept merely a "\cJ".

Under VMS, DOS, or their derivatives, a "\n" represents "\cJ", similar to Unix and Plan9. From the perspective of a tty, Unix and DOS behave identically: a user who hits Enter generates a "\cM", but this arrives at the program as a "\n", which is "\cJ". A "\n" (that's a "\cJ", remember) sent to a terminal shows up as a "\cM\cJ".

These strange conversions happen to Windows files as well. A DOS text file actually physically contains two characters at the end of every line, "\cM\cJ". The last block in the file has a "\cZ" to indicate where the text stops. When you write a line like "bad news\n" on those systems, the file contains "bad news\cM\cJ", just as if it were a terminal.

When you read a line on such systems, it's even stranger. The file itself contains "bad news\cM\cJ", a 10-byte string. When you read it in, your program gets nothing but "bad news\n", where that "\n" is the virtual newline character, that is, a linefeed ("\cJ"). That means to get rid of it, a single chop or chomp will do. But your poor program has been tricked into thinking it's only read nine bytes from the file. If you were to read 10 such lines, you would appear to have read just 90 bytes into the file, but in fact would be at position 100. That's why the tell function must always be used to determine your location. You can't infer your position just by counting what you've read.

This legacy of the old CP/M filesystem, whose equivalent of a Unix inode stored only block counts and not file sizes, has frustrated programmers for decades, and no end is in sight. Because DOS is compatible with CP/M file formats, Windows with DOS, and NT with Windows, the sins of the fathers have truly been visited unto the children of the fourth generation.

You can circumvent the single "\n" terminator by telling Perl (and the operating system) that you're working with binary data. The binmode function indicates that data read or written through the given filehandle should not be mangled the way a text file would likely be on those systems.

$gifname = "picture.gif";
open(GIF, $gifname)         or die "can't open $gifname: $!";

binmode(GIF);               # now DOS won't mangle binary input from GIF
binmode(STDOUT);            # now DOS won't mangle binary output to STDOUT

while (read(GIF, $buff, 8 * 2**10)) {
    print STDOUT $buff;
}

Calling binmode on systems that don't make this distinction (including Unix, the Mac, and Plan 9) is harmless. Inappropriately doing so (such as on a text file) on systems that do (including MVS, VMS, and DOS, regardless of its GUI ) can mangle your files.

If you're not using binmode, the data you read using stdio (<>) will automatically have the native system's line terminator changed to "\n", even if you change $/. Similarly, any "\n" you print to the filehandle will be turned into the native line terminator. See this chapter's Introduction for more details.

If you want to get what was on the disk, byte for byte, you should set binmode if you're on one of the odd systems listed above. Then, of course, you also have to set $/ to the real record separator if you want to use <> on it.

See Also

The open and binmode functions in perlfunc (1) and in Chapter 3 of Programming Perl; your system's open (2) and fopen (3) manpages


Previous: 8.10. Removing the Last Line of a FilePerl CookbookNext: 8.12. Using Random-Access I/O
8.10. Removing the Last Line of a FileBook Index8.12. Using Random-Access I/O

Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.