Bits and Pieces (Programming Perl) Book Home

Chapter 2. Bits and Pieces

Since we're starting small, the progression through the next several chapters is necessarily from small to large. That is, we take a bottom-up approach, beginning with the smallest components of Perl programs and building them into more elaborate structures, much like molecules are built out of atoms. The disadvantage of this approach is that you don't necessarily get the Big Picture before getting lost in a welter of details. The advantage is that you can understand the examples as we go along. (If you're a top-down person, just turn the book over and read the chapters backward.)

Each chapter does build on the preceding chapter (or the subsequent chapter, if you're reading backward), so you'll need to be careful if you're the sort of person who skips around.

You're certainly welcome to peek at the reference materials toward the end of the book as we go along. (That doesn't count as skipping around.) In particular, any isolated word in typewriter font is likely to be found in Chapter 29, "Functions". And although we've tried to stay operating-system neutral, if you are unfamiliar with Unix terminology and run into a word that doesn't seem to mean what you think it ought to mean, you should check whether the word is in the Glossary. If the Glossary doesn't work, the index probably will.

2.1. Atoms

Although there are various invisible things going on behind the scenes that we'll explain presently, the smallest things you generally work with in Perl are individual characters. And we do mean characters; historically, Perl freely confused bytes with characters and characters with bytes, but in this new era of global networking, we must be careful to distinguish the two.

Perl may, of course, be written entirely in the 7-bit ASCII character set. Perl also allows you to write in any 8-bit or 16-bit character set, whether it's a national character set or some other legacy character set. However, if you choose to write in one of these older, non-ASCII character sets, you may use non-ASCII characters only within string literals. You are responsible for making sure that the semantics of your program are consistent with the particular national character set you've chosen. For instance, if you're using a 16-bit encoding for an Asian national character set, keep in mind that Perl will generally think of each of your characters as two bytes, not as one character.

As described in Chapter 15, "Unicode", we've recently added support for Unicode to Perl.[1] This support is pervasive throughout the language: you can use Unicode characters in identifiers (variable names and such) as well as within literal strings. When you are using Unicode, you don't need to worry about how many bits or bytes it takes to represent a character. Perl just pretends all Unicode characters are the same size (that is, size 1), even though any given character might be represented by multiple bytes internally. Perl normally represents Unicode internally as UTF-8, a variable-length encoding. (For instance, a Unicode smiley character, U-263A, would be represented internally as a three-byte sequence.)

[1] As excited as we are about Unicode support, most of our examples will be in ASCII, since not everyone has a decent Unicode editor yet.

If you'll let us drive our analogy of the physical elements a bit further, characters are atomic in the same sense as the individual atoms of the various elements. Yes, they're composed of smaller particles known as bits and bytes, but if you break a character apart (in a character accelerator, no doubt), the individual bits and bytes lose the distinguishing chemical properties of the character as a whole. Just as neutrons are an implementation detail of the U-238 atom, so too bytes are an implementation detail of the U-263A character.

So we'll be careful to say "characters" when we mean characters, and "bytes" when we mean bytes. But we don't mean to scare you--you can still do the good old-fashioned byte processing easily enough. All you have to do is tell Perl that you still want to think of bytes as characters. You can do that with a use bytes pragma (see Chapter 31, "Pragmatic Modules"). But even if you don't do that, Perl will still do a pretty good job of keeping small characters in 8 bits when you expect it to.

So don't sweat the small stuff. Let's move on to bigger and better things.


Part 2. The Gory Details		2.2. Molecules

Chapter 2. Bits and Pieces

Contents:

2.1. Atoms