=over =item pack TEMPLATE,LIST X Takes a LIST of values and converts it into a string using the rules given by the TEMPLATE. The resulting string is the concatenation of the converted values. Typically, each converted value looks like its machine-level representation. For example, on 32-bit machines an integer may be represented by a sequence of 4 bytes, which will in Perl be presented as a string that's 4 characters long. See L for an introduction to this function. The TEMPLATE is a sequence of characters that give the order and type of values, as follows: a A string with arbitrary binary data, will be null padded. A A text (ASCII) string, will be space padded. Z A null-terminated (ASCIZ) string, will be null padded. b A bit string (ascending bit order inside each byte, like vec()). B A bit string (descending bit order inside each byte). h A hex string (low nybble first). H A hex string (high nybble first). c A signed char (8-bit) value. C An unsigned char (octet) value. W An unsigned char value (can be greater than 255). s A signed short (16-bit) value. S An unsigned short value. l A signed long (32-bit) value. L An unsigned long value. q A signed quad (64-bit) value. Q An unsigned quad value. (Quads are available only if your system supports 64-bit integer values _and_ if Perl has been compiled to support those. Raises an exception otherwise.) i A signed integer value. I An unsigned integer value. (This 'integer' is _at_least_ 32 bits wide. Its exact size depends on what a local C compiler calls 'int'.) n An unsigned short (16-bit) in "network" (big-endian) order. N An unsigned long (32-bit) in "network" (big-endian) order. v An unsigned short (16-bit) in "VAX" (little-endian) order. V An unsigned long (32-bit) in "VAX" (little-endian) order. j A Perl internal signed integer value (IV). J A Perl internal unsigned integer value (UV). f A single-precision float in native format. d A double-precision float in native format. F A Perl internal floating-point value (NV) in native format D A float of long-double precision in native format. (Long doubles are available only if your system supports long double values _and_ if Perl has been compiled to support those. Raises an exception otherwise. Note that there are different long double formats.) p A pointer to a null-terminated string. P A pointer to a structure (fixed-length string). u A uuencoded string. U A Unicode character number. Encodes to a character in char- acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in byte mode. w A BER compressed integer (not an ASN.1 BER, see perlpacktut for details). Its bytes represent an unsigned integer in base 128, most significant digit first, with as few digits as possible. Bit eight (the high bit) is set on each byte except the last. x A null byte (a.k.a ASCII NUL, "\000", chr(0)) X Back up a byte. @ Null-fill or truncate to absolute position, counted from the start of the innermost ()-group. . Null-fill or truncate to absolute position specified by the value. ( Start of a ()-group. One or more modifiers below may optionally follow certain letters in the TEMPLATE (the second column lists letters for which the modifier is valid): ! sSlLiI Forces native (short, long, int) sizes instead of fixed (16-/32-bit) sizes. ! xX Make x and X act as alignment commands. ! nNvV Treat integers as signed instead of unsigned. ! @. Specify position as byte offset in the internal representation of the packed string. Efficient but dangerous. > sSiIlLqQ Force big-endian byte-order on the type. jJfFdDpP (The "big end" touches the construct.) < sSiIlLqQ Force little-endian byte-order on the type. jJfFdDpP (The "little end" touches the construct.) The C<< > >> and C<< < >> modifiers can also be used on C<()> groups to force a particular byte-order on all components in that group, including all its subgroups. =begin comment Larry recalls that the hex and bit string formats (H, h, B, b) were added to pack for processing data from NASA's Magellan probe. Magellan was in an elliptical orbit, using the antenna for the radar mapping when close to Venus and for communicating data back to Earth for the rest of the orbit. There were two transmission units, but one of these failed, and then the other developed a fault whereby it would randomly flip the sense of all the bits. It was easy to automatically detect complete records with the correct sense, and complete records with all the bits flipped. However, this didn't recover the records where the sense flipped midway. A colleague of Larry's was able to pretty much eyeball where the records flipped, so they wrote an editor named kybble (a pun on the dog food Kibbles 'n Bits) to enable him to manually correct the records and recover the data. For this purpose pack gained the hex and bit string format specifiers. git shows that they were added to perl 3.0 in patch #44 (Jan 1991, commit 27e2fb84680b9cc1), but the patch description makes no mention of their addition, let alone the story behind them. =end comment The following rules apply: =over =item * Each letter may optionally be followed by a number indicating the repeat count. A numeric repeat count may optionally be enclosed in brackets, as in C. The repeat count gobbles that many values from the LIST when used with all format types other than C, C, C, C, C, C, C, C<@>, C<.>, C, C, and C

format packs a pointer to a null-terminated string. You are responsible for ensuring that the string is not a temporary value, as that could potentially get deallocated before you got around to using the packed result. The C

format packs a pointer to a structure of the size indicated by the length. A null pointer is created if the corresponding value for C

or C

is L|/undef EXPR>; similarly with L|/unpack TEMPLATE,EXPR>, where a null pointer unpacks into L|/undef EXPR>. If your system has a strange pointer size--meaning a pointer is neither as big as an int nor as big as a long--it may not be possible to pack or unpack pointers in big- or little-endian byte order. Attempting to do so raises an exception. =item * The C template character allows packing and unpacking of a sequence of items where the packed structure contains a packed item count followed by the packed items themselves. This is useful when the structure you're unpacking has encoded the sizes or repeat counts for some of its fields within the structure itself as separate fields. For L|/pack TEMPLATE,LIST>, you write ICI, and the I describes how the length value is packed. Formats likely to be of most use are integer-packing ones like C for Java strings, C for ASN.1 or SNMP, and C for Sun XDR. For L|/pack TEMPLATE,LIST>, I may have a repeat count, in which case the minimum of that and the number of available items is used as the argument for I. If it has no repeat count or uses a '*', the number of available items is used. For L|/unpack TEMPLATE,EXPR>, an internal stack of integer arguments unpacked so far is used. You write CI and the repeat count is obtained by popping off the last element from the stack. The I must not have a repeat count. If I refers to a string type (C<"A">, C<"a">, or C<"Z">), the I is the string length, not the number of strings. With an explicit repeat count for pack, the packed string is adjusted to that length. For example: This code: gives this result: unpack("W/a", "\004Gurusamy") ("Guru") unpack("a3/A A*", "007 Bond J ") (" Bond", "J") unpack("a3 x2 /A A*", "007: Bond, J.") ("Bond, J", ".") pack("n/a* w/a","hello,","world") "\000\006hello,\005world" pack("a/W2", ord("a") .. ord("z")) "2ab" The I is not returned explicitly from L|/unpack TEMPLATE,EXPR>. Supplying a count to the I format letter is only useful with C, C, or C. Packing with a I of C or C may introduce C<"\000"> characters, which Perl does not regard as legal in numeric strings. =item * The integer types C, C, C, and C may be followed by a C modifier to specify native shorts or longs. As shown in the example above, a bare C means exactly 32 bits, although the native C as seen by the local C compiler may be larger. This is mainly an issue on 64-bit platforms. You can see whether using C makes any difference this way: printf "format s is %d, s! is %d\n", length pack("s"), length pack("s!"); printf "format l is %d, l! is %d\n", length pack("l"), length pack("l!"); C and C are also allowed, but only for completeness' sake: they are identical to C and C. The actual sizes (in bytes) of native shorts, ints, longs, and long longs on the platform where Perl was built are also available from the command line: $ perl -V:{short,int,long{,long}}size shortsize='2'; intsize='4'; longsize='4'; longlongsize='8'; or programmatically via the L|Config> module: use Config; print $Config{shortsize}, "\n"; print $Config{intsize}, "\n"; print $Config{longsize}, "\n"; print $Config{longlongsize}, "\n"; C<$Config{longlongsize}> is undefined on systems without long long support. =item * The integer formats C, C, C, C, C, C, C, and C are inherently non-portable between processors and operating systems because they obey native byteorder and endianness. For example, a 4-byte integer 0x12345678 (305419896 decimal) would be ordered natively (arranged in and handled by the CPU registers) into bytes as 0x12 0x34 0x56 0x78 # big-endian 0x78 0x56 0x34 0x12 # little-endian Basically, Intel and VAX CPUs are little-endian, while everybody else, including Motorola m68k/88k, PPC, Sparc, HP PA, Power, and Cray, are big-endian. Alpha and MIPS can be either: Digital/Compaq uses (well, used) them in little-endian mode, but SGI/Cray uses them in big-endian mode. The names I and I are comic references to the egg-eating habits of the little-endian Lilliputians and the big-endian Blefuscudians from the classic Jonathan Swift satire, I. This entered computer lingo via the paper "On Holy Wars and a Plea for Peace" by Danny Cohen, USC/ISI IEN 137, April 1, 1980. Some systems may have even weirder byte orders such as 0x56 0x78 0x12 0x34 0x34 0x12 0x78 0x56 These are called mid-endian, middle-endian, mixed-endian, or just weird. You can determine your system endianness with this incantation: printf("%#02x ", $_) for unpack("W*", pack L=>0x12345678); The byteorder on the platform where Perl was built is also available via L: use Config; print "$Config{byteorder}\n"; or from the command line: $ perl -V:byteorder Byteorders C<"1234"> and C<"12345678"> are little-endian; C<"4321"> and C<"87654321"> are big-endian. Systems with multiarchitecture binaries will have C<"ffff">, signifying that static information doesn't work, one must use runtime probing. For portably packed integers, either use the formats C, C, C, and C or else use the C<< > >> and C<< < >> modifiers described immediately below. See also L. =item * Also floating point numbers have endianness. Usually (but not always) this agrees with the integer endianness. Even though most platforms these days use the IEEE 754 binary format, there are differences, especially if the long doubles are involved. You can see the C variables C and C (also C, C): the "kind" values are enums, unlike C. Portability-wise the best option is probably to keep to the IEEE 754 64-bit doubles, and of agreed-upon endianness. Another possibility is the C<"%a">) format of L|/printf FILEHANDLE FORMAT, LIST>. =item * Starting with Perl 5.10.0, integer and floating-point formats, along with the C

and C

formats and C<()> groups, may all be followed by the C<< > >> or C<< < >> endianness modifiers to respectively enforce big- or little-endian byte-order. These modifiers are especially useful given how C, C, C, and C don't cover signed integers, 64-bit integers, or floating-point values. Here are some concerns to keep in mind when using an endianness modifier: =over =item * Exchanging signed integers between different platforms works only when all platforms store them in the same format. Most platforms store signed integers in two's-complement notation, so usually this is not an issue. =item * The C<< > >> or C<< < >> modifiers can only be used on floating-point formats on big- or little-endian machines. Otherwise, attempting to use them raises an exception. =item * Forcing big- or little-endian byte-order on floating-point values for data exchange can work only if all platforms use the same binary representation such as IEEE floating-point. Even if all platforms are using IEEE, there may still be subtle differences. Being able to use C<< > >> or C<< < >> on floating-point values can be useful, but also dangerous if you don't know exactly what you're doing. It is not a general way to portably store floating-point values. =item * When using C<< > >> or C<< < >> on a C<()> group, this affects all types inside the group that accept byte-order modifiers, including all subgroups. It is silently ignored for all other types. You are not allowed to override the byte-order within a group that already has a byte-order modifier suffix. =back =item * Real numbers (floats and doubles) are in native machine format only. Due to the multiplicity of floating-point formats and the lack of a standard "network" representation for them, no facility for interchange has been made. This means that packed floating-point data written on one machine may not be readable on another, even if both use IEEE floating-point arithmetic (because the endianness of the memory representation is not part of the IEEE spec). See also L. If you know I what you're doing, you can use the C<< > >> or C<< < >> modifiers to force big- or little-endian byte-order on floating-point values. Because Perl uses doubles (or long doubles, if configured) internally for all numeric calculation, converting from double into float and thence to double again loses precision, so C) will not in general equal $foo. =item * Pack and unpack can operate in two modes: character mode (C mode) where the packed string is processed per character, and UTF-8 byte mode (C mode) where the packed string is processed in its UTF-8-encoded Unicode form on a byte-by-byte basis. Character mode is the default unless the format string starts with C. You can always switch mode mid-format with an explicit C or C in the format. This mode remains in effect until the next mode change, or until the end of the C<()> group it (directly) applies to. Using C to get Unicode characters while using C to get I-Unicode bytes is not necessarily obvious. Probably only the first of these is what you want: $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | perl -CS -ne 'printf "%v04X\n", $_ for unpack("C0A*", $_)' 03B1.03C9 $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | perl -CS -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)' CE.B1.CF.89 $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | perl -C0 -ne 'printf "%v02X\n", $_ for unpack("C0A*", $_)' CE.B1.CF.89 $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | perl -C0 -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)' C3.8E.C2.B1.C3.8F.C2.89 Those examples also illustrate that you should not try to use L|/pack TEMPLATE,LIST>/L|/unpack TEMPLATE,EXPR> as a substitute for the L module. =item * You must yourself do any alignment or padding by inserting, for example, enough C<"x">es while packing. There is no way for L|/pack TEMPLATE,LIST> and L|/unpack TEMPLATE,EXPR> to know where characters are going to or coming from, so they handle their output and input as flat sequences of characters. =item * A C<()> group is a sub-TEMPLATE enclosed in parentheses. A group may take a repeat count either as postfix, or for L|/unpack TEMPLATE,EXPR>, also via the C template character. Within each repetition of a group, positioning with C<@> starts over at 0. Therefore, the result of pack("@1A((@2A)@3A)", qw[X Y Z]) is the string C<"\0X\0\0YZ">. =item * C and C accept the C modifier to act as alignment commands: they jump forward or back to the closest position aligned at a multiple of C characters. For example, to L|/pack TEMPLATE,LIST> or L|/unpack TEMPLATE,EXPR> a C structure like struct { char c; /* one signed, 8-bit character */ double d; char cc[2]; } one may need to use the template C. This assumes that doubles must be aligned to the size of double. For alignment commands, a C of 0 is equivalent to a C of 1; both are no-ops. =item * C, C, C and C accept the C modifier to represent signed 16-/32-bit integers in big-/little-endian order. This is portable only when all platforms sharing packed data use the same binary representation for signed integers; for example, when all platforms use two's-complement representation. =item * Comments can be embedded in a TEMPLATE using C<#> through the end of line. White space can separate pack codes from each other, but modifiers and repeat counts must follow immediately. Breaking complex templates into individual line-by-line components, suitably annotated, can do as much to improve legibility and maintainability of pack/unpack formats as C can for complicated pattern matches. =item * If TEMPLATE requires more arguments than L|/pack TEMPLATE,LIST> is given, L|/pack TEMPLATE,LIST> assumes additional C<""> arguments. If TEMPLATE requires fewer arguments than given, extra arguments are ignored. =item * Attempting to pack the special floating point values C and C (infinity, also in negative, and not-a-number) into packed integer values (like C<"L">) is a fatal error. The reason for this is that there simply isn't any sensible mapping for these special values into integers. =back Examples: $foo = pack("WWWW",65,66,67,68); # foo eq "ABCD" $foo = pack("W4",65,66,67,68); # same thing $foo = pack("W4",0x24b6,0x24b7,0x24b8,0x24b9); # same thing with Unicode circled letters. $foo = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9); # same thing with Unicode circled letters. You don't get the # UTF-8 bytes because the U at the start of the format caused # a switch to U0-mode, so the UTF-8 bytes get joined into # characters $foo = pack("C0U4",0x24b6,0x24b7,0x24b8,0x24b9); # foo eq "\xe2\x92\xb6\xe2\x92\xb7\xe2\x92\xb8\xe2\x92\xb9" # This is the UTF-8 encoding of the string in the # previous example $foo = pack("ccxxcc",65,66,67,68); # foo eq "AB\0\0CD" # NOTE: The examples above featuring "W" and "c" are true # only on ASCII and ASCII-derived systems such as ISO Latin 1 # and UTF-8. On EBCDIC systems, the first example would be # $foo = pack("WWWW",193,194,195,196); $foo = pack("s2",1,2); # "\001\000\002\000" on little-endian # "\000\001\000\002" on big-endian $foo = pack("a4","abcd","x","y","z"); # "abcd" $foo = pack("aaaa","abcd","x","y","z"); # "axyz" $foo = pack("a14","abcdefg"); # "abcdefg\0\0\0\0\0\0\0" $foo = pack("i9pl", gmtime); # a real struct tm (on my system anyway) $utmp_template = "Z8 Z8 Z16 L"; $utmp = pack($utmp_template, @utmp1); # a struct utmp (BSDish) @utmp2 = unpack($utmp_template, $utmp); # "@utmp1" eq "@utmp2" sub bintodec { unpack("N", pack("B32", substr("0" x 32 . shift, -32))); } $foo = pack('sx2l', 12, 34); # short 12, two zero bytes padding, long 34 $bar = pack('s@4l', 12, 34); # short 12, zero fill to position 4, long 34 # $foo eq $bar $baz = pack('s.l', 12, 4, 34); # short 12, zero fill to position 4, long 34 $foo = pack('nN', 42, 4711); # pack big-endian 16- and 32-bit unsigned integers $foo = pack('S>L>', 42, 4711); # exactly the same $foo = pack('s|/unpack TEMPLATE,EXPR>. =back

, where it means something else, described below. Supplying a C<*> for the repeat count instead of a number means to use however many items are left, except for: =over =item * C<@>, C, and C, where it is equivalent to C<0>. =item * <.>, where it means relative to the start of the string. =item * C, where it is equivalent to 1 (or 45, which here is equivalent). =back One can replace a numeric repeat count with a template letter enclosed in brackets to use the packed byte length of the bracketed template for the repeat count. For example, the template C skips as many bytes as in a packed long, and the template C<"$t X[$t] $t"> unpacks twice whatever $t (when variable-expanded) unpacks. If the template in brackets contains alignment commands (such as C), its packed length is calculated as if the start of the template had the maximal possible alignment. When used with C, a C<*> as the repeat count is guaranteed to add a trailing null byte, so the resulting string is always one byte longer than the byte length of the item itself. When used with C<@>, the repeat count represents an offset from the start of the innermost C<()> group. When used with C<.>, the repeat count determines the starting position to calculate the value offset as follows: =over =item * If the repeat count is C<0>, it's relative to the current position. =item * If the repeat count is C<*>, the offset is relative to the start of the packed string. =item * And if it's an integer I, the offset is relative to the start of the Ith innermost C<( )> group, or to the start of the string if I is bigger then the group level. =back The repeat count for C is interpreted as the maximal number of bytes to encode per line of output, with 0, 1 and 2 replaced by 45. The repeat count should not be more than 65. =item * The C, C, and C types gobble just one value, but pack it as a string of length count, padding with nulls or spaces as needed. When unpacking, C strips trailing whitespace and nulls, C strips everything after the first null, and C returns data with no stripping at all. If the value to pack is too long, the result is truncated. If it's too long and an explicit count is provided, C packs only C<$count-1> bytes, followed by a null byte. Thus C always packs a trailing null, except when the count is 0. =item * Likewise, the C and C formats pack a string that's that many bits long. Each such format generates 1 bit of the result. These are typically followed by a repeat count like C or C. Each result bit is based on the least-significant bit of the corresponding input character, i.e., on C. In particular, characters C<"0"> and C<"1"> generate bits 0 and 1, as do characters C<"\000"> and C<"\001">. Starting from the beginning of the input string, each 8-tuple of characters is converted to 1 character of output. With format C, the first character of the 8-tuple determines the least-significant bit of a character; with format C, it determines the most-significant bit of a character. If the length of the input string is not evenly divisible by 8, the remainder is packed as if the input string were padded by null characters at the end. Similarly during unpacking, "extra" bits are ignored. If the input string is longer than needed, remaining characters are ignored. A C<*> for the repeat count uses all characters of the input field. On unpacking, bits are converted to a string of C<0>s and C<1>s. =item * The C and C formats pack a string that many nybbles (4-bit groups, representable as hexadecimal digits, C<"0".."9"> C<"a".."f">) long. For each such format, L|/pack TEMPLATE,LIST> generates 4 bits of result. With non-alphabetical characters, the result is based on the 4 least-significant bits of the input character, i.e., on C. In particular, characters C<"0"> and C<"1"> generate nybbles 0 and 1, as do bytes C<"\000"> and C<"\001">. For characters C<"a".."f"> and C<"A".."F">, the result is compatible with the usual hexadecimal digits, so that C<"a"> and C<"A"> both generate the nybble C<0xA==10>. Use only these specific hex characters with this format. Starting from the beginning of the template to L|/pack TEMPLATE,LIST>, each pair of characters is converted to 1 character of output. With format C, the first character of the pair determines the least-significant nybble of the output character; with format C, it determines the most-significant nybble. If the length of the input string is not even, it behaves as if padded by a null character at the end. Similarly, "extra" nybbles are ignored during unpacking. If the input string is longer than needed, extra characters are ignored. A C<*> for the repeat count uses all characters of the input field. For L|/unpack TEMPLATE,EXPR>, nybbles are converted to a string of hexadecimal digits. =item * The C