]>
Commit | Line | Data |
---|---|---|
ef6bacff RC |
1 | .PU |
2 | .TH bzip2 1 | |
3 | .SH NAME | |
ba95a000 | 4 | bzip2, bunzip2 \- a block-sorting file compressor, v1.0.2 |
ef6bacff RC |
5 | .br |
6 | bzcat \- decompresses files to stdout | |
7 | .br | |
8 | bzip2recover \- recovers data from damaged bzip2 files | |
9 | ||
10 | .SH SYNOPSIS | |
11 | .ll +8 | |
12 | .B bzip2 | |
13 | .RB [ " \-cdfkqstvzVL123456789 " ] | |
14 | [ | |
15 | .I "filenames \&..." | |
16 | ] | |
17 | .ll -8 | |
18 | .br | |
19 | .B bunzip2 | |
20 | .RB [ " \-fkvsVL " ] | |
21 | [ | |
22 | .I "filenames \&..." | |
23 | ] | |
24 | .br | |
25 | .B bzcat | |
26 | .RB [ " \-s " ] | |
27 | [ | |
28 | .I "filenames \&..." | |
29 | ] | |
30 | .br | |
31 | .B bzip2recover | |
32 | .I "filename" | |
33 | ||
34 | .SH DESCRIPTION | |
35 | .I bzip2 | |
36 | compresses files using the Burrows-Wheeler block sorting | |
37 | text compression algorithm, and Huffman coding. Compression is | |
38 | generally considerably better than that achieved by more conventional | |
39 | LZ77/LZ78-based compressors, and approaches the performance of the PPM | |
40 | family of statistical compressors. | |
41 | ||
42 | The command-line options are deliberately very similar to | |
43 | those of | |
44 | .I GNU gzip, | |
45 | but they are not identical. | |
46 | ||
47 | .I bzip2 | |
48 | expects a list of file names to accompany the | |
49 | command-line flags. Each file is replaced by a compressed version of | |
50 | itself, with the name "original_name.bz2". | |
51 | Each compressed file | |
52 | has the same modification date, permissions, and, when possible, | |
53 | ownership as the corresponding original, so that these properties can | |
54 | be correctly restored at decompression time. File name handling is | |
55 | naive in the sense that there is no mechanism for preserving original | |
56 | file names, permissions, ownerships or dates in filesystems which lack | |
57 | these concepts, or have serious file name length restrictions, such as | |
58 | MS-DOS. | |
59 | ||
60 | .I bzip2 | |
61 | and | |
62 | .I bunzip2 | |
63 | will by default not overwrite existing | |
64 | files. If you want this to happen, specify the \-f flag. | |
65 | ||
66 | If no file names are specified, | |
67 | .I bzip2 | |
68 | compresses from standard | |
69 | input to standard output. In this case, | |
70 | .I bzip2 | |
71 | will decline to | |
72 | write compressed output to a terminal, as this would be entirely | |
73 | incomprehensible and therefore pointless. | |
74 | ||
75 | .I bunzip2 | |
76 | (or | |
77 | .I bzip2 \-d) | |
78 | decompresses all | |
79 | specified files. Files which were not created by | |
80 | .I bzip2 | |
81 | will be detected and ignored, and a warning issued. | |
82 | .I bzip2 | |
83 | attempts to guess the filename for the decompressed file | |
84 | from that of the compressed file as follows: | |
85 | ||
86 | filename.bz2 becomes filename | |
87 | filename.bz becomes filename | |
88 | filename.tbz2 becomes filename.tar | |
89 | filename.tbz becomes filename.tar | |
90 | anyothername becomes anyothername.out | |
91 | ||
92 | If the file does not end in one of the recognised endings, | |
93 | .I .bz2, | |
94 | .I .bz, | |
95 | .I .tbz2 | |
96 | or | |
97 | .I .tbz, | |
98 | .I bzip2 | |
99 | complains that it cannot | |
100 | guess the name of the original file, and uses the original name | |
101 | with | |
102 | .I .out | |
103 | appended. | |
104 | ||
105 | As with compression, supplying no | |
106 | filenames causes decompression from | |
107 | standard input to standard output. | |
108 | ||
109 | .I bunzip2 | |
110 | will correctly decompress a file which is the | |
111 | concatenation of two or more compressed files. The result is the | |
112 | concatenation of the corresponding uncompressed files. Integrity | |
113 | testing (\-t) | |
114 | of concatenated | |
115 | compressed files is also supported. | |
116 | ||
117 | You can also compress or decompress files to the standard output by | |
118 | giving the \-c flag. Multiple files may be compressed and | |
119 | decompressed like this. The resulting outputs are fed sequentially to | |
120 | stdout. Compression of multiple files | |
121 | in this manner generates a stream | |
122 | containing multiple compressed file representations. Such a stream | |
123 | can be decompressed correctly only by | |
124 | .I bzip2 | |
125 | version 0.9.0 or | |
126 | later. Earlier versions of | |
127 | .I bzip2 | |
128 | will stop after decompressing | |
129 | the first file in the stream. | |
130 | ||
131 | .I bzcat | |
132 | (or | |
133 | .I bzip2 -dc) | |
134 | decompresses all specified files to | |
135 | the standard output. | |
136 | ||
137 | .I bzip2 | |
138 | will read arguments from the environment variables | |
139 | .I BZIP2 | |
140 | and | |
141 | .I BZIP, | |
142 | in that order, and will process them | |
143 | before any arguments read from the command line. This gives a | |
144 | convenient way to supply default arguments. | |
145 | ||
146 | Compression is always performed, even if the compressed | |
147 | file is slightly | |
148 | larger than the original. Files of less than about one hundred bytes | |
149 | tend to get larger, since the compression mechanism has a constant | |
150 | overhead in the region of 50 bytes. Random data (including the output | |
151 | of most file compressors) is coded at about 8.05 bits per byte, giving | |
152 | an expansion of around 0.5%. | |
153 | ||
154 | As a self-check for your protection, | |
155 | .I | |
156 | bzip2 | |
157 | uses 32-bit CRCs to | |
158 | make sure that the decompressed version of a file is identical to the | |
159 | original. This guards against corruption of the compressed data, and | |
160 | against undetected bugs in | |
161 | .I bzip2 | |
162 | (hopefully very unlikely). The | |
163 | chances of data corruption going undetected is microscopic, about one | |
164 | chance in four billion for each file processed. Be aware, though, that | |
165 | the check occurs upon decompression, so it can only tell you that | |
166 | something is wrong. It can't help you | |
167 | recover the original uncompressed | |
168 | data. You can use | |
169 | .I bzip2recover | |
170 | to try to recover data from | |
171 | damaged files. | |
172 | ||
173 | Return values: 0 for a normal exit, 1 for environmental problems (file | |
174 | not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt | |
175 | compressed file, 3 for an internal consistency error (eg, bug) which | |
176 | caused | |
177 | .I bzip2 | |
178 | to panic. | |
179 | ||
180 | .SH OPTIONS | |
181 | .TP | |
182 | .B \-c --stdout | |
183 | Compress or decompress to standard output. | |
184 | .TP | |
185 | .B \-d --decompress | |
186 | Force decompression. | |
187 | .I bzip2, | |
188 | .I bunzip2 | |
189 | and | |
190 | .I bzcat | |
191 | are | |
192 | really the same program, and the decision about what actions to take is | |
193 | done on the basis of which name is used. This flag overrides that | |
194 | mechanism, and forces | |
195 | .I bzip2 | |
196 | to decompress. | |
197 | .TP | |
198 | .B \-z --compress | |
199 | The complement to \-d: forces compression, regardless of the | |
ba95a000 | 200 | invocation name. |
ef6bacff RC |
201 | .TP |
202 | .B \-t --test | |
203 | Check integrity of the specified file(s), but don't decompress them. | |
204 | This really performs a trial decompression and throws away the result. | |
205 | .TP | |
206 | .B \-f --force | |
207 | Force overwrite of output files. Normally, | |
208 | .I bzip2 | |
209 | will not overwrite | |
210 | existing output files. Also forces | |
211 | .I bzip2 | |
212 | to break hard links | |
213 | to files, which it otherwise wouldn't do. | |
ba95a000 RC |
214 | |
215 | bzip2 normally declines to decompress files which don't have the | |
216 | correct magic header bytes. If forced (-f), however, it will pass | |
217 | such files through unmodified. This is how GNU gzip behaves. | |
ef6bacff RC |
218 | .TP |
219 | .B \-k --keep | |
220 | Keep (don't delete) input files during compression | |
221 | or decompression. | |
222 | .TP | |
223 | .B \-s --small | |
224 | Reduce memory usage, for compression, decompression and testing. Files | |
225 | are decompressed and tested using a modified algorithm which only | |
226 | requires 2.5 bytes per block byte. This means any file can be | |
227 | decompressed in 2300k of memory, albeit at about half the normal speed. | |
228 | ||
229 | During compression, \-s selects a block size of 200k, which limits | |
230 | memory use to around the same figure, at the expense of your compression | |
231 | ratio. In short, if your machine is low on memory (8 megabytes or | |
232 | less), use \-s for everything. See MEMORY MANAGEMENT below. | |
233 | .TP | |
234 | .B \-q --quiet | |
235 | Suppress non-essential warning messages. Messages pertaining to | |
236 | I/O errors and other critical events will not be suppressed. | |
237 | .TP | |
238 | .B \-v --verbose | |
239 | Verbose mode -- show the compression ratio for each file processed. | |
240 | Further \-v's increase the verbosity level, spewing out lots of | |
241 | information which is primarily of interest for diagnostic purposes. | |
242 | .TP | |
243 | .B \-L --license -V --version | |
244 | Display the software version, license terms and conditions. | |
245 | .TP | |
ba95a000 | 246 | .B \-1 (or \-\-fast) to \-9 (or \-\-best) |
ef6bacff RC |
247 | Set the block size to 100 k, 200 k .. 900 k when compressing. Has no |
248 | effect when decompressing. See MEMORY MANAGEMENT below. | |
ba95a000 RC |
249 | The \-\-fast and \-\-best aliases are primarily for GNU gzip |
250 | compatibility. In particular, \-\-fast doesn't make things | |
251 | significantly faster. | |
252 | And \-\-best merely selects the default behaviour. | |
ef6bacff RC |
253 | .TP |
254 | .B \-- | |
255 | Treats all subsequent arguments as file names, even if they start | |
256 | with a dash. This is so you can handle files with names beginning | |
257 | with a dash, for example: bzip2 \-- \-myfilename. | |
258 | .TP | |
259 | .B \--repetitive-fast --repetitive-best | |
260 | These flags are redundant in versions 0.9.5 and above. They provided | |
261 | some coarse control over the behaviour of the sorting algorithm in | |
262 | earlier versions, which was sometimes useful. 0.9.5 and above have an | |
263 | improved algorithm which renders these flags irrelevant. | |
264 | ||
265 | .SH MEMORY MANAGEMENT | |
266 | .I bzip2 | |
267 | compresses large files in blocks. The block size affects | |
268 | both the compression ratio achieved, and the amount of memory needed for | |
269 | compression and decompression. The flags \-1 through \-9 | |
270 | specify the block size to be 100,000 bytes through 900,000 bytes (the | |
271 | default) respectively. At decompression time, the block size used for | |
272 | compression is read from the header of the compressed file, and | |
273 | .I bunzip2 | |
274 | then allocates itself just enough memory to decompress | |
275 | the file. Since block sizes are stored in compressed files, it follows | |
276 | that the flags \-1 to \-9 are irrelevant to and so ignored | |
277 | during decompression. | |
278 | ||
279 | Compression and decompression requirements, | |
280 | in bytes, can be estimated as: | |
281 | ||
282 | Compression: 400k + ( 8 x block size ) | |
283 | ||
284 | Decompression: 100k + ( 4 x block size ), or | |
285 | 100k + ( 2.5 x block size ) | |
286 | ||
287 | Larger block sizes give rapidly diminishing marginal returns. Most of | |
288 | the compression comes from the first two or three hundred k of block | |
289 | size, a fact worth bearing in mind when using | |
290 | .I bzip2 | |
291 | on small machines. | |
292 | It is also important to appreciate that the decompression memory | |
293 | requirement is set at compression time by the choice of block size. | |
294 | ||
295 | For files compressed with the default 900k block size, | |
296 | .I bunzip2 | |
297 | will require about 3700 kbytes to decompress. To support decompression | |
298 | of any file on a 4 megabyte machine, | |
299 | .I bunzip2 | |
300 | has an option to | |
301 | decompress using approximately half this amount of memory, about 2300 | |
302 | kbytes. Decompression speed is also halved, so you should use this | |
303 | option only where necessary. The relevant flag is -s. | |
304 | ||
305 | In general, try and use the largest block size memory constraints allow, | |
306 | since that maximises the compression achieved. Compression and | |
307 | decompression speed are virtually unaffected by block size. | |
308 | ||
309 | Another significant point applies to files which fit in a single block | |
310 | -- that means most files you'd encounter using a large block size. The | |
311 | amount of real memory touched is proportional to the size of the file, | |
312 | since the file is smaller than a block. For example, compressing a file | |
313 | 20,000 bytes long with the flag -9 will cause the compressor to | |
314 | allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 | |
315 | kbytes of it. Similarly, the decompressor will allocate 3700k but only | |
316 | touch 100k + 20000 * 4 = 180 kbytes. | |
317 | ||
318 | Here is a table which summarises the maximum memory usage for different | |
319 | block sizes. Also recorded is the total compressed size for 14 files of | |
320 | the Calgary Text Compression Corpus totalling 3,141,622 bytes. This | |
321 | column gives some feel for how compression varies with block size. | |
322 | These figures tend to understate the advantage of larger block sizes for | |
323 | larger files, since the Corpus is dominated by smaller files. | |
324 | ||
325 | Compress Decompress Decompress Corpus | |
326 | Flag usage usage -s usage Size | |
327 | ||
328 | -1 1200k 500k 350k 914704 | |
329 | -2 2000k 900k 600k 877703 | |
330 | -3 2800k 1300k 850k 860338 | |
331 | -4 3600k 1700k 1100k 846899 | |
332 | -5 4400k 2100k 1350k 845160 | |
333 | -6 5200k 2500k 1600k 838626 | |
334 | -7 6100k 2900k 1850k 834096 | |
335 | -8 6800k 3300k 2100k 828642 | |
336 | -9 7600k 3700k 2350k 828642 | |
337 | ||
338 | .SH RECOVERING DATA FROM DAMAGED FILES | |
339 | .I bzip2 | |
340 | compresses files in blocks, usually 900kbytes long. Each | |
341 | block is handled independently. If a media or transmission error causes | |
342 | a multi-block .bz2 | |
343 | file to become damaged, it may be possible to | |
344 | recover data from the undamaged blocks in the file. | |
345 | ||
346 | The compressed representation of each block is delimited by a 48-bit | |
347 | pattern, which makes it possible to find the block boundaries with | |
348 | reasonable certainty. Each block also carries its own 32-bit CRC, so | |
349 | damaged blocks can be distinguished from undamaged ones. | |
350 | ||
351 | .I bzip2recover | |
352 | is a simple program whose purpose is to search for | |
353 | blocks in .bz2 files, and write each block out into its own .bz2 | |
354 | file. You can then use | |
355 | .I bzip2 | |
356 | \-t | |
357 | to test the | |
358 | integrity of the resulting files, and decompress those which are | |
359 | undamaged. | |
360 | ||
361 | .I bzip2recover | |
362 | takes a single argument, the name of the damaged file, | |
ba95a000 RC |
363 | and writes a number of files "rec00001file.bz2", |
364 | "rec00002file.bz2", etc, containing the extracted blocks. | |
ef6bacff RC |
365 | The output filenames are designed so that the use of |
366 | wildcards in subsequent processing -- for example, | |
ba95a000 | 367 | "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in |
ef6bacff RC |
368 | the correct order. |
369 | ||
370 | .I bzip2recover | |
371 | should be of most use dealing with large .bz2 | |
372 | files, as these will contain many blocks. It is clearly | |
373 | futile to use it on damaged single-block files, since a | |
374 | damaged block cannot be recovered. If you wish to minimise | |
375 | any potential data loss through media or transmission errors, | |
376 | you might consider compressing with a smaller | |
377 | block size. | |
378 | ||
379 | .SH PERFORMANCE NOTES | |
380 | The sorting phase of compression gathers together similar strings in the | |
381 | file. Because of this, files containing very long runs of repeated | |
382 | symbols, like "aabaabaabaab ..." (repeated several hundred times) may | |
383 | compress more slowly than normal. Versions 0.9.5 and above fare much | |
384 | better than previous versions in this respect. The ratio between | |
385 | worst-case and average-case compression time is in the region of 10:1. | |
386 | For previous versions, this figure was more like 100:1. You can use the | |
387 | \-vvvv option to monitor progress in great detail, if you want. | |
388 | ||
389 | Decompression speed is unaffected by these phenomena. | |
390 | ||
391 | .I bzip2 | |
392 | usually allocates several megabytes of memory to operate | |
393 | in, and then charges all over it in a fairly random fashion. This means | |
394 | that performance, both for compressing and decompressing, is largely | |
395 | determined by the speed at which your machine can service cache misses. | |
396 | Because of this, small changes to the code to reduce the miss rate have | |
397 | been observed to give disproportionately large performance improvements. | |
398 | I imagine | |
399 | .I bzip2 | |
400 | will perform best on machines with very large caches. | |
401 | ||
402 | .SH CAVEATS | |
403 | I/O error messages are not as helpful as they could be. | |
404 | .I bzip2 | |
405 | tries hard to detect I/O errors and exit cleanly, but the details of | |
406 | what the problem is sometimes seem rather misleading. | |
407 | ||
ba95a000 | 408 | This manual page pertains to version 1.0.2 of |
ef6bacff | 409 | .I bzip2. |
ba95a000 RC |
410 | Compressed data created by this version is entirely forwards and |
411 | backwards compatible with the previous public releases, versions | |
412 | 0.1pl2, 0.9.0, 0.9.5, 1.0.0 and 1.0.1, but with the following | |
413 | exception: 0.9.0 and above can correctly decompress multiple | |
414 | concatenated compressed files. 0.1pl2 cannot do this; it will stop | |
415 | after decompressing just the first file in the stream. | |
ef6bacff RC |
416 | |
417 | .I bzip2recover | |
ba95a000 RC |
418 | versions prior to this one, 1.0.2, used 32-bit integers to represent |
419 | bit positions in compressed files, so it could not handle compressed | |
420 | files more than 512 megabytes long. Version 1.0.2 and above uses | |
421 | 64-bit ints on some platforms which support them (GNU supported | |
422 | targets, and Windows). To establish whether or not bzip2recover was | |
423 | built with such a limitation, run it without arguments. In any event | |
424 | you can build yourself an unlimited version if you can recompile it | |
425 | with MaybeUInt64 set to be an unsigned 64-bit integer. | |
426 | ||
427 | ||
ef6bacff RC |
428 | |
429 | .SH AUTHOR | |
430 | Julian Seward, jseward@acm.org. | |
431 | ||
ba95a000 | 432 | http://sources.redhat.com/bzip2 |
ef6bacff RC |
433 | |
434 | The ideas embodied in | |
435 | .I bzip2 | |
436 | are due to (at least) the following | |
437 | people: Michael Burrows and David Wheeler (for the block sorting | |
438 | transformation), David Wheeler (again, for the Huffman coder), Peter | |
439 | Fenwick (for the structured coding model in the original | |
440 | .I bzip, | |
441 | and many refinements), and Alistair Moffat, Radford Neal and Ian Witten | |
442 | (for the arithmetic coder in the original | |
443 | .I bzip). | |
444 | I am much | |
445 | indebted for their help, support and advice. See the manual in the | |
446 | source distribution for pointers to sources of documentation. Christian | |
447 | von Roques encouraged me to look for faster sorting algorithms, so as to | |
448 | speed up compression. Bela Lubkin encouraged me to improve the | |
ba95a000 RC |
449 | worst-case compression performance. |
450 | The bz* scripts are derived from those of GNU gzip. | |
451 | Many people sent patches, helped | |
ef6bacff RC |
452 | with portability problems, lent machines, gave advice and were generally |
453 | helpful. |