]>
Commit | Line | Data |
---|---|---|
ef6bacff RC |
1 | \input texinfo @c -*- Texinfo -*- |
2 | @setfilename bzip2.info | |
3 | ||
4 | @ignore | |
ba95a000 | 5 | This file documents bzip2 version 1.0.2, and associated library |
ef6bacff RC |
6 | libbzip2, written by Julian Seward (jseward@acm.org). |
7 | ||
ba95a000 | 8 | Copyright (C) 1996-2002 Julian R Seward |
ef6bacff RC |
9 | |
10 | Permission is granted to make and distribute verbatim copies of | |
11 | this manual provided the copyright notice and this permission notice | |
12 | are preserved on all copies. | |
13 | ||
14 | Permission is granted to copy and distribute translations of this manual | |
15 | into another language, under the above conditions for verbatim copies. | |
16 | @end ignore | |
17 | ||
18 | @ifinfo | |
19 | @format | |
20 | START-INFO-DIR-ENTRY | |
21 | * Bzip2: (bzip2). A program and library for data compression. | |
22 | END-INFO-DIR-ENTRY | |
23 | @end format | |
24 | ||
25 | @end ifinfo | |
26 | ||
27 | @iftex | |
28 | @c @finalout | |
29 | @settitle bzip2 and libbzip2 | |
30 | @titlepage | |
31 | @title bzip2 and libbzip2 | |
32 | @subtitle a program and library for data compression | |
ba95a000 RC |
33 | @subtitle copyright (C) 1996-2002 Julian Seward |
34 | @subtitle version 1.0.2 of 30 December 2001 | |
ef6bacff RC |
35 | @author Julian Seward |
36 | ||
37 | @end titlepage | |
38 | ||
39 | @parindent 0mm | |
40 | @parskip 2mm | |
41 | ||
42 | @end iftex | |
ba95a000 RC |
43 | @node Top,,, (dir) |
44 | ||
45 | The following text is the License for this software. You should | |
46 | find it identical to that contained in the file LICENSE in the | |
47 | source distribution. | |
48 | ||
49 | @bf{------------------ START OF THE LICENSE ------------------} | |
ef6bacff RC |
50 | |
51 | This program, @code{bzip2}, | |
52 | and associated library @code{libbzip2}, are | |
ba95a000 | 53 | Copyright (C) 1996-2002 Julian R Seward. All rights reserved. |
ef6bacff RC |
54 | |
55 | Redistribution and use in source and binary forms, with or without | |
56 | modification, are permitted provided that the following conditions | |
57 | are met: | |
58 | @itemize @bullet | |
59 | @item | |
60 | Redistributions of source code must retain the above copyright | |
61 | notice, this list of conditions and the following disclaimer. | |
62 | @item | |
63 | The origin of this software must not be misrepresented; you must | |
64 | not claim that you wrote the original software. If you use this | |
65 | software in a product, an acknowledgment in the product | |
66 | documentation would be appreciated but is not required. | |
67 | @item | |
68 | Altered source versions must be plainly marked as such, and must | |
69 | not be misrepresented as being the original software. | |
70 | @item | |
71 | The name of the author may not be used to endorse or promote | |
72 | products derived from this software without specific prior written | |
73 | permission. | |
74 | @end itemize | |
75 | THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS | |
76 | OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED | |
77 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |
78 | ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY | |
79 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |
80 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE | |
81 | GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS | |
82 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, | |
83 | WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING | |
84 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS | |
85 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
86 | ||
87 | Julian Seward, Cambridge, UK. | |
88 | ||
89 | @code{jseward@@acm.org} | |
90 | ||
ba95a000 | 91 | @code{bzip2}/@code{libbzip2} version 1.0.2 of 30 December 2001. |
ef6bacff | 92 | |
ba95a000 | 93 | @bf{------------------ END OF THE LICENSE ------------------} |
ef6bacff | 94 | |
ba95a000 | 95 | Web sites: |
ef6bacff | 96 | |
ba95a000 RC |
97 | @code{http://sources.redhat.com/bzip2} |
98 | ||
99 | @code{http://www.cacheprof.org} | |
ef6bacff RC |
100 | |
101 | PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented | |
102 | algorithms. However, I do not have the resources available to carry out | |
103 | a full patent search. Therefore I cannot give any guarantee of the | |
104 | above statement. | |
105 | ||
106 | ||
107 | ||
108 | ||
109 | ||
110 | ||
111 | ||
ef6bacff RC |
112 | @chapter Introduction |
113 | ||
114 | @code{bzip2} compresses files using the Burrows-Wheeler | |
115 | block-sorting text compression algorithm, and Huffman coding. | |
116 | Compression is generally considerably better than that | |
117 | achieved by more conventional LZ77/LZ78-based compressors, | |
118 | and approaches the performance of the PPM family of statistical compressors. | |
119 | ||
120 | @code{bzip2} is built on top of @code{libbzip2}, a flexible library | |
121 | for handling compressed data in the @code{bzip2} format. This manual | |
122 | describes both how to use the program and | |
123 | how to work with the library interface. Most of the | |
124 | manual is devoted to this library, not the program, | |
125 | which is good news if your interest is only in the program. | |
126 | ||
127 | Chapter 2 describes how to use @code{bzip2}; this is the only part | |
128 | you need to read if you just want to know how to operate the program. | |
129 | Chapter 3 describes the programming interfaces in detail, and | |
130 | Chapter 4 records some miscellaneous notes which I thought | |
131 | ought to be recorded somewhere. | |
132 | ||
133 | ||
134 | @chapter How to use @code{bzip2} | |
135 | ||
136 | This chapter contains a copy of the @code{bzip2} man page, | |
137 | and nothing else. | |
138 | ||
139 | @quotation | |
140 | ||
141 | @unnumberedsubsubsec NAME | |
142 | @itemize | |
143 | @item @code{bzip2}, @code{bunzip2} | |
ba95a000 | 144 | - a block-sorting file compressor, v1.0.2 |
ef6bacff RC |
145 | @item @code{bzcat} |
146 | - decompresses files to stdout | |
147 | @item @code{bzip2recover} | |
148 | - recovers data from damaged bzip2 files | |
149 | @end itemize | |
150 | ||
151 | @unnumberedsubsubsec SYNOPSIS | |
152 | @itemize | |
153 | @item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ... ] | |
154 | @item @code{bunzip2} [ -fkvsVL ] [ filenames ... ] | |
155 | @item @code{bzcat} [ -s ] [ filenames ... ] | |
156 | @item @code{bzip2recover} filename | |
157 | @end itemize | |
158 | ||
159 | @unnumberedsubsubsec DESCRIPTION | |
160 | ||
161 | @code{bzip2} compresses files using the Burrows-Wheeler block sorting | |
162 | text compression algorithm, and Huffman coding. Compression is | |
163 | generally considerably better than that achieved by more conventional | |
164 | LZ77/LZ78-based compressors, and approaches the performance of the PPM | |
165 | family of statistical compressors. | |
166 | ||
167 | The command-line options are deliberately very similar to those of GNU | |
168 | @code{gzip}, but they are not identical. | |
169 | ||
170 | @code{bzip2} expects a list of file names to accompany the command-line | |
171 | flags. Each file is replaced by a compressed version of itself, with | |
172 | the name @code{original_name.bz2}. Each compressed file has the same | |
173 | modification date, permissions, and, when possible, ownership as the | |
174 | corresponding original, so that these properties can be correctly | |
175 | restored at decompression time. File name handling is naive in the | |
176 | sense that there is no mechanism for preserving original file names, | |
177 | permissions, ownerships or dates in filesystems which lack these | |
178 | concepts, or have serious file name length restrictions, such as MS-DOS. | |
179 | ||
180 | @code{bzip2} and @code{bunzip2} will by default not overwrite existing | |
181 | files. If you want this to happen, specify the @code{-f} flag. | |
182 | ||
183 | If no file names are specified, @code{bzip2} compresses from standard | |
184 | input to standard output. In this case, @code{bzip2} will decline to | |
185 | write compressed output to a terminal, as this would be entirely | |
186 | incomprehensible and therefore pointless. | |
187 | ||
188 | @code{bunzip2} (or @code{bzip2 -d}) decompresses all | |
189 | specified files. Files which were not created by @code{bzip2} | |
190 | will be detected and ignored, and a warning issued. | |
191 | @code{bzip2} attempts to guess the filename for the decompressed file | |
192 | from that of the compressed file as follows: | |
193 | @itemize | |
194 | @item @code{filename.bz2 } becomes @code{filename} | |
195 | @item @code{filename.bz } becomes @code{filename} | |
196 | @item @code{filename.tbz2} becomes @code{filename.tar} | |
197 | @item @code{filename.tbz } becomes @code{filename.tar} | |
198 | @item @code{anyothername } becomes @code{anyothername.out} | |
199 | @end itemize | |
200 | If the file does not end in one of the recognised endings, | |
201 | @code{.bz2}, @code{.bz}, | |
202 | @code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot | |
203 | guess the name of the original file, and uses the original name | |
204 | with @code{.out} appended. | |
205 | ||
206 | As with compression, supplying no | |
207 | filenames causes decompression from standard input to standard output. | |
208 | ||
209 | @code{bunzip2} will correctly decompress a file which is the | |
210 | concatenation of two or more compressed files. The result is the | |
211 | concatenation of the corresponding uncompressed files. Integrity | |
212 | testing (@code{-t}) of concatenated compressed files is also supported. | |
213 | ||
214 | You can also compress or decompress files to the standard output by | |
215 | giving the @code{-c} flag. Multiple files may be compressed and | |
216 | decompressed like this. The resulting outputs are fed sequentially to | |
217 | stdout. Compression of multiple files in this manner generates a stream | |
218 | containing multiple compressed file representations. Such a stream | |
219 | can be decompressed correctly only by @code{bzip2} version 0.9.0 or | |
220 | later. Earlier versions of @code{bzip2} will stop after decompressing | |
221 | the first file in the stream. | |
222 | ||
223 | @code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to | |
224 | the standard output. | |
225 | ||
226 | @code{bzip2} will read arguments from the environment variables | |
227 | @code{BZIP2} and @code{BZIP}, in that order, and will process them | |
228 | before any arguments read from the command line. This gives a | |
229 | convenient way to supply default arguments. | |
230 | ||
231 | Compression is always performed, even if the compressed file is slightly | |
232 | larger than the original. Files of less than about one hundred bytes | |
233 | tend to get larger, since the compression mechanism has a constant | |
234 | overhead in the region of 50 bytes. Random data (including the output | |
235 | of most file compressors) is coded at about 8.05 bits per byte, giving | |
236 | an expansion of around 0.5%. | |
237 | ||
238 | As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to | |
239 | make sure that the decompressed version of a file is identical to the | |
240 | original. This guards against corruption of the compressed data, and | |
241 | against undetected bugs in @code{bzip2} (hopefully very unlikely). The | |
242 | chances of data corruption going undetected is microscopic, about one | |
243 | chance in four billion for each file processed. Be aware, though, that | |
244 | the check occurs upon decompression, so it can only tell you that | |
245 | something is wrong. It can't help you recover the original uncompressed | |
246 | data. You can use @code{bzip2recover} to try to recover data from | |
247 | damaged files. | |
248 | ||
249 | Return values: 0 for a normal exit, 1 for environmental problems (file | |
250 | not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt | |
251 | compressed file, 3 for an internal consistency error (eg, bug) which | |
252 | caused @code{bzip2} to panic. | |
253 | ||
254 | ||
255 | @unnumberedsubsubsec OPTIONS | |
256 | @table @code | |
257 | @item -c --stdout | |
258 | Compress or decompress to standard output. | |
259 | @item -d --decompress | |
260 | Force decompression. @code{bzip2}, @code{bunzip2} and @code{bzcat} are | |
261 | really the same program, and the decision about what actions to take is | |
262 | done on the basis of which name is used. This flag overrides that | |
263 | mechanism, and forces bzip2 to decompress. | |
264 | @item -z --compress | |
265 | The complement to @code{-d}: forces compression, regardless of the | |
266 | invokation name. | |
267 | @item -t --test | |
268 | Check integrity of the specified file(s), but don't decompress them. | |
269 | This really performs a trial decompression and throws away the result. | |
270 | @item -f --force | |
271 | Force overwrite of output files. Normally, @code{bzip2} will not overwrite | |
272 | existing output files. Also forces @code{bzip2} to break hard links | |
273 | to files, which it otherwise wouldn't do. | |
ba95a000 RC |
274 | |
275 | @code{bzip2} normally declines to decompress files which don't have the | |
276 | correct magic header bytes. If forced (@code{-f}), however, it will | |
277 | pass such files through unmodified. This is how GNU @code{gzip} | |
278 | behaves. | |
ef6bacff RC |
279 | @item -k --keep |
280 | Keep (don't delete) input files during compression | |
281 | or decompression. | |
282 | @item -s --small | |
283 | Reduce memory usage, for compression, decompression and testing. Files | |
284 | are decompressed and tested using a modified algorithm which only | |
285 | requires 2.5 bytes per block byte. This means any file can be | |
286 | decompressed in 2300k of memory, albeit at about half the normal speed. | |
287 | ||
288 | During compression, @code{-s} selects a block size of 200k, which limits | |
289 | memory use to around the same figure, at the expense of your compression | |
290 | ratio. In short, if your machine is low on memory (8 megabytes or | |
291 | less), use -s for everything. See MEMORY MANAGEMENT below. | |
292 | @item -q --quiet | |
293 | Suppress non-essential warning messages. Messages pertaining to | |
294 | I/O errors and other critical events will not be suppressed. | |
295 | @item -v --verbose | |
296 | Verbose mode -- show the compression ratio for each file processed. | |
297 | Further @code{-v}'s increase the verbosity level, spewing out lots of | |
298 | information which is primarily of interest for diagnostic purposes. | |
299 | @item -L --license -V --version | |
300 | Display the software version, license terms and conditions. | |
ba95a000 | 301 | @item -1 (or --fast) to -9 (or --best) |
ef6bacff RC |
302 | Set the block size to 100 k, 200 k .. 900 k when compressing. Has no |
303 | effect when decompressing. See MEMORY MANAGEMENT below. | |
ba95a000 RC |
304 | The @code{--fast} and @code{--best} aliases are primarily for GNU |
305 | @code{gzip} compatibility. In particular, @code{--fast} doesn't make | |
306 | things significantly faster. And @code{--best} merely selects the | |
307 | default behaviour. | |
ef6bacff RC |
308 | @item -- |
309 | Treats all subsequent arguments as file names, even if they start | |
310 | with a dash. This is so you can handle files with names beginning | |
311 | with a dash, for example: @code{bzip2 -- -myfilename}. | |
312 | @item --repetitive-fast | |
313 | @item --repetitive-best | |
314 | These flags are redundant in versions 0.9.5 and above. They provided | |
315 | some coarse control over the behaviour of the sorting algorithm in | |
316 | earlier versions, which was sometimes useful. 0.9.5 and above have an | |
317 | improved algorithm which renders these flags irrelevant. | |
318 | @end table | |
319 | ||
320 | ||
321 | @unnumberedsubsubsec MEMORY MANAGEMENT | |
322 | ||
323 | @code{bzip2} compresses large files in blocks. The block size affects | |
324 | both the compression ratio achieved, and the amount of memory needed for | |
325 | compression and decompression. The flags @code{-1} through @code{-9} | |
326 | specify the block size to be 100,000 bytes through 900,000 bytes (the | |
327 | default) respectively. At decompression time, the block size used for | |
328 | compression is read from the header of the compressed file, and | |
329 | @code{bunzip2} then allocates itself just enough memory to decompress | |
330 | the file. Since block sizes are stored in compressed files, it follows | |
331 | that the flags @code{-1} to @code{-9} are irrelevant to and so ignored | |
332 | during decompression. | |
333 | ||
334 | Compression and decompression requirements, in bytes, can be estimated | |
335 | as: | |
336 | @example | |
337 | Compression: 400k + ( 8 x block size ) | |
338 | ||
339 | Decompression: 100k + ( 4 x block size ), or | |
340 | 100k + ( 2.5 x block size ) | |
341 | @end example | |
342 | Larger block sizes give rapidly diminishing marginal returns. Most of | |
343 | the compression comes from the first two or three hundred k of block | |
344 | size, a fact worth bearing in mind when using @code{bzip2} on small machines. | |
345 | It is also important to appreciate that the decompression memory | |
346 | requirement is set at compression time by the choice of block size. | |
347 | ||
348 | For files compressed with the default 900k block size, @code{bunzip2} | |
349 | will require about 3700 kbytes to decompress. To support decompression | |
350 | of any file on a 4 megabyte machine, @code{bunzip2} has an option to | |
351 | decompress using approximately half this amount of memory, about 2300 | |
352 | kbytes. Decompression speed is also halved, so you should use this | |
353 | option only where necessary. The relevant flag is @code{-s}. | |
354 | ||
355 | In general, try and use the largest block size memory constraints allow, | |
356 | since that maximises the compression achieved. Compression and | |
357 | decompression speed are virtually unaffected by block size. | |
358 | ||
359 | Another significant point applies to files which fit in a single block | |
360 | -- that means most files you'd encounter using a large block size. The | |
361 | amount of real memory touched is proportional to the size of the file, | |
362 | since the file is smaller than a block. For example, compressing a file | |
363 | 20,000 bytes long with the flag @code{-9} will cause the compressor to | |
364 | allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 | |
365 | kbytes of it. Similarly, the decompressor will allocate 3700k but only | |
366 | touch 100k + 20000 * 4 = 180 kbytes. | |
367 | ||
368 | Here is a table which summarises the maximum memory usage for different | |
369 | block sizes. Also recorded is the total compressed size for 14 files of | |
370 | the Calgary Text Compression Corpus totalling 3,141,622 bytes. This | |
371 | column gives some feel for how compression varies with block size. | |
372 | These figures tend to understate the advantage of larger block sizes for | |
373 | larger files, since the Corpus is dominated by smaller files. | |
374 | @example | |
375 | Compress Decompress Decompress Corpus | |
376 | Flag usage usage -s usage Size | |
377 | ||
378 | -1 1200k 500k 350k 914704 | |
379 | -2 2000k 900k 600k 877703 | |
380 | -3 2800k 1300k 850k 860338 | |
381 | -4 3600k 1700k 1100k 846899 | |
382 | -5 4400k 2100k 1350k 845160 | |
383 | -6 5200k 2500k 1600k 838626 | |
384 | -7 6100k 2900k 1850k 834096 | |
385 | -8 6800k 3300k 2100k 828642 | |
386 | -9 7600k 3700k 2350k 828642 | |
387 | @end example | |
388 | ||
389 | @unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES | |
390 | ||
391 | @code{bzip2} compresses files in blocks, usually 900kbytes long. Each | |
392 | block is handled independently. If a media or transmission error causes | |
393 | a multi-block @code{.bz2} file to become damaged, it may be possible to | |
394 | recover data from the undamaged blocks in the file. | |
395 | ||
396 | The compressed representation of each block is delimited by a 48-bit | |
397 | pattern, which makes it possible to find the block boundaries with | |
398 | reasonable certainty. Each block also carries its own 32-bit CRC, so | |
399 | damaged blocks can be distinguished from undamaged ones. | |
400 | ||
401 | @code{bzip2recover} is a simple program whose purpose is to search for | |
402 | blocks in @code{.bz2} files, and write each block out into its own | |
403 | @code{.bz2} file. You can then use @code{bzip2 -t} to test the | |
404 | integrity of the resulting files, and decompress those which are | |
405 | undamaged. | |
406 | ||
407 | @code{bzip2recover} | |
ba95a000 RC |
408 | takes a single argument, the name of the damaged file, and writes a |
409 | number of files @code{rec00001file.bz2}, @code{rec00002file.bz2}, etc, | |
410 | containing the extracted blocks. The output filenames are designed so | |
411 | that the use of wildcards in subsequent processing -- for example, | |
412 | @code{bzip2 -dc rec*file.bz2 > recovered_data} -- processes the files in | |
413 | the correct order. | |
ef6bacff RC |
414 | |
415 | @code{bzip2recover} should be of most use dealing with large @code{.bz2} | |
ba95a000 RC |
416 | files, as these will contain many blocks. It is clearly futile to use |
417 | it on damaged single-block files, since a damaged block cannot be | |
418 | recovered. If you wish to minimise any potential data loss through | |
419 | media or transmission errors, you might consider compressing with a | |
420 | smaller block size. | |
ef6bacff RC |
421 | |
422 | ||
423 | @unnumberedsubsubsec PERFORMANCE NOTES | |
424 | ||
425 | The sorting phase of compression gathers together similar strings in the | |
426 | file. Because of this, files containing very long runs of repeated | |
427 | symbols, like "aabaabaabaab ..." (repeated several hundred times) may | |
428 | compress more slowly than normal. Versions 0.9.5 and above fare much | |
429 | better than previous versions in this respect. The ratio between | |
430 | worst-case and average-case compression time is in the region of 10:1. | |
431 | For previous versions, this figure was more like 100:1. You can use the | |
432 | @code{-vvvv} option to monitor progress in great detail, if you want. | |
433 | ||
434 | Decompression speed is unaffected by these phenomena. | |
435 | ||
436 | @code{bzip2} usually allocates several megabytes of memory to operate | |
437 | in, and then charges all over it in a fairly random fashion. This means | |
438 | that performance, both for compressing and decompressing, is largely | |
439 | determined by the speed at which your machine can service cache misses. | |
440 | Because of this, small changes to the code to reduce the miss rate have | |
441 | been observed to give disproportionately large performance improvements. | |
442 | I imagine @code{bzip2} will perform best on machines with very large | |
443 | caches. | |
444 | ||
445 | ||
446 | @unnumberedsubsubsec CAVEATS | |
447 | ||
448 | I/O error messages are not as helpful as they could be. @code{bzip2} | |
449 | tries hard to detect I/O errors and exit cleanly, but the details of | |
450 | what the problem is sometimes seem rather misleading. | |
451 | ||
ba95a000 | 452 | This manual page pertains to version 1.0.2 of @code{bzip2}. Compressed |
ef6bacff | 453 | data created by this version is entirely forwards and backwards |
ba95a000 RC |
454 | compatible with the previous public releases, versions 0.1pl2, 0.9.0, |
455 | 0.9.5, 1.0.0 and 1.0.1, but with the following exception: 0.9.0 and | |
456 | above can correctly decompress multiple concatenated compressed files. | |
457 | 0.1pl2 cannot do this; it will stop after decompressing just the first | |
458 | file in the stream. | |
459 | ||
460 | @code{bzip2recover} versions prior to this one, 1.0.2, used 32-bit | |
461 | integers to represent bit positions in compressed files, so it could not | |
462 | handle compressed files more than 512 megabytes long. Version 1.0.2 and | |
463 | above uses 64-bit ints on some platforms which support them (GNU | |
464 | supported targets, and Windows). To establish whether or not | |
465 | @code{bzip2recover} was built with such a limitation, run it without | |
466 | arguments. In any event you can build yourself an unlimited version if | |
467 | you can recompile it with @code{MaybeUInt64} set to be an unsigned | |
468 | 64-bit integer. | |
ef6bacff | 469 | |
ef6bacff RC |
470 | |
471 | ||
472 | @unnumberedsubsubsec AUTHOR | |
473 | Julian Seward, @code{jseward@@acm.org}. | |
474 | ||
ba95a000 RC |
475 | @code{http://sources.redhat.com/bzip2} |
476 | ||
ef6bacff RC |
477 | The ideas embodied in @code{bzip2} are due to (at least) the following |
478 | people: Michael Burrows and David Wheeler (for the block sorting | |
479 | transformation), David Wheeler (again, for the Huffman coder), Peter | |
480 | Fenwick (for the structured coding model in the original @code{bzip}, | |
481 | and many refinements), and Alistair Moffat, Radford Neal and Ian Witten | |
482 | (for the arithmetic coder in the original @code{bzip}). I am much | |
483 | indebted for their help, support and advice. See the manual in the | |
484 | source distribution for pointers to sources of documentation. Christian | |
485 | von Roques encouraged me to look for faster sorting algorithms, so as to | |
486 | speed up compression. Bela Lubkin encouraged me to improve the | |
ba95a000 RC |
487 | worst-case compression performance. The @code{bz*} scripts are derived |
488 | from those of GNU @code{gzip}. Many people sent patches, helped with | |
489 | portability problems, lent machines, gave advice and were generally | |
ef6bacff RC |
490 | helpful. |
491 | ||
492 | @end quotation | |
493 | ||
494 | ||
495 | ||
496 | ||
497 | @chapter Programming with @code{libbzip2} | |
498 | ||
499 | This chapter describes the programming interface to @code{libbzip2}. | |
500 | ||
501 | For general background information, particularly about memory | |
502 | use and performance aspects, you'd be well advised to read Chapter 2 | |
503 | as well. | |
504 | ||
505 | @section Top-level structure | |
506 | ||
507 | @code{libbzip2} is a flexible library for compressing and decompressing | |
508 | data in the @code{bzip2} data format. Although packaged as a single | |
509 | entity, it helps to regard the library as three separate parts: the low | |
510 | level interface, and the high level interface, and some utility | |
511 | functions. | |
512 | ||
513 | The structure of @code{libbzip2}'s interfaces is similar to | |
514 | that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib} | |
515 | library. | |
516 | ||
517 | All externally visible symbols have names beginning @code{BZ2_}. | |
518 | This is new in version 1.0. The intention is to minimise pollution | |
519 | of the namespaces of library clients. | |
520 | ||
521 | @subsection Low-level summary | |
522 | ||
523 | This interface provides services for compressing and decompressing | |
524 | data in memory. There's no provision for dealing with files, streams | |
525 | or any other I/O mechanisms, just straight memory-to-memory work. | |
526 | In fact, this part of the library can be compiled without inclusion | |
527 | of @code{stdio.h}, which may be helpful for embedded applications. | |
528 | ||
529 | The low-level part of the library has no global variables and | |
530 | is therefore thread-safe. | |
531 | ||
532 | Six routines make up the low level interface: | |
533 | @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, and @* @code{BZ2_bzCompressEnd} | |
534 | for compression, | |
535 | and a corresponding trio @code{BZ2_bzDecompressInit}, @* @code{BZ2_bzDecompress} | |
536 | and @code{BZ2_bzDecompressEnd} for decompression. | |
537 | The @code{*Init} functions allocate | |
538 | memory for compression/decompression and do other | |
539 | initialisations, whilst the @code{*End} functions close down operations | |
540 | and release memory. | |
541 | ||
542 | The real work is done by @code{BZ2_bzCompress} and @code{BZ2_bzDecompress}. | |
543 | These compress and decompress data from a user-supplied input buffer | |
544 | to a user-supplied output buffer. These buffers can be any size; | |
545 | arbitrary quantities of data are handled by making repeated calls | |
546 | to these functions. This is a flexible mechanism allowing a | |
547 | consumer-pull style of activity, or producer-push, or a mixture of | |
548 | both. | |
549 | ||
550 | ||
551 | ||
552 | @subsection High-level summary | |
553 | ||
554 | This interface provides some handy wrappers around the low-level | |
555 | interface to facilitate reading and writing @code{bzip2} format | |
556 | files (@code{.bz2} files). The routines provide hooks to facilitate | |
557 | reading files in which the @code{bzip2} data stream is embedded | |
558 | within some larger-scale file structure, or where there are | |
559 | multiple @code{bzip2} data streams concatenated end-to-end. | |
560 | ||
561 | For reading files, @code{BZ2_bzReadOpen}, @code{BZ2_bzRead}, | |
562 | @code{BZ2_bzReadClose} and @* @code{BZ2_bzReadGetUnused} are supplied. For | |
563 | writing files, @code{BZ2_bzWriteOpen}, @code{BZ2_bzWrite} and | |
564 | @code{BZ2_bzWriteFinish} are available. | |
565 | ||
566 | As with the low-level library, no global variables are used | |
567 | so the library is per se thread-safe. However, if I/O errors | |
568 | occur whilst reading or writing the underlying compressed files, | |
569 | you may have to consult @code{errno} to determine the cause of | |
570 | the error. In that case, you'd need a C library which correctly | |
571 | supports @code{errno} in a multithreaded environment. | |
572 | ||
573 | To make the library a little simpler and more portable, | |
574 | @code{BZ2_bzReadOpen} and @code{BZ2_bzWriteOpen} require you to pass them file | |
575 | handles (@code{FILE*}s) which have previously been opened for reading or | |
576 | writing respectively. That avoids portability problems associated with | |
577 | file operations and file attributes, whilst not being much of an | |
578 | imposition on the programmer. | |
579 | ||
580 | ||
581 | ||
582 | @subsection Utility functions summary | |
583 | For very simple needs, @code{BZ2_bzBuffToBuffCompress} and | |
584 | @code{BZ2_bzBuffToBuffDecompress} are provided. These compress | |
585 | data in memory from one buffer to another buffer in a single | |
586 | function call. You should assess whether these functions | |
587 | fulfill your memory-to-memory compression/decompression | |
588 | requirements before investing effort in understanding the more | |
589 | general but more complex low-level interface. | |
590 | ||
591 | Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} / | |
592 | @code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to | |
593 | give better @code{zlib} compatibility. These functions are | |
594 | @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush}, | |
595 | @code{BZ2_bzclose}, | |
596 | @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}. You may find these functions | |
597 | more convenient for simple file reading and writing, than those in the | |
598 | high-level interface. These functions are not (yet) officially part of | |
599 | the library, and are minimally documented here. If they break, you | |
600 | get to keep all the pieces. I hope to document them properly when time | |
601 | permits. | |
602 | ||
603 | Yoshioka also contributed modifications to allow the library to be | |
604 | built as a Windows DLL. | |
605 | ||
606 | ||
607 | @section Error handling | |
608 | ||
609 | The library is designed to recover cleanly in all situations, including | |
610 | the worst-case situation of decompressing random data. I'm not | |
611 | 100% sure that it can always do this, so you might want to add | |
612 | a signal handler to catch segmentation violations during decompression | |
613 | if you are feeling especially paranoid. I would be interested in | |
614 | hearing more about the robustness of the library to corrupted | |
615 | compressed data. | |
616 | ||
617 | Version 1.0 is much more robust in this respect than | |
618 | 0.9.0 or 0.9.5. Investigations with Checker (a tool for | |
619 | detecting problems with memory management, similar to Purify) | |
620 | indicate that, at least for the few files I tested, all single-bit | |
621 | errors in the decompressed data are caught properly, with no | |
622 | segmentation faults, no reads of uninitialised data and no | |
623 | out of range reads or writes. So it's certainly much improved, | |
624 | although I wouldn't claim it to be totally bombproof. | |
625 | ||
626 | The file @code{bzlib.h} contains all definitions needed to use | |
627 | the library. In particular, you should definitely not include | |
628 | @code{bzlib_private.h}. | |
629 | ||
630 | In @code{bzlib.h}, the various return values are defined. The following | |
631 | list is not intended as an exhaustive description of the circumstances | |
632 | in which a given value may be returned -- those descriptions are given | |
633 | later. Rather, it is intended to convey the rough meaning of each | |
634 | return value. The first five actions are normal and not intended to | |
635 | denote an error situation. | |
636 | @table @code | |
637 | @item BZ_OK | |
638 | The requested action was completed successfully. | |
639 | @item BZ_RUN_OK | |
640 | @itemx BZ_FLUSH_OK | |
641 | @itemx BZ_FINISH_OK | |
642 | In @code{BZ2_bzCompress}, the requested flush/finish/nothing-special action | |
643 | was completed successfully. | |
644 | @item BZ_STREAM_END | |
645 | Compression of data was completed, or the logical stream end was | |
646 | detected during decompression. | |
647 | @end table | |
648 | ||
649 | The following return values indicate an error of some kind. | |
650 | @table @code | |
651 | @item BZ_CONFIG_ERROR | |
652 | Indicates that the library has been improperly compiled on your | |
653 | platform -- a major configuration error. Specifically, it means | |
654 | that @code{sizeof(char)}, @code{sizeof(short)} and @code{sizeof(int)} | |
655 | are not 1, 2 and 4 respectively, as they should be. Note that the | |
656 | library should still work properly on 64-bit platforms which follow | |
657 | the LP64 programming model -- that is, where @code{sizeof(long)} | |
658 | and @code{sizeof(void*)} are 8. Under LP64, @code{sizeof(int)} is | |
659 | still 4, so @code{libbzip2}, which doesn't use the @code{long} type, | |
660 | is OK. | |
661 | @item BZ_SEQUENCE_ERROR | |
662 | When using the library, it is important to call the functions in the | |
663 | correct sequence and with data structures (buffers etc) in the correct | |
664 | states. @code{libbzip2} checks as much as it can to ensure this is | |
665 | happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which | |
666 | complies precisely with the function semantics, as detailed below, | |
667 | should never receive this value; such an event denotes buggy code | |
668 | which you should investigate. | |
669 | @item BZ_PARAM_ERROR | |
670 | Returned when a parameter to a function call is out of range | |
671 | or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR}, | |
672 | this denotes a bug in the client code. The distinction between | |
673 | @code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth | |
674 | making. | |
675 | @item BZ_MEM_ERROR | |
676 | Returned when a request to allocate memory failed. Note that the | |
677 | quantity of memory needed to decompress a stream cannot be determined | |
678 | until the stream's header has been read. So @code{BZ2_bzDecompress} and | |
679 | @code{BZ2_bzRead} may return @code{BZ_MEM_ERROR} even though some of | |
680 | the compressed data has been read. The same is not true for | |
681 | compression; once @code{BZ2_bzCompressInit} or @code{BZ2_bzWriteOpen} have | |
682 | successfully completed, @code{BZ_MEM_ERROR} cannot occur. | |
683 | @item BZ_DATA_ERROR | |
684 | Returned when a data integrity error is detected during decompression. | |
685 | Most importantly, this means when stored and computed CRCs for the | |
686 | data do not match. This value is also returned upon detection of any | |
687 | other anomaly in the compressed data. | |
688 | @item BZ_DATA_ERROR_MAGIC | |
689 | As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to | |
690 | know when the compressed stream does not start with the correct | |
691 | magic bytes (@code{'B' 'Z' 'h'}). | |
692 | @item BZ_IO_ERROR | |
693 | Returned by @code{BZ2_bzRead} and @code{BZ2_bzWrite} when there is an error | |
694 | reading or writing in the compressed file, and by @code{BZ2_bzReadOpen} | |
695 | and @code{BZ2_bzWriteOpen} for attempts to use a file for which the | |
696 | error indicator (viz, @code{ferror(f)}) is set. | |
697 | On receipt of @code{BZ_IO_ERROR}, the caller should consult | |
698 | @code{errno} and/or @code{perror} to acquire operating-system | |
699 | specific information about the problem. | |
700 | @item BZ_UNEXPECTED_EOF | |
701 | Returned by @code{BZ2_bzRead} when the compressed file finishes | |
702 | before the logical end of stream is detected. | |
703 | @item BZ_OUTBUFF_FULL | |
704 | Returned by @code{BZ2_bzBuffToBuffCompress} and | |
705 | @code{BZ2_bzBuffToBuffDecompress} to indicate that the output data | |
706 | will not fit into the output buffer provided. | |
707 | @end table | |
708 | ||
709 | ||
710 | ||
711 | @section Low-level interface | |
712 | ||
713 | @subsection @code{BZ2_bzCompressInit} | |
714 | @example | |
715 | typedef | |
716 | struct @{ | |
717 | char *next_in; | |
718 | unsigned int avail_in; | |
719 | unsigned int total_in_lo32; | |
720 | unsigned int total_in_hi32; | |
721 | ||
722 | char *next_out; | |
723 | unsigned int avail_out; | |
724 | unsigned int total_out_lo32; | |
725 | unsigned int total_out_hi32; | |
726 | ||
727 | void *state; | |
728 | ||
729 | void *(*bzalloc)(void *,int,int); | |
730 | void (*bzfree)(void *,void *); | |
731 | void *opaque; | |
732 | @} | |
733 | bz_stream; | |
734 | ||
735 | int BZ2_bzCompressInit ( bz_stream *strm, | |
736 | int blockSize100k, | |
737 | int verbosity, | |
738 | int workFactor ); | |
739 | ||
740 | @end example | |
741 | ||
742 | Prepares for compression. The @code{bz_stream} structure | |
743 | holds all data pertaining to the compression activity. | |
744 | A @code{bz_stream} structure should be allocated and initialised | |
745 | prior to the call. | |
746 | The fields of @code{bz_stream} | |
747 | comprise the entirety of the user-visible data. @code{state} | |
748 | is a pointer to the private data structures required for compression. | |
749 | ||
750 | Custom memory allocators are supported, via fields @code{bzalloc}, | |
751 | @code{bzfree}, | |
752 | and @code{opaque}. The value | |
753 | @code{opaque} is passed to as the first argument to | |
754 | all calls to @code{bzalloc} and @code{bzfree}, but is | |
755 | otherwise ignored by the library. | |
756 | The call @code{bzalloc ( opaque, n, m )} is expected to return a | |
757 | pointer @code{p} to | |
758 | @code{n * m} bytes of memory, and @code{bzfree ( opaque, p )} | |
759 | should free | |
760 | that memory. | |
761 | ||
762 | If you don't want to use a custom memory allocator, set @code{bzalloc}, | |
763 | @code{bzfree} and | |
764 | @code{opaque} to @code{NULL}, | |
765 | and the library will then use the standard @code{malloc}/@code{free} | |
766 | routines. | |
767 | ||
768 | Before calling @code{BZ2_bzCompressInit}, fields @code{bzalloc}, | |
769 | @code{bzfree} and @code{opaque} should | |
770 | be filled appropriately, as just described. Upon return, the internal | |
771 | state will have been allocated and initialised, and @code{total_in_lo32}, | |
772 | @code{total_in_hi32}, @code{total_out_lo32} and | |
773 | @code{total_out_hi32} will have been set to zero. | |
774 | These four fields are used by the library | |
775 | to inform the caller of the total amount of data passed into and out of | |
776 | the library, respectively. You should not try to change them. | |
777 | As of version 1.0, 64-bit counts are maintained, even on 32-bit | |
778 | platforms, using the @code{_hi32} fields to store the upper 32 bits | |
779 | of the count. So, for example, the total amount of data in | |
780 | is @code{(total_in_hi32 << 32) + total_in_lo32}. | |
781 | ||
782 | Parameter @code{blockSize100k} specifies the block size to be used for | |
783 | compression. It should be a value between 1 and 9 inclusive, and the | |
784 | actual block size used is 100000 x this figure. 9 gives the best | |
785 | compression but takes most memory. | |
786 | ||
787 | Parameter @code{verbosity} should be set to a number between 0 and 4 | |
788 | inclusive. 0 is silent, and greater numbers give increasingly verbose | |
789 | monitoring/debugging output. If the library has been compiled with | |
790 | @code{-DBZ_NO_STDIO}, no such output will appear for any verbosity | |
791 | setting. | |
792 | ||
793 | Parameter @code{workFactor} controls how the compression phase behaves | |
794 | when presented with worst case, highly repetitive, input data. If | |
795 | compression runs into difficulties caused by repetitive data, the | |
796 | library switches from the standard sorting algorithm to a fallback | |
797 | algorithm. The fallback is slower than the standard algorithm by | |
798 | perhaps a factor of three, but always behaves reasonably, no matter how | |
799 | bad the input. | |
800 | ||
801 | Lower values of @code{workFactor} reduce the amount of effort the | |
802 | standard algorithm will expend before resorting to the fallback. You | |
803 | should set this parameter carefully; too low, and many inputs will be | |
804 | handled by the fallback algorithm and so compress rather slowly, too | |
805 | high, and your average-to-worst case compression times can become very | |
806 | large. The default value of 30 gives reasonable behaviour over a wide | |
807 | range of circumstances. | |
808 | ||
809 | Allowable values range from 0 to 250 inclusive. 0 is a special case, | |
810 | equivalent to using the default value of 30. | |
811 | ||
812 | Note that the compressed output generated is the same regardless of | |
813 | whether or not the fallback algorithm is used. | |
814 | ||
815 | Be aware also that this parameter may disappear entirely in future | |
816 | versions of the library. In principle it should be possible to devise a | |
817 | good way to automatically choose which algorithm to use. Such a | |
818 | mechanism would render the parameter obsolete. | |
819 | ||
820 | Possible return values: | |
821 | @display | |
822 | @code{BZ_CONFIG_ERROR} | |
823 | if the library has been mis-compiled | |
824 | @code{BZ_PARAM_ERROR} | |
825 | if @code{strm} is @code{NULL} | |
826 | or @code{blockSize} < 1 or @code{blockSize} > 9 | |
827 | or @code{verbosity} < 0 or @code{verbosity} > 4 | |
828 | or @code{workFactor} < 0 or @code{workFactor} > 250 | |
829 | @code{BZ_MEM_ERROR} | |
830 | if not enough memory is available | |
831 | @code{BZ_OK} | |
832 | otherwise | |
833 | @end display | |
834 | Allowable next actions: | |
835 | @display | |
836 | @code{BZ2_bzCompress} | |
837 | if @code{BZ_OK} is returned | |
838 | no specific action needed in case of error | |
839 | @end display | |
840 | ||
841 | @subsection @code{BZ2_bzCompress} | |
842 | @example | |
843 | int BZ2_bzCompress ( bz_stream *strm, int action ); | |
844 | @end example | |
845 | Provides more input and/or output buffer space for the library. The | |
846 | caller maintains input and output buffers, and calls @code{BZ2_bzCompress} to | |
847 | transfer data between them. | |
848 | ||
849 | Before each call to @code{BZ2_bzCompress}, @code{next_in} should point at | |
850 | the data to be compressed, and @code{avail_in} should indicate how many | |
851 | bytes the library may read. @code{BZ2_bzCompress} updates @code{next_in}, | |
852 | @code{avail_in} and @code{total_in} to reflect the number of bytes it | |
853 | has read. | |
854 | ||
855 | Similarly, @code{next_out} should point to a buffer in which the | |
856 | compressed data is to be placed, with @code{avail_out} indicating how | |
857 | much output space is available. @code{BZ2_bzCompress} updates | |
858 | @code{next_out}, @code{avail_out} and @code{total_out} to reflect the | |
859 | number of bytes output. | |
860 | ||
861 | You may provide and remove as little or as much data as you like on each | |
862 | call of @code{BZ2_bzCompress}. In the limit, it is acceptable to supply and | |
863 | remove data one byte at a time, although this would be terribly | |
864 | inefficient. You should always ensure that at least one byte of output | |
865 | space is available at each call. | |
866 | ||
867 | A second purpose of @code{BZ2_bzCompress} is to request a change of mode of the | |
868 | compressed stream. | |
869 | ||
870 | Conceptually, a compressed stream can be in one of four states: IDLE, | |
871 | RUNNING, FLUSHING and FINISHING. Before initialisation | |
872 | (@code{BZ2_bzCompressInit}) and after termination (@code{BZ2_bzCompressEnd}), a | |
873 | stream is regarded as IDLE. | |
874 | ||
875 | Upon initialisation (@code{BZ2_bzCompressInit}), the stream is placed in the | |
876 | RUNNING state. Subsequent calls to @code{BZ2_bzCompress} should pass | |
877 | @code{BZ_RUN} as the requested action; other actions are illegal and | |
878 | will result in @code{BZ_SEQUENCE_ERROR}. | |
879 | ||
880 | At some point, the calling program will have provided all the input data | |
881 | it wants to. It will then want to finish up -- in effect, asking the | |
882 | library to process any data it might have buffered internally. In this | |
883 | state, @code{BZ2_bzCompress} will no longer attempt to read data from | |
884 | @code{next_in}, but it will want to write data to @code{next_out}. | |
885 | Because the output buffer supplied by the user can be arbitrarily small, | |
886 | the finishing-up operation cannot necessarily be done with a single call | |
887 | of @code{BZ2_bzCompress}. | |
888 | ||
889 | Instead, the calling program passes @code{BZ_FINISH} as an action to | |
890 | @code{BZ2_bzCompress}. This changes the stream's state to FINISHING. Any | |
891 | remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and | |
892 | transferred to the output buffer. To do this, @code{BZ2_bzCompress} must be | |
893 | called repeatedly until all the output has been consumed. At that | |
894 | point, @code{BZ2_bzCompress} returns @code{BZ_STREAM_END}, and the stream's | |
895 | state is set back to IDLE. @code{BZ2_bzCompressEnd} should then be | |
896 | called. | |
897 | ||
898 | Just to make sure the calling program does not cheat, the library makes | |
899 | a note of @code{avail_in} at the time of the first call to | |
900 | @code{BZ2_bzCompress} which has @code{BZ_FINISH} as an action (ie, at the | |
901 | time the program has announced its intention to not supply any more | |
902 | input). By comparing this value with that of @code{avail_in} over | |
903 | subsequent calls to @code{BZ2_bzCompress}, the library can detect any | |
904 | attempts to slip in more data to compress. Any calls for which this is | |
905 | detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a | |
906 | programming mistake which should be corrected. | |
907 | ||
908 | Instead of asking to finish, the calling program may ask | |
909 | @code{BZ2_bzCompress} to take all the remaining input, compress it and | |
910 | terminate the current (Burrows-Wheeler) compression block. This could | |
911 | be useful for error control purposes. The mechanism is analogous to | |
912 | that for finishing: call @code{BZ2_bzCompress} with an action of | |
913 | @code{BZ_FLUSH}, remove output data, and persist with the | |
914 | @code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As | |
915 | with finishing, @code{BZ2_bzCompress} detects any attempt to provide more | |
916 | input data once the flush has begun. | |
917 | ||
918 | Once the flush is complete, the stream returns to the normal RUNNING | |
919 | state. | |
920 | ||
921 | This all sounds pretty complex, but isn't really. Here's a table | |
922 | which shows which actions are allowable in each state, what action | |
923 | will be taken, what the next state is, and what the non-error return | |
924 | values are. Note that you can't explicitly ask what state the | |
925 | stream is in, but nor do you need to -- it can be inferred from the | |
926 | values returned by @code{BZ2_bzCompress}. | |
927 | @display | |
928 | IDLE/@code{any} | |
929 | Illegal. IDLE state only exists after @code{BZ2_bzCompressEnd} or | |
930 | before @code{BZ2_bzCompressInit}. | |
931 | Return value = @code{BZ_SEQUENCE_ERROR} | |
932 | ||
933 | RUNNING/@code{BZ_RUN} | |
934 | Compress from @code{next_in} to @code{next_out} as much as possible. | |
935 | Next state = RUNNING | |
936 | Return value = @code{BZ_RUN_OK} | |
937 | ||
938 | RUNNING/@code{BZ_FLUSH} | |
939 | Remember current value of @code{next_in}. Compress from @code{next_in} | |
940 | to @code{next_out} as much as possible, but do not accept any more input. | |
941 | Next state = FLUSHING | |
942 | Return value = @code{BZ_FLUSH_OK} | |
943 | ||
944 | RUNNING/@code{BZ_FINISH} | |
945 | Remember current value of @code{next_in}. Compress from @code{next_in} | |
946 | to @code{next_out} as much as possible, but do not accept any more input. | |
947 | Next state = FINISHING | |
948 | Return value = @code{BZ_FINISH_OK} | |
949 | ||
950 | FLUSHING/@code{BZ_FLUSH} | |
951 | Compress from @code{next_in} to @code{next_out} as much as possible, | |
952 | but do not accept any more input. | |
953 | If all the existing input has been used up and all compressed | |
954 | output has been removed | |
955 | Next state = RUNNING; Return value = @code{BZ_RUN_OK} | |
956 | else | |
957 | Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK} | |
958 | ||
959 | FLUSHING/other | |
960 | Illegal. | |
961 | Return value = @code{BZ_SEQUENCE_ERROR} | |
962 | ||
963 | FINISHING/@code{BZ_FINISH} | |
964 | Compress from @code{next_in} to @code{next_out} as much as possible, | |
965 | but to not accept any more input. | |
966 | If all the existing input has been used up and all compressed | |
967 | output has been removed | |
968 | Next state = IDLE; Return value = @code{BZ_STREAM_END} | |
969 | else | |
970 | Next state = FINISHING; Return value = @code{BZ_FINISHING} | |
971 | ||
972 | FINISHING/other | |
973 | Illegal. | |
974 | Return value = @code{BZ_SEQUENCE_ERROR} | |
975 | @end display | |
976 | ||
977 | That still looks complicated? Well, fair enough. The usual sequence | |
978 | of calls for compressing a load of data is: | |
979 | @itemize @bullet | |
980 | @item Get started with @code{BZ2_bzCompressInit}. | |
981 | @item Shovel data in and shlurp out its compressed form using zero or more | |
982 | calls of @code{BZ2_bzCompress} with action = @code{BZ_RUN}. | |
983 | @item Finish up. | |
984 | Repeatedly call @code{BZ2_bzCompress} with action = @code{BZ_FINISH}, | |
985 | copying out the compressed output, until @code{BZ_STREAM_END} is returned. | |
986 | @item Close up and go home. Call @code{BZ2_bzCompressEnd}. | |
987 | @end itemize | |
988 | If the data you want to compress fits into your input buffer all | |
989 | at once, you can skip the calls of @code{BZ2_bzCompress ( ..., BZ_RUN )} and | |
990 | just do the @code{BZ2_bzCompress ( ..., BZ_FINISH )} calls. | |
991 | ||
992 | All required memory is allocated by @code{BZ2_bzCompressInit}. The | |
993 | compression library can accept any data at all (obviously). So you | |
994 | shouldn't get any error return values from the @code{BZ2_bzCompress} calls. | |
995 | If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in | |
996 | your programming. | |
997 | ||
998 | Trivial other possible return values: | |
999 | @display | |
1000 | @code{BZ_PARAM_ERROR} | |
1001 | if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL} | |
1002 | @end display | |
1003 | ||
1004 | @subsection @code{BZ2_bzCompressEnd} | |
1005 | @example | |
1006 | int BZ2_bzCompressEnd ( bz_stream *strm ); | |
1007 | @end example | |
1008 | Releases all memory associated with a compression stream. | |
1009 | ||
1010 | Possible return values: | |
1011 | @display | |
1012 | @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | |
1013 | @code{BZ_OK} otherwise | |
1014 | @end display | |
1015 | ||
1016 | ||
1017 | @subsection @code{BZ2_bzDecompressInit} | |
1018 | @example | |
1019 | int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small ); | |
1020 | @end example | |
1021 | Prepares for decompression. As with @code{BZ2_bzCompressInit}, a | |
1022 | @code{bz_stream} record should be allocated and initialised before the | |
1023 | call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be | |
1024 | set if a custom memory allocator is required, or made @code{NULL} for | |
1025 | the normal @code{malloc}/@code{free} routines. Upon return, the internal | |
1026 | state will have been initialised, and @code{total_in} and | |
1027 | @code{total_out} will be zero. | |
1028 | ||
1029 | For the meaning of parameter @code{verbosity}, see @code{BZ2_bzCompressInit}. | |
1030 | ||
1031 | If @code{small} is nonzero, the library will use an alternative | |
1032 | decompression algorithm which uses less memory but at the cost of | |
1033 | decompressing more slowly (roughly speaking, half the speed, but the | |
1034 | maximum memory requirement drops to around 2300k). See Chapter 2 for | |
1035 | more information on memory management. | |
1036 | ||
1037 | Note that the amount of memory needed to decompress | |
1038 | a stream cannot be determined until the stream's header has been read, | |
1039 | so even if @code{BZ2_bzDecompressInit} succeeds, a subsequent | |
1040 | @code{BZ2_bzDecompress} could fail with @code{BZ_MEM_ERROR}. | |
1041 | ||
1042 | Possible return values: | |
1043 | @display | |
1044 | @code{BZ_CONFIG_ERROR} | |
1045 | if the library has been mis-compiled | |
1046 | @code{BZ_PARAM_ERROR} | |
1047 | if @code{(small != 0 && small != 1)} | |
1048 | or @code{(verbosity < 0 || verbosity > 4)} | |
1049 | @code{BZ_MEM_ERROR} | |
1050 | if insufficient memory is available | |
1051 | @end display | |
1052 | ||
1053 | Allowable next actions: | |
1054 | @display | |
1055 | @code{BZ2_bzDecompress} | |
1056 | if @code{BZ_OK} was returned | |
1057 | no specific action required in case of error | |
1058 | @end display | |
1059 | ||
1060 | ||
1061 | ||
1062 | @subsection @code{BZ2_bzDecompress} | |
1063 | @example | |
1064 | int BZ2_bzDecompress ( bz_stream *strm ); | |
1065 | @end example | |
1066 | Provides more input and/out output buffer space for the library. The | |
1067 | caller maintains input and output buffers, and uses @code{BZ2_bzDecompress} | |
1068 | to transfer data between them. | |
1069 | ||
1070 | Before each call to @code{BZ2_bzDecompress}, @code{next_in} | |
1071 | should point at the compressed data, | |
1072 | and @code{avail_in} should indicate how many bytes the library | |
1073 | may read. @code{BZ2_bzDecompress} updates @code{next_in}, @code{avail_in} | |
1074 | and @code{total_in} | |
1075 | to reflect the number of bytes it has read. | |
1076 | ||
1077 | Similarly, @code{next_out} should point to a buffer in which the uncompressed | |
1078 | output is to be placed, with @code{avail_out} indicating how much output space | |
1079 | is available. @code{BZ2_bzCompress} updates @code{next_out}, | |
1080 | @code{avail_out} and @code{total_out} to reflect | |
1081 | the number of bytes output. | |
1082 | ||
1083 | You may provide and remove as little or as much data as you like on | |
1084 | each call of @code{BZ2_bzDecompress}. | |
1085 | In the limit, it is acceptable to | |
1086 | supply and remove data one byte at a time, although this would be | |
1087 | terribly inefficient. You should always ensure that at least one | |
1088 | byte of output space is available at each call. | |
1089 | ||
1090 | Use of @code{BZ2_bzDecompress} is simpler than @code{BZ2_bzCompress}. | |
1091 | ||
1092 | You should provide input and remove output as described above, and | |
1093 | repeatedly call @code{BZ2_bzDecompress} until @code{BZ_STREAM_END} is | |
1094 | returned. Appearance of @code{BZ_STREAM_END} denotes that | |
1095 | @code{BZ2_bzDecompress} has detected the logical end of the compressed | |
1096 | stream. @code{BZ2_bzDecompress} will not produce @code{BZ_STREAM_END} until | |
1097 | all output data has been placed into the output buffer, so once | |
1098 | @code{BZ_STREAM_END} appears, you are guaranteed to have available all | |
1099 | the decompressed output, and @code{BZ2_bzDecompressEnd} can safely be | |
1100 | called. | |
1101 | ||
1102 | If case of an error return value, you should call @code{BZ2_bzDecompressEnd} | |
1103 | to clean up and release memory. | |
1104 | ||
1105 | Possible return values: | |
1106 | @display | |
1107 | @code{BZ_PARAM_ERROR} | |
1108 | if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | |
1109 | or @code{strm->avail_out < 1} | |
1110 | @code{BZ_DATA_ERROR} | |
1111 | if a data integrity error is detected in the compressed stream | |
1112 | @code{BZ_DATA_ERROR_MAGIC} | |
1113 | if the compressed stream doesn't begin with the right magic bytes | |
1114 | @code{BZ_MEM_ERROR} | |
1115 | if there wasn't enough memory available | |
1116 | @code{BZ_STREAM_END} | |
1117 | if the logical end of the data stream was detected and all | |
1118 | output in has been consumed, eg @code{s->avail_out > 0} | |
1119 | @code{BZ_OK} | |
1120 | otherwise | |
1121 | @end display | |
1122 | Allowable next actions: | |
1123 | @display | |
1124 | @code{BZ2_bzDecompress} | |
1125 | if @code{BZ_OK} was returned | |
1126 | @code{BZ2_bzDecompressEnd} | |
1127 | otherwise | |
1128 | @end display | |
1129 | ||
1130 | ||
1131 | @subsection @code{BZ2_bzDecompressEnd} | |
1132 | @example | |
1133 | int BZ2_bzDecompressEnd ( bz_stream *strm ); | |
1134 | @end example | |
1135 | Releases all memory associated with a decompression stream. | |
1136 | ||
1137 | Possible return values: | |
1138 | @display | |
1139 | @code{BZ_PARAM_ERROR} | |
1140 | if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | |
1141 | @code{BZ_OK} | |
1142 | otherwise | |
1143 | @end display | |
1144 | ||
1145 | Allowable next actions: | |
1146 | @display | |
1147 | None. | |
1148 | @end display | |
1149 | ||
1150 | ||
1151 | @section High-level interface | |
1152 | ||
1153 | This interface provides functions for reading and writing | |
1154 | @code{bzip2} format files. First, some general points. | |
1155 | ||
1156 | @itemize @bullet | |
1157 | @item All of the functions take an @code{int*} first argument, | |
1158 | @code{bzerror}. | |
1159 | After each call, @code{bzerror} should be consulted first to determine | |
1160 | the outcome of the call. If @code{bzerror} is @code{BZ_OK}, | |
1161 | the call completed | |
1162 | successfully, and only then should the return value of the function | |
1163 | (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR}, | |
1164 | there was an error | |
1165 | reading/writing the underlying compressed file, and you should | |
1166 | then consult @code{errno}/@code{perror} to determine the | |
1167 | cause of the difficulty. | |
1168 | @code{bzerror} may also be set to various other values; precise details are | |
1169 | given on a per-function basis below. | |
1170 | @item If @code{bzerror} indicates an error | |
1171 | (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}), | |
1172 | you should immediately call @code{BZ2_bzReadClose} (or @code{BZ2_bzWriteClose}, | |
1173 | depending on whether you are attempting to read or to write) | |
1174 | to free up all resources associated | |
1175 | with the stream. Once an error has been indicated, behaviour of all calls | |
1176 | except @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) is undefined. | |
1177 | The implication is that (1) @code{bzerror} should | |
1178 | be checked after each call, and (2) if @code{bzerror} indicates an error, | |
1179 | @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) should then be called to clean up. | |
1180 | @item The @code{FILE*} arguments passed to | |
1181 | @code{BZ2_bzReadOpen}/@code{BZ2_bzWriteOpen} | |
1182 | should be set to binary mode. | |
1183 | Most Unix systems will do this by default, but other platforms, | |
1184 | including Windows and Mac, will not. If you omit this, you may | |
1185 | encounter problems when moving code to new platforms. | |
1186 | @item Memory allocation requests are handled by | |
1187 | @code{malloc}/@code{free}. | |
1188 | At present | |
1189 | there is no facility for user-defined memory allocators in the file I/O | |
1190 | functions (could easily be added, though). | |
1191 | @end itemize | |
1192 | ||
1193 | ||
1194 | ||
1195 | @subsection @code{BZ2_bzReadOpen} | |
1196 | @example | |
1197 | typedef void BZFILE; | |
1198 | ||
1199 | BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f, | |
1200 | int small, int verbosity, | |
1201 | void *unused, int nUnused ); | |
1202 | @end example | |
1203 | Prepare to read compressed data from file handle @code{f}. @code{f} | |
1204 | should refer to a file which has been opened for reading, and for which | |
1205 | the error indicator (@code{ferror(f)})is not set. If @code{small} is 1, | |
1206 | the library will try to decompress using less memory, at the expense of | |
1207 | speed. | |
1208 | ||
1209 | For reasons explained below, @code{BZ2_bzRead} will decompress the | |
1210 | @code{nUnused} bytes starting at @code{unused}, before starting to read | |
1211 | from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be | |
1212 | supplied like this. If this facility is not required, you should pass | |
1213 | @code{NULL} and @code{0} for @code{unused} and n@code{Unused} | |
1214 | respectively. | |
1215 | ||
1216 | For the meaning of parameters @code{small} and @code{verbosity}, | |
1217 | see @code{BZ2_bzDecompressInit}. | |
1218 | ||
1219 | The amount of memory needed to decompress a file cannot be determined | |
1220 | until the file's header has been read. So it is possible that | |
1221 | @code{BZ2_bzReadOpen} returns @code{BZ_OK} but a subsequent call of | |
1222 | @code{BZ2_bzRead} will return @code{BZ_MEM_ERROR}. | |
1223 | ||
1224 | Possible assignments to @code{bzerror}: | |
1225 | @display | |
1226 | @code{BZ_CONFIG_ERROR} | |
1227 | if the library has been mis-compiled | |
1228 | @code{BZ_PARAM_ERROR} | |
1229 | if @code{f} is @code{NULL} | |
1230 | or @code{small} is neither @code{0} nor @code{1} | |
1231 | or @code{(unused == NULL && nUnused != 0)} | |
1232 | or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))} | |
1233 | @code{BZ_IO_ERROR} | |
1234 | if @code{ferror(f)} is nonzero | |
1235 | @code{BZ_MEM_ERROR} | |
1236 | if insufficient memory is available | |
1237 | @code{BZ_OK} | |
1238 | otherwise. | |
1239 | @end display | |
1240 | ||
1241 | Possible return values: | |
1242 | @display | |
1243 | Pointer to an abstract @code{BZFILE} | |
1244 | if @code{bzerror} is @code{BZ_OK} | |
1245 | @code{NULL} | |
1246 | otherwise | |
1247 | @end display | |
1248 | ||
1249 | Allowable next actions: | |
1250 | @display | |
1251 | @code{BZ2_bzRead} | |
1252 | if @code{bzerror} is @code{BZ_OK} | |
1253 | @code{BZ2_bzClose} | |
1254 | otherwise | |
1255 | @end display | |
1256 | ||
1257 | ||
1258 | @subsection @code{BZ2_bzRead} | |
1259 | @example | |
1260 | int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); | |
1261 | @end example | |
1262 | Reads up to @code{len} (uncompressed) bytes from the compressed file | |
1263 | @code{b} into | |
1264 | the buffer @code{buf}. If the read was successful, | |
1265 | @code{bzerror} is set to @code{BZ_OK} | |
1266 | and the number of bytes read is returned. If the logical end-of-stream | |
1267 | was detected, @code{bzerror} will be set to @code{BZ_STREAM_END}, | |
1268 | and the number | |
1269 | of bytes read is returned. All other @code{bzerror} values denote an error. | |
1270 | ||
1271 | @code{BZ2_bzRead} will supply @code{len} bytes, | |
1272 | unless the logical stream end is detected | |
1273 | or an error occurs. Because of this, it is possible to detect the | |
1274 | stream end by observing when the number of bytes returned is | |
1275 | less than the number | |
1276 | requested. Nevertheless, this is regarded as inadvisable; you should | |
1277 | instead check @code{bzerror} after every call and watch out for | |
1278 | @code{BZ_STREAM_END}. | |
1279 | ||
1280 | Internally, @code{BZ2_bzRead} copies data from the compressed file in chunks | |
1281 | of size @code{BZ_MAX_UNUSED} bytes | |
1282 | before decompressing it. If the file contains more bytes than strictly | |
1283 | needed to reach the logical end-of-stream, @code{BZ2_bzRead} will almost certainly | |
1284 | read some of the trailing data before signalling @code{BZ_SEQUENCE_END}. | |
1285 | To collect the read but unused data once @code{BZ_SEQUENCE_END} has | |
1286 | appeared, call @code{BZ2_bzReadGetUnused} immediately before @code{BZ2_bzReadClose}. | |
1287 | ||
1288 | Possible assignments to @code{bzerror}: | |
1289 | @display | |
1290 | @code{BZ_PARAM_ERROR} | |
1291 | if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} | |
1292 | @code{BZ_SEQUENCE_ERROR} | |
1293 | if @code{b} was opened with @code{BZ2_bzWriteOpen} | |
1294 | @code{BZ_IO_ERROR} | |
1295 | if there is an error reading from the compressed file | |
1296 | @code{BZ_UNEXPECTED_EOF} | |
1297 | if the compressed file ended before the logical end-of-stream was detected | |
1298 | @code{BZ_DATA_ERROR} | |
1299 | if a data integrity error was detected in the compressed stream | |
1300 | @code{BZ_DATA_ERROR_MAGIC} | |
1301 | if the stream does not begin with the requisite header bytes (ie, is not | |
1302 | a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}. | |
1303 | @code{BZ_MEM_ERROR} | |
1304 | if insufficient memory was available | |
1305 | @code{BZ_STREAM_END} | |
1306 | if the logical end of stream was detected. | |
1307 | @code{BZ_OK} | |
1308 | otherwise. | |
1309 | @end display | |
1310 | ||
1311 | Possible return values: | |
1312 | @display | |
1313 | number of bytes read | |
1314 | if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END} | |
1315 | undefined | |
1316 | otherwise | |
1317 | @end display | |
1318 | ||
1319 | Allowable next actions: | |
1320 | @display | |
1321 | collect data from @code{buf}, then @code{BZ2_bzRead} or @code{BZ2_bzReadClose} | |
1322 | if @code{bzerror} is @code{BZ_OK} | |
1323 | collect data from @code{buf}, then @code{BZ2_bzReadClose} or @code{BZ2_bzReadGetUnused} | |
1324 | if @code{bzerror} is @code{BZ_SEQUENCE_END} | |
1325 | @code{BZ2_bzReadClose} | |
1326 | otherwise | |
1327 | @end display | |
1328 | ||
1329 | ||
1330 | ||
1331 | @subsection @code{BZ2_bzReadGetUnused} | |
1332 | @example | |
1333 | void BZ2_bzReadGetUnused ( int* bzerror, BZFILE *b, | |
1334 | void** unused, int* nUnused ); | |
1335 | @end example | |
1336 | Returns data which was read from the compressed file but was not needed | |
1337 | to get to the logical end-of-stream. @code{*unused} is set to the address | |
1338 | of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will | |
1339 | be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive. | |
1340 | ||
1341 | This function may only be called once @code{BZ2_bzRead} has signalled | |
1342 | @code{BZ_STREAM_END} but before @code{BZ2_bzReadClose}. | |
1343 | ||
1344 | Possible assignments to @code{bzerror}: | |
1345 | @display | |
1346 | @code{BZ_PARAM_ERROR} | |
1347 | if @code{b} is @code{NULL} | |
1348 | or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL} | |
1349 | @code{BZ_SEQUENCE_ERROR} | |
1350 | if @code{BZ_STREAM_END} has not been signalled | |
1351 | or if @code{b} was opened with @code{BZ2_bzWriteOpen} | |
1352 | @code{BZ_OK} | |
1353 | otherwise | |
1354 | @end display | |
1355 | ||
1356 | Allowable next actions: | |
1357 | @display | |
1358 | @code{BZ2_bzReadClose} | |
1359 | @end display | |
1360 | ||
1361 | ||
1362 | @subsection @code{BZ2_bzReadClose} | |
1363 | @example | |
1364 | void BZ2_bzReadClose ( int *bzerror, BZFILE *b ); | |
1365 | @end example | |
1366 | Releases all memory pertaining to the compressed file @code{b}. | |
1367 | @code{BZ2_bzReadClose} does not call @code{fclose} on the underlying file | |
1368 | handle, so you should do that yourself if appropriate. | |
1369 | @code{BZ2_bzReadClose} should be called to clean up after all error | |
1370 | situations. | |
1371 | ||
1372 | Possible assignments to @code{bzerror}: | |
1373 | @display | |
1374 | @code{BZ_SEQUENCE_ERROR} | |
1375 | if @code{b} was opened with @code{BZ2_bzOpenWrite} | |
1376 | @code{BZ_OK} | |
1377 | otherwise | |
1378 | @end display | |
1379 | ||
1380 | Allowable next actions: | |
1381 | @display | |
1382 | none | |
1383 | @end display | |
1384 | ||
1385 | ||
1386 | ||
1387 | @subsection @code{BZ2_bzWriteOpen} | |
1388 | @example | |
1389 | BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f, | |
1390 | int blockSize100k, int verbosity, | |
1391 | int workFactor ); | |
1392 | @end example | |
1393 | Prepare to write compressed data to file handle @code{f}. | |
1394 | @code{f} should refer to | |
1395 | a file which has been opened for writing, and for which the error | |
1396 | indicator (@code{ferror(f)})is not set. | |
1397 | ||
1398 | For the meaning of parameters @code{blockSize100k}, | |
1399 | @code{verbosity} and @code{workFactor}, see | |
1400 | @* @code{BZ2_bzCompressInit}. | |
1401 | ||
1402 | All required memory is allocated at this stage, so if the call | |
1403 | completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a | |
1404 | subsequent call to @code{BZ2_bzWrite}. | |
1405 | ||
1406 | Possible assignments to @code{bzerror}: | |
1407 | @display | |
1408 | @code{BZ_CONFIG_ERROR} | |
1409 | if the library has been mis-compiled | |
1410 | @code{BZ_PARAM_ERROR} | |
1411 | if @code{f} is @code{NULL} | |
1412 | or @code{blockSize100k < 1} or @code{blockSize100k > 9} | |
1413 | @code{BZ_IO_ERROR} | |
1414 | if @code{ferror(f)} is nonzero | |
1415 | @code{BZ_MEM_ERROR} | |
1416 | if insufficient memory is available | |
1417 | @code{BZ_OK} | |
1418 | otherwise | |
1419 | @end display | |
1420 | ||
1421 | Possible return values: | |
1422 | @display | |
1423 | Pointer to an abstract @code{BZFILE} | |
1424 | if @code{bzerror} is @code{BZ_OK} | |
1425 | @code{NULL} | |
1426 | otherwise | |
1427 | @end display | |
1428 | ||
1429 | Allowable next actions: | |
1430 | @display | |
1431 | @code{BZ2_bzWrite} | |
1432 | if @code{bzerror} is @code{BZ_OK} | |
1433 | (you could go directly to @code{BZ2_bzWriteClose}, but this would be pretty pointless) | |
1434 | @code{BZ2_bzWriteClose} | |
1435 | otherwise | |
1436 | @end display | |
1437 | ||
1438 | ||
1439 | ||
1440 | @subsection @code{BZ2_bzWrite} | |
1441 | @example | |
1442 | void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); | |
1443 | @end example | |
1444 | Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be | |
1445 | compressed and written to the file. | |
1446 | ||
1447 | Possible assignments to @code{bzerror}: | |
1448 | @display | |
1449 | @code{BZ_PARAM_ERROR} | |
1450 | if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} | |
1451 | @code{BZ_SEQUENCE_ERROR} | |
1452 | if b was opened with @code{BZ2_bzReadOpen} | |
1453 | @code{BZ_IO_ERROR} | |
1454 | if there is an error writing the compressed file. | |
1455 | @code{BZ_OK} | |
1456 | otherwise | |
1457 | @end display | |
1458 | ||
1459 | ||
1460 | ||
1461 | ||
1462 | @subsection @code{BZ2_bzWriteClose} | |
1463 | @example | |
1464 | void BZ2_bzWriteClose ( int *bzerror, BZFILE* f, | |
1465 | int abandon, | |
1466 | unsigned int* nbytes_in, | |
1467 | unsigned int* nbytes_out ); | |
1468 | ||
1469 | void BZ2_bzWriteClose64 ( int *bzerror, BZFILE* f, | |
1470 | int abandon, | |
1471 | unsigned int* nbytes_in_lo32, | |
1472 | unsigned int* nbytes_in_hi32, | |
1473 | unsigned int* nbytes_out_lo32, | |
1474 | unsigned int* nbytes_out_hi32 ); | |
1475 | @end example | |
1476 | ||
1477 | Compresses and flushes to the compressed file all data so far supplied | |
1478 | by @code{BZ2_bzWrite}. The logical end-of-stream markers are also written, so | |
1479 | subsequent calls to @code{BZ2_bzWrite} are illegal. All memory associated | |
1480 | with the compressed file @code{b} is released. | |
1481 | @code{fflush} is called on the | |
1482 | compressed file, but it is not @code{fclose}'d. | |
1483 | ||
1484 | If @code{BZ2_bzWriteClose} is called to clean up after an error, the only | |
1485 | action is to release the memory. The library records the error codes | |
1486 | issued by previous calls, so this situation will be detected | |
1487 | automatically. There is no attempt to complete the compression | |
1488 | operation, nor to @code{fflush} the compressed file. You can force this | |
1489 | behaviour to happen even in the case of no error, by passing a nonzero | |
1490 | value to @code{abandon}. | |
1491 | ||
1492 | If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the | |
1493 | total volume of uncompressed data handled. Similarly, @code{nbytes_out} | |
1494 | will be set to the total volume of compressed data written. For | |
1495 | compatibility with older versions of the library, @code{BZ2_bzWriteClose} | |
1496 | only yields the lower 32 bits of these counts. Use | |
1497 | @code{BZ2_bzWriteClose64} if you want the full 64 bit counts. These | |
1498 | two functions are otherwise absolutely identical. | |
1499 | ||
1500 | ||
1501 | Possible assignments to @code{bzerror}: | |
1502 | @display | |
1503 | @code{BZ_SEQUENCE_ERROR} | |
1504 | if @code{b} was opened with @code{BZ2_bzReadOpen} | |
1505 | @code{BZ_IO_ERROR} | |
1506 | if there is an error writing the compressed file | |
1507 | @code{BZ_OK} | |
1508 | otherwise | |
1509 | @end display | |
1510 | ||
1511 | @subsection Handling embedded compressed data streams | |
1512 | ||
1513 | The high-level library facilitates use of | |
1514 | @code{bzip2} data streams which form some part of a surrounding, larger | |
1515 | data stream. | |
1516 | @itemize @bullet | |
1517 | @item For writing, the library takes an open file handle, writes | |
1518 | compressed data to it, @code{fflush}es it but does not @code{fclose} it. | |
1519 | The calling application can write its own data before and after the | |
1520 | compressed data stream, using that same file handle. | |
1521 | @item Reading is more complex, and the facilities are not as general | |
1522 | as they could be since generality is hard to reconcile with efficiency. | |
1523 | @code{BZ2_bzRead} reads from the compressed file in blocks of size | |
1524 | @code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot | |
1525 | the logical end of compressed stream. | |
1526 | To recover this data once decompression has | |
1527 | ended, call @code{BZ2_bzReadGetUnused} after the last call of @code{BZ2_bzRead} | |
1528 | (the one returning @code{BZ_STREAM_END}) but before calling | |
1529 | @code{BZ2_bzReadClose}. | |
1530 | @end itemize | |
1531 | ||
1532 | This mechanism makes it easy to decompress multiple @code{bzip2} | |
1533 | streams placed end-to-end. As the end of one stream, when @code{BZ2_bzRead} | |
1534 | returns @code{BZ_STREAM_END}, call @code{BZ2_bzReadGetUnused} to collect the | |
1535 | unused data (copy it into your own buffer somewhere). | |
1536 | That data forms the start of the next compressed stream. | |
1537 | To start uncompressing that next stream, call @code{BZ2_bzReadOpen} again, | |
1538 | feeding in the unused data via the @code{unused}/@code{nUnused} | |
1539 | parameters. | |
1540 | Keep doing this until @code{BZ_STREAM_END} return coincides with the | |
1541 | physical end of file (@code{feof(f)}). In this situation | |
1542 | @code{BZ2_bzReadGetUnused} | |
1543 | will of course return no data. | |
1544 | ||
1545 | This should give some feel for how the high-level interface can be used. | |
1546 | If you require extra flexibility, you'll have to bite the bullet and get | |
1547 | to grips with the low-level interface. | |
1548 | ||
1549 | @subsection Standard file-reading/writing code | |
1550 | Here's how you'd write data to a compressed file: | |
1551 | @example @code | |
1552 | FILE* f; | |
1553 | BZFILE* b; | |
1554 | int nBuf; | |
1555 | char buf[ /* whatever size you like */ ]; | |
1556 | int bzerror; | |
1557 | int nWritten; | |
1558 | ||
1559 | f = fopen ( "myfile.bz2", "w" ); | |
1560 | if (!f) @{ | |
1561 | /* handle error */ | |
1562 | @} | |
1563 | b = BZ2_bzWriteOpen ( &bzerror, f, 9 ); | |
1564 | if (bzerror != BZ_OK) @{ | |
1565 | BZ2_bzWriteClose ( b ); | |
1566 | /* handle error */ | |
1567 | @} | |
1568 | ||
1569 | while ( /* condition */ ) @{ | |
1570 | /* get data to write into buf, and set nBuf appropriately */ | |
1571 | nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf ); | |
1572 | if (bzerror == BZ_IO_ERROR) @{ | |
1573 | BZ2_bzWriteClose ( &bzerror, b ); | |
1574 | /* handle error */ | |
1575 | @} | |
1576 | @} | |
1577 | ||
1578 | BZ2_bzWriteClose ( &bzerror, b ); | |
1579 | if (bzerror == BZ_IO_ERROR) @{ | |
1580 | /* handle error */ | |
1581 | @} | |
1582 | @end example | |
1583 | And to read from a compressed file: | |
1584 | @example | |
1585 | FILE* f; | |
1586 | BZFILE* b; | |
1587 | int nBuf; | |
1588 | char buf[ /* whatever size you like */ ]; | |
1589 | int bzerror; | |
1590 | int nWritten; | |
1591 | ||
1592 | f = fopen ( "myfile.bz2", "r" ); | |
1593 | if (!f) @{ | |
1594 | /* handle error */ | |
1595 | @} | |
1596 | b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 ); | |
1597 | if (bzerror != BZ_OK) @{ | |
1598 | BZ2_bzReadClose ( &bzerror, b ); | |
1599 | /* handle error */ | |
1600 | @} | |
1601 | ||
1602 | bzerror = BZ_OK; | |
1603 | while (bzerror == BZ_OK && /* arbitrary other conditions */) @{ | |
1604 | nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ ); | |
1605 | if (bzerror == BZ_OK) @{ | |
1606 | /* do something with buf[0 .. nBuf-1] */ | |
1607 | @} | |
1608 | @} | |
1609 | if (bzerror != BZ_STREAM_END) @{ | |
1610 | BZ2_bzReadClose ( &bzerror, b ); | |
1611 | /* handle error */ | |
1612 | @} else @{ | |
1613 | BZ2_bzReadClose ( &bzerror ); | |
1614 | @} | |
1615 | @end example | |
1616 | ||
1617 | ||
1618 | ||
1619 | @section Utility functions | |
1620 | @subsection @code{BZ2_bzBuffToBuffCompress} | |
1621 | @example | |
1622 | int BZ2_bzBuffToBuffCompress( char* dest, | |
1623 | unsigned int* destLen, | |
1624 | char* source, | |
1625 | unsigned int sourceLen, | |
1626 | int blockSize100k, | |
1627 | int verbosity, | |
1628 | int workFactor ); | |
1629 | @end example | |
1630 | Attempts to compress the data in @code{source[0 .. sourceLen-1]} | |
1631 | into the destination buffer, @code{dest[0 .. *destLen-1]}. | |
1632 | If the destination buffer is big enough, @code{*destLen} is | |
1633 | set to the size of the compressed data, and @code{BZ_OK} is | |
1634 | returned. If the compressed data won't fit, @code{*destLen} | |
1635 | is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. | |
1636 | ||
1637 | Compression in this manner is a one-shot event, done with a single call | |
1638 | to this function. The resulting compressed data is a complete | |
1639 | @code{bzip2} format data stream. There is no mechanism for making | |
1640 | additional calls to provide extra input data. If you want that kind of | |
1641 | mechanism, use the low-level interface. | |
1642 | ||
1643 | For the meaning of parameters @code{blockSize100k}, @code{verbosity} | |
1644 | and @code{workFactor}, @* see @code{BZ2_bzCompressInit}. | |
1645 | ||
1646 | To guarantee that the compressed data will fit in its buffer, allocate | |
1647 | an output buffer of size 1% larger than the uncompressed data, plus | |
1648 | six hundred extra bytes. | |
1649 | ||
1650 | @code{BZ2_bzBuffToBuffDecompress} will not write data at or | |
1651 | beyond @code{dest[*destLen]}, even in case of buffer overflow. | |
1652 | ||
1653 | Possible return values: | |
1654 | @display | |
1655 | @code{BZ_CONFIG_ERROR} | |
1656 | if the library has been mis-compiled | |
1657 | @code{BZ_PARAM_ERROR} | |
1658 | if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} | |
1659 | or @code{blockSize100k < 1} or @code{blockSize100k > 9} | |
1660 | or @code{verbosity < 0} or @code{verbosity > 4} | |
1661 | or @code{workFactor < 0} or @code{workFactor > 250} | |
1662 | @code{BZ_MEM_ERROR} | |
1663 | if insufficient memory is available | |
1664 | @code{BZ_OUTBUFF_FULL} | |
1665 | if the size of the compressed data exceeds @code{*destLen} | |
1666 | @code{BZ_OK} | |
1667 | otherwise | |
1668 | @end display | |
1669 | ||
1670 | ||
1671 | ||
1672 | @subsection @code{BZ2_bzBuffToBuffDecompress} | |
1673 | @example | |
1674 | int BZ2_bzBuffToBuffDecompress ( char* dest, | |
1675 | unsigned int* destLen, | |
1676 | char* source, | |
1677 | unsigned int sourceLen, | |
1678 | int small, | |
1679 | int verbosity ); | |
1680 | @end example | |
1681 | Attempts to decompress the data in @code{source[0 .. sourceLen-1]} | |
1682 | into the destination buffer, @code{dest[0 .. *destLen-1]}. | |
1683 | If the destination buffer is big enough, @code{*destLen} is | |
1684 | set to the size of the uncompressed data, and @code{BZ_OK} is | |
1685 | returned. If the compressed data won't fit, @code{*destLen} | |
1686 | is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. | |
1687 | ||
1688 | @code{source} is assumed to hold a complete @code{bzip2} format | |
1689 | data stream. @* @code{BZ2_bzBuffToBuffDecompress} tries to decompress | |
1690 | the entirety of the stream into the output buffer. | |
1691 | ||
1692 | For the meaning of parameters @code{small} and @code{verbosity}, | |
1693 | see @code{BZ2_bzDecompressInit}. | |
1694 | ||
1695 | Because the compression ratio of the compressed data cannot be known in | |
1696 | advance, there is no easy way to guarantee that the output buffer will | |
1697 | be big enough. You may of course make arrangements in your code to | |
1698 | record the size of the uncompressed data, but such a mechanism is beyond | |
1699 | the scope of this library. | |
1700 | ||
1701 | @code{BZ2_bzBuffToBuffDecompress} will not write data at or | |
1702 | beyond @code{dest[*destLen]}, even in case of buffer overflow. | |
1703 | ||
1704 | Possible return values: | |
1705 | @display | |
1706 | @code{BZ_CONFIG_ERROR} | |
1707 | if the library has been mis-compiled | |
1708 | @code{BZ_PARAM_ERROR} | |
1709 | if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} | |
1710 | or @code{small != 0 && small != 1} | |
1711 | or @code{verbosity < 0} or @code{verbosity > 4} | |
1712 | @code{BZ_MEM_ERROR} | |
1713 | if insufficient memory is available | |
1714 | @code{BZ_OUTBUFF_FULL} | |
1715 | if the size of the compressed data exceeds @code{*destLen} | |
1716 | @code{BZ_DATA_ERROR} | |
1717 | if a data integrity error was detected in the compressed data | |
1718 | @code{BZ_DATA_ERROR_MAGIC} | |
1719 | if the compressed data doesn't begin with the right magic bytes | |
1720 | @code{BZ_UNEXPECTED_EOF} | |
1721 | if the compressed data ends unexpectedly | |
1722 | @code{BZ_OK} | |
1723 | otherwise | |
1724 | @end display | |
1725 | ||
1726 | ||
1727 | ||
1728 | @section @code{zlib} compatibility functions | |
1729 | Yoshioka Tsuneo has contributed some functions to | |
1730 | give better @code{zlib} compatibility. These functions are | |
1731 | @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush}, | |
1732 | @code{BZ2_bzclose}, | |
1733 | @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}. | |
1734 | These functions are not (yet) officially part of | |
1735 | the library. If they break, you get to keep all the pieces. | |
1736 | Nevertheless, I think they work ok. | |
1737 | @example | |
1738 | typedef void BZFILE; | |
1739 | ||
1740 | const char * BZ2_bzlibVersion ( void ); | |
1741 | @end example | |
1742 | Returns a string indicating the library version. | |
1743 | @example | |
1744 | BZFILE * BZ2_bzopen ( const char *path, const char *mode ); | |
1745 | BZFILE * BZ2_bzdopen ( int fd, const char *mode ); | |
1746 | @end example | |
1747 | Opens a @code{.bz2} file for reading or writing, using either its name | |
1748 | or a pre-existing file descriptor. | |
1749 | Analogous to @code{fopen} and @code{fdopen}. | |
1750 | @example | |
1751 | int BZ2_bzread ( BZFILE* b, void* buf, int len ); | |
1752 | int BZ2_bzwrite ( BZFILE* b, void* buf, int len ); | |
1753 | @end example | |
1754 | Reads/writes data from/to a previously opened @code{BZFILE}. | |
1755 | Analogous to @code{fread} and @code{fwrite}. | |
1756 | @example | |
1757 | int BZ2_bzflush ( BZFILE* b ); | |
1758 | void BZ2_bzclose ( BZFILE* b ); | |
1759 | @end example | |
1760 | Flushes/closes a @code{BZFILE}. @code{BZ2_bzflush} doesn't actually do | |
1761 | anything. Analogous to @code{fflush} and @code{fclose}. | |
1762 | ||
1763 | @example | |
1764 | const char * BZ2_bzerror ( BZFILE *b, int *errnum ) | |
1765 | @end example | |
1766 | Returns a string describing the more recent error status of | |
1767 | @code{b}, and also sets @code{*errnum} to its numerical value. | |
1768 | ||
1769 | ||
1770 | @section Using the library in a @code{stdio}-free environment | |
1771 | ||
1772 | @subsection Getting rid of @code{stdio} | |
1773 | ||
1774 | In a deeply embedded application, you might want to use just | |
1775 | the memory-to-memory functions. You can do this conveniently | |
1776 | by compiling the library with preprocessor symbol @code{BZ_NO_STDIO} | |
1777 | defined. Doing this gives you a library containing only the following | |
1778 | eight functions: | |
1779 | ||
1780 | @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, @code{BZ2_bzCompressEnd} @* | |
1781 | @code{BZ2_bzDecompressInit}, @code{BZ2_bzDecompress}, @code{BZ2_bzDecompressEnd} @* | |
1782 | @code{BZ2_bzBuffToBuffCompress}, @code{BZ2_bzBuffToBuffDecompress} | |
1783 | ||
1784 | When compiled like this, all functions will ignore @code{verbosity} | |
1785 | settings. | |
1786 | ||
1787 | @subsection Critical error handling | |
1788 | @code{libbzip2} contains a number of internal assertion checks which | |
1789 | should, needless to say, never be activated. Nevertheless, if an | |
1790 | assertion should fail, behaviour depends on whether or not the library | |
1791 | was compiled with @code{BZ_NO_STDIO} set. | |
1792 | ||
1793 | For a normal compile, an assertion failure yields the message | |
1794 | @example | |
1795 | bzip2/libbzip2: internal error number N. | |
ba95a000 | 1796 | This is a bug in bzip2/libbzip2, 1.0.2, 30-Dec-2001. |
ef6bacff RC |
1797 | Please report it to me at: jseward@@acm.org. If this happened |
1798 | when you were using some program which uses libbzip2 as a | |
1799 | component, you should also report this bug to the author(s) | |
1800 | of that program. Please make an effort to report this bug; | |
1801 | timely and accurate bug reports eventually lead to higher | |
ba95a000 | 1802 | quality software. Thanks. Julian Seward, 30 December 2001. |
ef6bacff | 1803 | @end example |
ba95a000 RC |
1804 | where @code{N} is some error code number. If @code{N == 1007}, it also |
1805 | prints some extra text advising the reader that unreliable memory is | |
1806 | often associated with internal error 1007. (This is a | |
1807 | frequently-observed-phenomenon with versions 1.0.0/1.0.1). | |
1808 | ||
1809 | @code{exit(3)} is then called. | |
ef6bacff RC |
1810 | |
1811 | For a @code{stdio}-free library, assertion failures result | |
1812 | in a call to a function declared as: | |
1813 | @example | |
1814 | extern void bz_internal_error ( int errcode ); | |
1815 | @end example | |
1816 | The relevant code is passed as a parameter. You should supply | |
1817 | such a function. | |
1818 | ||
1819 | In either case, once an assertion failure has occurred, any | |
1820 | @code{bz_stream} records involved can be regarded as invalid. | |
1821 | You should not attempt to resume normal operation with them. | |
1822 | ||
1823 | You may, of course, change critical error handling to suit | |
1824 | your needs. As I said above, critical errors indicate bugs | |
1825 | in the library and should not occur. All "normal" error | |
1826 | situations are indicated via error return codes from functions, | |
1827 | and can be recovered from. | |
1828 | ||
1829 | ||
1830 | @section Making a Windows DLL | |
1831 | Everything related to Windows has been contributed by Yoshioka Tsuneo | |
1832 | @* (@code{QWF00133@@niftyserve.or.jp} / | |
1833 | @code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to | |
1834 | him (but perhaps Cc: me, @code{jseward@@acm.org}). | |
1835 | ||
1836 | My vague understanding of what to do is: using Visual C++ 5.0, | |
1837 | open the project file @code{libbz2.dsp}, and build. That's all. | |
1838 | ||
1839 | If you can't | |
1840 | open the project file for some reason, make a new one, naming these files: | |
1841 | @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, | |
1842 | @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* | |
1843 | @code{randtable.c} and @code{libbz2.def}. You will also need | |
1844 | to name the header files @code{bzlib.h} and @code{bzlib_private.h}. | |
1845 | ||
1846 | If you don't use VC++, you may need to define the proprocessor symbol | |
1847 | @code{_WIN32}. | |
1848 | ||
1849 | Finally, @code{dlltest.c} is a sample program using the DLL. It has a | |
1850 | project file, @code{dlltest.dsp}. | |
1851 | ||
1852 | If you just want a makefile for Visual C, have a look at | |
1853 | @code{makefile.msc}. | |
1854 | ||
1855 | Be aware that if you compile @code{bzip2} itself on Win32, you must set | |
1856 | @code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file | |
1857 | @code{bzip2.c}, before compiling. Otherwise the resulting binary won't | |
1858 | work correctly. | |
1859 | ||
1860 | I haven't tried any of this stuff myself, but it all looks plausible. | |
1861 | ||
1862 | ||
1863 | ||
1864 | @chapter Miscellanea | |
1865 | ||
1866 | These are just some random thoughts of mine. Your mileage may | |
1867 | vary. | |
1868 | ||
1869 | @section Limitations of the compressed file format | |
1870 | @code{bzip2-1.0}, @code{0.9.5} and @code{0.9.0} | |
1871 | use exactly the same file format as the previous | |
1872 | version, @code{bzip2-0.1}. This decision was made in the interests of | |
1873 | stability. Creating yet another incompatible compressed file format | |
1874 | would create further confusion and disruption for users. | |
1875 | ||
1876 | Nevertheless, this is not a painless decision. Development | |
1877 | work since the release of @code{bzip2-0.1} in August 1997 | |
1878 | has shown complexities in the file format which slow down | |
1879 | decompression and, in retrospect, are unnecessary. These are: | |
1880 | @itemize @bullet | |
1881 | @item The run-length encoder, which is the first of the | |
1882 | compression transformations, is entirely irrelevant. | |
1883 | The original purpose was to protect the sorting algorithm | |
1884 | from the very worst case input: a string of repeated | |
1885 | symbols. But algorithm steps Q6a and Q6b in the original | |
1886 | Burrows-Wheeler technical report (SRC-124) show how | |
1887 | repeats can be handled without difficulty in block | |
1888 | sorting. | |
1889 | @item The randomisation mechanism doesn't really need to be | |
1890 | there. Udi Manber and Gene Myers published a suffix | |
1891 | array construction algorithm a few years back, which | |
1892 | can be employed to sort any block, no matter how | |
1893 | repetitive, in O(N log N) time. Subsequent work by | |
1894 | Kunihiko Sadakane has produced a derivative O(N (log N)^2) | |
1895 | algorithm which usually outperforms the Manber-Myers | |
1896 | algorithm. | |
1897 | ||
1898 | I could have changed to Sadakane's algorithm, but I find | |
1899 | it to be slower than @code{bzip2}'s existing algorithm for | |
1900 | most inputs, and the randomisation mechanism protects | |
1901 | adequately against bad cases. I didn't think it was | |
1902 | a good tradeoff to make. Partly this is due to the fact | |
1903 | that I was not flooded with email complaints about | |
1904 | @code{bzip2-0.1}'s performance on repetitive data, so | |
1905 | perhaps it isn't a problem for real inputs. | |
1906 | ||
1907 | Probably the best long-term solution, | |
1908 | and the one I have incorporated into 0.9.5 and above, | |
1909 | is to use the existing sorting | |
1910 | algorithm initially, and fall back to a O(N (log N)^2) | |
1911 | algorithm if the standard algorithm gets into difficulties. | |
1912 | @item The compressed file format was never designed to be | |
1913 | handled by a library, and I have had to jump though | |
1914 | some hoops to produce an efficient implementation of | |
1915 | decompression. It's a bit hairy. Try passing | |
1916 | @code{decompress.c} through the C preprocessor | |
1917 | and you'll see what I mean. Much of this complexity | |
1918 | could have been avoided if the compressed size of | |
1919 | each block of data was recorded in the data stream. | |
1920 | @item An Adler-32 checksum, rather than a CRC32 checksum, | |
1921 | would be faster to compute. | |
1922 | @end itemize | |
1923 | It would be fair to say that the @code{bzip2} format was frozen | |
1924 | before I properly and fully understood the performance | |
1925 | consequences of doing so. | |
1926 | ||
1927 | Improvements which I was able to incorporate into | |
1928 | 0.9.0, despite using the same file format, are: | |
1929 | @itemize @bullet | |
1930 | @item Single array implementation of the inverse BWT. This | |
1931 | significantly speeds up decompression, presumably | |
1932 | because it reduces the number of cache misses. | |
1933 | @item Faster inverse MTF transform for large MTF values. The | |
1934 | new implementation is based on the notion of sliding blocks | |
1935 | of values. | |
1936 | @item @code{bzip2-0.9.0} now reads and writes files with @code{fread} | |
1937 | and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. | |
1938 | Duh! Well, you live and learn. | |
1939 | ||
1940 | @end itemize | |
1941 | Further ahead, it would be nice | |
1942 | to be able to do random access into files. This will | |
1943 | require some careful design of compressed file formats. | |
1944 | ||
1945 | ||
1946 | ||
1947 | @section Portability issues | |
1948 | After some consideration, I have decided not to use | |
1949 | GNU @code{autoconf} to configure 0.9.5 or 1.0. | |
1950 | ||
1951 | @code{autoconf}, admirable and wonderful though it is, | |
1952 | mainly assists with portability problems between Unix-like | |
1953 | platforms. But @code{bzip2} doesn't have much in the way | |
1954 | of portability problems on Unix; most of the difficulties appear | |
1955 | when porting to the Mac, or to Microsoft's operating systems. | |
1956 | @code{autoconf} doesn't help in those cases, and brings in a | |
1957 | whole load of new complexity. | |
1958 | ||
1959 | Most people should be able to compile the library and program | |
1960 | under Unix straight out-of-the-box, so to speak, especially | |
1961 | if you have a version of GNU C available. | |
1962 | ||
1963 | There are a couple of @code{__inline__} directives in the code. GNU C | |
1964 | (@code{gcc}) should be able to handle them. If you're not using | |
1965 | GNU C, your C compiler shouldn't see them at all. | |
1966 | If your compiler does, for some reason, see them and doesn't | |
1967 | like them, just @code{#define} @code{__inline__} to be @code{/* */}. One | |
1968 | easy way to do this is to compile with the flag @code{-D__inline__=}, | |
1969 | which should be understood by most Unix compilers. | |
1970 | ||
1971 | If you still have difficulties, try compiling with the macro | |
1972 | @code{BZ_STRICT_ANSI} defined. This should enable you to build the | |
1973 | library in a strictly ANSI compliant environment. Building the program | |
1974 | itself like this is dangerous and not supported, since you remove | |
1975 | @code{bzip2}'s checks against compressing directories, symbolic links, | |
1976 | devices, and other not-really-a-file entities. This could cause | |
1977 | filesystem corruption! | |
1978 | ||
1979 | One other thing: if you create a @code{bzip2} binary for public | |
1980 | distribution, please try and link it statically (@code{gcc -s}). This | |
1981 | avoids all sorts of library-version issues that others may encounter | |
1982 | later on. | |
1983 | ||
1984 | If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and | |
1985 | @code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling. | |
1986 | Otherwise the resulting binary won't work correctly. | |
1987 | ||
1988 | ||
1989 | ||
1990 | @section Reporting bugs | |
1991 | I tried pretty hard to make sure @code{bzip2} is | |
1992 | bug free, both by design and by testing. Hopefully | |
1993 | you'll never need to read this section for real. | |
1994 | ||
1995 | Nevertheless, if @code{bzip2} dies with a segmentation | |
1996 | fault, a bus error or an internal assertion failure, it | |
1997 | will ask you to email me a bug report. Experience with | |
1998 | version 0.1 shows that almost all these problems can | |
1999 | be traced to either compiler bugs or hardware problems. | |
2000 | @itemize @bullet | |
2001 | @item | |
2002 | Recompile the program with no optimisation, and see if it | |
2003 | works. And/or try a different compiler. | |
2004 | I heard all sorts of stories about various flavours | |
2005 | of GNU C (and other compilers) generating bad code for | |
2006 | @code{bzip2}, and I've run across two such examples myself. | |
2007 | ||
2008 | 2.7.X versions of GNU C are known to generate bad code from | |
2009 | time to time, at high optimisation levels. | |
2010 | If you get problems, try using the flags | |
2011 | @code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}. | |
2012 | You should specifically @emph{not} use @code{-funroll-loops}. | |
2013 | ||
2014 | You may notice that the Makefile runs six tests as part of | |
2015 | the build process. If the program passes all of these, it's | |
2016 | a pretty good (but not 100%) indication that the compiler has | |
2017 | done its job correctly. | |
2018 | @item | |
2019 | If @code{bzip2} crashes randomly, and the crashes are not | |
2020 | repeatable, you may have a flaky memory subsystem. @code{bzip2} | |
2021 | really hammers your memory hierarchy, and if it's a bit marginal, | |
2022 | you may get these problems. Ditto if your disk or I/O subsystem | |
2023 | is slowly failing. Yup, this really does happen. | |
2024 | ||
2025 | Try using a different machine of the same type, and see if | |
2026 | you can repeat the problem. | |
2027 | @item This isn't really a bug, but ... If @code{bzip2} tells | |
2028 | you your file is corrupted on decompression, and you | |
2029 | obtained the file via FTP, there is a possibility that you | |
2030 | forgot to tell FTP to do a binary mode transfer. That absolutely | |
2031 | will cause the file to be non-decompressible. You'll have to transfer | |
2032 | it again. | |
2033 | @end itemize | |
2034 | ||
2035 | If you've incorporated @code{libbzip2} into your own program | |
2036 | and are getting problems, please, please, please, check that the | |
2037 | parameters you are passing in calls to the library, are | |
2038 | correct, and in accordance with what the documentation says | |
2039 | is allowable. I have tried to make the library robust against | |
2040 | such problems, but I'm sure I haven't succeeded. | |
2041 | ||
2042 | Finally, if the above comments don't help, you'll have to send | |
2043 | me a bug report. Now, it's just amazing how many people will | |
2044 | send me a bug report saying something like | |
2045 | @display | |
2046 | bzip2 crashed with segmentation fault on my machine | |
2047 | @end display | |
2048 | and absolutely nothing else. Needless to say, a such a report | |
2049 | is @emph{totally, utterly, completely and comprehensively 100% useless; | |
2050 | a waste of your time, my time, and net bandwidth}. | |
2051 | With no details at all, there's no way I can possibly begin | |
2052 | to figure out what the problem is. | |
2053 | ||
2054 | The rules of the game are: facts, facts, facts. Don't omit | |
2055 | them because "oh, they won't be relevant". At the bare | |
2056 | minimum: | |
2057 | @display | |
2058 | Machine type. Operating system version. | |
2059 | Exact version of @code{bzip2} (do @code{bzip2 -V}). | |
2060 | Exact version of the compiler used. | |
2061 | Flags passed to the compiler. | |
2062 | @end display | |
2063 | However, the most important single thing that will help me is | |
2064 | the file that you were trying to compress or decompress at the | |
2065 | time the problem happened. Without that, my ability to do anything | |
2066 | more than speculate about the cause, is limited. | |
2067 | ||
2068 | Please remember that I connect to the Internet with a modem, so | |
2069 | you should contact me before mailing me huge files. | |
2070 | ||
2071 | ||
2072 | @section Did you get the right package? | |
2073 | ||
2074 | @code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles | |
2075 | and memory. Also, it gives very large latencies. In the worst case, you | |
2076 | can feed many megabytes of uncompressed data into the library before | |
2077 | getting any compressed output, so this probably rules out applications | |
2078 | requiring interactive behaviour. | |
2079 | ||
2080 | These aren't faults of my implementation, I hope, but more | |
2081 | an intrinsic property of the Burrows-Wheeler transform (unfortunately). | |
2082 | Maybe this isn't what you want. | |
2083 | ||
2084 | If you want a compressor and/or library which is faster, uses less | |
2085 | memory but gets pretty good compression, and has minimal latency, | |
2086 | consider Jean-loup | |
ba95a000 | 2087 | Gailly's and Mark Adler's work, @code{zlib-1.1.3} and |
ef6bacff RC |
2088 | @code{gzip-1.2.4}. Look for them at |
2089 | ||
ba95a000 | 2090 | @code{http://www.zlib.org} and |
ef6bacff RC |
2091 | @code{http://www.gzip.org} respectively. |
2092 | ||
2093 | For something faster and lighter still, you might try Markus F X J | |
2094 | Oberhumer's @code{LZO} real-time compression/decompression library, at | |
2095 | @* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}. | |
2096 | ||
2097 | If you want to use the @code{bzip2} algorithms to compress small blocks | |
2098 | of data, 64k bytes or smaller, for example on an on-the-fly disk | |
2099 | compressor, you'd be well advised not to use this library. Instead, | |
2100 | I've made a special library tuned for that kind of use. It's part of | |
2101 | @code{e2compr-0.40}, an on-the-fly disk compressor for the Linux | |
2102 | @code{ext2} filesystem. Look at | |
2103 | @code{http://www.netspace.net.au/~reiter/e2compr}. | |
2104 | ||
2105 | ||
2106 | ||
2107 | @section Testing | |
2108 | ||
2109 | A record of the tests I've done. | |
2110 | ||
2111 | First, some data sets: | |
2112 | @itemize @bullet | |
2113 | @item B: a directory containing 6001 files, one for every length in the | |
2114 | range 0 to 6000 bytes. The files contain random lowercase | |
2115 | letters. 18.7 megabytes. | |
2116 | @item H: my home directory tree. Documents, source code, mail files, | |
2117 | compressed data. H contains B, and also a directory of | |
2118 | files designed as boundary cases for the sorting; mostly very | |
2119 | repetitive, nasty files. 565 megabytes. | |
2120 | @item A: directory tree holding various applications built from source: | |
2121 | @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc. | |
2122 | 2200 megabytes. | |
2123 | @end itemize | |
2124 | The tests conducted are as follows. Each test means compressing | |
2125 | (a copy of) each file in the data set, decompressing it and | |
2126 | comparing it against the original. | |
2127 | ||
2128 | First, a bunch of tests with block sizes and internal buffer | |
2129 | sizes set very small, | |
2130 | to detect any problems with the | |
2131 | blocking and buffering mechanisms. | |
2132 | This required modifying the source code so as to try to | |
2133 | break it. | |
2134 | @enumerate | |
2135 | @item Data set H, with | |
2136 | buffer size of 1 byte, and block size of 23 bytes. | |
2137 | @item Data set B, buffer sizes 1 byte, block size 1 byte. | |
2138 | @item As (2) but small-mode decompression. | |
2139 | @item As (2) with block size 2 bytes. | |
2140 | @item As (2) with block size 3 bytes. | |
2141 | @item As (2) with block size 4 bytes. | |
2142 | @item As (2) with block size 5 bytes. | |
2143 | @item As (2) with block size 6 bytes and small-mode decompression. | |
2144 | @item H with buffer size of 1 byte, but normal block | |
2145 | size (up to 900000 bytes). | |
2146 | @end enumerate | |
2147 | Then some tests with unmodified source code. | |
2148 | @enumerate | |
2149 | @item H, all settings normal. | |
2150 | @item As (1), with small-mode decompress. | |
2151 | @item H, compress with flag @code{-1}. | |
2152 | @item H, compress with flag @code{-s}, decompress with flag @code{-s}. | |
2153 | @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, | |
2154 | @code{bzip2-0.9.5} decompressing, all settings normal. | |
2155 | @item Backwards compatibility: H, @code{bzip2-0.9.5} compressing, | |
2156 | @code{bzip2-0.1pl2} decompressing, all settings normal. | |
2157 | @item Bigger tests: A, all settings normal. | |
2158 | @item As (7), using the fallback (Sadakane-like) sorting algorithm. | |
2159 | @item As (8), compress with flag @code{-1}, decompress with flag | |
2160 | @code{-s}. | |
2161 | @item H, using the fallback sorting algorithm. | |
2162 | @item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing, | |
2163 | @code{bzip2-0.9.5} decompressing, all settings normal. | |
2164 | @item Backwards compatibility: A, @code{bzip2-0.9.5} compressing, | |
2165 | @code{bzip2-0.1pl2} decompressing, all settings normal. | |
2166 | @item Misc test: about 400 megabytes of @code{.tar} files with | |
2167 | @code{bzip2} compiled with Checker (a memory access error | |
2168 | detector, like Purify). | |
2169 | @item Misc tests to make sure it builds and runs ok on non-Linux/x86 | |
2170 | platforms. | |
2171 | @end enumerate | |
2172 | These tests were conducted on a 225 MHz IDT WinChip machine, running | |
2173 | Linux 2.0.36. They represent nearly a week of continuous computation. | |
2174 | All tests completed successfully. | |
2175 | ||
2176 | ||
2177 | @section Further reading | |
2178 | @code{bzip2} is not research work, in the sense that it doesn't present | |
2179 | any new ideas. Rather, it's an engineering exercise based on existing | |
2180 | ideas. | |
2181 | ||
2182 | Four documents describe essentially all the ideas behind @code{bzip2}: | |
2183 | @example | |
2184 | Michael Burrows and D. J. Wheeler: | |
2185 | "A block-sorting lossless data compression algorithm" | |
2186 | 10th May 1994. | |
2187 | Digital SRC Research Report 124. | |
2188 | ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz | |
2189 | If you have trouble finding it, try searching at the | |
2190 | New Zealand Digital Library, http://www.nzdl.org. | |
2191 | ||
2192 | Daniel S. Hirschberg and Debra A. LeLewer | |
2193 | "Efficient Decoding of Prefix Codes" | |
2194 | Communications of the ACM, April 1990, Vol 33, Number 4. | |
2195 | You might be able to get an electronic copy of this | |
2196 | from the ACM Digital Library. | |
2197 | ||
2198 | David J. Wheeler | |
2199 | Program bred3.c and accompanying document bred3.ps. | |
2200 | This contains the idea behind the multi-table Huffman | |
2201 | coding scheme. | |
2202 | ftp://ftp.cl.cam.ac.uk/users/djw3/ | |
2203 | ||
2204 | Jon L. Bentley and Robert Sedgewick | |
2205 | "Fast Algorithms for Sorting and Searching Strings" | |
2206 | Available from Sedgewick's web page, | |
2207 | www.cs.princeton.edu/~rs | |
2208 | @end example | |
2209 | The following paper gives valuable additional insights into the | |
2210 | algorithm, but is not immediately the basis of any code | |
2211 | used in bzip2. | |
2212 | @example | |
2213 | Peter Fenwick: | |
2214 | Block Sorting Text Compression | |
2215 | Proceedings of the 19th Australasian Computer Science Conference, | |
2216 | Melbourne, Australia. Jan 31 - Feb 2, 1996. | |
2217 | ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps | |
2218 | @end example | |
2219 | Kunihiko Sadakane's sorting algorithm, mentioned above, | |
2220 | is available from: | |
2221 | @example | |
2222 | http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz | |
2223 | @end example | |
2224 | The Manber-Myers suffix array construction | |
2225 | algorithm is described in a paper | |
2226 | available from: | |
2227 | @example | |
2228 | http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps | |
2229 | @end example | |
2230 | Finally, the following paper documents some recent investigations | |
2231 | I made into the performance of sorting algorithms: | |
2232 | @example | |
2233 | Julian Seward: | |
2234 | On the Performance of BWT Sorting Algorithms | |
2235 | Proceedings of the IEEE Data Compression Conference 2000 | |
2236 | Snowbird, Utah. 28-30 March 2000. | |
2237 | @end example | |
2238 | ||
2239 | ||
2240 | @contents | |
2241 | ||
2242 | @bye | |
2243 |