]> cygwin.com Git - cygwin-apps/setup.git/blame - bz2lib/manual.texi
2002-04-29 Robert Collins <rbtcollins@hotmail.com>
[cygwin-apps/setup.git] / bz2lib / manual.texi
CommitLineData
ef6bacff
RC
1\input texinfo @c -*- Texinfo -*-
2@setfilename bzip2.info
3
4@ignore
ba95a000 5This file documents bzip2 version 1.0.2, and associated library
ef6bacff
RC
6libbzip2, written by Julian Seward (jseward@acm.org).
7
ba95a000 8Copyright (C) 1996-2002 Julian R Seward
ef6bacff
RC
9
10Permission is granted to make and distribute verbatim copies of
11this manual provided the copyright notice and this permission notice
12are preserved on all copies.
13
14Permission is granted to copy and distribute translations of this manual
15into another language, under the above conditions for verbatim copies.
16@end ignore
17
18@ifinfo
19@format
20START-INFO-DIR-ENTRY
21* Bzip2: (bzip2). A program and library for data compression.
22END-INFO-DIR-ENTRY
23@end format
24
25@end ifinfo
26
27@iftex
28@c @finalout
29@settitle bzip2 and libbzip2
30@titlepage
31@title bzip2 and libbzip2
32@subtitle a program and library for data compression
ba95a000
RC
33@subtitle copyright (C) 1996-2002 Julian Seward
34@subtitle version 1.0.2 of 30 December 2001
ef6bacff
RC
35@author Julian Seward
36
37@end titlepage
38
39@parindent 0mm
40@parskip 2mm
41
42@end iftex
ba95a000
RC
43@node Top,,, (dir)
44
45The following text is the License for this software. You should
46find it identical to that contained in the file LICENSE in the
47source distribution.
48
49@bf{------------------ START OF THE LICENSE ------------------}
ef6bacff
RC
50
51This program, @code{bzip2},
52and associated library @code{libbzip2}, are
ba95a000 53Copyright (C) 1996-2002 Julian R Seward. All rights reserved.
ef6bacff
RC
54
55Redistribution and use in source and binary forms, with or without
56modification, are permitted provided that the following conditions
57are met:
58@itemize @bullet
59@item
60 Redistributions of source code must retain the above copyright
61 notice, this list of conditions and the following disclaimer.
62@item
63 The origin of this software must not be misrepresented; you must
64 not claim that you wrote the original software. If you use this
65 software in a product, an acknowledgment in the product
66 documentation would be appreciated but is not required.
67@item
68 Altered source versions must be plainly marked as such, and must
69 not be misrepresented as being the original software.
70@item
71 The name of the author may not be used to endorse or promote
72 products derived from this software without specific prior written
73 permission.
74@end itemize
75THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
76OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
77WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
78ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
79DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
80DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
81GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
82INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
83WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
84NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
85SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
86
87Julian Seward, Cambridge, UK.
88
89@code{jseward@@acm.org}
90
ba95a000 91@code{bzip2}/@code{libbzip2} version 1.0.2 of 30 December 2001.
ef6bacff 92
ba95a000 93@bf{------------------ END OF THE LICENSE ------------------}
ef6bacff 94
ba95a000 95Web sites:
ef6bacff 96
ba95a000
RC
97@code{http://sources.redhat.com/bzip2}
98
99@code{http://www.cacheprof.org}
ef6bacff
RC
100
101PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
102algorithms. However, I do not have the resources available to carry out
103a full patent search. Therefore I cannot give any guarantee of the
104above statement.
105
106
107
108
109
110
111
ef6bacff
RC
112@chapter Introduction
113
114@code{bzip2} compresses files using the Burrows-Wheeler
115block-sorting text compression algorithm, and Huffman coding.
116Compression is generally considerably better than that
117achieved by more conventional LZ77/LZ78-based compressors,
118and approaches the performance of the PPM family of statistical compressors.
119
120@code{bzip2} is built on top of @code{libbzip2}, a flexible library
121for handling compressed data in the @code{bzip2} format. This manual
122describes both how to use the program and
123how to work with the library interface. Most of the
124manual is devoted to this library, not the program,
125which is good news if your interest is only in the program.
126
127Chapter 2 describes how to use @code{bzip2}; this is the only part
128you need to read if you just want to know how to operate the program.
129Chapter 3 describes the programming interfaces in detail, and
130Chapter 4 records some miscellaneous notes which I thought
131ought to be recorded somewhere.
132
133
134@chapter How to use @code{bzip2}
135
136This chapter contains a copy of the @code{bzip2} man page,
137and nothing else.
138
139@quotation
140
141@unnumberedsubsubsec NAME
142@itemize
143@item @code{bzip2}, @code{bunzip2}
ba95a000 144- a block-sorting file compressor, v1.0.2
ef6bacff
RC
145@item @code{bzcat}
146- decompresses files to stdout
147@item @code{bzip2recover}
148- recovers data from damaged bzip2 files
149@end itemize
150
151@unnumberedsubsubsec SYNOPSIS
152@itemize
153@item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ... ]
154@item @code{bunzip2} [ -fkvsVL ] [ filenames ... ]
155@item @code{bzcat} [ -s ] [ filenames ... ]
156@item @code{bzip2recover} filename
157@end itemize
158
159@unnumberedsubsubsec DESCRIPTION
160
161@code{bzip2} compresses files using the Burrows-Wheeler block sorting
162text compression algorithm, and Huffman coding. Compression is
163generally considerably better than that achieved by more conventional
164LZ77/LZ78-based compressors, and approaches the performance of the PPM
165family of statistical compressors.
166
167The command-line options are deliberately very similar to those of GNU
168@code{gzip}, but they are not identical.
169
170@code{bzip2} expects a list of file names to accompany the command-line
171flags. Each file is replaced by a compressed version of itself, with
172the name @code{original_name.bz2}. Each compressed file has the same
173modification date, permissions, and, when possible, ownership as the
174corresponding original, so that these properties can be correctly
175restored at decompression time. File name handling is naive in the
176sense that there is no mechanism for preserving original file names,
177permissions, ownerships or dates in filesystems which lack these
178concepts, or have serious file name length restrictions, such as MS-DOS.
179
180@code{bzip2} and @code{bunzip2} will by default not overwrite existing
181files. If you want this to happen, specify the @code{-f} flag.
182
183If no file names are specified, @code{bzip2} compresses from standard
184input to standard output. In this case, @code{bzip2} will decline to
185write compressed output to a terminal, as this would be entirely
186incomprehensible and therefore pointless.
187
188@code{bunzip2} (or @code{bzip2 -d}) decompresses all
189specified files. Files which were not created by @code{bzip2}
190will be detected and ignored, and a warning issued.
191@code{bzip2} attempts to guess the filename for the decompressed file
192from that of the compressed file as follows:
193@itemize
194@item @code{filename.bz2 } becomes @code{filename}
195@item @code{filename.bz } becomes @code{filename}
196@item @code{filename.tbz2} becomes @code{filename.tar}
197@item @code{filename.tbz } becomes @code{filename.tar}
198@item @code{anyothername } becomes @code{anyothername.out}
199@end itemize
200If the file does not end in one of the recognised endings,
201@code{.bz2}, @code{.bz},
202@code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot
203guess the name of the original file, and uses the original name
204with @code{.out} appended.
205
206As with compression, supplying no
207filenames causes decompression from standard input to standard output.
208
209@code{bunzip2} will correctly decompress a file which is the
210concatenation of two or more compressed files. The result is the
211concatenation of the corresponding uncompressed files. Integrity
212testing (@code{-t}) of concatenated compressed files is also supported.
213
214You can also compress or decompress files to the standard output by
215giving the @code{-c} flag. Multiple files may be compressed and
216decompressed like this. The resulting outputs are fed sequentially to
217stdout. Compression of multiple files in this manner generates a stream
218containing multiple compressed file representations. Such a stream
219can be decompressed correctly only by @code{bzip2} version 0.9.0 or
220later. Earlier versions of @code{bzip2} will stop after decompressing
221the first file in the stream.
222
223@code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to
224the standard output.
225
226@code{bzip2} will read arguments from the environment variables
227@code{BZIP2} and @code{BZIP}, in that order, and will process them
228before any arguments read from the command line. This gives a
229convenient way to supply default arguments.
230
231Compression is always performed, even if the compressed file is slightly
232larger than the original. Files of less than about one hundred bytes
233tend to get larger, since the compression mechanism has a constant
234overhead in the region of 50 bytes. Random data (including the output
235of most file compressors) is coded at about 8.05 bits per byte, giving
236an expansion of around 0.5%.
237
238As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to
239make sure that the decompressed version of a file is identical to the
240original. This guards against corruption of the compressed data, and
241against undetected bugs in @code{bzip2} (hopefully very unlikely). The
242chances of data corruption going undetected is microscopic, about one
243chance in four billion for each file processed. Be aware, though, that
244the check occurs upon decompression, so it can only tell you that
245something is wrong. It can't help you recover the original uncompressed
246data. You can use @code{bzip2recover} to try to recover data from
247damaged files.
248
249Return values: 0 for a normal exit, 1 for environmental problems (file
250not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
251compressed file, 3 for an internal consistency error (eg, bug) which
252caused @code{bzip2} to panic.
253
254
255@unnumberedsubsubsec OPTIONS
256@table @code
257@item -c --stdout
258Compress or decompress to standard output.
259@item -d --decompress
260Force decompression. @code{bzip2}, @code{bunzip2} and @code{bzcat} are
261really the same program, and the decision about what actions to take is
262done on the basis of which name is used. This flag overrides that
263mechanism, and forces bzip2 to decompress.
264@item -z --compress
265The complement to @code{-d}: forces compression, regardless of the
266invokation name.
267@item -t --test
268Check integrity of the specified file(s), but don't decompress them.
269This really performs a trial decompression and throws away the result.
270@item -f --force
271Force overwrite of output files. Normally, @code{bzip2} will not overwrite
272existing output files. Also forces @code{bzip2} to break hard links
273to files, which it otherwise wouldn't do.
ba95a000
RC
274
275@code{bzip2} normally declines to decompress files which don't have the
276correct magic header bytes. If forced (@code{-f}), however, it will
277pass such files through unmodified. This is how GNU @code{gzip}
278behaves.
ef6bacff
RC
279@item -k --keep
280Keep (don't delete) input files during compression
281or decompression.
282@item -s --small
283Reduce memory usage, for compression, decompression and testing. Files
284are decompressed and tested using a modified algorithm which only
285requires 2.5 bytes per block byte. This means any file can be
286decompressed in 2300k of memory, albeit at about half the normal speed.
287
288During compression, @code{-s} selects a block size of 200k, which limits
289memory use to around the same figure, at the expense of your compression
290ratio. In short, if your machine is low on memory (8 megabytes or
291less), use -s for everything. See MEMORY MANAGEMENT below.
292@item -q --quiet
293Suppress non-essential warning messages. Messages pertaining to
294I/O errors and other critical events will not be suppressed.
295@item -v --verbose
296Verbose mode -- show the compression ratio for each file processed.
297Further @code{-v}'s increase the verbosity level, spewing out lots of
298information which is primarily of interest for diagnostic purposes.
299@item -L --license -V --version
300Display the software version, license terms and conditions.
ba95a000 301@item -1 (or --fast) to -9 (or --best)
ef6bacff
RC
302Set the block size to 100 k, 200 k .. 900 k when compressing. Has no
303effect when decompressing. See MEMORY MANAGEMENT below.
ba95a000
RC
304The @code{--fast} and @code{--best} aliases are primarily for GNU
305@code{gzip} compatibility. In particular, @code{--fast} doesn't make
306things significantly faster. And @code{--best} merely selects the
307default behaviour.
ef6bacff
RC
308@item --
309Treats all subsequent arguments as file names, even if they start
310with a dash. This is so you can handle files with names beginning
311with a dash, for example: @code{bzip2 -- -myfilename}.
312@item --repetitive-fast
313@item --repetitive-best
314These flags are redundant in versions 0.9.5 and above. They provided
315some coarse control over the behaviour of the sorting algorithm in
316earlier versions, which was sometimes useful. 0.9.5 and above have an
317improved algorithm which renders these flags irrelevant.
318@end table
319
320
321@unnumberedsubsubsec MEMORY MANAGEMENT
322
323@code{bzip2} compresses large files in blocks. The block size affects
324both the compression ratio achieved, and the amount of memory needed for
325compression and decompression. The flags @code{-1} through @code{-9}
326specify the block size to be 100,000 bytes through 900,000 bytes (the
327default) respectively. At decompression time, the block size used for
328compression is read from the header of the compressed file, and
329@code{bunzip2} then allocates itself just enough memory to decompress
330the file. Since block sizes are stored in compressed files, it follows
331that the flags @code{-1} to @code{-9} are irrelevant to and so ignored
332during decompression.
333
334Compression and decompression requirements, in bytes, can be estimated
335as:
336@example
337 Compression: 400k + ( 8 x block size )
338
339 Decompression: 100k + ( 4 x block size ), or
340 100k + ( 2.5 x block size )
341@end example
342Larger block sizes give rapidly diminishing marginal returns. Most of
343the compression comes from the first two or three hundred k of block
344size, a fact worth bearing in mind when using @code{bzip2} on small machines.
345It is also important to appreciate that the decompression memory
346requirement is set at compression time by the choice of block size.
347
348For files compressed with the default 900k block size, @code{bunzip2}
349will require about 3700 kbytes to decompress. To support decompression
350of any file on a 4 megabyte machine, @code{bunzip2} has an option to
351decompress using approximately half this amount of memory, about 2300
352kbytes. Decompression speed is also halved, so you should use this
353option only where necessary. The relevant flag is @code{-s}.
354
355In general, try and use the largest block size memory constraints allow,
356since that maximises the compression achieved. Compression and
357decompression speed are virtually unaffected by block size.
358
359Another significant point applies to files which fit in a single block
360-- that means most files you'd encounter using a large block size. The
361amount of real memory touched is proportional to the size of the file,
362since the file is smaller than a block. For example, compressing a file
36320,000 bytes long with the flag @code{-9} will cause the compressor to
364allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
365kbytes of it. Similarly, the decompressor will allocate 3700k but only
366touch 100k + 20000 * 4 = 180 kbytes.
367
368Here is a table which summarises the maximum memory usage for different
369block sizes. Also recorded is the total compressed size for 14 files of
370the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
371column gives some feel for how compression varies with block size.
372These figures tend to understate the advantage of larger block sizes for
373larger files, since the Corpus is dominated by smaller files.
374@example
375 Compress Decompress Decompress Corpus
376 Flag usage usage -s usage Size
377
378 -1 1200k 500k 350k 914704
379 -2 2000k 900k 600k 877703
380 -3 2800k 1300k 850k 860338
381 -4 3600k 1700k 1100k 846899
382 -5 4400k 2100k 1350k 845160
383 -6 5200k 2500k 1600k 838626
384 -7 6100k 2900k 1850k 834096
385 -8 6800k 3300k 2100k 828642
386 -9 7600k 3700k 2350k 828642
387@end example
388
389@unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES
390
391@code{bzip2} compresses files in blocks, usually 900kbytes long. Each
392block is handled independently. If a media or transmission error causes
393a multi-block @code{.bz2} file to become damaged, it may be possible to
394recover data from the undamaged blocks in the file.
395
396The compressed representation of each block is delimited by a 48-bit
397pattern, which makes it possible to find the block boundaries with
398reasonable certainty. Each block also carries its own 32-bit CRC, so
399damaged blocks can be distinguished from undamaged ones.
400
401@code{bzip2recover} is a simple program whose purpose is to search for
402blocks in @code{.bz2} files, and write each block out into its own
403@code{.bz2} file. You can then use @code{bzip2 -t} to test the
404integrity of the resulting files, and decompress those which are
405undamaged.
406
407@code{bzip2recover}
ba95a000
RC
408takes a single argument, the name of the damaged file, and writes a
409number of files @code{rec00001file.bz2}, @code{rec00002file.bz2}, etc,
410containing the extracted blocks. The output filenames are designed so
411that the use of wildcards in subsequent processing -- for example,
412@code{bzip2 -dc rec*file.bz2 > recovered_data} -- processes the files in
413the correct order.
ef6bacff
RC
414
415@code{bzip2recover} should be of most use dealing with large @code{.bz2}
ba95a000
RC
416files, as these will contain many blocks. It is clearly futile to use
417it on damaged single-block files, since a damaged block cannot be
418recovered. If you wish to minimise any potential data loss through
419media or transmission errors, you might consider compressing with a
420smaller block size.
ef6bacff
RC
421
422
423@unnumberedsubsubsec PERFORMANCE NOTES
424
425The sorting phase of compression gathers together similar strings in the
426file. Because of this, files containing very long runs of repeated
427symbols, like "aabaabaabaab ..." (repeated several hundred times) may
428compress more slowly than normal. Versions 0.9.5 and above fare much
429better than previous versions in this respect. The ratio between
430worst-case and average-case compression time is in the region of 10:1.
431For previous versions, this figure was more like 100:1. You can use the
432@code{-vvvv} option to monitor progress in great detail, if you want.
433
434Decompression speed is unaffected by these phenomena.
435
436@code{bzip2} usually allocates several megabytes of memory to operate
437in, and then charges all over it in a fairly random fashion. This means
438that performance, both for compressing and decompressing, is largely
439determined by the speed at which your machine can service cache misses.
440Because of this, small changes to the code to reduce the miss rate have
441been observed to give disproportionately large performance improvements.
442I imagine @code{bzip2} will perform best on machines with very large
443caches.
444
445
446@unnumberedsubsubsec CAVEATS
447
448I/O error messages are not as helpful as they could be. @code{bzip2}
449tries hard to detect I/O errors and exit cleanly, but the details of
450what the problem is sometimes seem rather misleading.
451
ba95a000 452This manual page pertains to version 1.0.2 of @code{bzip2}. Compressed
ef6bacff 453data created by this version is entirely forwards and backwards
ba95a000
RC
454compatible with the previous public releases, versions 0.1pl2, 0.9.0,
4550.9.5, 1.0.0 and 1.0.1, but with the following exception: 0.9.0 and
456above can correctly decompress multiple concatenated compressed files.
4570.1pl2 cannot do this; it will stop after decompressing just the first
458file in the stream.
459
460@code{bzip2recover} versions prior to this one, 1.0.2, used 32-bit
461integers to represent bit positions in compressed files, so it could not
462handle compressed files more than 512 megabytes long. Version 1.0.2 and
463above uses 64-bit ints on some platforms which support them (GNU
464supported targets, and Windows). To establish whether or not
465@code{bzip2recover} was built with such a limitation, run it without
466arguments. In any event you can build yourself an unlimited version if
467you can recompile it with @code{MaybeUInt64} set to be an unsigned
46864-bit integer.
ef6bacff 469
ef6bacff
RC
470
471
472@unnumberedsubsubsec AUTHOR
473Julian Seward, @code{jseward@@acm.org}.
474
ba95a000
RC
475@code{http://sources.redhat.com/bzip2}
476
ef6bacff
RC
477The ideas embodied in @code{bzip2} are due to (at least) the following
478people: Michael Burrows and David Wheeler (for the block sorting
479transformation), David Wheeler (again, for the Huffman coder), Peter
480Fenwick (for the structured coding model in the original @code{bzip},
481and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
482(for the arithmetic coder in the original @code{bzip}). I am much
483indebted for their help, support and advice. See the manual in the
484source distribution for pointers to sources of documentation. Christian
485von Roques encouraged me to look for faster sorting algorithms, so as to
486speed up compression. Bela Lubkin encouraged me to improve the
ba95a000
RC
487worst-case compression performance. The @code{bz*} scripts are derived
488from those of GNU @code{gzip}. Many people sent patches, helped with
489portability problems, lent machines, gave advice and were generally
ef6bacff
RC
490helpful.
491
492@end quotation
493
494
495
496
497@chapter Programming with @code{libbzip2}
498
499This chapter describes the programming interface to @code{libbzip2}.
500
501For general background information, particularly about memory
502use and performance aspects, you'd be well advised to read Chapter 2
503as well.
504
505@section Top-level structure
506
507@code{libbzip2} is a flexible library for compressing and decompressing
508data in the @code{bzip2} data format. Although packaged as a single
509entity, it helps to regard the library as three separate parts: the low
510level interface, and the high level interface, and some utility
511functions.
512
513The structure of @code{libbzip2}'s interfaces is similar to
514that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib}
515library.
516
517All externally visible symbols have names beginning @code{BZ2_}.
518This is new in version 1.0. The intention is to minimise pollution
519of the namespaces of library clients.
520
521@subsection Low-level summary
522
523This interface provides services for compressing and decompressing
524data in memory. There's no provision for dealing with files, streams
525or any other I/O mechanisms, just straight memory-to-memory work.
526In fact, this part of the library can be compiled without inclusion
527of @code{stdio.h}, which may be helpful for embedded applications.
528
529The low-level part of the library has no global variables and
530is therefore thread-safe.
531
532Six routines make up the low level interface:
533@code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, and @* @code{BZ2_bzCompressEnd}
534for compression,
535and a corresponding trio @code{BZ2_bzDecompressInit}, @* @code{BZ2_bzDecompress}
536and @code{BZ2_bzDecompressEnd} for decompression.
537The @code{*Init} functions allocate
538memory for compression/decompression and do other
539initialisations, whilst the @code{*End} functions close down operations
540and release memory.
541
542The real work is done by @code{BZ2_bzCompress} and @code{BZ2_bzDecompress}.
543These compress and decompress data from a user-supplied input buffer
544to a user-supplied output buffer. These buffers can be any size;
545arbitrary quantities of data are handled by making repeated calls
546to these functions. This is a flexible mechanism allowing a
547consumer-pull style of activity, or producer-push, or a mixture of
548both.
549
550
551
552@subsection High-level summary
553
554This interface provides some handy wrappers around the low-level
555interface to facilitate reading and writing @code{bzip2} format
556files (@code{.bz2} files). The routines provide hooks to facilitate
557reading files in which the @code{bzip2} data stream is embedded
558within some larger-scale file structure, or where there are
559multiple @code{bzip2} data streams concatenated end-to-end.
560
561For reading files, @code{BZ2_bzReadOpen}, @code{BZ2_bzRead},
562@code{BZ2_bzReadClose} and @* @code{BZ2_bzReadGetUnused} are supplied. For
563writing files, @code{BZ2_bzWriteOpen}, @code{BZ2_bzWrite} and
564@code{BZ2_bzWriteFinish} are available.
565
566As with the low-level library, no global variables are used
567so the library is per se thread-safe. However, if I/O errors
568occur whilst reading or writing the underlying compressed files,
569you may have to consult @code{errno} to determine the cause of
570the error. In that case, you'd need a C library which correctly
571supports @code{errno} in a multithreaded environment.
572
573To make the library a little simpler and more portable,
574@code{BZ2_bzReadOpen} and @code{BZ2_bzWriteOpen} require you to pass them file
575handles (@code{FILE*}s) which have previously been opened for reading or
576writing respectively. That avoids portability problems associated with
577file operations and file attributes, whilst not being much of an
578imposition on the programmer.
579
580
581
582@subsection Utility functions summary
583For very simple needs, @code{BZ2_bzBuffToBuffCompress} and
584@code{BZ2_bzBuffToBuffDecompress} are provided. These compress
585data in memory from one buffer to another buffer in a single
586function call. You should assess whether these functions
587fulfill your memory-to-memory compression/decompression
588requirements before investing effort in understanding the more
589general but more complex low-level interface.
590
591Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} /
592@code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to
593give better @code{zlib} compatibility. These functions are
594@code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
595@code{BZ2_bzclose},
596@code{BZ2_bzerror} and @code{BZ2_bzlibVersion}. You may find these functions
597more convenient for simple file reading and writing, than those in the
598high-level interface. These functions are not (yet) officially part of
599the library, and are minimally documented here. If they break, you
600get to keep all the pieces. I hope to document them properly when time
601permits.
602
603Yoshioka also contributed modifications to allow the library to be
604built as a Windows DLL.
605
606
607@section Error handling
608
609The library is designed to recover cleanly in all situations, including
610the worst-case situation of decompressing random data. I'm not
611100% sure that it can always do this, so you might want to add
612a signal handler to catch segmentation violations during decompression
613if you are feeling especially paranoid. I would be interested in
614hearing more about the robustness of the library to corrupted
615compressed data.
616
617Version 1.0 is much more robust in this respect than
6180.9.0 or 0.9.5. Investigations with Checker (a tool for
619detecting problems with memory management, similar to Purify)
620indicate that, at least for the few files I tested, all single-bit
621errors in the decompressed data are caught properly, with no
622segmentation faults, no reads of uninitialised data and no
623out of range reads or writes. So it's certainly much improved,
624although I wouldn't claim it to be totally bombproof.
625
626The file @code{bzlib.h} contains all definitions needed to use
627the library. In particular, you should definitely not include
628@code{bzlib_private.h}.
629
630In @code{bzlib.h}, the various return values are defined. The following
631list is not intended as an exhaustive description of the circumstances
632in which a given value may be returned -- those descriptions are given
633later. Rather, it is intended to convey the rough meaning of each
634return value. The first five actions are normal and not intended to
635denote an error situation.
636@table @code
637@item BZ_OK
638The requested action was completed successfully.
639@item BZ_RUN_OK
640@itemx BZ_FLUSH_OK
641@itemx BZ_FINISH_OK
642In @code{BZ2_bzCompress}, the requested flush/finish/nothing-special action
643was completed successfully.
644@item BZ_STREAM_END
645Compression of data was completed, or the logical stream end was
646detected during decompression.
647@end table
648
649The following return values indicate an error of some kind.
650@table @code
651@item BZ_CONFIG_ERROR
652Indicates that the library has been improperly compiled on your
653platform -- a major configuration error. Specifically, it means
654that @code{sizeof(char)}, @code{sizeof(short)} and @code{sizeof(int)}
655are not 1, 2 and 4 respectively, as they should be. Note that the
656library should still work properly on 64-bit platforms which follow
657the LP64 programming model -- that is, where @code{sizeof(long)}
658and @code{sizeof(void*)} are 8. Under LP64, @code{sizeof(int)} is
659still 4, so @code{libbzip2}, which doesn't use the @code{long} type,
660is OK.
661@item BZ_SEQUENCE_ERROR
662When using the library, it is important to call the functions in the
663correct sequence and with data structures (buffers etc) in the correct
664states. @code{libbzip2} checks as much as it can to ensure this is
665happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which
666complies precisely with the function semantics, as detailed below,
667should never receive this value; such an event denotes buggy code
668which you should investigate.
669@item BZ_PARAM_ERROR
670Returned when a parameter to a function call is out of range
671or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR},
672this denotes a bug in the client code. The distinction between
673@code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth
674making.
675@item BZ_MEM_ERROR
676Returned when a request to allocate memory failed. Note that the
677quantity of memory needed to decompress a stream cannot be determined
678until the stream's header has been read. So @code{BZ2_bzDecompress} and
679@code{BZ2_bzRead} may return @code{BZ_MEM_ERROR} even though some of
680the compressed data has been read. The same is not true for
681compression; once @code{BZ2_bzCompressInit} or @code{BZ2_bzWriteOpen} have
682successfully completed, @code{BZ_MEM_ERROR} cannot occur.
683@item BZ_DATA_ERROR
684Returned when a data integrity error is detected during decompression.
685Most importantly, this means when stored and computed CRCs for the
686data do not match. This value is also returned upon detection of any
687other anomaly in the compressed data.
688@item BZ_DATA_ERROR_MAGIC
689As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to
690know when the compressed stream does not start with the correct
691magic bytes (@code{'B' 'Z' 'h'}).
692@item BZ_IO_ERROR
693Returned by @code{BZ2_bzRead} and @code{BZ2_bzWrite} when there is an error
694reading or writing in the compressed file, and by @code{BZ2_bzReadOpen}
695and @code{BZ2_bzWriteOpen} for attempts to use a file for which the
696error indicator (viz, @code{ferror(f)}) is set.
697On receipt of @code{BZ_IO_ERROR}, the caller should consult
698@code{errno} and/or @code{perror} to acquire operating-system
699specific information about the problem.
700@item BZ_UNEXPECTED_EOF
701Returned by @code{BZ2_bzRead} when the compressed file finishes
702before the logical end of stream is detected.
703@item BZ_OUTBUFF_FULL
704Returned by @code{BZ2_bzBuffToBuffCompress} and
705@code{BZ2_bzBuffToBuffDecompress} to indicate that the output data
706will not fit into the output buffer provided.
707@end table
708
709
710
711@section Low-level interface
712
713@subsection @code{BZ2_bzCompressInit}
714@example
715typedef
716 struct @{
717 char *next_in;
718 unsigned int avail_in;
719 unsigned int total_in_lo32;
720 unsigned int total_in_hi32;
721
722 char *next_out;
723 unsigned int avail_out;
724 unsigned int total_out_lo32;
725 unsigned int total_out_hi32;
726
727 void *state;
728
729 void *(*bzalloc)(void *,int,int);
730 void (*bzfree)(void *,void *);
731 void *opaque;
732 @}
733 bz_stream;
734
735int BZ2_bzCompressInit ( bz_stream *strm,
736 int blockSize100k,
737 int verbosity,
738 int workFactor );
739
740@end example
741
742Prepares for compression. The @code{bz_stream} structure
743holds all data pertaining to the compression activity.
744A @code{bz_stream} structure should be allocated and initialised
745prior to the call.
746The fields of @code{bz_stream}
747comprise the entirety of the user-visible data. @code{state}
748is a pointer to the private data structures required for compression.
749
750Custom memory allocators are supported, via fields @code{bzalloc},
751@code{bzfree},
752and @code{opaque}. The value
753@code{opaque} is passed to as the first argument to
754all calls to @code{bzalloc} and @code{bzfree}, but is
755otherwise ignored by the library.
756The call @code{bzalloc ( opaque, n, m )} is expected to return a
757pointer @code{p} to
758@code{n * m} bytes of memory, and @code{bzfree ( opaque, p )}
759should free
760that memory.
761
762If you don't want to use a custom memory allocator, set @code{bzalloc},
763@code{bzfree} and
764@code{opaque} to @code{NULL},
765and the library will then use the standard @code{malloc}/@code{free}
766routines.
767
768Before calling @code{BZ2_bzCompressInit}, fields @code{bzalloc},
769@code{bzfree} and @code{opaque} should
770be filled appropriately, as just described. Upon return, the internal
771state will have been allocated and initialised, and @code{total_in_lo32},
772@code{total_in_hi32}, @code{total_out_lo32} and
773@code{total_out_hi32} will have been set to zero.
774These four fields are used by the library
775to inform the caller of the total amount of data passed into and out of
776the library, respectively. You should not try to change them.
777As of version 1.0, 64-bit counts are maintained, even on 32-bit
778platforms, using the @code{_hi32} fields to store the upper 32 bits
779of the count. So, for example, the total amount of data in
780is @code{(total_in_hi32 << 32) + total_in_lo32}.
781
782Parameter @code{blockSize100k} specifies the block size to be used for
783compression. It should be a value between 1 and 9 inclusive, and the
784actual block size used is 100000 x this figure. 9 gives the best
785compression but takes most memory.
786
787Parameter @code{verbosity} should be set to a number between 0 and 4
788inclusive. 0 is silent, and greater numbers give increasingly verbose
789monitoring/debugging output. If the library has been compiled with
790@code{-DBZ_NO_STDIO}, no such output will appear for any verbosity
791setting.
792
793Parameter @code{workFactor} controls how the compression phase behaves
794when presented with worst case, highly repetitive, input data. If
795compression runs into difficulties caused by repetitive data, the
796library switches from the standard sorting algorithm to a fallback
797algorithm. The fallback is slower than the standard algorithm by
798perhaps a factor of three, but always behaves reasonably, no matter how
799bad the input.
800
801Lower values of @code{workFactor} reduce the amount of effort the
802standard algorithm will expend before resorting to the fallback. You
803should set this parameter carefully; too low, and many inputs will be
804handled by the fallback algorithm and so compress rather slowly, too
805high, and your average-to-worst case compression times can become very
806large. The default value of 30 gives reasonable behaviour over a wide
807range of circumstances.
808
809Allowable values range from 0 to 250 inclusive. 0 is a special case,
810equivalent to using the default value of 30.
811
812Note that the compressed output generated is the same regardless of
813whether or not the fallback algorithm is used.
814
815Be aware also that this parameter may disappear entirely in future
816versions of the library. In principle it should be possible to devise a
817good way to automatically choose which algorithm to use. Such a
818mechanism would render the parameter obsolete.
819
820Possible return values:
821@display
822 @code{BZ_CONFIG_ERROR}
823 if the library has been mis-compiled
824 @code{BZ_PARAM_ERROR}
825 if @code{strm} is @code{NULL}
826 or @code{blockSize} < 1 or @code{blockSize} > 9
827 or @code{verbosity} < 0 or @code{verbosity} > 4
828 or @code{workFactor} < 0 or @code{workFactor} > 250
829 @code{BZ_MEM_ERROR}
830 if not enough memory is available
831 @code{BZ_OK}
832 otherwise
833@end display
834Allowable next actions:
835@display
836 @code{BZ2_bzCompress}
837 if @code{BZ_OK} is returned
838 no specific action needed in case of error
839@end display
840
841@subsection @code{BZ2_bzCompress}
842@example
843 int BZ2_bzCompress ( bz_stream *strm, int action );
844@end example
845Provides more input and/or output buffer space for the library. The
846caller maintains input and output buffers, and calls @code{BZ2_bzCompress} to
847transfer data between them.
848
849Before each call to @code{BZ2_bzCompress}, @code{next_in} should point at
850the data to be compressed, and @code{avail_in} should indicate how many
851bytes the library may read. @code{BZ2_bzCompress} updates @code{next_in},
852@code{avail_in} and @code{total_in} to reflect the number of bytes it
853has read.
854
855Similarly, @code{next_out} should point to a buffer in which the
856compressed data is to be placed, with @code{avail_out} indicating how
857much output space is available. @code{BZ2_bzCompress} updates
858@code{next_out}, @code{avail_out} and @code{total_out} to reflect the
859number of bytes output.
860
861You may provide and remove as little or as much data as you like on each
862call of @code{BZ2_bzCompress}. In the limit, it is acceptable to supply and
863remove data one byte at a time, although this would be terribly
864inefficient. You should always ensure that at least one byte of output
865space is available at each call.
866
867A second purpose of @code{BZ2_bzCompress} is to request a change of mode of the
868compressed stream.
869
870Conceptually, a compressed stream can be in one of four states: IDLE,
871RUNNING, FLUSHING and FINISHING. Before initialisation
872(@code{BZ2_bzCompressInit}) and after termination (@code{BZ2_bzCompressEnd}), a
873stream is regarded as IDLE.
874
875Upon initialisation (@code{BZ2_bzCompressInit}), the stream is placed in the
876RUNNING state. Subsequent calls to @code{BZ2_bzCompress} should pass
877@code{BZ_RUN} as the requested action; other actions are illegal and
878will result in @code{BZ_SEQUENCE_ERROR}.
879
880At some point, the calling program will have provided all the input data
881it wants to. It will then want to finish up -- in effect, asking the
882library to process any data it might have buffered internally. In this
883state, @code{BZ2_bzCompress} will no longer attempt to read data from
884@code{next_in}, but it will want to write data to @code{next_out}.
885Because the output buffer supplied by the user can be arbitrarily small,
886the finishing-up operation cannot necessarily be done with a single call
887of @code{BZ2_bzCompress}.
888
889Instead, the calling program passes @code{BZ_FINISH} as an action to
890@code{BZ2_bzCompress}. This changes the stream's state to FINISHING. Any
891remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and
892transferred to the output buffer. To do this, @code{BZ2_bzCompress} must be
893called repeatedly until all the output has been consumed. At that
894point, @code{BZ2_bzCompress} returns @code{BZ_STREAM_END}, and the stream's
895state is set back to IDLE. @code{BZ2_bzCompressEnd} should then be
896called.
897
898Just to make sure the calling program does not cheat, the library makes
899a note of @code{avail_in} at the time of the first call to
900@code{BZ2_bzCompress} which has @code{BZ_FINISH} as an action (ie, at the
901time the program has announced its intention to not supply any more
902input). By comparing this value with that of @code{avail_in} over
903subsequent calls to @code{BZ2_bzCompress}, the library can detect any
904attempts to slip in more data to compress. Any calls for which this is
905detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a
906programming mistake which should be corrected.
907
908Instead of asking to finish, the calling program may ask
909@code{BZ2_bzCompress} to take all the remaining input, compress it and
910terminate the current (Burrows-Wheeler) compression block. This could
911be useful for error control purposes. The mechanism is analogous to
912that for finishing: call @code{BZ2_bzCompress} with an action of
913@code{BZ_FLUSH}, remove output data, and persist with the
914@code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As
915with finishing, @code{BZ2_bzCompress} detects any attempt to provide more
916input data once the flush has begun.
917
918Once the flush is complete, the stream returns to the normal RUNNING
919state.
920
921This all sounds pretty complex, but isn't really. Here's a table
922which shows which actions are allowable in each state, what action
923will be taken, what the next state is, and what the non-error return
924values are. Note that you can't explicitly ask what state the
925stream is in, but nor do you need to -- it can be inferred from the
926values returned by @code{BZ2_bzCompress}.
927@display
928IDLE/@code{any}
929 Illegal. IDLE state only exists after @code{BZ2_bzCompressEnd} or
930 before @code{BZ2_bzCompressInit}.
931 Return value = @code{BZ_SEQUENCE_ERROR}
932
933RUNNING/@code{BZ_RUN}
934 Compress from @code{next_in} to @code{next_out} as much as possible.
935 Next state = RUNNING
936 Return value = @code{BZ_RUN_OK}
937
938RUNNING/@code{BZ_FLUSH}
939 Remember current value of @code{next_in}. Compress from @code{next_in}
940 to @code{next_out} as much as possible, but do not accept any more input.
941 Next state = FLUSHING
942 Return value = @code{BZ_FLUSH_OK}
943
944RUNNING/@code{BZ_FINISH}
945 Remember current value of @code{next_in}. Compress from @code{next_in}
946 to @code{next_out} as much as possible, but do not accept any more input.
947 Next state = FINISHING
948 Return value = @code{BZ_FINISH_OK}
949
950FLUSHING/@code{BZ_FLUSH}
951 Compress from @code{next_in} to @code{next_out} as much as possible,
952 but do not accept any more input.
953 If all the existing input has been used up and all compressed
954 output has been removed
955 Next state = RUNNING; Return value = @code{BZ_RUN_OK}
956 else
957 Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK}
958
959FLUSHING/other
960 Illegal.
961 Return value = @code{BZ_SEQUENCE_ERROR}
962
963FINISHING/@code{BZ_FINISH}
964 Compress from @code{next_in} to @code{next_out} as much as possible,
965 but to not accept any more input.
966 If all the existing input has been used up and all compressed
967 output has been removed
968 Next state = IDLE; Return value = @code{BZ_STREAM_END}
969 else
970 Next state = FINISHING; Return value = @code{BZ_FINISHING}
971
972FINISHING/other
973 Illegal.
974 Return value = @code{BZ_SEQUENCE_ERROR}
975@end display
976
977That still looks complicated? Well, fair enough. The usual sequence
978of calls for compressing a load of data is:
979@itemize @bullet
980@item Get started with @code{BZ2_bzCompressInit}.
981@item Shovel data in and shlurp out its compressed form using zero or more
982calls of @code{BZ2_bzCompress} with action = @code{BZ_RUN}.
983@item Finish up.
984Repeatedly call @code{BZ2_bzCompress} with action = @code{BZ_FINISH},
985copying out the compressed output, until @code{BZ_STREAM_END} is returned.
986@item Close up and go home. Call @code{BZ2_bzCompressEnd}.
987@end itemize
988If the data you want to compress fits into your input buffer all
989at once, you can skip the calls of @code{BZ2_bzCompress ( ..., BZ_RUN )} and
990just do the @code{BZ2_bzCompress ( ..., BZ_FINISH )} calls.
991
992All required memory is allocated by @code{BZ2_bzCompressInit}. The
993compression library can accept any data at all (obviously). So you
994shouldn't get any error return values from the @code{BZ2_bzCompress} calls.
995If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in
996your programming.
997
998Trivial other possible return values:
999@display
1000 @code{BZ_PARAM_ERROR}
1001 if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL}
1002@end display
1003
1004@subsection @code{BZ2_bzCompressEnd}
1005@example
1006int BZ2_bzCompressEnd ( bz_stream *strm );
1007@end example
1008Releases all memory associated with a compression stream.
1009
1010Possible return values:
1011@display
1012 @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1013 @code{BZ_OK} otherwise
1014@end display
1015
1016
1017@subsection @code{BZ2_bzDecompressInit}
1018@example
1019int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );
1020@end example
1021Prepares for decompression. As with @code{BZ2_bzCompressInit}, a
1022@code{bz_stream} record should be allocated and initialised before the
1023call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be
1024set if a custom memory allocator is required, or made @code{NULL} for
1025the normal @code{malloc}/@code{free} routines. Upon return, the internal
1026state will have been initialised, and @code{total_in} and
1027@code{total_out} will be zero.
1028
1029For the meaning of parameter @code{verbosity}, see @code{BZ2_bzCompressInit}.
1030
1031If @code{small} is nonzero, the library will use an alternative
1032decompression algorithm which uses less memory but at the cost of
1033decompressing more slowly (roughly speaking, half the speed, but the
1034maximum memory requirement drops to around 2300k). See Chapter 2 for
1035more information on memory management.
1036
1037Note that the amount of memory needed to decompress
1038a stream cannot be determined until the stream's header has been read,
1039so even if @code{BZ2_bzDecompressInit} succeeds, a subsequent
1040@code{BZ2_bzDecompress} could fail with @code{BZ_MEM_ERROR}.
1041
1042Possible return values:
1043@display
1044 @code{BZ_CONFIG_ERROR}
1045 if the library has been mis-compiled
1046 @code{BZ_PARAM_ERROR}
1047 if @code{(small != 0 && small != 1)}
1048 or @code{(verbosity < 0 || verbosity > 4)}
1049 @code{BZ_MEM_ERROR}
1050 if insufficient memory is available
1051@end display
1052
1053Allowable next actions:
1054@display
1055 @code{BZ2_bzDecompress}
1056 if @code{BZ_OK} was returned
1057 no specific action required in case of error
1058@end display
1059
1060
1061
1062@subsection @code{BZ2_bzDecompress}
1063@example
1064int BZ2_bzDecompress ( bz_stream *strm );
1065@end example
1066Provides more input and/out output buffer space for the library. The
1067caller maintains input and output buffers, and uses @code{BZ2_bzDecompress}
1068to transfer data between them.
1069
1070Before each call to @code{BZ2_bzDecompress}, @code{next_in}
1071should point at the compressed data,
1072and @code{avail_in} should indicate how many bytes the library
1073may read. @code{BZ2_bzDecompress} updates @code{next_in}, @code{avail_in}
1074and @code{total_in}
1075to reflect the number of bytes it has read.
1076
1077Similarly, @code{next_out} should point to a buffer in which the uncompressed
1078output is to be placed, with @code{avail_out} indicating how much output space
1079is available. @code{BZ2_bzCompress} updates @code{next_out},
1080@code{avail_out} and @code{total_out} to reflect
1081the number of bytes output.
1082
1083You may provide and remove as little or as much data as you like on
1084each call of @code{BZ2_bzDecompress}.
1085In the limit, it is acceptable to
1086supply and remove data one byte at a time, although this would be
1087terribly inefficient. You should always ensure that at least one
1088byte of output space is available at each call.
1089
1090Use of @code{BZ2_bzDecompress} is simpler than @code{BZ2_bzCompress}.
1091
1092You should provide input and remove output as described above, and
1093repeatedly call @code{BZ2_bzDecompress} until @code{BZ_STREAM_END} is
1094returned. Appearance of @code{BZ_STREAM_END} denotes that
1095@code{BZ2_bzDecompress} has detected the logical end of the compressed
1096stream. @code{BZ2_bzDecompress} will not produce @code{BZ_STREAM_END} until
1097all output data has been placed into the output buffer, so once
1098@code{BZ_STREAM_END} appears, you are guaranteed to have available all
1099the decompressed output, and @code{BZ2_bzDecompressEnd} can safely be
1100called.
1101
1102If case of an error return value, you should call @code{BZ2_bzDecompressEnd}
1103to clean up and release memory.
1104
1105Possible return values:
1106@display
1107 @code{BZ_PARAM_ERROR}
1108 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1109 or @code{strm->avail_out < 1}
1110 @code{BZ_DATA_ERROR}
1111 if a data integrity error is detected in the compressed stream
1112 @code{BZ_DATA_ERROR_MAGIC}
1113 if the compressed stream doesn't begin with the right magic bytes
1114 @code{BZ_MEM_ERROR}
1115 if there wasn't enough memory available
1116 @code{BZ_STREAM_END}
1117 if the logical end of the data stream was detected and all
1118 output in has been consumed, eg @code{s->avail_out > 0}
1119 @code{BZ_OK}
1120 otherwise
1121@end display
1122Allowable next actions:
1123@display
1124 @code{BZ2_bzDecompress}
1125 if @code{BZ_OK} was returned
1126 @code{BZ2_bzDecompressEnd}
1127 otherwise
1128@end display
1129
1130
1131@subsection @code{BZ2_bzDecompressEnd}
1132@example
1133int BZ2_bzDecompressEnd ( bz_stream *strm );
1134@end example
1135Releases all memory associated with a decompression stream.
1136
1137Possible return values:
1138@display
1139 @code{BZ_PARAM_ERROR}
1140 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1141 @code{BZ_OK}
1142 otherwise
1143@end display
1144
1145Allowable next actions:
1146@display
1147 None.
1148@end display
1149
1150
1151@section High-level interface
1152
1153This interface provides functions for reading and writing
1154@code{bzip2} format files. First, some general points.
1155
1156@itemize @bullet
1157@item All of the functions take an @code{int*} first argument,
1158 @code{bzerror}.
1159 After each call, @code{bzerror} should be consulted first to determine
1160 the outcome of the call. If @code{bzerror} is @code{BZ_OK},
1161 the call completed
1162 successfully, and only then should the return value of the function
1163 (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR},
1164 there was an error
1165 reading/writing the underlying compressed file, and you should
1166 then consult @code{errno}/@code{perror} to determine the
1167 cause of the difficulty.
1168 @code{bzerror} may also be set to various other values; precise details are
1169 given on a per-function basis below.
1170@item If @code{bzerror} indicates an error
1171 (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}),
1172 you should immediately call @code{BZ2_bzReadClose} (or @code{BZ2_bzWriteClose},
1173 depending on whether you are attempting to read or to write)
1174 to free up all resources associated
1175 with the stream. Once an error has been indicated, behaviour of all calls
1176 except @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) is undefined.
1177 The implication is that (1) @code{bzerror} should
1178 be checked after each call, and (2) if @code{bzerror} indicates an error,
1179 @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) should then be called to clean up.
1180@item The @code{FILE*} arguments passed to
1181 @code{BZ2_bzReadOpen}/@code{BZ2_bzWriteOpen}
1182 should be set to binary mode.
1183 Most Unix systems will do this by default, but other platforms,
1184 including Windows and Mac, will not. If you omit this, you may
1185 encounter problems when moving code to new platforms.
1186@item Memory allocation requests are handled by
1187 @code{malloc}/@code{free}.
1188 At present
1189 there is no facility for user-defined memory allocators in the file I/O
1190 functions (could easily be added, though).
1191@end itemize
1192
1193
1194
1195@subsection @code{BZ2_bzReadOpen}
1196@example
1197 typedef void BZFILE;
1198
1199 BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f,
1200 int small, int verbosity,
1201 void *unused, int nUnused );
1202@end example
1203Prepare to read compressed data from file handle @code{f}. @code{f}
1204should refer to a file which has been opened for reading, and for which
1205the error indicator (@code{ferror(f)})is not set. If @code{small} is 1,
1206the library will try to decompress using less memory, at the expense of
1207speed.
1208
1209For reasons explained below, @code{BZ2_bzRead} will decompress the
1210@code{nUnused} bytes starting at @code{unused}, before starting to read
1211from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be
1212supplied like this. If this facility is not required, you should pass
1213@code{NULL} and @code{0} for @code{unused} and n@code{Unused}
1214respectively.
1215
1216For the meaning of parameters @code{small} and @code{verbosity},
1217see @code{BZ2_bzDecompressInit}.
1218
1219The amount of memory needed to decompress a file cannot be determined
1220until the file's header has been read. So it is possible that
1221@code{BZ2_bzReadOpen} returns @code{BZ_OK} but a subsequent call of
1222@code{BZ2_bzRead} will return @code{BZ_MEM_ERROR}.
1223
1224Possible assignments to @code{bzerror}:
1225@display
1226 @code{BZ_CONFIG_ERROR}
1227 if the library has been mis-compiled
1228 @code{BZ_PARAM_ERROR}
1229 if @code{f} is @code{NULL}
1230 or @code{small} is neither @code{0} nor @code{1}
1231 or @code{(unused == NULL && nUnused != 0)}
1232 or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))}
1233 @code{BZ_IO_ERROR}
1234 if @code{ferror(f)} is nonzero
1235 @code{BZ_MEM_ERROR}
1236 if insufficient memory is available
1237 @code{BZ_OK}
1238 otherwise.
1239@end display
1240
1241Possible return values:
1242@display
1243 Pointer to an abstract @code{BZFILE}
1244 if @code{bzerror} is @code{BZ_OK}
1245 @code{NULL}
1246 otherwise
1247@end display
1248
1249Allowable next actions:
1250@display
1251 @code{BZ2_bzRead}
1252 if @code{bzerror} is @code{BZ_OK}
1253 @code{BZ2_bzClose}
1254 otherwise
1255@end display
1256
1257
1258@subsection @code{BZ2_bzRead}
1259@example
1260 int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
1261@end example
1262Reads up to @code{len} (uncompressed) bytes from the compressed file
1263@code{b} into
1264the buffer @code{buf}. If the read was successful,
1265@code{bzerror} is set to @code{BZ_OK}
1266and the number of bytes read is returned. If the logical end-of-stream
1267was detected, @code{bzerror} will be set to @code{BZ_STREAM_END},
1268and the number
1269of bytes read is returned. All other @code{bzerror} values denote an error.
1270
1271@code{BZ2_bzRead} will supply @code{len} bytes,
1272unless the logical stream end is detected
1273or an error occurs. Because of this, it is possible to detect the
1274stream end by observing when the number of bytes returned is
1275less than the number
1276requested. Nevertheless, this is regarded as inadvisable; you should
1277instead check @code{bzerror} after every call and watch out for
1278@code{BZ_STREAM_END}.
1279
1280Internally, @code{BZ2_bzRead} copies data from the compressed file in chunks
1281of size @code{BZ_MAX_UNUSED} bytes
1282before decompressing it. If the file contains more bytes than strictly
1283needed to reach the logical end-of-stream, @code{BZ2_bzRead} will almost certainly
1284read some of the trailing data before signalling @code{BZ_SEQUENCE_END}.
1285To collect the read but unused data once @code{BZ_SEQUENCE_END} has
1286appeared, call @code{BZ2_bzReadGetUnused} immediately before @code{BZ2_bzReadClose}.
1287
1288Possible assignments to @code{bzerror}:
1289@display
1290 @code{BZ_PARAM_ERROR}
1291 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1292 @code{BZ_SEQUENCE_ERROR}
1293 if @code{b} was opened with @code{BZ2_bzWriteOpen}
1294 @code{BZ_IO_ERROR}
1295 if there is an error reading from the compressed file
1296 @code{BZ_UNEXPECTED_EOF}
1297 if the compressed file ended before the logical end-of-stream was detected
1298 @code{BZ_DATA_ERROR}
1299 if a data integrity error was detected in the compressed stream
1300 @code{BZ_DATA_ERROR_MAGIC}
1301 if the stream does not begin with the requisite header bytes (ie, is not
1302 a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}.
1303 @code{BZ_MEM_ERROR}
1304 if insufficient memory was available
1305 @code{BZ_STREAM_END}
1306 if the logical end of stream was detected.
1307 @code{BZ_OK}
1308 otherwise.
1309@end display
1310
1311Possible return values:
1312@display
1313 number of bytes read
1314 if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END}
1315 undefined
1316 otherwise
1317@end display
1318
1319Allowable next actions:
1320@display
1321 collect data from @code{buf}, then @code{BZ2_bzRead} or @code{BZ2_bzReadClose}
1322 if @code{bzerror} is @code{BZ_OK}
1323 collect data from @code{buf}, then @code{BZ2_bzReadClose} or @code{BZ2_bzReadGetUnused}
1324 if @code{bzerror} is @code{BZ_SEQUENCE_END}
1325 @code{BZ2_bzReadClose}
1326 otherwise
1327@end display
1328
1329
1330
1331@subsection @code{BZ2_bzReadGetUnused}
1332@example
1333 void BZ2_bzReadGetUnused ( int* bzerror, BZFILE *b,
1334 void** unused, int* nUnused );
1335@end example
1336Returns data which was read from the compressed file but was not needed
1337to get to the logical end-of-stream. @code{*unused} is set to the address
1338of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will
1339be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive.
1340
1341This function may only be called once @code{BZ2_bzRead} has signalled
1342@code{BZ_STREAM_END} but before @code{BZ2_bzReadClose}.
1343
1344Possible assignments to @code{bzerror}:
1345@display
1346 @code{BZ_PARAM_ERROR}
1347 if @code{b} is @code{NULL}
1348 or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL}
1349 @code{BZ_SEQUENCE_ERROR}
1350 if @code{BZ_STREAM_END} has not been signalled
1351 or if @code{b} was opened with @code{BZ2_bzWriteOpen}
1352 @code{BZ_OK}
1353 otherwise
1354@end display
1355
1356Allowable next actions:
1357@display
1358 @code{BZ2_bzReadClose}
1359@end display
1360
1361
1362@subsection @code{BZ2_bzReadClose}
1363@example
1364 void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
1365@end example
1366Releases all memory pertaining to the compressed file @code{b}.
1367@code{BZ2_bzReadClose} does not call @code{fclose} on the underlying file
1368handle, so you should do that yourself if appropriate.
1369@code{BZ2_bzReadClose} should be called to clean up after all error
1370situations.
1371
1372Possible assignments to @code{bzerror}:
1373@display
1374 @code{BZ_SEQUENCE_ERROR}
1375 if @code{b} was opened with @code{BZ2_bzOpenWrite}
1376 @code{BZ_OK}
1377 otherwise
1378@end display
1379
1380Allowable next actions:
1381@display
1382 none
1383@end display
1384
1385
1386
1387@subsection @code{BZ2_bzWriteOpen}
1388@example
1389 BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f,
1390 int blockSize100k, int verbosity,
1391 int workFactor );
1392@end example
1393Prepare to write compressed data to file handle @code{f}.
1394@code{f} should refer to
1395a file which has been opened for writing, and for which the error
1396indicator (@code{ferror(f)})is not set.
1397
1398For the meaning of parameters @code{blockSize100k},
1399@code{verbosity} and @code{workFactor}, see
1400@* @code{BZ2_bzCompressInit}.
1401
1402All required memory is allocated at this stage, so if the call
1403completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a
1404subsequent call to @code{BZ2_bzWrite}.
1405
1406Possible assignments to @code{bzerror}:
1407@display
1408 @code{BZ_CONFIG_ERROR}
1409 if the library has been mis-compiled
1410 @code{BZ_PARAM_ERROR}
1411 if @code{f} is @code{NULL}
1412 or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1413 @code{BZ_IO_ERROR}
1414 if @code{ferror(f)} is nonzero
1415 @code{BZ_MEM_ERROR}
1416 if insufficient memory is available
1417 @code{BZ_OK}
1418 otherwise
1419@end display
1420
1421Possible return values:
1422@display
1423 Pointer to an abstract @code{BZFILE}
1424 if @code{bzerror} is @code{BZ_OK}
1425 @code{NULL}
1426 otherwise
1427@end display
1428
1429Allowable next actions:
1430@display
1431 @code{BZ2_bzWrite}
1432 if @code{bzerror} is @code{BZ_OK}
1433 (you could go directly to @code{BZ2_bzWriteClose}, but this would be pretty pointless)
1434 @code{BZ2_bzWriteClose}
1435 otherwise
1436@end display
1437
1438
1439
1440@subsection @code{BZ2_bzWrite}
1441@example
1442 void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
1443@end example
1444Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be
1445compressed and written to the file.
1446
1447Possible assignments to @code{bzerror}:
1448@display
1449 @code{BZ_PARAM_ERROR}
1450 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1451 @code{BZ_SEQUENCE_ERROR}
1452 if b was opened with @code{BZ2_bzReadOpen}
1453 @code{BZ_IO_ERROR}
1454 if there is an error writing the compressed file.
1455 @code{BZ_OK}
1456 otherwise
1457@end display
1458
1459
1460
1461
1462@subsection @code{BZ2_bzWriteClose}
1463@example
1464 void BZ2_bzWriteClose ( int *bzerror, BZFILE* f,
1465 int abandon,
1466 unsigned int* nbytes_in,
1467 unsigned int* nbytes_out );
1468
1469 void BZ2_bzWriteClose64 ( int *bzerror, BZFILE* f,
1470 int abandon,
1471 unsigned int* nbytes_in_lo32,
1472 unsigned int* nbytes_in_hi32,
1473 unsigned int* nbytes_out_lo32,
1474 unsigned int* nbytes_out_hi32 );
1475@end example
1476
1477Compresses and flushes to the compressed file all data so far supplied
1478by @code{BZ2_bzWrite}. The logical end-of-stream markers are also written, so
1479subsequent calls to @code{BZ2_bzWrite} are illegal. All memory associated
1480with the compressed file @code{b} is released.
1481@code{fflush} is called on the
1482compressed file, but it is not @code{fclose}'d.
1483
1484If @code{BZ2_bzWriteClose} is called to clean up after an error, the only
1485action is to release the memory. The library records the error codes
1486issued by previous calls, so this situation will be detected
1487automatically. There is no attempt to complete the compression
1488operation, nor to @code{fflush} the compressed file. You can force this
1489behaviour to happen even in the case of no error, by passing a nonzero
1490value to @code{abandon}.
1491
1492If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the
1493total volume of uncompressed data handled. Similarly, @code{nbytes_out}
1494will be set to the total volume of compressed data written. For
1495compatibility with older versions of the library, @code{BZ2_bzWriteClose}
1496only yields the lower 32 bits of these counts. Use
1497@code{BZ2_bzWriteClose64} if you want the full 64 bit counts. These
1498two functions are otherwise absolutely identical.
1499
1500
1501Possible assignments to @code{bzerror}:
1502@display
1503 @code{BZ_SEQUENCE_ERROR}
1504 if @code{b} was opened with @code{BZ2_bzReadOpen}
1505 @code{BZ_IO_ERROR}
1506 if there is an error writing the compressed file
1507 @code{BZ_OK}
1508 otherwise
1509@end display
1510
1511@subsection Handling embedded compressed data streams
1512
1513The high-level library facilitates use of
1514@code{bzip2} data streams which form some part of a surrounding, larger
1515data stream.
1516@itemize @bullet
1517@item For writing, the library takes an open file handle, writes
1518compressed data to it, @code{fflush}es it but does not @code{fclose} it.
1519The calling application can write its own data before and after the
1520compressed data stream, using that same file handle.
1521@item Reading is more complex, and the facilities are not as general
1522as they could be since generality is hard to reconcile with efficiency.
1523@code{BZ2_bzRead} reads from the compressed file in blocks of size
1524@code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot
1525the logical end of compressed stream.
1526To recover this data once decompression has
1527ended, call @code{BZ2_bzReadGetUnused} after the last call of @code{BZ2_bzRead}
1528(the one returning @code{BZ_STREAM_END}) but before calling
1529@code{BZ2_bzReadClose}.
1530@end itemize
1531
1532This mechanism makes it easy to decompress multiple @code{bzip2}
1533streams placed end-to-end. As the end of one stream, when @code{BZ2_bzRead}
1534returns @code{BZ_STREAM_END}, call @code{BZ2_bzReadGetUnused} to collect the
1535unused data (copy it into your own buffer somewhere).
1536That data forms the start of the next compressed stream.
1537To start uncompressing that next stream, call @code{BZ2_bzReadOpen} again,
1538feeding in the unused data via the @code{unused}/@code{nUnused}
1539parameters.
1540Keep doing this until @code{BZ_STREAM_END} return coincides with the
1541physical end of file (@code{feof(f)}). In this situation
1542@code{BZ2_bzReadGetUnused}
1543will of course return no data.
1544
1545This should give some feel for how the high-level interface can be used.
1546If you require extra flexibility, you'll have to bite the bullet and get
1547to grips with the low-level interface.
1548
1549@subsection Standard file-reading/writing code
1550Here's how you'd write data to a compressed file:
1551@example @code
1552FILE* f;
1553BZFILE* b;
1554int nBuf;
1555char buf[ /* whatever size you like */ ];
1556int bzerror;
1557int nWritten;
1558
1559f = fopen ( "myfile.bz2", "w" );
1560if (!f) @{
1561 /* handle error */
1562@}
1563b = BZ2_bzWriteOpen ( &bzerror, f, 9 );
1564if (bzerror != BZ_OK) @{
1565 BZ2_bzWriteClose ( b );
1566 /* handle error */
1567@}
1568
1569while ( /* condition */ ) @{
1570 /* get data to write into buf, and set nBuf appropriately */
1571 nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf );
1572 if (bzerror == BZ_IO_ERROR) @{
1573 BZ2_bzWriteClose ( &bzerror, b );
1574 /* handle error */
1575 @}
1576@}
1577
1578BZ2_bzWriteClose ( &bzerror, b );
1579if (bzerror == BZ_IO_ERROR) @{
1580 /* handle error */
1581@}
1582@end example
1583And to read from a compressed file:
1584@example
1585FILE* f;
1586BZFILE* b;
1587int nBuf;
1588char buf[ /* whatever size you like */ ];
1589int bzerror;
1590int nWritten;
1591
1592f = fopen ( "myfile.bz2", "r" );
1593if (!f) @{
1594 /* handle error */
1595@}
1596b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 );
1597if (bzerror != BZ_OK) @{
1598 BZ2_bzReadClose ( &bzerror, b );
1599 /* handle error */
1600@}
1601
1602bzerror = BZ_OK;
1603while (bzerror == BZ_OK && /* arbitrary other conditions */) @{
1604 nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ );
1605 if (bzerror == BZ_OK) @{
1606 /* do something with buf[0 .. nBuf-1] */
1607 @}
1608@}
1609if (bzerror != BZ_STREAM_END) @{
1610 BZ2_bzReadClose ( &bzerror, b );
1611 /* handle error */
1612@} else @{
1613 BZ2_bzReadClose ( &bzerror );
1614@}
1615@end example
1616
1617
1618
1619@section Utility functions
1620@subsection @code{BZ2_bzBuffToBuffCompress}
1621@example
1622 int BZ2_bzBuffToBuffCompress( char* dest,
1623 unsigned int* destLen,
1624 char* source,
1625 unsigned int sourceLen,
1626 int blockSize100k,
1627 int verbosity,
1628 int workFactor );
1629@end example
1630Attempts to compress the data in @code{source[0 .. sourceLen-1]}
1631into the destination buffer, @code{dest[0 .. *destLen-1]}.
1632If the destination buffer is big enough, @code{*destLen} is
1633set to the size of the compressed data, and @code{BZ_OK} is
1634returned. If the compressed data won't fit, @code{*destLen}
1635is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1636
1637Compression in this manner is a one-shot event, done with a single call
1638to this function. The resulting compressed data is a complete
1639@code{bzip2} format data stream. There is no mechanism for making
1640additional calls to provide extra input data. If you want that kind of
1641mechanism, use the low-level interface.
1642
1643For the meaning of parameters @code{blockSize100k}, @code{verbosity}
1644and @code{workFactor}, @* see @code{BZ2_bzCompressInit}.
1645
1646To guarantee that the compressed data will fit in its buffer, allocate
1647an output buffer of size 1% larger than the uncompressed data, plus
1648six hundred extra bytes.
1649
1650@code{BZ2_bzBuffToBuffDecompress} will not write data at or
1651beyond @code{dest[*destLen]}, even in case of buffer overflow.
1652
1653Possible return values:
1654@display
1655 @code{BZ_CONFIG_ERROR}
1656 if the library has been mis-compiled
1657 @code{BZ_PARAM_ERROR}
1658 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1659 or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1660 or @code{verbosity < 0} or @code{verbosity > 4}
1661 or @code{workFactor < 0} or @code{workFactor > 250}
1662 @code{BZ_MEM_ERROR}
1663 if insufficient memory is available
1664 @code{BZ_OUTBUFF_FULL}
1665 if the size of the compressed data exceeds @code{*destLen}
1666 @code{BZ_OK}
1667 otherwise
1668@end display
1669
1670
1671
1672@subsection @code{BZ2_bzBuffToBuffDecompress}
1673@example
1674 int BZ2_bzBuffToBuffDecompress ( char* dest,
1675 unsigned int* destLen,
1676 char* source,
1677 unsigned int sourceLen,
1678 int small,
1679 int verbosity );
1680@end example
1681Attempts to decompress the data in @code{source[0 .. sourceLen-1]}
1682into the destination buffer, @code{dest[0 .. *destLen-1]}.
1683If the destination buffer is big enough, @code{*destLen} is
1684set to the size of the uncompressed data, and @code{BZ_OK} is
1685returned. If the compressed data won't fit, @code{*destLen}
1686is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1687
1688@code{source} is assumed to hold a complete @code{bzip2} format
1689data stream. @* @code{BZ2_bzBuffToBuffDecompress} tries to decompress
1690the entirety of the stream into the output buffer.
1691
1692For the meaning of parameters @code{small} and @code{verbosity},
1693see @code{BZ2_bzDecompressInit}.
1694
1695Because the compression ratio of the compressed data cannot be known in
1696advance, there is no easy way to guarantee that the output buffer will
1697be big enough. You may of course make arrangements in your code to
1698record the size of the uncompressed data, but such a mechanism is beyond
1699the scope of this library.
1700
1701@code{BZ2_bzBuffToBuffDecompress} will not write data at or
1702beyond @code{dest[*destLen]}, even in case of buffer overflow.
1703
1704Possible return values:
1705@display
1706 @code{BZ_CONFIG_ERROR}
1707 if the library has been mis-compiled
1708 @code{BZ_PARAM_ERROR}
1709 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1710 or @code{small != 0 && small != 1}
1711 or @code{verbosity < 0} or @code{verbosity > 4}
1712 @code{BZ_MEM_ERROR}
1713 if insufficient memory is available
1714 @code{BZ_OUTBUFF_FULL}
1715 if the size of the compressed data exceeds @code{*destLen}
1716 @code{BZ_DATA_ERROR}
1717 if a data integrity error was detected in the compressed data
1718 @code{BZ_DATA_ERROR_MAGIC}
1719 if the compressed data doesn't begin with the right magic bytes
1720 @code{BZ_UNEXPECTED_EOF}
1721 if the compressed data ends unexpectedly
1722 @code{BZ_OK}
1723 otherwise
1724@end display
1725
1726
1727
1728@section @code{zlib} compatibility functions
1729Yoshioka Tsuneo has contributed some functions to
1730give better @code{zlib} compatibility. These functions are
1731@code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
1732@code{BZ2_bzclose},
1733@code{BZ2_bzerror} and @code{BZ2_bzlibVersion}.
1734These functions are not (yet) officially part of
1735the library. If they break, you get to keep all the pieces.
1736Nevertheless, I think they work ok.
1737@example
1738typedef void BZFILE;
1739
1740const char * BZ2_bzlibVersion ( void );
1741@end example
1742Returns a string indicating the library version.
1743@example
1744BZFILE * BZ2_bzopen ( const char *path, const char *mode );
1745BZFILE * BZ2_bzdopen ( int fd, const char *mode );
1746@end example
1747Opens a @code{.bz2} file for reading or writing, using either its name
1748or a pre-existing file descriptor.
1749Analogous to @code{fopen} and @code{fdopen}.
1750@example
1751int BZ2_bzread ( BZFILE* b, void* buf, int len );
1752int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
1753@end example
1754Reads/writes data from/to a previously opened @code{BZFILE}.
1755Analogous to @code{fread} and @code{fwrite}.
1756@example
1757int BZ2_bzflush ( BZFILE* b );
1758void BZ2_bzclose ( BZFILE* b );
1759@end example
1760Flushes/closes a @code{BZFILE}. @code{BZ2_bzflush} doesn't actually do
1761anything. Analogous to @code{fflush} and @code{fclose}.
1762
1763@example
1764const char * BZ2_bzerror ( BZFILE *b, int *errnum )
1765@end example
1766Returns a string describing the more recent error status of
1767@code{b}, and also sets @code{*errnum} to its numerical value.
1768
1769
1770@section Using the library in a @code{stdio}-free environment
1771
1772@subsection Getting rid of @code{stdio}
1773
1774In a deeply embedded application, you might want to use just
1775the memory-to-memory functions. You can do this conveniently
1776by compiling the library with preprocessor symbol @code{BZ_NO_STDIO}
1777defined. Doing this gives you a library containing only the following
1778eight functions:
1779
1780@code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, @code{BZ2_bzCompressEnd} @*
1781@code{BZ2_bzDecompressInit}, @code{BZ2_bzDecompress}, @code{BZ2_bzDecompressEnd} @*
1782@code{BZ2_bzBuffToBuffCompress}, @code{BZ2_bzBuffToBuffDecompress}
1783
1784When compiled like this, all functions will ignore @code{verbosity}
1785settings.
1786
1787@subsection Critical error handling
1788@code{libbzip2} contains a number of internal assertion checks which
1789should, needless to say, never be activated. Nevertheless, if an
1790assertion should fail, behaviour depends on whether or not the library
1791was compiled with @code{BZ_NO_STDIO} set.
1792
1793For a normal compile, an assertion failure yields the message
1794@example
1795 bzip2/libbzip2: internal error number N.
ba95a000 1796 This is a bug in bzip2/libbzip2, 1.0.2, 30-Dec-2001.
ef6bacff
RC
1797 Please report it to me at: jseward@@acm.org. If this happened
1798 when you were using some program which uses libbzip2 as a
1799 component, you should also report this bug to the author(s)
1800 of that program. Please make an effort to report this bug;
1801 timely and accurate bug reports eventually lead to higher
ba95a000 1802 quality software. Thanks. Julian Seward, 30 December 2001.
ef6bacff 1803@end example
ba95a000
RC
1804where @code{N} is some error code number. If @code{N == 1007}, it also
1805prints some extra text advising the reader that unreliable memory is
1806often associated with internal error 1007. (This is a
1807frequently-observed-phenomenon with versions 1.0.0/1.0.1).
1808
1809@code{exit(3)} is then called.
ef6bacff
RC
1810
1811For a @code{stdio}-free library, assertion failures result
1812in a call to a function declared as:
1813@example
1814 extern void bz_internal_error ( int errcode );
1815@end example
1816The relevant code is passed as a parameter. You should supply
1817such a function.
1818
1819In either case, once an assertion failure has occurred, any
1820@code{bz_stream} records involved can be regarded as invalid.
1821You should not attempt to resume normal operation with them.
1822
1823You may, of course, change critical error handling to suit
1824your needs. As I said above, critical errors indicate bugs
1825in the library and should not occur. All "normal" error
1826situations are indicated via error return codes from functions,
1827and can be recovered from.
1828
1829
1830@section Making a Windows DLL
1831Everything related to Windows has been contributed by Yoshioka Tsuneo
1832@* (@code{QWF00133@@niftyserve.or.jp} /
1833@code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to
1834him (but perhaps Cc: me, @code{jseward@@acm.org}).
1835
1836My vague understanding of what to do is: using Visual C++ 5.0,
1837open the project file @code{libbz2.dsp}, and build. That's all.
1838
1839If you can't
1840open the project file for some reason, make a new one, naming these files:
1841@code{blocksort.c}, @code{bzlib.c}, @code{compress.c},
1842@code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
1843@code{randtable.c} and @code{libbz2.def}. You will also need
1844to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
1845
1846If you don't use VC++, you may need to define the proprocessor symbol
1847@code{_WIN32}.
1848
1849Finally, @code{dlltest.c} is a sample program using the DLL. It has a
1850project file, @code{dlltest.dsp}.
1851
1852If you just want a makefile for Visual C, have a look at
1853@code{makefile.msc}.
1854
1855Be aware that if you compile @code{bzip2} itself on Win32, you must set
1856@code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file
1857@code{bzip2.c}, before compiling. Otherwise the resulting binary won't
1858work correctly.
1859
1860I haven't tried any of this stuff myself, but it all looks plausible.
1861
1862
1863
1864@chapter Miscellanea
1865
1866These are just some random thoughts of mine. Your mileage may
1867vary.
1868
1869@section Limitations of the compressed file format
1870@code{bzip2-1.0}, @code{0.9.5} and @code{0.9.0}
1871use exactly the same file format as the previous
1872version, @code{bzip2-0.1}. This decision was made in the interests of
1873stability. Creating yet another incompatible compressed file format
1874would create further confusion and disruption for users.
1875
1876Nevertheless, this is not a painless decision. Development
1877work since the release of @code{bzip2-0.1} in August 1997
1878has shown complexities in the file format which slow down
1879decompression and, in retrospect, are unnecessary. These are:
1880@itemize @bullet
1881@item The run-length encoder, which is the first of the
1882 compression transformations, is entirely irrelevant.
1883 The original purpose was to protect the sorting algorithm
1884 from the very worst case input: a string of repeated
1885 symbols. But algorithm steps Q6a and Q6b in the original
1886 Burrows-Wheeler technical report (SRC-124) show how
1887 repeats can be handled without difficulty in block
1888 sorting.
1889@item The randomisation mechanism doesn't really need to be
1890 there. Udi Manber and Gene Myers published a suffix
1891 array construction algorithm a few years back, which
1892 can be employed to sort any block, no matter how
1893 repetitive, in O(N log N) time. Subsequent work by
1894 Kunihiko Sadakane has produced a derivative O(N (log N)^2)
1895 algorithm which usually outperforms the Manber-Myers
1896 algorithm.
1897
1898 I could have changed to Sadakane's algorithm, but I find
1899 it to be slower than @code{bzip2}'s existing algorithm for
1900 most inputs, and the randomisation mechanism protects
1901 adequately against bad cases. I didn't think it was
1902 a good tradeoff to make. Partly this is due to the fact
1903 that I was not flooded with email complaints about
1904 @code{bzip2-0.1}'s performance on repetitive data, so
1905 perhaps it isn't a problem for real inputs.
1906
1907 Probably the best long-term solution,
1908 and the one I have incorporated into 0.9.5 and above,
1909 is to use the existing sorting
1910 algorithm initially, and fall back to a O(N (log N)^2)
1911 algorithm if the standard algorithm gets into difficulties.
1912@item The compressed file format was never designed to be
1913 handled by a library, and I have had to jump though
1914 some hoops to produce an efficient implementation of
1915 decompression. It's a bit hairy. Try passing
1916 @code{decompress.c} through the C preprocessor
1917 and you'll see what I mean. Much of this complexity
1918 could have been avoided if the compressed size of
1919 each block of data was recorded in the data stream.
1920@item An Adler-32 checksum, rather than a CRC32 checksum,
1921 would be faster to compute.
1922@end itemize
1923It would be fair to say that the @code{bzip2} format was frozen
1924before I properly and fully understood the performance
1925consequences of doing so.
1926
1927Improvements which I was able to incorporate into
19280.9.0, despite using the same file format, are:
1929@itemize @bullet
1930@item Single array implementation of the inverse BWT. This
1931 significantly speeds up decompression, presumably
1932 because it reduces the number of cache misses.
1933@item Faster inverse MTF transform for large MTF values. The
1934 new implementation is based on the notion of sliding blocks
1935 of values.
1936@item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
1937 and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
1938 Duh! Well, you live and learn.
1939
1940@end itemize
1941Further ahead, it would be nice
1942to be able to do random access into files. This will
1943require some careful design of compressed file formats.
1944
1945
1946
1947@section Portability issues
1948After some consideration, I have decided not to use
1949GNU @code{autoconf} to configure 0.9.5 or 1.0.
1950
1951@code{autoconf}, admirable and wonderful though it is,
1952mainly assists with portability problems between Unix-like
1953platforms. But @code{bzip2} doesn't have much in the way
1954of portability problems on Unix; most of the difficulties appear
1955when porting to the Mac, or to Microsoft's operating systems.
1956@code{autoconf} doesn't help in those cases, and brings in a
1957whole load of new complexity.
1958
1959Most people should be able to compile the library and program
1960under Unix straight out-of-the-box, so to speak, especially
1961if you have a version of GNU C available.
1962
1963There are a couple of @code{__inline__} directives in the code. GNU C
1964(@code{gcc}) should be able to handle them. If you're not using
1965GNU C, your C compiler shouldn't see them at all.
1966If your compiler does, for some reason, see them and doesn't
1967like them, just @code{#define} @code{__inline__} to be @code{/* */}. One
1968easy way to do this is to compile with the flag @code{-D__inline__=},
1969which should be understood by most Unix compilers.
1970
1971If you still have difficulties, try compiling with the macro
1972@code{BZ_STRICT_ANSI} defined. This should enable you to build the
1973library in a strictly ANSI compliant environment. Building the program
1974itself like this is dangerous and not supported, since you remove
1975@code{bzip2}'s checks against compressing directories, symbolic links,
1976devices, and other not-really-a-file entities. This could cause
1977filesystem corruption!
1978
1979One other thing: if you create a @code{bzip2} binary for public
1980distribution, please try and link it statically (@code{gcc -s}). This
1981avoids all sorts of library-version issues that others may encounter
1982later on.
1983
1984If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and
1985@code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling.
1986Otherwise the resulting binary won't work correctly.
1987
1988
1989
1990@section Reporting bugs
1991I tried pretty hard to make sure @code{bzip2} is
1992bug free, both by design and by testing. Hopefully
1993you'll never need to read this section for real.
1994
1995Nevertheless, if @code{bzip2} dies with a segmentation
1996fault, a bus error or an internal assertion failure, it
1997will ask you to email me a bug report. Experience with
1998version 0.1 shows that almost all these problems can
1999be traced to either compiler bugs or hardware problems.
2000@itemize @bullet
2001@item
2002Recompile the program with no optimisation, and see if it
2003works. And/or try a different compiler.
2004I heard all sorts of stories about various flavours
2005of GNU C (and other compilers) generating bad code for
2006@code{bzip2}, and I've run across two such examples myself.
2007
20082.7.X versions of GNU C are known to generate bad code from
2009time to time, at high optimisation levels.
2010If you get problems, try using the flags
2011@code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}.
2012You should specifically @emph{not} use @code{-funroll-loops}.
2013
2014You may notice that the Makefile runs six tests as part of
2015the build process. If the program passes all of these, it's
2016a pretty good (but not 100%) indication that the compiler has
2017done its job correctly.
2018@item
2019If @code{bzip2} crashes randomly, and the crashes are not
2020repeatable, you may have a flaky memory subsystem. @code{bzip2}
2021really hammers your memory hierarchy, and if it's a bit marginal,
2022you may get these problems. Ditto if your disk or I/O subsystem
2023is slowly failing. Yup, this really does happen.
2024
2025Try using a different machine of the same type, and see if
2026you can repeat the problem.
2027@item This isn't really a bug, but ... If @code{bzip2} tells
2028you your file is corrupted on decompression, and you
2029obtained the file via FTP, there is a possibility that you
2030forgot to tell FTP to do a binary mode transfer. That absolutely
2031will cause the file to be non-decompressible. You'll have to transfer
2032it again.
2033@end itemize
2034
2035If you've incorporated @code{libbzip2} into your own program
2036and are getting problems, please, please, please, check that the
2037parameters you are passing in calls to the library, are
2038correct, and in accordance with what the documentation says
2039is allowable. I have tried to make the library robust against
2040such problems, but I'm sure I haven't succeeded.
2041
2042Finally, if the above comments don't help, you'll have to send
2043me a bug report. Now, it's just amazing how many people will
2044send me a bug report saying something like
2045@display
2046 bzip2 crashed with segmentation fault on my machine
2047@end display
2048and absolutely nothing else. Needless to say, a such a report
2049is @emph{totally, utterly, completely and comprehensively 100% useless;
2050a waste of your time, my time, and net bandwidth}.
2051With no details at all, there's no way I can possibly begin
2052to figure out what the problem is.
2053
2054The rules of the game are: facts, facts, facts. Don't omit
2055them because "oh, they won't be relevant". At the bare
2056minimum:
2057@display
2058 Machine type. Operating system version.
2059 Exact version of @code{bzip2} (do @code{bzip2 -V}).
2060 Exact version of the compiler used.
2061 Flags passed to the compiler.
2062@end display
2063However, the most important single thing that will help me is
2064the file that you were trying to compress or decompress at the
2065time the problem happened. Without that, my ability to do anything
2066more than speculate about the cause, is limited.
2067
2068Please remember that I connect to the Internet with a modem, so
2069you should contact me before mailing me huge files.
2070
2071
2072@section Did you get the right package?
2073
2074@code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles
2075and memory. Also, it gives very large latencies. In the worst case, you
2076can feed many megabytes of uncompressed data into the library before
2077getting any compressed output, so this probably rules out applications
2078requiring interactive behaviour.
2079
2080These aren't faults of my implementation, I hope, but more
2081an intrinsic property of the Burrows-Wheeler transform (unfortunately).
2082Maybe this isn't what you want.
2083
2084If you want a compressor and/or library which is faster, uses less
2085memory but gets pretty good compression, and has minimal latency,
2086consider Jean-loup
ba95a000 2087Gailly's and Mark Adler's work, @code{zlib-1.1.3} and
ef6bacff
RC
2088@code{gzip-1.2.4}. Look for them at
2089
ba95a000 2090@code{http://www.zlib.org} and
ef6bacff
RC
2091@code{http://www.gzip.org} respectively.
2092
2093For something faster and lighter still, you might try Markus F X J
2094Oberhumer's @code{LZO} real-time compression/decompression library, at
2095@* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}.
2096
2097If you want to use the @code{bzip2} algorithms to compress small blocks
2098of data, 64k bytes or smaller, for example on an on-the-fly disk
2099compressor, you'd be well advised not to use this library. Instead,
2100I've made a special library tuned for that kind of use. It's part of
2101@code{e2compr-0.40}, an on-the-fly disk compressor for the Linux
2102@code{ext2} filesystem. Look at
2103@code{http://www.netspace.net.au/~reiter/e2compr}.
2104
2105
2106
2107@section Testing
2108
2109A record of the tests I've done.
2110
2111First, some data sets:
2112@itemize @bullet
2113@item B: a directory containing 6001 files, one for every length in the
2114 range 0 to 6000 bytes. The files contain random lowercase
2115 letters. 18.7 megabytes.
2116@item H: my home directory tree. Documents, source code, mail files,
2117 compressed data. H contains B, and also a directory of
2118 files designed as boundary cases for the sorting; mostly very
2119 repetitive, nasty files. 565 megabytes.
2120@item A: directory tree holding various applications built from source:
2121 @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc.
2122 2200 megabytes.
2123@end itemize
2124The tests conducted are as follows. Each test means compressing
2125(a copy of) each file in the data set, decompressing it and
2126comparing it against the original.
2127
2128First, a bunch of tests with block sizes and internal buffer
2129sizes set very small,
2130to detect any problems with the
2131blocking and buffering mechanisms.
2132This required modifying the source code so as to try to
2133break it.
2134@enumerate
2135@item Data set H, with
2136 buffer size of 1 byte, and block size of 23 bytes.
2137@item Data set B, buffer sizes 1 byte, block size 1 byte.
2138@item As (2) but small-mode decompression.
2139@item As (2) with block size 2 bytes.
2140@item As (2) with block size 3 bytes.
2141@item As (2) with block size 4 bytes.
2142@item As (2) with block size 5 bytes.
2143@item As (2) with block size 6 bytes and small-mode decompression.
2144@item H with buffer size of 1 byte, but normal block
2145 size (up to 900000 bytes).
2146@end enumerate
2147Then some tests with unmodified source code.
2148@enumerate
2149@item H, all settings normal.
2150@item As (1), with small-mode decompress.
2151@item H, compress with flag @code{-1}.
2152@item H, compress with flag @code{-s}, decompress with flag @code{-s}.
2153@item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
2154 @code{bzip2-0.9.5} decompressing, all settings normal.
2155@item Backwards compatibility: H, @code{bzip2-0.9.5} compressing,
2156 @code{bzip2-0.1pl2} decompressing, all settings normal.
2157@item Bigger tests: A, all settings normal.
2158@item As (7), using the fallback (Sadakane-like) sorting algorithm.
2159@item As (8), compress with flag @code{-1}, decompress with flag
2160 @code{-s}.
2161@item H, using the fallback sorting algorithm.
2162@item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing,
2163 @code{bzip2-0.9.5} decompressing, all settings normal.
2164@item Backwards compatibility: A, @code{bzip2-0.9.5} compressing,
2165 @code{bzip2-0.1pl2} decompressing, all settings normal.
2166@item Misc test: about 400 megabytes of @code{.tar} files with
2167 @code{bzip2} compiled with Checker (a memory access error
2168 detector, like Purify).
2169@item Misc tests to make sure it builds and runs ok on non-Linux/x86
2170 platforms.
2171@end enumerate
2172These tests were conducted on a 225 MHz IDT WinChip machine, running
2173Linux 2.0.36. They represent nearly a week of continuous computation.
2174All tests completed successfully.
2175
2176
2177@section Further reading
2178@code{bzip2} is not research work, in the sense that it doesn't present
2179any new ideas. Rather, it's an engineering exercise based on existing
2180ideas.
2181
2182Four documents describe essentially all the ideas behind @code{bzip2}:
2183@example
2184Michael Burrows and D. J. Wheeler:
2185 "A block-sorting lossless data compression algorithm"
2186 10th May 1994.
2187 Digital SRC Research Report 124.
2188 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
2189 If you have trouble finding it, try searching at the
2190 New Zealand Digital Library, http://www.nzdl.org.
2191
2192Daniel S. Hirschberg and Debra A. LeLewer
2193 "Efficient Decoding of Prefix Codes"
2194 Communications of the ACM, April 1990, Vol 33, Number 4.
2195 You might be able to get an electronic copy of this
2196 from the ACM Digital Library.
2197
2198David J. Wheeler
2199 Program bred3.c and accompanying document bred3.ps.
2200 This contains the idea behind the multi-table Huffman
2201 coding scheme.
2202 ftp://ftp.cl.cam.ac.uk/users/djw3/
2203
2204Jon L. Bentley and Robert Sedgewick
2205 "Fast Algorithms for Sorting and Searching Strings"
2206 Available from Sedgewick's web page,
2207 www.cs.princeton.edu/~rs
2208@end example
2209The following paper gives valuable additional insights into the
2210algorithm, but is not immediately the basis of any code
2211used in bzip2.
2212@example
2213Peter Fenwick:
2214 Block Sorting Text Compression
2215 Proceedings of the 19th Australasian Computer Science Conference,
2216 Melbourne, Australia. Jan 31 - Feb 2, 1996.
2217 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
2218@end example
2219Kunihiko Sadakane's sorting algorithm, mentioned above,
2220is available from:
2221@example
2222http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
2223@end example
2224The Manber-Myers suffix array construction
2225algorithm is described in a paper
2226available from:
2227@example
2228http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
2229@end example
2230Finally, the following paper documents some recent investigations
2231I made into the performance of sorting algorithms:
2232@example
2233Julian Seward:
2234 On the Performance of BWT Sorting Algorithms
2235 Proceedings of the IEEE Data Compression Conference 2000
2236 Snowbird, Utah. 28-30 March 2000.
2237@end example
2238
2239
2240@contents
2241
2242@bye
2243
This page took 0.242678 seconds and 5 git commands to generate.