]> cygwin.com Git - cygwin-apps/setup.git/blame - bz2lib/manual_4.html
2002-04-29 Robert Collins <rbtcollins@hotmail.com>
[cygwin-apps/setup.git] / bz2lib / manual_4.html
CommitLineData
ef6bacff 1<HTML>
ba95a000
RC
2<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
3<!-- Created on January, 5 2002 by texi2html 1.64 -->
4<!--
5Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author)
6 Karl Berry <karl@freefriends.org>
7 Olaf Bachmann <obachman@mathematik.uni-kl.de>
8 and many others.
9Maintained by: Olaf Bachmann <obachman@mathematik.uni-kl.de>
10Send bugs and suggestions to <texi2html@mathematik.uni-kl.de>
11
12-->
ef6bacff 13<HEAD>
ba95a000 14<TITLE>Untitled Document: 4. Miscellanea</TITLE>
ef6bacff 15
ba95a000
RC
16<META NAME="description" CONTENT="Untitled Document: 4. Miscellanea">
17<META NAME="keywords" CONTENT="Untitled Document: 4. Miscellanea">
18<META NAME="resource-type" CONTENT="document">
19<META NAME="distribution" CONTENT="global">
20<META NAME="Generator" CONTENT="texi2html 1.64">
ef6bacff
RC
21
22</HEAD>
ef6bacff 23
ba95a000 24<BODY LANG="" BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#800080" ALINK="#FF0000">
ef6bacff 25
ba95a000
RC
26<A NAME="SEC43"></A>
27<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
28<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_3.html#SEC42"> &lt; </A>]</TD>
29<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC44"> &gt; </A>]</TD>
30<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
31<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
32<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
33<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
34<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
35<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
36<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
37</TR></TABLE>
38<H1> 4. Miscellanea </H1>
39<!--docid::SEC43::-->
ef6bacff 40<P>
ba95a000 41
ef6bacff
RC
42These are just some random thoughts of mine. Your mileage may
43vary.
ba95a000
RC
44</P><P>
45
46<HR SIZE="6">
47<A NAME="SEC44"></A>
48<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
49<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC43"> &lt; </A>]</TD>
50<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC45"> &gt; </A>]</TD>
51<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
52<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
53<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
54<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
55<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
56<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
57<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
58</TR></TABLE>
59<H2> 4.1 Limitations of the compressed file format </H2>
60<!--docid::SEC44::-->
ef6bacff
RC
61<CODE>bzip2-1.0</CODE>, <CODE>0.9.5</CODE> and <CODE>0.9.0</CODE>
62use exactly the same file format as the previous
63version, <CODE>bzip2-0.1</CODE>. This decision was made in the interests of
64stability. Creating yet another incompatible compressed file format
65would create further confusion and disruption for users.
ef6bacff 66<P>
ba95a000 67
ef6bacff
RC
68Nevertheless, this is not a painless decision. Development
69work since the release of <CODE>bzip2-0.1</CODE> in August 1997
70has shown complexities in the file format which slow down
71decompression and, in retrospect, are unnecessary. These are:
ef6bacff
RC
72<UL>
73<LI>The run-length encoder, which is the first of the
ef6bacff
RC
74 compression transformations, is entirely irrelevant.
75 The original purpose was to protect the sorting algorithm
76 from the very worst case input: a string of repeated
77 symbols. But algorithm steps Q6a and Q6b in the original
78 Burrows-Wheeler technical report (SRC-124) show how
79 repeats can be handled without difficulty in block
80 sorting.
81<LI>The randomisation mechanism doesn't really need to be
ef6bacff
RC
82 there. Udi Manber and Gene Myers published a suffix
83 array construction algorithm a few years back, which
84 can be employed to sort any block, no matter how
85 repetitive, in O(N log N) time. Subsequent work by
86 Kunihiko Sadakane has produced a derivative O(N (log N)^2)
87 algorithm which usually outperforms the Manber-Myers
88 algorithm.
ba95a000 89<P>
ef6bacff
RC
90
91 I could have changed to Sadakane's algorithm, but I find
92 it to be slower than <CODE>bzip2</CODE>'s existing algorithm for
93 most inputs, and the randomisation mechanism protects
94 adequately against bad cases. I didn't think it was
95 a good tradeoff to make. Partly this is due to the fact
96 that I was not flooded with email complaints about
97 <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so
98 perhaps it isn't a problem for real inputs.
ba95a000 99</P><P>
ef6bacff
RC
100
101 Probably the best long-term solution,
102 and the one I have incorporated into 0.9.5 and above,
103 is to use the existing sorting
104 algorithm initially, and fall back to a O(N (log N)^2)
105 algorithm if the standard algorithm gets into difficulties.
106<LI>The compressed file format was never designed to be
ef6bacff
RC
107 handled by a library, and I have had to jump though
108 some hoops to produce an efficient implementation of
109 decompression. It's a bit hairy. Try passing
110 <CODE>decompress.c</CODE> through the C preprocessor
111 and you'll see what I mean. Much of this complexity
112 could have been avoided if the compressed size of
113 each block of data was recorded in the data stream.
114<LI>An Adler-32 checksum, rather than a CRC32 checksum,
ef6bacff
RC
115 would be faster to compute.
116</UL>
ef6bacff
RC
117It would be fair to say that the <CODE>bzip2</CODE> format was frozen
118before I properly and fully understood the performance
119consequences of doing so.
ef6bacff 120<P>
ba95a000 121
ef6bacff
RC
122Improvements which I was able to incorporate into
1230.9.0, despite using the same file format, are:
ef6bacff
RC
124<UL>
125<LI>Single array implementation of the inverse BWT. This
ef6bacff
RC
126 significantly speeds up decompression, presumably
127 because it reduces the number of cache misses.
128<LI>Faster inverse MTF transform for large MTF values. The
ef6bacff
RC
129 new implementation is based on the notion of sliding blocks
130 of values.
131<LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE>
ef6bacff
RC
132 and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>.
133 Duh! Well, you live and learn.
ba95a000 134<P>
ef6bacff
RC
135
136</UL>
ef6bacff
RC
137Further ahead, it would be nice
138to be able to do random access into files. This will
139require some careful design of compressed file formats.
ef6bacff 140<P>
ba95a000
RC
141
142<HR SIZE="6">
143<A NAME="SEC45"></A>
144<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
145<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC44"> &lt; </A>]</TD>
146<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC46"> &gt; </A>]</TD>
147<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
148<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
149<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
150<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
151<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
152<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
153<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
154</TR></TABLE>
155<H2> 4.2 Portability issues </H2>
156<!--docid::SEC45::-->
ef6bacff
RC
157After some consideration, I have decided not to use
158GNU <CODE>autoconf</CODE> to configure 0.9.5 or 1.0.
ef6bacff 159<P>
ba95a000 160
ef6bacff
RC
161<CODE>autoconf</CODE>, admirable and wonderful though it is,
162mainly assists with portability problems between Unix-like
163platforms. But <CODE>bzip2</CODE> doesn't have much in the way
164of portability problems on Unix; most of the difficulties appear
165when porting to the Mac, or to Microsoft's operating systems.
166<CODE>autoconf</CODE> doesn't help in those cases, and brings in a
167whole load of new complexity.
ba95a000 168</P><P>
ef6bacff 169
ef6bacff
RC
170Most people should be able to compile the library and program
171under Unix straight out-of-the-box, so to speak, especially
172if you have a version of GNU C available.
ba95a000 173</P><P>
ef6bacff 174
ef6bacff
RC
175There are a couple of <CODE>__inline__</CODE> directives in the code. GNU C
176(<CODE>gcc</CODE>) should be able to handle them. If you're not using
177GNU C, your C compiler shouldn't see them at all.
178If your compiler does, for some reason, see them and doesn't
179like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be <CODE>/* */</CODE>. One
180easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>,
181which should be understood by most Unix compilers.
ba95a000 182</P><P>
ef6bacff 183
ef6bacff
RC
184If you still have difficulties, try compiling with the macro
185<CODE>BZ_STRICT_ANSI</CODE> defined. This should enable you to build the
186library in a strictly ANSI compliant environment. Building the program
187itself like this is dangerous and not supported, since you remove
188<CODE>bzip2</CODE>'s checks against compressing directories, symbolic links,
189devices, and other not-really-a-file entities. This could cause
190filesystem corruption!
ba95a000 191</P><P>
ef6bacff 192
ef6bacff
RC
193One other thing: if you create a <CODE>bzip2</CODE> binary for public
194distribution, please try and link it statically (<CODE>gcc -s</CODE>). This
195avoids all sorts of library-version issues that others may encounter
196later on.
ba95a000 197</P><P>
ef6bacff 198
ef6bacff
RC
199If you build <CODE>bzip2</CODE> on Win32, you must set <CODE>BZ_UNIX</CODE> to 0 and
200<CODE>BZ_LCCWIN32</CODE> to 1, in the file <CODE>bzip2.c</CODE>, before compiling.
201Otherwise the resulting binary won't work correctly.
ba95a000
RC
202</P><P>
203
204<HR SIZE="6">
205<A NAME="SEC46"></A>
206<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
207<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC45"> &lt; </A>]</TD>
208<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC47"> &gt; </A>]</TD>
209<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
210<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
211<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
212<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
213<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
214<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
215<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
216</TR></TABLE>
217<H2> 4.3 Reporting bugs </H2>
218<!--docid::SEC46::-->
ef6bacff
RC
219I tried pretty hard to make sure <CODE>bzip2</CODE> is
220bug free, both by design and by testing. Hopefully
221you'll never need to read this section for real.
ef6bacff 222<P>
ba95a000 223
ef6bacff
RC
224Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation
225fault, a bus error or an internal assertion failure, it
226will ask you to email me a bug report. Experience with
227version 0.1 shows that almost all these problems can
228be traced to either compiler bugs or hardware problems.
ef6bacff
RC
229<UL>
230<LI>
ef6bacff
RC
231Recompile the program with no optimisation, and see if it
232works. And/or try a different compiler.
233I heard all sorts of stories about various flavours
234of GNU C (and other compilers) generating bad code for
235<CODE>bzip2</CODE>, and I've run across two such examples myself.
ba95a000 236<P>
ef6bacff
RC
237
2382.7.X versions of GNU C are known to generate bad code from
239time to time, at high optimisation levels.
240If you get problems, try using the flags
241<CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>.
242You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>.
ba95a000 243</P><P>
ef6bacff
RC
244
245You may notice that the Makefile runs six tests as part of
246the build process. If the program passes all of these, it's
247a pretty good (but not 100%) indication that the compiler has
248done its job correctly.
249<LI>
ef6bacff
RC
250If <CODE>bzip2</CODE> crashes randomly, and the crashes are not
251repeatable, you may have a flaky memory subsystem. <CODE>bzip2</CODE>
252really hammers your memory hierarchy, and if it's a bit marginal,
253you may get these problems. Ditto if your disk or I/O subsystem
254is slowly failing. Yup, this really does happen.
ba95a000 255<P>
ef6bacff
RC
256
257Try using a different machine of the same type, and see if
258you can repeat the problem.
259<LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells
ef6bacff
RC
260you your file is corrupted on decompression, and you
261obtained the file via FTP, there is a possibility that you
262forgot to tell FTP to do a binary mode transfer. That absolutely
263will cause the file to be non-decompressible. You'll have to transfer
264it again.
265</UL>
ef6bacff 266<P>
ba95a000 267
ef6bacff
RC
268If you've incorporated <CODE>libbzip2</CODE> into your own program
269and are getting problems, please, please, please, check that the
270parameters you are passing in calls to the library, are
271correct, and in accordance with what the documentation says
272is allowable. I have tried to make the library robust against
273such problems, but I'm sure I haven't succeeded.
ba95a000 274</P><P>
ef6bacff 275
ef6bacff
RC
276Finally, if the above comments don't help, you'll have to send
277me a bug report. Now, it's just amazing how many people will
278send me a bug report saying something like
ba95a000
RC
279<TABLE><tr><td>&nbsp;</td><td class=display><pre style="font-family: serif"> bzip2 crashed with segmentation fault on my machine
280</pre></td></tr></table>and absolutely nothing else. Needless to say, a such a report
ef6bacff
RC
281is <EM>totally, utterly, completely and comprehensively 100% useless;
282a waste of your time, my time, and net bandwidth</EM>.
283With no details at all, there's no way I can possibly begin
284to figure out what the problem is.
ba95a000 285</P><P>
ef6bacff 286
ef6bacff
RC
287The rules of the game are: facts, facts, facts. Don't omit
288them because "oh, they won't be relevant". At the bare
289minimum:
ba95a000 290<TABLE><tr><td>&nbsp;</td><td class=display><pre style="font-family: serif"> Machine type. Operating system version.
ef6bacff
RC
291 Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>).
292 Exact version of the compiler used.
293 Flags passed to the compiler.
ba95a000 294</pre></td></tr></table>However, the most important single thing that will help me is
ef6bacff
RC
295the file that you were trying to compress or decompress at the
296time the problem happened. Without that, my ability to do anything
297more than speculate about the cause, is limited.
ba95a000 298</P><P>
ef6bacff 299
ef6bacff
RC
300Please remember that I connect to the Internet with a modem, so
301you should contact me before mailing me huge files.
ba95a000
RC
302</P><P>
303
304<HR SIZE="6">
305<A NAME="SEC47"></A>
306<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
307<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC46"> &lt; </A>]</TD>
308<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC48"> &gt; </A>]</TD>
309<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
310<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
311<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
312<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
313<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
314<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
315<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
316</TR></TABLE>
317<H2> 4.4 Did you get the right package? </H2>
318<!--docid::SEC47::-->
ef6bacff 319<P>
ba95a000 320
ef6bacff
RC
321<CODE>bzip2</CODE> is a resource hog. It soaks up large amounts of CPU cycles
322and memory. Also, it gives very large latencies. In the worst case, you
323can feed many megabytes of uncompressed data into the library before
324getting any compressed output, so this probably rules out applications
325requiring interactive behaviour.
ba95a000 326</P><P>
ef6bacff 327
ef6bacff
RC
328These aren't faults of my implementation, I hope, but more
329an intrinsic property of the Burrows-Wheeler transform (unfortunately).
330Maybe this isn't what you want.
ba95a000 331</P><P>
ef6bacff 332
ef6bacff
RC
333If you want a compressor and/or library which is faster, uses less
334memory but gets pretty good compression, and has minimal latency,
335consider Jean-loup
ba95a000 336Gailly's and Mark Adler's work, <CODE>zlib-1.1.3</CODE> and
ef6bacff 337<CODE>gzip-1.2.4</CODE>. Look for them at
ba95a000 338</P><P>
ef6bacff 339
ba95a000 340<CODE>http://www.zlib.org</CODE> and
ef6bacff 341<CODE>http://www.gzip.org</CODE> respectively.
ba95a000 342</P><P>
ef6bacff 343
ef6bacff
RC
344For something faster and lighter still, you might try Markus F X J
345Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at
346<BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>.
ba95a000 347</P><P>
ef6bacff 348
ef6bacff
RC
349If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks
350of data, 64k bytes or smaller, for example on an on-the-fly disk
351compressor, you'd be well advised not to use this library. Instead,
352I've made a special library tuned for that kind of use. It's part of
353<CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux
354<CODE>ext2</CODE> filesystem. Look at
355<CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>.
ba95a000
RC
356</P><P>
357
358<HR SIZE="6">
359<A NAME="SEC48"></A>
360<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
361<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC47"> &lt; </A>]</TD>
362<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC49"> &gt; </A>]</TD>
363<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
364<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
365<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
366<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
367<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
368<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
369<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
370</TR></TABLE>
371<H2> 4.5 Testing </H2>
372<!--docid::SEC48::-->
ef6bacff 373<P>
ba95a000 374
ef6bacff 375A record of the tests I've done.
ba95a000 376</P><P>
ef6bacff 377
ef6bacff 378First, some data sets:
ef6bacff
RC
379<UL>
380<LI>B: a directory containing 6001 files, one for every length in the
ef6bacff
RC
381 range 0 to 6000 bytes. The files contain random lowercase
382 letters. 18.7 megabytes.
383<LI>H: my home directory tree. Documents, source code, mail files,
ef6bacff
RC
384 compressed data. H contains B, and also a directory of
385 files designed as boundary cases for the sorting; mostly very
386 repetitive, nasty files. 565 megabytes.
387<LI>A: directory tree holding various applications built from source:
ef6bacff
RC
388 <CODE>egcs</CODE>, <CODE>gcc-2.8.1</CODE>, KDE, GTK, Octave, etc.
389 2200 megabytes.
390</UL>
ef6bacff
RC
391The tests conducted are as follows. Each test means compressing
392(a copy of) each file in the data set, decompressing it and
393comparing it against the original.
ef6bacff 394<P>
ba95a000 395
ef6bacff
RC
396First, a bunch of tests with block sizes and internal buffer
397sizes set very small,
398to detect any problems with the
399blocking and buffering mechanisms.
400This required modifying the source code so as to try to
401break it.
ef6bacff
RC
402<OL>
403<LI>Data set H, with
ef6bacff
RC
404 buffer size of 1 byte, and block size of 23 bytes.
405<LI>Data set B, buffer sizes 1 byte, block size 1 byte.
ef6bacff 406<LI>As (2) but small-mode decompression.
ef6bacff 407<LI>As (2) with block size 2 bytes.
ef6bacff 408<LI>As (2) with block size 3 bytes.
ef6bacff 409<LI>As (2) with block size 4 bytes.
ef6bacff 410<LI>As (2) with block size 5 bytes.
ef6bacff 411<LI>As (2) with block size 6 bytes and small-mode decompression.
ef6bacff 412<LI>H with buffer size of 1 byte, but normal block
ef6bacff
RC
413 size (up to 900000 bytes).
414</OL>
ef6bacff 415Then some tests with unmodified source code.
ef6bacff
RC
416<OL>
417<LI>H, all settings normal.
ef6bacff 418<LI>As (1), with small-mode decompress.
ef6bacff 419<LI>H, compress with flag <CODE>-1</CODE>.
ef6bacff 420<LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>.
ef6bacff 421<LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing,
ef6bacff
RC
422 <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
423<LI>Backwards compatibility: H, <CODE>bzip2-0.9.5</CODE> compressing,
ef6bacff
RC
424 <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
425<LI>Bigger tests: A, all settings normal.
ef6bacff 426<LI>As (7), using the fallback (Sadakane-like) sorting algorithm.
ef6bacff 427<LI>As (8), compress with flag <CODE>-1</CODE>, decompress with flag
ef6bacff
RC
428 <CODE>-s</CODE>.
429<LI>H, using the fallback sorting algorithm.
ef6bacff 430<LI>Forwards compatibility: A, <CODE>bzip2-0.1pl2</CODE> compressing,
ef6bacff
RC
431 <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
432<LI>Backwards compatibility: A, <CODE>bzip2-0.9.5</CODE> compressing,
ef6bacff
RC
433 <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
434<LI>Misc test: about 400 megabytes of <CODE>.tar</CODE> files with
ef6bacff
RC
435 <CODE>bzip2</CODE> compiled with Checker (a memory access error
436 detector, like Purify).
437<LI>Misc tests to make sure it builds and runs ok on non-Linux/x86
ef6bacff
RC
438 platforms.
439</OL>
ef6bacff
RC
440These tests were conducted on a 225 MHz IDT WinChip machine, running
441Linux 2.0.36. They represent nearly a week of continuous computation.
442All tests completed successfully.
ef6bacff 443<P>
ba95a000
RC
444
445<HR SIZE="6">
446<A NAME="SEC49"></A>
447<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
448<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC48"> &lt; </A>]</TD>
449<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt; ]</TD>
450<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
451<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD>
452<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
453<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
454<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
455<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
456<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
457</TR></TABLE>
458<H2> 4.6 Further reading </H2>
459<!--docid::SEC49::-->
ef6bacff
RC
460<CODE>bzip2</CODE> is not research work, in the sense that it doesn't present
461any new ideas. Rather, it's an engineering exercise based on existing
462ideas.
ef6bacff 463<P>
ef6bacff 464
ba95a000
RC
465Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>:
466<TABLE><tr><td>&nbsp;</td><td class=example><pre>Michael Burrows and D. J. Wheeler:
ef6bacff
RC
467 "A block-sorting lossless data compression algorithm"
468 10th May 1994.
469 Digital SRC Research Report 124.
470 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
471 If you have trouble finding it, try searching at the
472 New Zealand Digital Library, http://www.nzdl.org.
473
474Daniel S. Hirschberg and Debra A. LeLewer
475 "Efficient Decoding of Prefix Codes"
476 Communications of the ACM, April 1990, Vol 33, Number 4.
477 You might be able to get an electronic copy of this
478 from the ACM Digital Library.
479
480David J. Wheeler
481 Program bred3.c and accompanying document bred3.ps.
482 This contains the idea behind the multi-table Huffman
483 coding scheme.
484 ftp://ftp.cl.cam.ac.uk/users/djw3/
485
486Jon L. Bentley and Robert Sedgewick
487 "Fast Algorithms for Sorting and Searching Strings"
488 Available from Sedgewick's web page,
489 www.cs.princeton.edu/~rs
ba95a000 490</pre></td></tr></table>The following paper gives valuable additional insights into the
ef6bacff
RC
491algorithm, but is not immediately the basis of any code
492used in bzip2.
ba95a000 493<TABLE><tr><td>&nbsp;</td><td class=example><pre>Peter Fenwick:
ef6bacff
RC
494 Block Sorting Text Compression
495 Proceedings of the 19th Australasian Computer Science Conference,
496 Melbourne, Australia. Jan 31 - Feb 2, 1996.
497 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
ba95a000 498</pre></td></tr></table>Kunihiko Sadakane's sorting algorithm, mentioned above,
ef6bacff 499is available from:
ba95a000
RC
500<TABLE><tr><td>&nbsp;</td><td class=example><pre>http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
501</pre></td></tr></table>The Manber-Myers suffix array construction
ef6bacff
RC
502algorithm is described in a paper
503available from:
ba95a000
RC
504<TABLE><tr><td>&nbsp;</td><td class=example><pre>http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
505</pre></td></tr></table>Finally, the following paper documents some recent investigations
ef6bacff 506I made into the performance of sorting algorithms:
ba95a000 507<TABLE><tr><td>&nbsp;</td><td class=example><pre>Julian Seward:
ef6bacff
RC
508 On the Performance of BWT Sorting Algorithms
509 Proceedings of the IEEE Data Compression Conference 2000
510 Snowbird, Utah. 28-30 March 2000.
ba95a000
RC
511</pre></td></tr></table></P><P>
512
513<HR SIZE="6">
514<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
515<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[ &lt;&lt; ]</TD>
516<TD VALIGN="MIDDLE" ALIGN="LEFT">[ &gt;&gt; ]</TD>
517<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD>
518<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD>
519<TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD>
520<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD>
521</TR></TABLE>
522<BR>
523<FONT SIZE="-1">
524This document was generated
525by <I>Julian Seward</I> on <I>January, 5 2002</I>
526using <A HREF="http://www.mathematik.uni-kl.de/~obachman/Texi2html
527"><I>texi2html</I></A>
ef6bacff 528
ef6bacff
RC
529</BODY>
530</HTML>
This page took 0.078929 seconds and 5 git commands to generate.