]>
Commit | Line | Data |
---|---|---|
ef6bacff | 1 | <HTML> |
ba95a000 RC |
2 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
3 | <!-- Created on January, 5 2002 by texi2html 1.64 --> | |
4 | <!-- | |
5 | Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author) | |
6 | Karl Berry <karl@freefriends.org> | |
7 | Olaf Bachmann <obachman@mathematik.uni-kl.de> | |
8 | and many others. | |
9 | Maintained by: Olaf Bachmann <obachman@mathematik.uni-kl.de> | |
10 | Send bugs and suggestions to <texi2html@mathematik.uni-kl.de> | |
11 | ||
12 | --> | |
ef6bacff | 13 | <HEAD> |
ba95a000 | 14 | <TITLE>Untitled Document: 4. Miscellanea</TITLE> |
ef6bacff | 15 | |
ba95a000 RC |
16 | <META NAME="description" CONTENT="Untitled Document: 4. Miscellanea"> |
17 | <META NAME="keywords" CONTENT="Untitled Document: 4. Miscellanea"> | |
18 | <META NAME="resource-type" CONTENT="document"> | |
19 | <META NAME="distribution" CONTENT="global"> | |
20 | <META NAME="Generator" CONTENT="texi2html 1.64"> | |
ef6bacff RC |
21 | |
22 | </HEAD> | |
ef6bacff | 23 | |
ba95a000 | 24 | <BODY LANG="" BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#800080" ALINK="#FF0000"> |
ef6bacff | 25 | |
ba95a000 RC |
26 | <A NAME="SEC43"></A> |
27 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
28 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_3.html#SEC42"> < </A>]</TD> | |
29 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC44"> > </A>]</TD> | |
30 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
31 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD> | |
32 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
33 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
34 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
35 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
36 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
37 | </TR></TABLE> | |
38 | <H1> 4. Miscellanea </H1> | |
39 | <!--docid::SEC43::--> | |
ef6bacff | 40 | <P> |
ba95a000 | 41 | |
ef6bacff RC |
42 | These are just some random thoughts of mine. Your mileage may |
43 | vary. | |
ba95a000 RC |
44 | </P><P> |
45 | ||
46 | <HR SIZE="6"> | |
47 | <A NAME="SEC44"></A> | |
48 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
49 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC43"> < </A>]</TD> | |
50 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC45"> > </A>]</TD> | |
51 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
52 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD> | |
53 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
54 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
55 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
56 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
57 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
58 | </TR></TABLE> | |
59 | <H2> 4.1 Limitations of the compressed file format </H2> | |
60 | <!--docid::SEC44::--> | |
ef6bacff RC |
61 | <CODE>bzip2-1.0</CODE>, <CODE>0.9.5</CODE> and <CODE>0.9.0</CODE> |
62 | use exactly the same file format as the previous | |
63 | version, <CODE>bzip2-0.1</CODE>. This decision was made in the interests of | |
64 | stability. Creating yet another incompatible compressed file format | |
65 | would create further confusion and disruption for users. | |
ef6bacff | 66 | <P> |
ba95a000 | 67 | |
ef6bacff RC |
68 | Nevertheless, this is not a painless decision. Development |
69 | work since the release of <CODE>bzip2-0.1</CODE> in August 1997 | |
70 | has shown complexities in the file format which slow down | |
71 | decompression and, in retrospect, are unnecessary. These are: | |
ef6bacff RC |
72 | <UL> |
73 | <LI>The run-length encoder, which is the first of the | |
ef6bacff RC |
74 | compression transformations, is entirely irrelevant. |
75 | The original purpose was to protect the sorting algorithm | |
76 | from the very worst case input: a string of repeated | |
77 | symbols. But algorithm steps Q6a and Q6b in the original | |
78 | Burrows-Wheeler technical report (SRC-124) show how | |
79 | repeats can be handled without difficulty in block | |
80 | sorting. | |
81 | <LI>The randomisation mechanism doesn't really need to be | |
ef6bacff RC |
82 | there. Udi Manber and Gene Myers published a suffix |
83 | array construction algorithm a few years back, which | |
84 | can be employed to sort any block, no matter how | |
85 | repetitive, in O(N log N) time. Subsequent work by | |
86 | Kunihiko Sadakane has produced a derivative O(N (log N)^2) | |
87 | algorithm which usually outperforms the Manber-Myers | |
88 | algorithm. | |
ba95a000 | 89 | <P> |
ef6bacff RC |
90 | |
91 | I could have changed to Sadakane's algorithm, but I find | |
92 | it to be slower than <CODE>bzip2</CODE>'s existing algorithm for | |
93 | most inputs, and the randomisation mechanism protects | |
94 | adequately against bad cases. I didn't think it was | |
95 | a good tradeoff to make. Partly this is due to the fact | |
96 | that I was not flooded with email complaints about | |
97 | <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so | |
98 | perhaps it isn't a problem for real inputs. | |
ba95a000 | 99 | </P><P> |
ef6bacff RC |
100 | |
101 | Probably the best long-term solution, | |
102 | and the one I have incorporated into 0.9.5 and above, | |
103 | is to use the existing sorting | |
104 | algorithm initially, and fall back to a O(N (log N)^2) | |
105 | algorithm if the standard algorithm gets into difficulties. | |
106 | <LI>The compressed file format was never designed to be | |
ef6bacff RC |
107 | handled by a library, and I have had to jump though |
108 | some hoops to produce an efficient implementation of | |
109 | decompression. It's a bit hairy. Try passing | |
110 | <CODE>decompress.c</CODE> through the C preprocessor | |
111 | and you'll see what I mean. Much of this complexity | |
112 | could have been avoided if the compressed size of | |
113 | each block of data was recorded in the data stream. | |
114 | <LI>An Adler-32 checksum, rather than a CRC32 checksum, | |
ef6bacff RC |
115 | would be faster to compute. |
116 | </UL> | |
ef6bacff RC |
117 | It would be fair to say that the <CODE>bzip2</CODE> format was frozen |
118 | before I properly and fully understood the performance | |
119 | consequences of doing so. | |
ef6bacff | 120 | <P> |
ba95a000 | 121 | |
ef6bacff RC |
122 | Improvements which I was able to incorporate into |
123 | 0.9.0, despite using the same file format, are: | |
ef6bacff RC |
124 | <UL> |
125 | <LI>Single array implementation of the inverse BWT. This | |
ef6bacff RC |
126 | significantly speeds up decompression, presumably |
127 | because it reduces the number of cache misses. | |
128 | <LI>Faster inverse MTF transform for large MTF values. The | |
ef6bacff RC |
129 | new implementation is based on the notion of sliding blocks |
130 | of values. | |
131 | <LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE> | |
ef6bacff RC |
132 | and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>. |
133 | Duh! Well, you live and learn. | |
ba95a000 | 134 | <P> |
ef6bacff RC |
135 | |
136 | </UL> | |
ef6bacff RC |
137 | Further ahead, it would be nice |
138 | to be able to do random access into files. This will | |
139 | require some careful design of compressed file formats. | |
ef6bacff | 140 | <P> |
ba95a000 RC |
141 | |
142 | <HR SIZE="6"> | |
143 | <A NAME="SEC45"></A> | |
144 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
145 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC44"> < </A>]</TD> | |
146 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC46"> > </A>]</TD> | |
147 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
148 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD> | |
149 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
150 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
151 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
152 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
153 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
154 | </TR></TABLE> | |
155 | <H2> 4.2 Portability issues </H2> | |
156 | <!--docid::SEC45::--> | |
ef6bacff RC |
157 | After some consideration, I have decided not to use |
158 | GNU <CODE>autoconf</CODE> to configure 0.9.5 or 1.0. | |
ef6bacff | 159 | <P> |
ba95a000 | 160 | |
ef6bacff RC |
161 | <CODE>autoconf</CODE>, admirable and wonderful though it is, |
162 | mainly assists with portability problems between Unix-like | |
163 | platforms. But <CODE>bzip2</CODE> doesn't have much in the way | |
164 | of portability problems on Unix; most of the difficulties appear | |
165 | when porting to the Mac, or to Microsoft's operating systems. | |
166 | <CODE>autoconf</CODE> doesn't help in those cases, and brings in a | |
167 | whole load of new complexity. | |
ba95a000 | 168 | </P><P> |
ef6bacff | 169 | |
ef6bacff RC |
170 | Most people should be able to compile the library and program |
171 | under Unix straight out-of-the-box, so to speak, especially | |
172 | if you have a version of GNU C available. | |
ba95a000 | 173 | </P><P> |
ef6bacff | 174 | |
ef6bacff RC |
175 | There are a couple of <CODE>__inline__</CODE> directives in the code. GNU C |
176 | (<CODE>gcc</CODE>) should be able to handle them. If you're not using | |
177 | GNU C, your C compiler shouldn't see them at all. | |
178 | If your compiler does, for some reason, see them and doesn't | |
179 | like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be <CODE>/* */</CODE>. One | |
180 | easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>, | |
181 | which should be understood by most Unix compilers. | |
ba95a000 | 182 | </P><P> |
ef6bacff | 183 | |
ef6bacff RC |
184 | If you still have difficulties, try compiling with the macro |
185 | <CODE>BZ_STRICT_ANSI</CODE> defined. This should enable you to build the | |
186 | library in a strictly ANSI compliant environment. Building the program | |
187 | itself like this is dangerous and not supported, since you remove | |
188 | <CODE>bzip2</CODE>'s checks against compressing directories, symbolic links, | |
189 | devices, and other not-really-a-file entities. This could cause | |
190 | filesystem corruption! | |
ba95a000 | 191 | </P><P> |
ef6bacff | 192 | |
ef6bacff RC |
193 | One other thing: if you create a <CODE>bzip2</CODE> binary for public |
194 | distribution, please try and link it statically (<CODE>gcc -s</CODE>). This | |
195 | avoids all sorts of library-version issues that others may encounter | |
196 | later on. | |
ba95a000 | 197 | </P><P> |
ef6bacff | 198 | |
ef6bacff RC |
199 | If you build <CODE>bzip2</CODE> on Win32, you must set <CODE>BZ_UNIX</CODE> to 0 and |
200 | <CODE>BZ_LCCWIN32</CODE> to 1, in the file <CODE>bzip2.c</CODE>, before compiling. | |
201 | Otherwise the resulting binary won't work correctly. | |
ba95a000 RC |
202 | </P><P> |
203 | ||
204 | <HR SIZE="6"> | |
205 | <A NAME="SEC46"></A> | |
206 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
207 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC45"> < </A>]</TD> | |
208 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC47"> > </A>]</TD> | |
209 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
210 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD> | |
211 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
212 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
213 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
214 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
215 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
216 | </TR></TABLE> | |
217 | <H2> 4.3 Reporting bugs </H2> | |
218 | <!--docid::SEC46::--> | |
ef6bacff RC |
219 | I tried pretty hard to make sure <CODE>bzip2</CODE> is |
220 | bug free, both by design and by testing. Hopefully | |
221 | you'll never need to read this section for real. | |
ef6bacff | 222 | <P> |
ba95a000 | 223 | |
ef6bacff RC |
224 | Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation |
225 | fault, a bus error or an internal assertion failure, it | |
226 | will ask you to email me a bug report. Experience with | |
227 | version 0.1 shows that almost all these problems can | |
228 | be traced to either compiler bugs or hardware problems. | |
ef6bacff RC |
229 | <UL> |
230 | <LI> | |
ef6bacff RC |
231 | Recompile the program with no optimisation, and see if it |
232 | works. And/or try a different compiler. | |
233 | I heard all sorts of stories about various flavours | |
234 | of GNU C (and other compilers) generating bad code for | |
235 | <CODE>bzip2</CODE>, and I've run across two such examples myself. | |
ba95a000 | 236 | <P> |
ef6bacff RC |
237 | |
238 | 2.7.X versions of GNU C are known to generate bad code from | |
239 | time to time, at high optimisation levels. | |
240 | If you get problems, try using the flags | |
241 | <CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>. | |
242 | You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>. | |
ba95a000 | 243 | </P><P> |
ef6bacff RC |
244 | |
245 | You may notice that the Makefile runs six tests as part of | |
246 | the build process. If the program passes all of these, it's | |
247 | a pretty good (but not 100%) indication that the compiler has | |
248 | done its job correctly. | |
249 | <LI> | |
ef6bacff RC |
250 | If <CODE>bzip2</CODE> crashes randomly, and the crashes are not |
251 | repeatable, you may have a flaky memory subsystem. <CODE>bzip2</CODE> | |
252 | really hammers your memory hierarchy, and if it's a bit marginal, | |
253 | you may get these problems. Ditto if your disk or I/O subsystem | |
254 | is slowly failing. Yup, this really does happen. | |
ba95a000 | 255 | <P> |
ef6bacff RC |
256 | |
257 | Try using a different machine of the same type, and see if | |
258 | you can repeat the problem. | |
259 | <LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells | |
ef6bacff RC |
260 | you your file is corrupted on decompression, and you |
261 | obtained the file via FTP, there is a possibility that you | |
262 | forgot to tell FTP to do a binary mode transfer. That absolutely | |
263 | will cause the file to be non-decompressible. You'll have to transfer | |
264 | it again. | |
265 | </UL> | |
ef6bacff | 266 | <P> |
ba95a000 | 267 | |
ef6bacff RC |
268 | If you've incorporated <CODE>libbzip2</CODE> into your own program |
269 | and are getting problems, please, please, please, check that the | |
270 | parameters you are passing in calls to the library, are | |
271 | correct, and in accordance with what the documentation says | |
272 | is allowable. I have tried to make the library robust against | |
273 | such problems, but I'm sure I haven't succeeded. | |
ba95a000 | 274 | </P><P> |
ef6bacff | 275 | |
ef6bacff RC |
276 | Finally, if the above comments don't help, you'll have to send |
277 | me a bug report. Now, it's just amazing how many people will | |
278 | send me a bug report saying something like | |
ba95a000 RC |
279 | <TABLE><tr><td> </td><td class=display><pre style="font-family: serif"> bzip2 crashed with segmentation fault on my machine |
280 | </pre></td></tr></table>and absolutely nothing else. Needless to say, a such a report | |
ef6bacff RC |
281 | is <EM>totally, utterly, completely and comprehensively 100% useless; |
282 | a waste of your time, my time, and net bandwidth</EM>. | |
283 | With no details at all, there's no way I can possibly begin | |
284 | to figure out what the problem is. | |
ba95a000 | 285 | </P><P> |
ef6bacff | 286 | |
ef6bacff RC |
287 | The rules of the game are: facts, facts, facts. Don't omit |
288 | them because "oh, they won't be relevant". At the bare | |
289 | minimum: | |
ba95a000 | 290 | <TABLE><tr><td> </td><td class=display><pre style="font-family: serif"> Machine type. Operating system version. |
ef6bacff RC |
291 | Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>). |
292 | Exact version of the compiler used. | |
293 | Flags passed to the compiler. | |
ba95a000 | 294 | </pre></td></tr></table>However, the most important single thing that will help me is |
ef6bacff RC |
295 | the file that you were trying to compress or decompress at the |
296 | time the problem happened. Without that, my ability to do anything | |
297 | more than speculate about the cause, is limited. | |
ba95a000 | 298 | </P><P> |
ef6bacff | 299 | |
ef6bacff RC |
300 | Please remember that I connect to the Internet with a modem, so |
301 | you should contact me before mailing me huge files. | |
ba95a000 RC |
302 | </P><P> |
303 | ||
304 | <HR SIZE="6"> | |
305 | <A NAME="SEC47"></A> | |
306 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
307 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC46"> < </A>]</TD> | |
308 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC48"> > </A>]</TD> | |
309 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
310 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD> | |
311 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
312 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
313 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
314 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
315 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
316 | </TR></TABLE> | |
317 | <H2> 4.4 Did you get the right package? </H2> | |
318 | <!--docid::SEC47::--> | |
ef6bacff | 319 | <P> |
ba95a000 | 320 | |
ef6bacff RC |
321 | <CODE>bzip2</CODE> is a resource hog. It soaks up large amounts of CPU cycles |
322 | and memory. Also, it gives very large latencies. In the worst case, you | |
323 | can feed many megabytes of uncompressed data into the library before | |
324 | getting any compressed output, so this probably rules out applications | |
325 | requiring interactive behaviour. | |
ba95a000 | 326 | </P><P> |
ef6bacff | 327 | |
ef6bacff RC |
328 | These aren't faults of my implementation, I hope, but more |
329 | an intrinsic property of the Burrows-Wheeler transform (unfortunately). | |
330 | Maybe this isn't what you want. | |
ba95a000 | 331 | </P><P> |
ef6bacff | 332 | |
ef6bacff RC |
333 | If you want a compressor and/or library which is faster, uses less |
334 | memory but gets pretty good compression, and has minimal latency, | |
335 | consider Jean-loup | |
ba95a000 | 336 | Gailly's and Mark Adler's work, <CODE>zlib-1.1.3</CODE> and |
ef6bacff | 337 | <CODE>gzip-1.2.4</CODE>. Look for them at |
ba95a000 | 338 | </P><P> |
ef6bacff | 339 | |
ba95a000 | 340 | <CODE>http://www.zlib.org</CODE> and |
ef6bacff | 341 | <CODE>http://www.gzip.org</CODE> respectively. |
ba95a000 | 342 | </P><P> |
ef6bacff | 343 | |
ef6bacff RC |
344 | For something faster and lighter still, you might try Markus F X J |
345 | Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at | |
346 | <BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>. | |
ba95a000 | 347 | </P><P> |
ef6bacff | 348 | |
ef6bacff RC |
349 | If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks |
350 | of data, 64k bytes or smaller, for example on an on-the-fly disk | |
351 | compressor, you'd be well advised not to use this library. Instead, | |
352 | I've made a special library tuned for that kind of use. It's part of | |
353 | <CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux | |
354 | <CODE>ext2</CODE> filesystem. Look at | |
355 | <CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>. | |
ba95a000 RC |
356 | </P><P> |
357 | ||
358 | <HR SIZE="6"> | |
359 | <A NAME="SEC48"></A> | |
360 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
361 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC47"> < </A>]</TD> | |
362 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC49"> > </A>]</TD> | |
363 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
364 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD> | |
365 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
366 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
367 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
368 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
369 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
370 | </TR></TABLE> | |
371 | <H2> 4.5 Testing </H2> | |
372 | <!--docid::SEC48::--> | |
ef6bacff | 373 | <P> |
ba95a000 | 374 | |
ef6bacff | 375 | A record of the tests I've done. |
ba95a000 | 376 | </P><P> |
ef6bacff | 377 | |
ef6bacff | 378 | First, some data sets: |
ef6bacff RC |
379 | <UL> |
380 | <LI>B: a directory containing 6001 files, one for every length in the | |
ef6bacff RC |
381 | range 0 to 6000 bytes. The files contain random lowercase |
382 | letters. 18.7 megabytes. | |
383 | <LI>H: my home directory tree. Documents, source code, mail files, | |
ef6bacff RC |
384 | compressed data. H contains B, and also a directory of |
385 | files designed as boundary cases for the sorting; mostly very | |
386 | repetitive, nasty files. 565 megabytes. | |
387 | <LI>A: directory tree holding various applications built from source: | |
ef6bacff RC |
388 | <CODE>egcs</CODE>, <CODE>gcc-2.8.1</CODE>, KDE, GTK, Octave, etc. |
389 | 2200 megabytes. | |
390 | </UL> | |
ef6bacff RC |
391 | The tests conducted are as follows. Each test means compressing |
392 | (a copy of) each file in the data set, decompressing it and | |
393 | comparing it against the original. | |
ef6bacff | 394 | <P> |
ba95a000 | 395 | |
ef6bacff RC |
396 | First, a bunch of tests with block sizes and internal buffer |
397 | sizes set very small, | |
398 | to detect any problems with the | |
399 | blocking and buffering mechanisms. | |
400 | This required modifying the source code so as to try to | |
401 | break it. | |
ef6bacff RC |
402 | <OL> |
403 | <LI>Data set H, with | |
ef6bacff RC |
404 | buffer size of 1 byte, and block size of 23 bytes. |
405 | <LI>Data set B, buffer sizes 1 byte, block size 1 byte. | |
ef6bacff | 406 | <LI>As (2) but small-mode decompression. |
ef6bacff | 407 | <LI>As (2) with block size 2 bytes. |
ef6bacff | 408 | <LI>As (2) with block size 3 bytes. |
ef6bacff | 409 | <LI>As (2) with block size 4 bytes. |
ef6bacff | 410 | <LI>As (2) with block size 5 bytes. |
ef6bacff | 411 | <LI>As (2) with block size 6 bytes and small-mode decompression. |
ef6bacff | 412 | <LI>H with buffer size of 1 byte, but normal block |
ef6bacff RC |
413 | size (up to 900000 bytes). |
414 | </OL> | |
ef6bacff | 415 | Then some tests with unmodified source code. |
ef6bacff RC |
416 | <OL> |
417 | <LI>H, all settings normal. | |
ef6bacff | 418 | <LI>As (1), with small-mode decompress. |
ef6bacff | 419 | <LI>H, compress with flag <CODE>-1</CODE>. |
ef6bacff | 420 | <LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>. |
ef6bacff | 421 | <LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing, |
ef6bacff RC |
422 | <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal. |
423 | <LI>Backwards compatibility: H, <CODE>bzip2-0.9.5</CODE> compressing, | |
ef6bacff RC |
424 | <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal. |
425 | <LI>Bigger tests: A, all settings normal. | |
ef6bacff | 426 | <LI>As (7), using the fallback (Sadakane-like) sorting algorithm. |
ef6bacff | 427 | <LI>As (8), compress with flag <CODE>-1</CODE>, decompress with flag |
ef6bacff RC |
428 | <CODE>-s</CODE>. |
429 | <LI>H, using the fallback sorting algorithm. | |
ef6bacff | 430 | <LI>Forwards compatibility: A, <CODE>bzip2-0.1pl2</CODE> compressing, |
ef6bacff RC |
431 | <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal. |
432 | <LI>Backwards compatibility: A, <CODE>bzip2-0.9.5</CODE> compressing, | |
ef6bacff RC |
433 | <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal. |
434 | <LI>Misc test: about 400 megabytes of <CODE>.tar</CODE> files with | |
ef6bacff RC |
435 | <CODE>bzip2</CODE> compiled with Checker (a memory access error |
436 | detector, like Purify). | |
437 | <LI>Misc tests to make sure it builds and runs ok on non-Linux/x86 | |
ef6bacff RC |
438 | platforms. |
439 | </OL> | |
ef6bacff RC |
440 | These tests were conducted on a 225 MHz IDT WinChip machine, running |
441 | Linux 2.0.36. They represent nearly a week of continuous computation. | |
442 | All tests completed successfully. | |
ef6bacff | 443 | <P> |
ba95a000 RC |
444 | |
445 | <HR SIZE="6"> | |
446 | <A NAME="SEC49"></A> | |
447 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
448 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_4.html#SEC48"> < </A>]</TD> | |
449 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ > ]</TD> | |
450 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
451 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top"> Up </A>]</TD> | |
452 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
453 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
454 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
455 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
456 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
457 | </TR></TABLE> | |
458 | <H2> 4.6 Further reading </H2> | |
459 | <!--docid::SEC49::--> | |
ef6bacff RC |
460 | <CODE>bzip2</CODE> is not research work, in the sense that it doesn't present |
461 | any new ideas. Rather, it's an engineering exercise based on existing | |
462 | ideas. | |
ef6bacff | 463 | <P> |
ef6bacff | 464 | |
ba95a000 RC |
465 | Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>: |
466 | <TABLE><tr><td> </td><td class=example><pre>Michael Burrows and D. J. Wheeler: | |
ef6bacff RC |
467 | "A block-sorting lossless data compression algorithm" |
468 | 10th May 1994. | |
469 | Digital SRC Research Report 124. | |
470 | ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz | |
471 | If you have trouble finding it, try searching at the | |
472 | New Zealand Digital Library, http://www.nzdl.org. | |
473 | ||
474 | Daniel S. Hirschberg and Debra A. LeLewer | |
475 | "Efficient Decoding of Prefix Codes" | |
476 | Communications of the ACM, April 1990, Vol 33, Number 4. | |
477 | You might be able to get an electronic copy of this | |
478 | from the ACM Digital Library. | |
479 | ||
480 | David J. Wheeler | |
481 | Program bred3.c and accompanying document bred3.ps. | |
482 | This contains the idea behind the multi-table Huffman | |
483 | coding scheme. | |
484 | ftp://ftp.cl.cam.ac.uk/users/djw3/ | |
485 | ||
486 | Jon L. Bentley and Robert Sedgewick | |
487 | "Fast Algorithms for Sorting and Searching Strings" | |
488 | Available from Sedgewick's web page, | |
489 | www.cs.princeton.edu/~rs | |
ba95a000 | 490 | </pre></td></tr></table>The following paper gives valuable additional insights into the |
ef6bacff RC |
491 | algorithm, but is not immediately the basis of any code |
492 | used in bzip2. | |
ba95a000 | 493 | <TABLE><tr><td> </td><td class=example><pre>Peter Fenwick: |
ef6bacff RC |
494 | Block Sorting Text Compression |
495 | Proceedings of the 19th Australasian Computer Science Conference, | |
496 | Melbourne, Australia. Jan 31 - Feb 2, 1996. | |
497 | ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps | |
ba95a000 | 498 | </pre></td></tr></table>Kunihiko Sadakane's sorting algorithm, mentioned above, |
ef6bacff | 499 | is available from: |
ba95a000 RC |
500 | <TABLE><tr><td> </td><td class=example><pre>http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz |
501 | </pre></td></tr></table>The Manber-Myers suffix array construction | |
ef6bacff RC |
502 | algorithm is described in a paper |
503 | available from: | |
ba95a000 RC |
504 | <TABLE><tr><td> </td><td class=example><pre>http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps |
505 | </pre></td></tr></table>Finally, the following paper documents some recent investigations | |
ef6bacff | 506 | I made into the performance of sorting algorithms: |
ba95a000 | 507 | <TABLE><tr><td> </td><td class=example><pre>Julian Seward: |
ef6bacff RC |
508 | On the Performance of BWT Sorting Algorithms |
509 | Proceedings of the IEEE Data Compression Conference 2000 | |
510 | Snowbird, Utah. 28-30 March 2000. | |
ba95a000 RC |
511 | </pre></td></tr></table></P><P> |
512 | ||
513 | <HR SIZE="6"> | |
514 | <TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0> | |
515 | <TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[ << ]</TD> | |
516 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[ >> ]</TD> | |
517 | <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT"> <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual.html#SEC_Top">Top</A>]</TD> | |
518 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_toc.html#SEC_Contents">Contents</A>]</TD> | |
519 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[Index]</TD> | |
520 | <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="manual_abt.html#SEC_About"> ? </A>]</TD> | |
521 | </TR></TABLE> | |
522 | <BR> | |
523 | <FONT SIZE="-1"> | |
524 | This document was generated | |
525 | by <I>Julian Seward</I> on <I>January, 5 2002</I> | |
526 | using <A HREF="http://www.mathematik.uni-kl.de/~obachman/Texi2html | |
527 | "><I>texi2html</I></A> | |
ef6bacff | 528 | |
ef6bacff RC |
529 | </BODY> |
530 | </HTML> |