This is the mail archive of the gdb-patches@sources.redhat.com mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Character set support


On Sep 19,  2:38pm, Jim Blandy wrote:

> Kevin, I've committed the following paragraph to the Red Hat internal
> sources:
> 
>     Note that these are all single-byte character sets.  More work inside
>     GDB is needed to support multi-byte or variable-width character
>     encodings, like the UTF-8 and UCS-2 encodings of Unicode.
> 
> Could you re-create the patch?

Below is a new patch for the affected file.  I'll commit everything
but this change shortly and wait for Eli to comment on the doc additions
below.

Kevin

gdb/doc/ChangeLog:
2002-MM-DD  Jim Blandy  <jimb@redhat.com>

	* gdb.texinfo: Add character set documentation.


Index: doc/gdb.texinfo
===================================================================
RCS file: /cvs/src/src/gdb/doc/gdb.texinfo,v
retrieving revision 1.120
diff -u -p -r1.120 gdb.texinfo
--- doc/gdb.texinfo	5 Sep 2002 12:13:08 -0000	1.120
+++ doc/gdb.texinfo	19 Sep 2002 20:18:09 -0000
@@ -4421,6 +4421,8 @@ Table}.
 * Vector Unit::                 Vector Unit
 * Memory Region Attributes::    Memory region attributes
 * Dump/Restore Files::          Copy between memory and a file
+* Character Sets::              Debugging programs that use a different
+                                character set than GDB does
 @end menu
 
 @node Expressions
@@ -5806,6 +5808,254 @@ These offsets are relative to the addres
 the @var{bias} argument is applied.
 
 @end table
+
+@node Character Sets
+@section Character Sets
+@cindex character sets
+@cindex charset
+@cindex translating between character sets
+@cindex host character set
+@cindex target character set
+
+If the program you are debugging uses a different character set to
+represent characters and strings than the one @value{GDBN} uses itself,
+@value{GDBN} can automatically translate between the character sets for
+you.  The character set @value{GDBN} uses we call the @dfn{host
+character set}; the one the inferior program uses we call the
+@dfn{target character set}.
+
+For example, if you are running @value{GDBN} on a Linux system, which
+uses the ISO Latin 1 character set, but you are using @value{GDBN}'s
+remote protocol (@pxref{Remote,Remote Debugging}) to debug a program
+running on an IBM mainframe, which uses the @sc{ebcdic} character set,
+then the host character set is Latin-1, and the target character set is
+@sc{ebcdic}.  If you give @value{GDBN} the command @code{set
+target-charset ebcdic-us}, then @value{GDBN} translates between
+@sc{ebcdic} and Latin 1 as you print character or string values, or use
+character and string literals in expressions.
+
+@value{GDBN} has no way to automatically recognize which character set
+the inferior program uses; you must tell it, using the @code{set
+target-charset} command, described below.
+
+Here are the commands for controlling @value{GDBN}'s character set
+support:
+
+@table @code
+@item set target-charset @var{charset}
+@kindex set target-charset
+Set the current target character set to @var{charset}.  We list the
+character set names @value{GDBN} recognizes below, but if you invoke the
+@code{set target-charset} command with no argument, @value{GDBN} lists
+the character sets it supports.
+@end table
+
+@table @code
+@item set host-charset @var{charset}
+@kindex set host-charset
+Set the current host character set to @var{charset}.
+
+By default, @value{GDBN} uses a host character set appropriate to the
+system it is running on; you can override that default using the
+@code{set host-charset} command.
+
+@value{GDBN} can only use certain character sets as its host character
+set.  We list the character set names @value{GDBN} recognizes below, and
+indicate which can be host character sets, but if you invoke the
+@code{set host-charset} command with no argument, @value{GDBN} lists the
+character sets it supports, placing an asterisk (@samp{*}) after those
+it can use as a host character set.
+
+@item set charset @var{charset}
+@kindex set charset
+Set the current host and target character sets to @var{charset}.  If you
+invoke the @code{set charset} command with no argument, it lists the
+character sets it supports.  @value{GDBN} can only use certain character
+sets as its host character set; it marks those in the list with an
+asterisk (@samp{*}).
+
+@item show charset
+@itemx show host-charset
+@itemx show target-charset
+@kindex show charset
+@kindex show host-charset
+@kindex show target-charset
+Show the current host and target charsets.  The @code{show host-charset}
+and @code{show target-charset} commands are synonyms for @code{show
+charset}.
+
+@end table
+
+@value{GDBN} currently includes support for the following character
+sets:
+
+@table @code
+
+@item ASCII
+@cindex ASCII character set
+Seven-bit U.S. @sc{ascii}.  @value{GDBN} can use this as its host
+character set.
+
+@item ISO-8859-1
+@cindex ISO 8859-1 character set
+@cindex ISO Latin 1 character set
+The ISO Latin 1 character set.  This extends ASCII with accented
+characters needed for French, German, and Spanish.  @value{GDBN} can use
+this as its host character set.
+
+@item EBCDIC-US
+@itemx IBM1047
+@cindex EBCDIC character set
+@cindex IBM1047 character set
+Variants of the @sc{ebcdic} character set, used on some of IBM's
+mainframe operating systems.  (Linux on the S/390 uses U.S. @sc{ascii}.)
+@value{GDBN} cannot use these as its host character set.
+
+@end table
+
+Note that these are all single-byte character sets.  More work inside
+GDB is needed to support multi-byte or variable-width character
+encodings, like the UTF-8 and UCS-2 encodings of Unicode.
+
+Here is an example of @value{GDBN}'s character set support in action.
+Assume that the following source code has been placed in the file
+@file{charset-test.c}:
+
+@example
+#include <stdio.h>
+
+char ascii_hello[]
+  = @{72, 101, 108, 108, 111, 44, 32, 119,
+     111, 114, 108, 100, 33, 10, 0@};
+char ibm1047_hello[]
+  = @{200, 133, 147, 147, 150, 107, 64, 166,
+     150, 153, 147, 132, 90, 37, 0@};
+
+main ()
+@{
+  printf ("Hello, world!\n");
+@}
+@end example
+
+In this program, @code{ascii_hello} and @code{ibm1047_hello} are arrays
+containing the string @samp{Hello, world!} followed by a newline,
+encoded in the @sc{ascii} and @sc{ibm1047} character sets.
+
+We compile the program, and invoke the debugger on it:
+
+@example
+$ gcc -g charset-test.c -o charset-test
+$ gdb -nw charset-test
+GNU gdb 2001-12-19-cvs
+Copyright 2001 Free Software Foundation, Inc.
+@dots{}
+(gdb) 
+@end example
+
+We can use the @code{show charset} command to see what character sets
+@value{GDBN} is currently using to interpret and display characters and
+strings:
+
+@example
+(gdb) show charset
+The current host and target character set is `iso-8859-1'.
+(gdb) 
+@end example
+
+For the sake of printing this manual, let's use @sc{ascii} as our
+initial character set:
+@example
+(gdb) set charset ascii
+(gdb) show charset
+The current host and target character set is `ascii'.
+(gdb) 
+@end example
+
+Let's assume that @sc{ascii} is indeed the correct character set for our
+host system --- in other words, let's assume that if @value{GDBN} prints
+characters using the @sc{ascii} character set, our terminal will display
+them properly.  Since our current target character set is also
+@sc{ascii}, the contents of @code{ascii_hello} print legibly:
+
+@example
+(gdb) print ascii_hello
+$1 = 0x401698 "Hello, world!\n"
+(gdb) print ascii_hello[0]
+$2 = 72 'H'
+(gdb) 
+@end example
+
+@value{GDBN} uses the target character set for character and string
+literals you use in expressions:
+
+@example
+(gdb) print '+'
+$3 = 43 '+'
+(gdb) 
+@end example
+
+The @sc{ascii} character set uses the number 43 to encode the @samp{+}
+character.
+
+@value{GDBN} relies on the user to tell it which character set the
+target program uses.  If we print @code{ibm1047_hello} while our target
+character set is still @sc{ascii}, we get jibberish:
+
+@example
+(gdb) print ibm1047_hello
+$4 = 0x4016a8 "\310\205\223\223\226k@@\246\226\231\223\204Z%"
+(gdb) print ibm1047_hello[0]
+$5 = 200 '\310'
+(gdb) 
+@end example
+
+If we invoke the @code{set target-charset} command without an argument,
+@value{GDBN} tells us the character sets it supports:
+
+@example
+(gdb) set target-charset
+Valid character sets are:
+  ascii *
+  iso-8859-1 *
+  ebcdic-us  
+  ibm1047  
+* - can be used as a host character set
+@end example
+
+We can select @sc{ibm1047} as our target character set, and examine the
+program's strings again.  Now the @sc{ascii} string is wrong, but
+@value{GDBN} translates the contents of @code{ibm1047_hello} from the
+target character set, @sc{ibm1047}, to the host character set,
+@sc{ascii}, and they display correctly:
+
+@example
+(gdb) set target-charset ibm1047
+(gdb) show charset
+The current host character set is `ascii'.
+The current target character set is `ibm1047'.
+(gdb) print ascii_hello
+$6 = 0x401698 "\110\145%%?\054\040\167?\162%\144\041\012"
+(gdb) print ascii_hello[0]
+$7 = 72 '\110'
+(gdb) print ibm1047_hello
+$8 = 0x4016a8 "Hello, world!\n"
+(gdb) print ibm1047_hello[0]
+$9 = 200 'H'
+(gdb)
+@end example
+
+As above, @value{GDBN} uses the target character set for character and
+string literals you use in expressions:
+
+@example
+(gdb) print '+'
+$10 = 78 '+'
+(gdb) 
+@end example
+
+The IBM1047 character set uses the number 78 to encode the @samp{+}
+character.
+
 
 @node Macros
 @chapter C Preprocessor Macros


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]