This is the mail archive of the libc-hacker@sources.redhat.com mailing list for the glibc project.

Note that libc-hacker is a closed list. You may look at the archives of this list, but subscription and posting are not open.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] Some more regex tests, speed up UTF-8 handling


Hi!

The following patch adds few new tests in tst-regex, so that
case insensitive searches and backwards searches are tested.
Also, it optimizes UTF-8 handling, because otherwise tst-regex
takes hours if not days (always killed it).

For non-UTF-8, I'm afraid re_search_internal should do a re_string_construct
instead of re_string_allocate if range < 0 and perhaps some fields need to
be added to re_string_t for this, so that re_string_reconstruct can in
that case just move the pointers and never recompute anything (it would be
enough to add one bit), because searching from start of string is really
not acceptable performance-wise.

Now, when reading the code I think I found a few problems I'd like
to discuss here.
1) build_wcs_upper_buffer doesn't seem to take into account characters
   where mb length of lowercase character is different from mb length of
   uppercase character, say in tr_TR.UTF-8:
   uppercase of 'i' is '\xc4\xb0' (I with dot above),
   lowercase of 'I' is '\xc4\xb1' (i without dot).
   I have tried to create a testcase for this - bug-regex15.c, but am not
   sure if it should pass, could you please look at it? If it should pass
   (it fails now), I think more tests should be added to test this.
   There is also the question of which characters in mbs and mbs_case
   re_string_t arrays are considered valid. If just the first character
   in the multi-byte sequence, then this could be handled by doing
   wcrtomb into temporary buffer and then copying back just the first
   character
2) do we really need mbs_case array at all? Especially if non-UTF-8
   range < 0 re_search is handled by re_string_construct, it could make
   visible memory usage difference. From what I can see, mbs_case is only
   really used in regcomp, never in regexec and even there only for rare
   occurences. Furthermore, it looks like mbs_case is not set up
   in all cases, e.g. with build_wcs_upper_buffer only for mbclen == 1
   chars
3) how is translate supposed to do with MB_CUR_MAX > 1 charsets?
   Most of the places just apply the translation if mbclen == 1 (e.g. for
   UTF-8 this is 0..0x7f), but some other places apply trans when
   wc <= 0xff (which is e.g. in UTF-8 different thing).
4) I don't understand the COMPLEX_BRACKET magic in re_compile_fastmap_iter
   with collate sequences too much, wonder if it would not be possible
   to make fastmap work with MB_CUR_MAX > 1 && icase. For the individual
   chars, setting fastmap for both first byte of lowercase mb sequence
   and first byte of uppercase mb sequence should be enough - this could
   speed up range < 0 searching a lot
5) I see IS_WORD_CHAR used in transit_state_bkref_loop, shouldn't that
   be IS_WIDE_WORD_CHAR for MB_CUR_MAX > 1?
6) re_search_internal calls re_string_allocate with input_len
   dfa->nodes_len + 1. Is there any guarantee this is bigger or equal
   to MB_CUR_MAX? I'm afraid bad things will happen if the buffers
   cannot do at least one full character.

We need more tests for MB_CUR_MAX > 1 regex searches, ChangeLog.8
has only two non-ASCII chars, which is not enough to cover this.
Apparently none of the testcases check whether translate works.

2002-11-13  Jakub Jelinek  <jakuB@redhat.com>

	* posix/regex_internal.c (re_string_allocate, re_string_construct,
	re_string_realloc_buffers, re_string_context_at): Use mb_cur_max
	re_string_t field instead of MB_CUR_MAX.
	(re_string_reconstruct): Likewise.  Optimize UTF-8.
	(re_string_constuct_common): Initialize mb_cur_max, utf8.
	(build_wcs_buffer): Fix comment.  Use towupper on wide character
	instead of toupper.
	* posix/regex_internal.h (re_string_t): Add utf8 and mb_cur_max
	fields.
	(re_dfa_t): Move state_hash_mask field to avoid padding on 64-bit
	arches.
	* posix/regexec.c (re_search_internal): Use mb_cur_max re_string_t
	field instead of MB_CUR_MAX.
	(check_matching, transit_state_sb, extend_buffers): Likewise.
	* posix/tst-regex.c (umemlen): New variable.
	(test_expr): Add expectedicase argument.  Test case insensitive
	searches as well as backwards searches (case sensitive and
	insensitive) too.
	(run_test): Add icase argument.  Use it to compute regcomp flags.
	(run_test_backwards): New function.
	(main): Cast read to size_t to avoid warning.  Set umemlen.
	Add expectedicase arguments to test_expr.

	* posix/bug-regex15.c: New test.
	* posix/Makefile (tests): Add bug-regex15.
	(bug-regex15-ENV): Set LOCPATH.
	* localedata/Makefile (LOCALES): Add tr_TR.UTF-8.

--- libc/posix/regex_internal.c.jj	2002-11-11 12:48:20.000000000 +0100
+++ libc/posix/regex_internal.c	2002-11-13 20:12:57.000000000 +0100
@@ -109,7 +109,7 @@ re_string_allocate (pstr, str, len, init
 		    : (unsigned char *) str);
   pstr->mbs = MBS_ALLOCATED (pstr) ? pstr->mbs : pstr->mbs_case;
   pstr->valid_len = (MBS_CASE_ALLOCATED (pstr) || MBS_ALLOCATED (pstr)
-		     || MB_CUR_MAX > 1) ? pstr->valid_len : len;
+		     || pstr->mb_cur_max > 1) ? pstr->valid_len : len;
   return REG_NOERROR;
 }
 
@@ -141,7 +141,7 @@ re_string_construct (pstr, str, len, tra
   if (icase)
     {
 #ifdef RE_ENABLE_I18N
-      if (MB_CUR_MAX > 1)
+      if (pstr->mb_cur_max > 1)
 	build_wcs_upper_buffer (pstr);
       else
 #endif /* RE_ENABLE_I18N  */
@@ -150,7 +150,7 @@ re_string_construct (pstr, str, len, tra
   else
     {
 #ifdef RE_ENABLE_I18N
-      if (MB_CUR_MAX > 1)
+      if (pstr->mb_cur_max > 1)
 	build_wcs_buffer (pstr);
       else
 #endif /* RE_ENABLE_I18N  */
@@ -175,7 +175,7 @@ re_string_realloc_buffers (pstr, new_buf
      int new_buf_len;
 {
 #ifdef RE_ENABLE_I18N
-  if (MB_CUR_MAX > 1)
+  if (pstr->mb_cur_max > 1)
     {
       wint_t *new_array = re_realloc (pstr->wcs, wint_t, new_buf_len);
       if (BE (new_array == NULL, 0))
@@ -219,6 +219,12 @@ re_string_construct_common (str, len, ps
   pstr->len = len;
   pstr->trans = trans;
   pstr->icase = icase ? 1 : 0;
+  pstr->mb_cur_max = MB_CUR_MAX;
+#ifdef _LIBC
+  if (pstr->mb_cur_max > 1
+      && strcmp (_NL_CURRENT (LC_CTYPE, _NL_CTYPE_CODESET_NAME), "UTF-8") == 0)
+    pstr->utf8 = 1;
+#endif
 }
 
 #ifdef RE_ENABLE_I18N
@@ -264,7 +270,7 @@ build_wcs_buffer (pstr)
 	  pstr->cur_state = prev_st;
 	}
 
-      /* Apply the translateion if we need.  */
+      /* Apply the translation if we need.  */
       if (pstr->trans != NULL && mbclen == 1)
 	{
 	  int ch = pstr->trans[pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]];
@@ -314,7 +320,7 @@ build_wcs_upper_buffer (pstr)
 	      ch = pstr->trans[ch];
 	      pstr->mbs_case[byte_idx] = ch;
 	    }
-	  pstr->wcs[byte_idx] = iswlower (wc) ? toupper (wc) : wc;
+	  pstr->wcs[byte_idx] = iswlower (wc) ? towupper (wc) : wc;
 	  pstr->mbs[byte_idx++] = islower (ch) ? toupper (ch) : ch;
 	  if (BE (mbclen == (size_t) -1, 0))
 	    pstr->cur_state = prev_st;
@@ -326,7 +332,7 @@ build_wcs_upper_buffer (pstr)
 	  else
 	    memcpy (pstr->mbs + byte_idx,
 		    pstr->raw_mbs + pstr->raw_mbs_idx + byte_idx, mbclen);
-	  pstr->wcs[byte_idx++] = iswlower (wc) ? toupper (wc) : wc;
+	  pstr->wcs[byte_idx++] = iswlower (wc) ? towupper (wc) : wc;
 	  /* Write paddings.  */
 	  for (remain_len = byte_idx + mbclen - 1; byte_idx < remain_len ;)
 	    pstr->wcs[byte_idx++] = WEOF;
@@ -386,7 +392,7 @@ build_upper_buffer (pstr)
       int ch = pstr->raw_mbs[pstr->raw_mbs_idx + char_idx];
       if (pstr->trans != NULL)
 	{
-	  ch =  pstr->trans[ch];
+	  ch = pstr->trans[ch];
 	  pstr->mbs_case[char_idx] = ch;
 	}
       if (islower (ch))
@@ -429,7 +435,7 @@ re_string_reconstruct (pstr, idx, eflags
     {
       /* Reset buffer.  */
 #ifdef RE_ENABLE_I18N
-      if (MB_CUR_MAX > 1)
+      if (pstr->mb_cur_max > 1)
 	memset (&pstr->cur_state, '\0', sizeof (mbstate_t));
 #endif /* RE_ENABLE_I18N */
       pstr->len += pstr->raw_mbs_idx;
@@ -453,7 +459,7 @@ re_string_reconstruct (pstr, idx, eflags
 	  pstr->tip_context = re_string_context_at (pstr, offset - 1, eflags,
 						    newline);
 #ifdef RE_ENABLE_I18N
-	  if (MB_CUR_MAX > 1)
+	  if (pstr->mb_cur_max > 1)
 	    memmove (pstr->wcs, pstr->wcs + offset,
 		     (pstr->valid_len - offset) * sizeof (wint_t));
 #endif /* RE_ENABLE_I18N */
@@ -473,13 +479,38 @@ re_string_reconstruct (pstr, idx, eflags
 	  /* No, skip all characters until IDX.  */
 	  pstr->valid_len = 0;
 #ifdef RE_ENABLE_I18N
-	  if (MB_CUR_MAX > 1)
+	  if (pstr->mb_cur_max > 1)
 	    {
 	      int wcs_idx;
-	      wint_t wc;
-	      pstr->valid_len = re_string_skip_chars (pstr, idx, &wc) - idx;
-	      for (wcs_idx = 0; wcs_idx < pstr->valid_len; ++wcs_idx)
-		pstr->wcs[wcs_idx] = WEOF;
+	      wint_t wc = WEOF;
+
+	      if (pstr->utf8 && offset >= pstr->mb_cur_max)
+		{
+		  /* Optimize UTF-8.  */
+		  for (wcs_idx = 1; wcs_idx <= pstr->mb_cur_max; ++wcs_idx)
+		    if ((pstr->raw_mbs[pstr->raw_mbs_idx + offset
+				       - wcs_idx] & 0xc0) != 0x80)
+		      break;
+		  if (wcs_idx <= pstr->mb_cur_max)
+		    {
+		      mbstate_t cur_state;
+
+		      memset (&cur_state, 0, sizeof (cur_state));
+		      if (mbrtowc (&wc, (const char *) pstr->raw_mbs + offset
+					- wcs_idx, wcs_idx, &cur_state)
+			  == wcs_idx)
+			memset (&pstr->cur_state, '\0', sizeof (mbstate_t));
+		      else
+			wc = WEOF;
+		    }
+		}
+	      if (wc == WEOF)
+		{
+		  pstr->valid_len = re_string_skip_chars (pstr, idx, &wc)
+				    - idx;
+		  for (wcs_idx = 0; wcs_idx < pstr->valid_len; ++wcs_idx)
+		    pstr->wcs[wcs_idx] = WEOF;
+ 		}
 	      if (pstr->trans && wc <= 0xff)
 		wc = pstr->trans[wc];
 	      pstr->tip_context = (IS_WIDE_WORD_CHAR (wc) ? CONTEXT_WORD
@@ -511,7 +542,7 @@ re_string_reconstruct (pstr, idx, eflags
 
   /* Then build the buffers.  */
 #ifdef RE_ENABLE_I18N
-  if (MB_CUR_MAX > 1)
+  if (pstr->mb_cur_max > 1)
     {
       if (pstr->icase)
 	build_wcs_upper_buffer (pstr);
@@ -562,7 +593,7 @@ re_string_context_at (input, idx, eflags
 	return ((eflags & REG_NOTEOL) ? CONTEXT_ENDBUF
 		: CONTEXT_NEWLINE | CONTEXT_ENDBUF);
     }
-  if (MB_CUR_MAX == 1)
+  if (input->mb_cur_max == 1)
     {
       c = re_string_byte_at (input, idx);
       if (IS_WORD_CHAR (c))
--- libc/posix/regex_internal.h.jj	2002-11-11 08:58:48.000000000 +0100
+++ libc/posix/regex_internal.h	2002-11-11 15:46:22.000000000 +0100
@@ -261,6 +261,10 @@ struct re_string_t
   RE_TRANSLATE_TYPE trans;
   /* 1 if REG_ICASE.  */
   unsigned int icase : 1;
+  /* 1 if UTF-8.  */
+  unsigned int utf8 : 1;
+  /* MB_CUR_MAX.  */
+  int mb_cur_max;
 };
 typedef struct re_string_t re_string_t;
 /* In case of REG_ICASE, we allocate the buffer dynamically for mbs.  */
@@ -476,11 +480,11 @@ struct re_dfa_t
   re_node_set *eclosures;
   re_node_set *inveclosures;
   struct re_state_table_entry *state_table;
-  unsigned int state_hash_mask;
   re_dfastate_t *init_state;
   re_dfastate_t *init_state_word;
   re_dfastate_t *init_state_nl;
   re_dfastate_t *init_state_begbuf;
+  unsigned int state_hash_mask;
   int states_alloc;
   int init_node;
   int nbackref; /* The number of backreference in this dfa.  */
--- libc/posix/regexec.c.jj	2002-11-08 11:55:51.000000000 +0100
+++ libc/posix/regexec.c	2002-11-11 16:15:13.000000000 +0100
@@ -631,7 +631,7 @@ re_search_internal (preg, string, length
   incr = (range < 0) ? -1 : 1;
   left_lim = (range < 0) ? start + range : start;
   right_lim = (range < 0) ? start : start + range;
-  sb = MB_CUR_MAX == 1;
+  sb = input.mb_cur_max == 1;
   fast_translate = sb || !(preg->syntax & RE_ICASE || preg->translate);
 
   for (;;)
@@ -971,7 +971,7 @@ check_matching (preg, mctx, fl_search, f
 	      /* Restart from initial state, since we are searching
 		 the point from where matching start.  */
 #ifdef RE_ENABLE_I18N
-	      if (MB_CUR_MAX == 1
+	      if (mctx->input->mb_cur_max == 1
 		  || re_string_first_byte (mctx->input, cur_str_idx))
 #endif /* RE_ENABLE_I18N */
 		cur_state = acquire_init_state_context (&err, preg, mctx,
@@ -2318,7 +2318,7 @@ transit_state_sb (err, preg, state, fl_s
     {
 #ifdef RE_ENABLE_I18N
       int not_initial = 0;
-      if (MB_CUR_MAX > 1)
+      if (mctx->input->mb_cur_max > 1)
 	for (node_cnt = 0; node_cnt < next_nodes.nelem; ++node_cnt)
 	  if (dfa->nodes[next_nodes.elems[node_cnt]].type == CHARACTER)
 	    {
@@ -3219,7 +3219,7 @@ extend_buffers (mctx)
   if (pstr->icase)
     {
 #ifdef RE_ENABLE_I18N
-      if (MB_CUR_MAX > 1)
+      if (pstr->mb_cur_max > 1)
 	build_wcs_upper_buffer (pstr);
       else
 #endif /* RE_ENABLE_I18N  */
@@ -3228,7 +3228,7 @@ extend_buffers (mctx)
   else
     {
 #ifdef RE_ENABLE_I18N
-      if (MB_CUR_MAX > 1)
+      if (pstr->mb_cur_max > 1)
 	build_wcs_buffer (pstr);
       else
 #endif /* RE_ENABLE_I18N  */
--- libc/posix/tst-regex.c.jj	2001-08-23 18:48:55.000000000 +0200
+++ libc/posix/tst-regex.c	2002-11-13 19:26:47.000000000 +0100
@@ -1,4 +1,4 @@
-/* Copyright (C) 2001 Free Software Foundation, Inc.
+/* Copyright (C) 2001, 2002 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -44,10 +44,13 @@ static iconv_t cd;
 static char *mem;
 static char *umem;
 static size_t memlen;
+static size_t umemlen;
 
-static int test_expr (const char *expr, int expected);
+static int test_expr (const char *expr, int expected, int expectedicase);
 static int run_test (const char *expr, const char *mem, size_t memlen,
-		     int expected);
+		     int icase, int expected);
+static int run_test_backwards (const char *expr, const char *mem,
+			       size_t memlen, int icase, int expected);
 
 
 int
@@ -78,7 +81,7 @@ main (void)
   if (mem == NULL)
     error (EXIT_FAILURE, errno, "while allocating buffer");
 
-  if (read (fd, mem, memlen) != memlen)
+  if ((size_t) read (fd, mem, memlen) != memlen)
     error (EXIT_FAILURE, 0, "cannot read entire file");
   mem[memlen] = '\0';
 
@@ -102,6 +105,7 @@ main (void)
   outmem = umem;
   outlen = 2 * memlen - 1;
   iconv (cd, &inmem, &inlen, &outmem, &outlen);
+  umemlen = outmem - umem;
   if (inlen != 0)
     error (EXIT_FAILURE, errno, "cannot convert buffer");
 
@@ -116,11 +120,11 @@ main (void)
 
   /* Run the actual tests.  All tests are run in a single-byte and a
      multi-byte locale.  */
-  result = test_expr ("[äáàâéèêíìîñöóòôüúùû]", 2);
-  result |= test_expr ("G.ran", 2);
-  result |= test_expr ("G.\\{1\\}ran", 2);
-  result |= test_expr ("G.*ran", 3);
-  result |= test_expr ("[äáàâ]", 0);
+  result = test_expr ("[äáàâéèêíìîñöóòôüúùû]", 2, 2);
+  result |= test_expr ("G.ran", 2, 3);
+  result |= test_expr ("G.\\{1\\}ran", 2, 3);
+  result |= test_expr ("G.*ran", 3, 44);
+  result |= test_expr ("[äáàâ]", 0, 0);
 
   /* Free the resources.  */
   free (umem);
@@ -132,7 +136,7 @@ main (void)
 
 
 static int
-test_expr (const char *expr, int expected)
+test_expr (const char *expr, int expected, int expectedicase)
 {
   int result;
   char *inmem;
@@ -146,7 +150,14 @@ test_expr (const char *expr, int expecte
     error (EXIT_FAILURE, 0, "cannot set locale de_DE.ISO-8859-1");
 
   printf ("\nTest \"%s\" with 8-bit locale\n", expr);
-  result = run_test (expr, mem, memlen, expected);
+  result = run_test (expr, mem, memlen, 0, expected);
+  printf ("\nTest \"%s\" with 8-bit locale, case insensitive\n", expr);
+  result |= run_test (expr, mem, memlen, 1, expectedicase);
+  printf ("\nTest \"%s\" backwards with 8-bit locale\n", expr);
+  result |= run_test_backwards (expr, mem, memlen, 0, expected);
+  printf ("\nTest \"%s\" backwards with 8-bit locale, case insensitive\n",
+	  expr);
+  result |= run_test_backwards (expr, mem, memlen, 1, expectedicase);
 
   /* Second test: search with an UTF-8 locale.  */
   if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
@@ -163,14 +174,22 @@ test_expr (const char *expr, int expecte
 
   /* Run the tests.  */
   printf ("\nTest \"%s\" with multi-byte locale\n", expr);
-  result |= run_test (uexpr, umem, 2 * memlen - outlen, expected);
+  result |= run_test (uexpr, umem, umemlen, 0, expected);
+  printf ("\nTest \"%s\" with multi-byte locale, case insensitive\n", expr);
+  result |= run_test (uexpr, umem, umemlen, 1, expectedicase);
+  printf ("\nTest \"%s\" backwards with multi-byte locale\n", expr);
+  result |= run_test_backwards (uexpr, umem, umemlen, 0, expected);
+  printf ("\nTest \"%s\" backwards with multi-byte locale, case insensitive\n",
+	  expr);
+  result |= run_test_backwards (uexpr, umem, umemlen, 1, expectedicase);
 
   return result;
 }
 
 
 static int
-run_test (const char *expr, const char *mem, size_t memlen, int expected)
+run_test (const char *expr, const char *mem, size_t memlen, int icase,
+	  int expected)
 {
 #ifdef _POSIX_CPUTIME
   struct timespec start;
@@ -186,7 +205,7 @@ run_test (const char *expr, const char *
     use_clock = clock_gettime (cl, &start) == 0;
 #endif
 
-  err = regcomp (&re, expr, REG_NEWLINE);
+  err = regcomp (&re, expr, REG_NEWLINE | (icase ? REG_ICASE : 0));
   if (err != REG_NOERROR)
     {
       char buf[200];
@@ -257,3 +276,97 @@ run_test (const char *expr, const char *
      expect.  */
   return cnt != expected;
 }
+
+
+static int
+run_test_backwards (const char *expr, const char *mem, size_t memlen,
+		    int icase, int expected)
+{
+#ifdef _POSIX_CPUTIME
+  struct timespec start;
+  struct timespec finish;
+#endif
+  struct re_pattern_buffer re;
+  const char *err;
+  size_t offset;
+  int cnt;
+
+#ifdef _POSIX_CPUTIME
+  if (use_clock)
+    use_clock = clock_gettime (cl, &start) == 0;
+#endif
+
+  re_set_syntax ((RE_SYNTAX_POSIX_BASIC & ~RE_DOT_NEWLINE)
+		 | RE_HAT_LISTS_NOT_NEWLINE
+		 | (icase ? RE_ICASE : 0));
+
+  memset (&re, 0, sizeof (re));
+  re.fastmap = malloc (256);
+  if (re.fastmap == NULL)
+    error (EXIT_FAILURE, errno, "cannot allocate fastmap");
+
+  err = re_compile_pattern (expr, strlen (expr), &re);
+  if (err != NULL)
+    error (EXIT_FAILURE, 0, "cannot compile expression: %s", err);
+
+  if (re_compile_fastmap (&re))
+    error (EXIT_FAILURE, 0, "couldn't compile fastmap");
+
+  cnt = 0;
+  offset = memlen;
+  assert (mem[memlen] == '\0');
+  while (offset <= memlen)
+    {
+      int start;
+      const char *sp;
+      const char *ep;
+
+      start = re_search (&re, mem, memlen, offset, -offset, NULL);
+      if (start == -1)
+	break;
+
+      if (start == -2)
+	error (EXIT_FAILURE, 0, "internal error in re_search");
+
+      sp = mem + start;
+      while (sp > mem && sp[-1] != '\n')
+	--sp;
+
+      ep = mem + start;
+      while (*ep != '\0' && *ep != '\n')
+	++ep;
+
+      printf ("match %d: \"%.*s\"\n", ++cnt, (int) (ep - sp), sp);
+
+      offset = sp - 1 - mem;
+    }
+
+  regfree (&re);
+
+#ifdef _POSIX_CPUTIME
+  if (use_clock)
+    {
+      use_clock = clock_gettime (cl, &finish) == 0;
+      if (use_clock)
+	{
+	  if (finish.tv_nsec < start.tv_nsec)
+	    {
+	      finish.tv_nsec -= start.tv_nsec - 1000000000;
+	      finish.tv_sec -= 1 + start.tv_sec;
+	    }
+	  else
+	    {
+	      finish.tv_nsec -= start.tv_nsec;
+	      finish.tv_sec -= start.tv_sec;
+	    }
+
+	  printf ("elapsed time: %ld.%09ld sec\n",
+		  finish.tv_sec, finish.tv_nsec);
+	}
+    }
+#endif
+
+  /* Return an error if the number of matches found is not match we
+     expect.  */
+  return cnt != expected;
+}
--- libc/posix/bug-regex15.c.jj	2002-11-13 17:55:35.000000000 +0100
+++ libc/posix/bug-regex15.c	2002-11-13 17:43:02.000000000 +0100
@@ -0,0 +1,85 @@
+/* Turkish regular expression tests.
+   Copyright (C) 2002 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Jakub Jelinek <jakub@redhat.com>, 2002.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, write to the Free
+   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+   02111-1307 USA.  */
+
+#include <sys/types.h>
+#include <mcheck.h>
+#include <regex.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <locale.h>
+
+/* Tests supposed to match.  */
+struct
+{
+  const char *pattern;
+  const char *string;
+  int flags, nmatch;
+  regmatch_t rm[5];
+} tests[] = {
+  { "\xc4\xb0I*\xc4\xb1$", "aBi\xc4\xb1\xc4\xb1I", REG_ICASE, 2,
+    { { 2, 8 }, { -1, -1 } } }
+};
+
+int
+main (void)
+{
+  regex_t re;
+  regmatch_t rm[5];
+  size_t i;
+  int n, ret = 0;
+
+  setlocale (LC_ALL, "tr_TR.UTF-8");
+  for (i = 0; i < sizeof (tests) / sizeof (tests[0]); ++i)
+    {
+      n = regcomp (&re, tests[i].pattern, tests[i].flags);
+      if (n != 0)
+	{
+	  char buf[500];
+	  regerror (n, &re, buf, sizeof (buf));
+	  printf ("regcomp %zd failed: %s\n", i, buf);
+	  ret = 1;
+	  continue;
+	}
+
+      if (regexec (&re, tests[i].string, tests[i].nmatch, rm, 0))
+	{
+	  printf ("regexec %zd failed\n", i);
+	  ret = 1;
+	  regfree (&re);
+	  continue;
+	}
+
+      for (n = 0; n < tests[i].nmatch; ++n)
+	if (rm[n].rm_so != tests[i].rm[n].rm_so
+              || rm[n].rm_eo != tests[i].rm[n].rm_eo)
+	  {
+	    if (tests[i].rm[n].rm_so == -1 && tests[i].rm[n].rm_eo == -1)
+	      break;
+	    printf ("regexec match failure rm[%d] %d..%d\n",
+		    n, rm[n].rm_so, rm[n].rm_eo);
+	    ret = 1;
+	    break;
+	  }
+
+      regfree (&re);
+    }
+
+  return ret;
+}
--- libc/posix/Makefile.jj	2002-10-25 12:37:56.000000000 +0200
+++ libc/posix/Makefile	2002-11-13 20:00:15.000000000 +0100
@@ -75,7 +75,7 @@ tests		:= tstgetopt testfnm runtests run
 		   tst-chmod bug-regex1 bug-regex2 bug-regex3 bug-regex4 \
 		   tst-gnuglob tst-regex bug-regex5 bug-regex6 bug-regex7 \
 		   bug-regex8 bug-regex9 bug-regex10 bug-regex11 bug-regex12 \
-		   bug-regex13 bug-regex14
+		   bug-regex13 bug-regex14 bug-regex15
 ifeq (yes,$(build-shared))
 test-srcs	:= globtest
 tests           += wordexp-test tst-exec tst-spawn
@@ -131,6 +131,7 @@ bug-regex1-ENV = LOCPATH=$(common-objpfx
 tst-regex-ENV = LOCPATH=$(common-objpfx)localedata
 bug-regex5-ENV = LOCPATH=$(common-objpfx)localedata
 bug-regex6-ENV = LOCPATH=$(common-objpfx)localedata
+bug-regex15-ENV = LOCPATH=$(common-objpfx)localedata
 
 testcases.h: TESTS TESTS2C.sed
 	sed -f TESTS2C.sed < $< > $@T
--- libc/localedata/Makefile.jj	2002-08-12 15:27:49.000000000 +0200
+++ libc/localedata/Makefile	2002-11-13 17:34:25.000000000 +0100
@@ -129,7 +129,8 @@ ifeq (no,$(cross-compiling))
 # We have to generate locales
 LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 \
 	   en_US.ISO-8859-1 ja_JP.EUC-JP da_DK.ISO-8859-1 \
-	   hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 ja_JP.SJIS fr_FR.ISO-8859-1
+	   hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 ja_JP.SJIS \
+	   fr_FR.ISO-8859-1 tr_TR.UTF-8
 LOCALE_SRCS := $(shell echo "$(LOCALES)"|sed 's/\([^ .]*\)[^ ]*/\1/g')
 CHARMAPS := $(shell echo "$(LOCALES)" | \
 		    sed -e 's/[^ .]*[.]\([^ ]*\)/\1/g' -e s/SJIS/SHIFT_JIS/g)

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]