This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Async signal safe TLS accesses


I've looked at the synchronization bits in this patch.  Detailed
comments below, and one general comment:

Please document the synchronization code you add in more detail.  The
C11 memory model should be the base model, even though the atomic
operations we currently have are somewhat different.
In particular, if using barriers / memory orders (MO) such as "_acq",
please make sure that you document which ordering they are meant to
achieve.  (Or, if there are no such barriers or just relaxed MO, please
document why we don't need a particular ordering.)  For example, if you
use an acquire barrier / MO, please briefly document with which release
barriers / MO this is supposed to create happens-before relations.
Also, the atomic operations that we currently have don't cover all the
options in the C11 model (for example, we don't have relaxed MO
read-modify-write ops AFAIR); thus, if you use a glibc atomic operation
that is stronger than what you'd actually need in the C11 model, please
document this as well so we can easily change that later on once we
should have those atomic ops too.

On Thu, 2013-11-07 at 11:30 -0800, Paul Pluzhnikov wrote:
> diff --git a/elf/dl-reloc.c b/elf/dl-reloc.c
> index 5c54310..3cabac7 100644
> --- a/elf/dl-reloc.c
> +++ b/elf/dl-reloc.c
> @@ -16,8 +16,10 @@
>     License along with the GNU C Library; if not, see
>     <http://www.gnu.org/licenses/>.  */
>  
> +#include <atomic.h>
>  #include <errno.h>
>  #include <libintl.h>
> +#include <signal.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <ldsodefs.h>
> @@ -70,8 +72,6 @@ _dl_try_allocate_static_tls (struct link_map *map)
>  
>    size_t offset = GL(dl_tls_static_used) + (freebytes - n *
> map->l_tls_align
>                                             -
> map->l_tls_firstbyte_offset);
> -
> -  map->l_tls_offset = GL(dl_tls_static_used) = offset;
>  #elif TLS_DTV_AT_TP
>    /* dl_tls_static_used includes the TCB at the beginning.  */
>    size_t offset = (((GL(dl_tls_static_used)
> @@ -83,7 +83,27 @@ _dl_try_allocate_static_tls (struct link_map *map)
>    if (used > GL(dl_tls_static_size))
>      goto fail;
>  
> -  map->l_tls_offset = offset;
> +#else
> +# error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
> +#endif
> +  /* We've computed the new value we want, now try to install it.  */
> +  ptrdiff_t val;
> +  while ((val = map->l_tls_offset) == NO_TLS_OFFSET)
> +    {
> +      atomic_compare_and_exchange_bool_acq (&map->l_tls_offset,
> +                                           offset,
> +                                           NO_TLS_OFFSET);
> +    }
> +  if (val != offset)
> +    {
> +      /* Lost a race to a TLS access in another thread. Too bad,
> nothing
> +        we can do here.  */
> +      goto fail;
> +    }

It looks as if you use the loop only to reload map->l_tls_offset after
the CAS, right?  (AFAICT, the only state transition you're trying to do
or wait for is from map->l_tls_offset == NO_TLS_OFFSET to something
else.  Our CAS is a strong CAS in the C11 terminology (ie, it won't fail
spuriously).  Thus, doing the CAS once should be sufficient.)

If so, why don't you either use just an atomic_compare_exchange_val_acq,
or an atomic_compare_exchange_val_acq with one load+if in front of it to
avoid the CAS overhead when not needed (ie, like test-and-test-and-set)?

The use of _acq MO seems inconsistent.  If you don't need _acq on the
CAS, but were using it because it's the weakest atomic op we currently
offer, please document that.  If you do need it, then you also need an
acquire load+if in front of the CAS, if the load+if can skip the CAS;
also, please document with which release MO this is supposed to
synchronize in this case (e.g., does the CAS need to have
acquire-release MO?).

> +  /* We installed the value; now update the globals.  */
> +#if TLS_TCB_AT_TP
> +  GL(dl_tls_static_used) = offset;
> +#elif TLS_DTV_AT_TP
>    map->l_tls_firstbyte_offset = GL(dl_tls_static_used);
>    GL(dl_tls_static_used) = used;
>  #else
> @@ -114,8 +134,18 @@ void
>  internal_function __attribute_noinline__
>  _dl_allocate_static_tls (struct link_map *map)
>  {
> -  if (map->l_tls_offset == FORCED_DYNAMIC_TLS_OFFSET
> -      || _dl_try_allocate_static_tls (map))
> +  /* We wrap this in a signal mask because it has to iterate all
> +     threads (including this one) and update this map's TLS entry.  A
> +     handler accessing TLS would try to do the same update and
> +     break.  */
> +  sigset_t old;
> +  _dl_mask_all_signals (&old);
> +  int err = -1;
> +  if (map->l_tls_offset != FORCED_DYNAMIC_TLS_OFFSET)
> +    err = _dl_try_allocate_static_tls (map);
> +
> +  _dl_unmask_signals (&old);
> +  if (err != 0)
>      {
>        _dl_signal_error (0, map->l_name, NULL, N_("\
>  cannot allocate memory in static TLS block"));
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index 576d9a1..3e2c895 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -17,6 +17,7 @@
>     <http://www.gnu.org/licenses/>.  */
>  
>  #include <assert.h>
> +#include <atomic.h>
>  #include <errno.h>
>  #include <libintl.h>
>  #include <signal.h>
> @@ -293,7 +294,7 @@ allocate_dtv (void *result)
>       initial set of modules.  This should avoid in most cases
> expansions
>       of the dtv.  */
>    dtv_length = GL(dl_tls_max_dtv_idx) + DTV_SURPLUS;
> -  dtv = calloc (dtv_length + 2, sizeof (dtv_t));
> +  dtv = __signal_safe_calloc (dtv_length + 2, sizeof (dtv_t));
>    if (dtv != NULL)
>      {
>        /* This is the initial length of the dtv.  */
> @@ -463,6 +464,18 @@ _dl_allocate_tls (void *mem)
>  }
>  rtld_hidden_def (_dl_allocate_tls)
>  
> +void
> +internal_function
> +_dl_clear_dtv (dtv_t *dtv)
> +{
> +  for (size_t cnt = 0; cnt < dtv[-1].counter; ++cnt)
> +    if (! dtv[1 + cnt].pointer.is_static
> +       && dtv[1 + cnt].pointer.val != TLS_DTV_UNALLOCATED)
> +      __signal_safe_free (dtv[1 + cnt].pointer.val);
> +  memset (dtv, '\0', (dtv[-1].counter + 1) * sizeof (dtv_t));
> +}
> +
> +rtld_hidden_def (_dl_clear_dtv)
>  
>  #ifndef SHARED
>  extern dtv_t _dl_static_dtv[];
> @@ -479,11 +492,11 @@ _dl_deallocate_tls (void *tcb, bool dealloc_tcb)
>    for (size_t cnt = 0; cnt < dtv[-1].counter; ++cnt)
>      if (! dtv[1 + cnt].pointer.is_static
>         && dtv[1 + cnt].pointer.val != TLS_DTV_UNALLOCATED)
> -      free (dtv[1 + cnt].pointer.val);
> +      __signal_safe_free (dtv[1 + cnt].pointer.val);
>  
>    /* The array starts with dtv[-1].  */
>    if (dtv != GL(dl_initial_dtv))
> -    free (dtv - 1);
> +    __signal_safe_free (dtv - 1);
>  
>    if (dealloc_tcb)
>      {
> @@ -521,20 +534,21 @@ rtld_hidden_def (_dl_deallocate_tls)
>  # endif
>  
>  
> -static void *
> -allocate_and_init (struct link_map *map)
> +static void
> +allocate_and_init (dtv_t *dtv, struct link_map *map)
>  {
>    void *newp;
> -
> -  newp = __libc_memalign (map->l_tls_align, map->l_tls_blocksize);
> +  newp = __signal_safe_memalign (map->l_tls_align,
> map->l_tls_blocksize);
>    if (newp == NULL)
>      oom ();
>  
> -  /* Initialize the memory.  */
> +  /* Initialize the memory. Since this is our thread's space, we are
> +     under a signal mask, and no one has touched this section before,
> +     we can safely just overwrite whatever's there.  */
>    memset (__mempcpy (newp, map->l_tls_initimage,
> map->l_tls_initimage_size),
>           '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
>  
> -  return newp;
> +  dtv->pointer.val = newp;
>  }
>  
>  
> @@ -576,7 +590,15 @@ _dl_update_slotinfo (unsigned long int req_modid)
>          the entry we need.  */
>        size_t new_gen = listp->slotinfo[idx].gen;
>        size_t total = 0;
> -
> +      int ret;
> +      sigset_t old;
> +      _dl_mask_all_signals (&old);
> +      /* We use the signal mask as a lock against reentrancy here.
> +         Check that a signal taken before the lock didn't already
> +         update us.  */
> +      dtv = THREAD_DTV ();
> +      if (dtv[0].counter >= listp->slotinfo[idx].gen)
> +        goto out;
>        /* We have to look through the entire dtv slotinfo list.  */
>        listp =  GL(dl_tls_dtv_slotinfo_list);
>        do
> @@ -596,25 +618,27 @@ _dl_update_slotinfo (unsigned long int
> req_modid)
>               if (gen <= dtv[0].counter)
>                 continue;
>  
> +             size_t modid = total + cnt;
> +
>               /* If there is no map this means the entry is empty.  */
>               struct link_map *map = listp->slotinfo[cnt].map;
>               if (map == NULL)
>                 {
>                   /* If this modid was used at some point the memory
>                      might still be allocated.  */
> -                 if (! dtv[total + cnt].pointer.is_static
> -                     && dtv[total + cnt].pointer.val !=
> TLS_DTV_UNALLOCATED)
> +                 if (dtv[-1].counter >= modid
> +                     && !dtv[modid].pointer.is_static
> +                     && dtv[modid].pointer.val !=
> TLS_DTV_UNALLOCATED)
>                     {
> -                     free (dtv[total + cnt].pointer.val);
> -                     dtv[total + cnt].pointer.val =
> TLS_DTV_UNALLOCATED;
> +                     __signal_safe_free (dtv[modid].pointer.val);
> +                     dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
>                     }
>  
>                   continue;
>                 }
>  
> +             assert (modid == map->l_tls_modid);
>               /* Check whether the current dtv array is large enough.
> */
> -             size_t modid = map->l_tls_modid;
> -             assert (total + cnt == modid);
>               if (dtv[-1].counter < modid)
>                 {
>                   /* Reallocate the dtv.  */
> @@ -628,17 +652,17 @@ _dl_update_slotinfo (unsigned long int
> req_modid)
>                     {
>                       /* This is the initial dtv that was allocated
>                          during rtld startup using the dl-minimal.c
> -                        malloc instead of the real malloc.  We can't
> +                        malloc instead of the real allocator.  We
> can't
>                          free it, we have to abandon the old storage.
> */
>  
> -                     newp = malloc ((2 + newsize) * sizeof (dtv_t));
> +                     newp = __signal_safe_malloc ((2 + newsize) *
> sizeof (dtv_t));
>                       if (newp == NULL)
>                         oom ();
>                       memcpy (newp, &dtv[-1], (2 + oldsize) * sizeof
> (dtv_t));
>                     }
>                   else
>                     {
> -                     newp = realloc (&dtv[-1],
> +                     newp = __signal_safe_realloc (&dtv[-1],
>                                       (2 + newsize) * sizeof (dtv_t));
>                       if (newp == NULL)
>                         oom ();
> @@ -668,7 +692,7 @@ _dl_update_slotinfo (unsigned long int req_modid)
>                    deallocate even if it is this dtv entry we are
>                    supposed to load.  The reason is that we call
>                    memalign and not malloc.  */
> -               free (dtv[modid].pointer.val);
> +               __signal_safe_free (dtv[modid].pointer.val);
>  
>               /* This module is loaded dynamically- We defer memory
>                  allocation.  */
> @@ -685,6 +709,8 @@ _dl_update_slotinfo (unsigned long int req_modid)
>  
>        /* This will be the new maximum generation counter.  */
>        dtv[0].counter = new_gen;
> +   out:
> +      _dl_unmask_signals (&old);
>      }
>  
>    return the_map;
> @@ -710,39 +736,50 @@ tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv,
> struct link_map *the_map)
>  
>        the_map = listp->slotinfo[idx].map;
>      }
> -
> - again:
> -  /* Make sure that, if a dlopen running in parallel forces the
> -     variable into static storage, we'll wait until the address in
> the
> -     static TLS block is set up, and use that.  If we're undecided
> -     yet, make sure we make the decision holding the lock as well.
> */
> -  if (__builtin_expect (the_map->l_tls_offset
> -                       != FORCED_DYNAMIC_TLS_OFFSET, 0))
> +  sigset_t old;
> +  _dl_mask_all_signals (&old);
> +
> +  /* as with update_slotinfo, we use the sigmask as a check against
> +     reentrancy.  */
> +  if (dtv[GET_ADDR_MODULE].pointer.val != TLS_DTV_UNALLOCATED)
> +    goto out;
> +
> +  /* Synchronize against a parallel dlopen() forcing this variable
> +     into static storage. If that happens, we have to be more careful
> +     about initializing the area, as that dlopen() will be iterating
> +     the threads to do so itself.  */
> +  ptrdiff_t offset;
> +  while ((offset = the_map->l_tls_offset) == NO_TLS_OFFSET)
>      {
> -      __rtld_lock_lock_recursive (GL(dl_load_lock));
> -      if (__builtin_expect (the_map->l_tls_offset == NO_TLS_OFFSET,
> 1))
> -       {
> -         the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET;
> -         __rtld_lock_unlock_recursive (GL(dl_load_lock));
> -       }
> -      else
> +      atomic_compare_and_exchange_bool_acq (&the_map->l_tls_offset,
> +                                           FORCED_DYNAMIC_TLS_OFFSET,
> +                                           NO_TLS_OFFSET);
> +    }
> +  if (offset == FORCED_DYNAMIC_TLS_OFFSET)
> +    {
> +      allocate_and_init (&dtv[GET_ADDR_MODULE], the_map);
> +    }

The previous comments should apply here in the same way as above.

> +  else
> +    {
> +      void **pp = &dtv[GET_ADDR_MODULE].pointer.val;
> +      while (atomic_forced_read (*pp) == TLS_DTV_UNALLOCATED)
>         {
> -         __rtld_lock_unlock_recursive (GL(dl_load_lock));
> -         if (__builtin_expect (the_map->l_tls_offset
> -                               != FORCED_DYNAMIC_TLS_OFFSET, 1))
> -           {
> -             void *p = dtv[GET_ADDR_MODULE].pointer.val;
> -             if (__builtin_expect (p == TLS_DTV_UNALLOCATED, 0))
> -               goto again;
> -
> -             return (char *) p + GET_ADDR_OFFSET;
> -           }
> +         /* for lack of a better (safe) thing to do, just spin.
> +           Someone else (not us; it's done under a signal mask) set
> +           this map to a static TLS offset, and they'll iterate all
> +           threads to initialize it.  They'll eventually set
> +           is_static=true, at which point we know they've fully
> +           completed initialization.  */
> +         atomic_delay ();

You could use a futex to block until *pp != TLS_DTV_UNALLOCATED.  If
this is a one-shot transition (ie, like in a latch), then this should be
simple and would just need a suitable futex wake operation elsewhere.

However, the pure spin-wait is probably fine unless we want to support
programs with real-time priorities (ie, where a higher-prio thread could
starve a lower-prio thread forever, for example, if the former is
spin-waiting for a result produced by the latter).  I don't know whether
we do or do not want to support that, so I'm leaving that to others to
decide.

>         }
> +      /* make sure we've picked up their initialization of the actual
> block.  */
> +      atomic_read_barrier ();

Please document where the matching release barrier or release-MO store
is.  Is it the write barrier I've commented on below?

> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
> index 1e0fe1f..2e4ab98 100644
> --- a/nptl/allocatestack.c
> +++ b/nptl/allocatestack.c
> @@ -242,11 +242,7 @@ get_cached_stack (size_t *sizep, void **memp)
>  
>    /* Clear the DTV.  */
>    dtv_t *dtv = GET_DTV (TLS_TPADJ (result));
> -  for (size_t cnt = 0; cnt < dtv[-1].counter; ++cnt)
> -    if (! dtv[1 + cnt].pointer.is_static
> -       && dtv[1 + cnt].pointer.val != TLS_DTV_UNALLOCATED)
> -      free (dtv[1 + cnt].pointer.val);
> -  memset (dtv, '\0', (dtv[-1].counter + 1) * sizeof (dtv_t));
> +  _dl_clear_dtv (dtv);
>  
>    /* Re-initialize the TLS.  */
>    _dl_allocate_tls_init (TLS_TPADJ (result));
> @@ -1177,13 +1173,15 @@ init_one_static_tls (struct pthread *curp,
> struct link_map *map)
>  #  error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
>  # endif
>  
> -  /* Fill in the DTV slot so that a later LD/GD access will find it.
> */
> -  dtv[map->l_tls_modid].pointer.val = dest;
> -  dtv[map->l_tls_modid].pointer.is_static = true;
> -
>    /* Initialize the memory.  */
>    memset (__mempcpy (dest, map->l_tls_initimage,
> map->l_tls_initimage_size),
>           '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
> +
> +  /* Fill in the DTV slot so that a later LD/GD access will find it.
> */
> +  dtv[map->l_tls_modid].pointer.is_static = true;
> +  atomic_write_barrier ();
> +  dtv[map->l_tls_modid].pointer.val = dest;

Is this the matching release barrier for the acquire MOs above?  If so,
please document it.  In general, if we have a barrier + atomic
load/store combination, I would prefer if we could point out this
combination too (to show over which route the happens-before is supposed
to be established).  In this case it's rather easy to see (well, I
*guess* it's the final store :) , but I guess documenting it can't hurt
either.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]