[C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

Mon Oct 29 06:07:00 GMT 2018

Hi

 Â Â Â  Some feedback regarding this patch ?

Thanks,
FranÃ§ois

On 8/21/18 10:34 PM, FranÃ§ois Dumont wrote:
> I missed a test that was failing because of this patch. So I revert a 
> small part of it and here is the new proposal.
>
> Tested under Linux x86_64, ok to commit ?
>
> FranÃ§ois
>
>
> On 24/07/2018 12:22, FranÃ§ois Dumont wrote:
>> Hi
>>
>> Â Â Â  Any chance to review this patch ?
>>
>> FranÃ§ois
>>
>>
>> On 06/06/2018 18:39, FranÃ§ois Dumont wrote:
>>> Hi
>>>
>>> Â Â Â  I review and rework this proposal. I noticed that the same idea 
>>> to limit buffer size within inplace_merge also apply to stable_sort.
>>>
>>> Â Â Â  I also change the decision when buffer is too small to consider 
>>> the buffer size rather than going through successive cuts of the 
>>> original ranges. This way we won't cut the range more than 
>>> necessary. Note that if you prefer this second part could be 
>>> committed seperatly.
>>>
>>> Â Â Â  PR libstdc++/83938
>>> Â Â Â  * include/bits/stl_algo.h:
>>> Â Â Â  (__stable_partition_adaptive): When buffer to small cut range using
>>> Â Â Â  buffer size.
>>> Â Â Â  (__inplace_merge): Take temporary buffer length from smallest 
>>> range.
>>> Â Â Â  (__merge_adaptive): When buffer too small consider smallest 
>>> range first
>>> Â Â Â  and cut based on buffer size.
>>> Â Â Â  (__stable_sort_adaptive): When buffer too small cut based on buffer
>>> Â Â Â  size.
>>> Â Â Â  (__stable_sort): Limit temporary buffer length.
>>> Â Â Â  * include/bits/stl_tempbuf.h (get_temporary_buffer): Change 
>>> function
>>> Â Â Â  to reduce requested buffer length on allocation failure.
>>>
>>> Tested under Linux x86_64.
>>>
>>> Ok to commit ?
>>>
>>> FranÃ§ois
>>>
>>>
>>> On 25/01/2018 23:37, chang jc wrote:
>>>> Hi:
>>>>
>>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>>
>>>> 2. The coding gain has been shownÂ  PR c++/83938; I re-post here
>>>>
>>>>
>>>>
>>>>
>>>> Â Â  21
>>>> Â Â  20
>>>> Â Â  19
>>>> Â Â  18
>>>> Â Â  17
>>>> Â Â  16
>>>>
>>>>
>>>> Â Â  0.471136
>>>> Â Â  0.625695
>>>> Â Â  0.767262
>>>> Â Â  0.907461
>>>> Â Â  1.04838
>>>> Â Â  1.19508
>>>>
>>>>
>>>> Â Â  0.340845
>>>> Â Â  0.48651
>>>> Â Â  0.639139
>>>> Â Â  0.770133
>>>> Â Â  0.898454
>>>> Â Â  1.04632
>>>>
>>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>>> 4325376 & B with 12451840 elements, total with 16M integers
>>>>
>>>> The proposed method has the speed up under given buffer size =, ex
>>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>>> given sizof(int)*64K bytes.
>>>>
>>>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>>
>>>> 4. It represents a fact that the intuitive idea to split from larger
>>>> part is wrong.
>>>>
>>>> For example, if you have an input sorted array A & B, A has 8 integers
>>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>>
>>>> Current it tries to split from B, right?
>>>>
>>>> Then we have:
>>>>
>>>> A1 | A2Â  B1 | B2
>>>>
>>>> B1 & B2 has 12 integers each, right?
>>>>
>>>> Current algorithm selects pivot as 13th integer from B, right?
>>>>
>>>> If the corresponding upper bound of A is 6th integer.
>>>>
>>>> Then it split in
>>>>
>>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>>
>>>> After rotate, we have two arrays to merge
>>>>
>>>> [A1 = 5 | B1 = 12]Â  & [A2 = 3 | B2 = 12]
>>>>
>>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>>
>>>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>>
>>>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>>>> | B1 = 12] again.
>>>>
>>>>
>>>> But wait, if we always split from the smaller one instead of larger 
>>>> one.
>>>>
>>>> After rotate, it promises two split arrays both contain 
>>>> ceiling[small/2].
>>>>
>>>> Since tmp buffer also allocate by starting from sizeof(small) &
>>>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>>
>>>> It means the allocated tmp buffer promises to be sufficient at the
>>>> level of (# of fail allocate).
>>>>
>>>> Instead, you can see if split from large at level (# of fail allocate)
>>>> several split array still CANNOT useÂ  tmp buffer to do buffered merge.
>>>>
>>>>
>>>> As you know, buffered merge is far speed then (split, rotate, and
>>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>>
>>>> the way should provide speedup.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 24/01/2018 18:23, FranÃ§ois Dumont wrote:
>>>>
>>>> Hi
>>>>
>>>>
>>>> Â Â Â Â  It sounds like a very sensitive change to make but nothing 
>>>> worth figures.
>>>> Do you have any bench showing the importance of the gain ?
>>>>
>>>> Â Â Â Â  At least the memory usage optimization is obvious.
>>>>
>>>> On 19/01/2018 10:43, chang jc wrote:
>>>>
>>>> Current std::inplace_merge() suffers from performance issue by 
>>>> inefficient
>>>>
>>>> logic under limited memory,
>>>>
>>>> It leads to performance downgrade.
>>>>
>>>> Please help to review it.
>>>>
>>>> Index: include/bits/stl_algo.h
>>>> ===================================================================
>>>> --- include/bits/stl_algo.hÂ Â Â  (revision 256871)
>>>> +++ include/bits/stl_algo.hÂ Â Â  (working copy)
>>>> @@ -2437,7 +2437,7 @@
>>>> Â Â Â Â Â Â Â Â  _BidirectionalIterator __second_cut = __middle;
>>>> Â Â Â Â Â Â Â Â  _Distance __len11 = 0;
>>>> Â Â Â Â Â Â Â Â  _Distance __len22 = 0;
>>>> -Â Â Â Â Â  if (__len1 > __len2)
>>>> +Â Â Â Â Â  if (__len1 < __len2)
>>>> Â Â Â Â Â Â Â Â Â Â  {
>>>> Â Â Â Â Â Â Â Â Â Â Â Â  __len11 = __len1 / 2;
>>>> Â Â Â Â Â Â Â Â Â Â Â Â  std::advance(__first_cut, __len11);
>>>> @@ -2539,9 +2539,15 @@
>>>> Â Â Â Â Â Â Â Â  const _DistanceType __len1 = std::distance(__first, 
>>>> __middle);
>>>> Â Â Â Â Â Â Â Â  const _DistanceType __len2 = std::distance(__middle, __last);
>>>>
>>>> +
>>>>
>>>> Â Â Â Â Â Â Â Â  typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>> _TmpBuf;
>>>>
>>>> -Â Â Â Â Â  _TmpBuf __buf(__first, __last);
>>>> -
>>>> +Â Â Â Â Â  _BidirectionalIterator __start, __end;
>>>> +Â Â Â Â Â  if (__len1 < __len2) {
>>>> +Â Â Â  __start = __first; __end = __middle;
>>>> +Â Â Â Â Â  } else {
>>>> +Â Â Â  __start = __middle; __end = __last;
>>>> +Â Â Â Â Â  }
>>>> +Â Â Â Â Â  _TmpBuf __buf(__start, ___end);
>>>>
>>>> Note another optimization, we could introduce a _Temporary_buffer<> 
>>>> constructor
>>>> in order to write:
>>>>
>>>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>>>
>>>> So that std::distance do not need to be called again.
>>>>
>>>> Â Â Â Â Â Â Â Â  if (__buf.begin() == 0)
>>>> Â Â Â Â Â Â  std::__merge_without_buffer
>>>> Â Â Â Â Â Â Â Â  (__first, __middle, __last, __len1, __len2, __comp);
>>>> Index: include/bits/stl_tempbuf.h
>>>> ===================================================================
>>>> --- include/bits/stl_tempbuf.hÂ Â Â  (revision 256871)
>>>> +++ include/bits/stl_tempbuf.hÂ Â Â  (working copy)
>>>> @@ -95,7 +95,7 @@
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  std::nothrow));
>>>> Â Â Â Â Â Â Â Â  if (__tmp != 0)
>>>> Â Â Â Â Â Â Â Â Â Â  return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>> -Â Â Â Â Â  __len /= 2;
>>>> +Â Â Â Â Â  __len = (__len + 1) / 2;
>>>>
>>>> This part is problematic, if __len is 1 and allocation fails it 
>>>> will loop
>>>> forever.
>>>>
>>>> It doesn't seem really necessary for your patch.
>>>>
>>>>
>>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>>
>>>>> This is a libstdc++ bug and patch, not the C++ front end.Â  So I'm
>>>>> adding the libstdc++ list to CC.
>>>>>
>>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>Â  
>>>>> wrote:
>>>>>> Current std::inplace_merge() suffers from performance issue by
>>>>> inefficient
>>>>>> logic under limited memory,
>>>>>>
>>>>>> It leads to performance downgrade.
>>>>>>
>>>>>> Please help to review it.
>>>>>>
>>>>>> Index: include/bits/stl_algo.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_algo.hÂ Â Â Â  (revision 256871)
>>>>>> +++ include/bits/stl_algo.hÂ Â Â Â  (working copy)
>>>>>> @@ -2437,7 +2437,7 @@
>>>>>> Â Â Â Â Â Â Â Â Â Â  _BidirectionalIterator __second_cut = __middle;
>>>>>> Â Â Â Â Â Â Â Â Â Â  _Distance __len11 = 0;
>>>>>> Â Â Â Â Â Â Â Â Â Â  _Distance __len22 = 0;
>>>>>> -Â Â Â Â Â Â Â Â  if (__len1 > __len2)
>>>>>> +Â Â Â Â Â Â Â Â  if (__len1 < __len2)
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â  {
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â  __len11 = __len1 / 2;
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â  std::advance(__first_cut, __len11);
>>>>>> @@ -2539,9 +2539,15 @@
>>>>>> Â Â Â Â Â Â Â  const _DistanceType __len1 = std::distance(__first, 
>>>>>> __middle);
>>>>>> Â Â Â Â Â Â Â  const _DistanceType __len2 = std::distance(__middle, 
>>>>>> __last);
>>>>>>
>>>>>> +
>>>>>> Â Â Â Â Â Â Â  typedef _Temporary_buffer<_BidirectionalIterator, 
>>>>>> _ValueType>
>>>>> _TmpBuf;
>>>>>> -Â Â Â Â Â  _TmpBuf __buf(__first, __last);
>>>>>> -
>>>>>> +Â Â Â Â Â  _BidirectionalIterator __start, __end;
>>>>>> +Â Â Â Â Â  if (__len1 < __len2) {
>>>>>> +Â Â Â Â Â Â  __start = __first; __end = __middle;
>>>>>> +Â Â Â Â Â  } else {
>>>>>> +Â Â Â Â Â Â  __start = __middle; __end = __last;
>>>>>> +Â Â Â Â Â  }
>>>>>> +Â Â Â Â Â  _TmpBuf __buf(__start, ___end);
>>>>>> Â Â Â Â Â Â Â  if (__buf.begin() == 0)
>>>>>> Â Â Â Â Â Â Â Â  std::__merge_without_buffer
>>>>>> Â Â Â Â Â Â Â Â Â Â  (__first, __middle, __last, __len1, __len2, __comp);
>>>>>> Index: include/bits/stl_tempbuf.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_tempbuf.hÂ  (revision 256871)
>>>>>> +++ include/bits/stl_tempbuf.hÂ  (working copy)
>>>>>> @@ -95,7 +95,7 @@
>>>>>> std::nothrow));
>>>>>> Â Â Â Â Â Â Â Â Â Â  if (__tmp != 0)
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â  return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>> -Â Â Â Â Â Â Â Â  __len /= 2;
>>>>>> +Â Â Â Â Â Â Â Â  __len = (__len + 1) / 2;
>>>>>> Â Â Â Â Â Â Â Â  }
>>>>>> Â Â Â Â Â Â Â  return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>> Â Â Â Â Â  }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>
>>>
>>
>