This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512


On Thu, Jul 25, 2013 at 03:17:43PM +0300, Janne Blomqvist wrote:
> On Wed, Jul 24, 2013 at 9:52 PM, OndÅej BÃlka <neleai@seznam.cz> wrote:
> > On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> >> On 07/24/2013 05:23 AM, Richard Biener wrote:
> >> > "H.J. Lu" <hjl.tools@gmail.com> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> Here is a patch to extend x86-64 psABI to support AVX-512:
> >> >
> >> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
> >>
> >> Having them callee saved pre-supposes that one knows the width of the register.
> >>
> >> There's room in the instruction set for avx1024.  Does anyone believe that is
> >> not going to appear in the next few years?
> >>
> > It would be mistake for intel to focus on avx1024. You hit diminishing
> > returns and only few workloads would utilize loading 128 bytes at once.
> > Problem with vectorization is that it becomes memory bound so you will
> > not got much because performance is dominated by cache throughput.
> >
> > You would get bigger speedup from more effective pipelining, more
> > fusion...
> 
> ISTR that one of the main reason "long" vector ISA's did so well on
> some workloads was not that the vector length was big, per se, but
> rather that the scatter/gather instructions these ISA's typically have
> allowed them to extract much more parallelism from the memory
> subsystem. The typical example being sparse matrix style problems, but
> I suppose other types of problems with indirect accesses could benefit
> as well. Deeper OoO buffers would in principle allow the same memory
> level parallelism extraction, but those apparently have quite steep
> power and silicon area cost scaling (O(n**2) or maybe even O(n**3)),
> making really deep buffers impractical.
> 
> And, IIRC scatter/gather instructions are featured as of some
> recent-ish AVX-something version. That being said, maybe current
> cache-based memory subsystems are different enough from the vector
> supercomputers of yore that the above doesn't hold to the same extent
> anymore..
>
Also this depends how many details intel got right. One example is
pmovmsk instruction. It is trivial to implement in silicon and gives
advantage over other architectures.

When a problem is 'find elements in array that satisfy some expression'
then without pmovmsk or equivalent finding what changed is relatively expensive.

One problem is that depending on profile you may spend majority of time
for small sizes. So you need to have effective branches for these sizes
(gcc does not handle that well yet). Then you get problem that it
increases icache pressure.

Then another problem is that you often could benefit from vector
instructions if you could read/write more memory. Reading can be done
inexpensively by checking if it crosses page, writing data is problem
and so we do a suboptimal path just to write only data that changed.

This could also be solved technologically if a masked move instruction 
could encode only to memory accesses that changed and thus avoid
possible race conditions in unchanged parts.
> 
> --
> Janne Blomqvist


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]