Abstract

In the article was given statement of a problem of matrix multiplication. Is is show that desired problem can be simpl formulated but for its solving may be required both heuristic methods and set of algorithmic modifications relating to algorithmic and high-level software optimization taking into account the particular problem and allow to increase the multiplication performance. These include: a comparative analysis of the performance of the actions performed without GPU-specific optimizations and with optimizations, which showed that computations without optimizing the work with global GPU memory have low processing performance. Optimizing data distribution in global and local memory The GPU allows you to reuse the calculation time and increase real performance. To compare the performance of the developed software implementations for OpenGL and CUDA technologies, identical calculations on identical GPUs were performed, which showed higher real performance when using CUDA cores. Specific values of generation performance measured for multi-threaded software implementation on GPU are given for all of described optimizations. It is shown that the most effective approach is based on the method we can get much more performance by technique of caching sub-blocks of the matrices (tiles) in the GPU's on-chip local memory, that with specialized software implementation is provide the performance of 275,3 GFLOP/s for GPU GeForce GTX 960M.

Highlights

  • Задача нахождения произведения плотных матриц встречается в ряде научно-технических направлений

  • Is is show that desired problem can be simpl formulated but for its solving may be required both heuristic methods and set of algorithmic modifications relating to algorithmic and high-level software optimization taking into account the particular problem and allow to increase the multiplication performance. These include: a comparative analysis of the performance of the actions performed without GPU-specific optimizations and with optimizations, which showed that computations without optimizing the work with global GPU memory have low processing performance

  • 2. Vatutin Je.I., Martynov I.A., Titov V.S. Ocenka real'noj proizvoditel'nosti sovremennyh videokart s podderzhkoj tehnologii CUDA v zadache umnozhenija matric

Read more

Summary

ТЕХНИЧЕСКИЕ НАУКИ

Результаты измерения реальной достигнутой производительности на GPU NVidia GeForce GTX 960M показали величину 275,3 GFLOP/s, что приблизительно на 10–20% меньше аналогичных результатов, получаемых при аналогичных условиях вычислительного эксперимента для той же GPU с использованием инструментария CUDA. Алгоритмическая оптимизация программной реализации алгоритмов умножения плотных вещественных матриц на графических процессорах с поддержкой технологии OpenGL // Известия Юго-Западного государственного университета. Выполнение этого OpenGL ядра запускается с размерностью work-group, равной 32 для тестовых GPU, в качестве которых выбраны GeForce GTX 960M и. Результаты сопоставления производительности обработки на CPU и GPU для реализации без оптимизации, CPU Intel Core i7-4750HQ + GPU GeForce GTX 960M. Для OpenGL платформы кеширование j-го столбца матрицы В будем производить в быстрой локальной памяти work-group. Для оптимизации обращения к локальной памяти также был реализован алгоритм умножения с кешированием iой строки матрицы A, соответствующее ядро которого приведено ниже.

Результаты использования данного
Список литературы
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.