Computação de Alto Desempenho - Fator chave para a competitividade do País, da Ciência e da...
-
Upload
igor-jose-f-freitas -
Category
Technology
-
view
57 -
download
1
Transcript of Computação de Alto Desempenho - Fator chave para a competitividade do País, da Ciência e da...
Computação de Alto DesempenhoFator chave para a competitividade do País, da Ciência e da IndústriaIgor Freitas, Engenheiro de Aplicação, 05/11/2015
3
Agenda O que é High Performance Computing ?
HPC & Competitividade da Indústria, da Ciência e do País
Iniciativas da Intel em HPC no Brasil
4
O que é High Performance Computing ?
5
O que é High Performance Computing ?“High-performance computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and
quickly. The term applies especially to systems that function above a teraflop or 1012 floating-point operations per second.”
or in a simpler way...
How to solve the hardest problems in the world regarding every aspect of our lives using a powerful and efficiency
supercomputer
6
Extending to New DimensionsHPC pode ser utilizado em diferentes áreas da ciência e da indústria
Aplicações em HPC
Aplicações Empresariais
Análise de Imagens médicas
Modelagem climática & Previsão do Tempo
Mercado Financeiro
Energia – Aplicações sísmicas
Conteúdo Digital Dinâmica Molecular
Dinâmica dos Fluídos
Manufatura e CAD/CAMSequenciamento de DNA Automação na Indústria Eletrônica
Defesa & Segurança
Mecanismos de busca
Banco de dados paralelos
Business Intelligence / Data Mining
7
O que é High Performance Computing ?Democratização da performance e operação de supercomputadores
“Calculadora Automática de Sequência Controlada ou “Mark I” da IBM”Missão: ”desenvolver uma máquina que pudesse fazer cálculos científicos rápidos a fim de entender os assuntos da guerra, tais como a trajetória das ogivas”
“Isso envolvia a tradução de problemas matemáticos para uma linguagem numérica que o computador pudesse entender.”
Grace Murray Hopper at the UNIVAC keyboard, c. 1960 - Fonte
8
A democratização dos clusters de HPCOs últimos 20 anos
108
105
$/FLOP
10
19941
2014
>15,000X IMPROVEMENT1
YEAR Avanços na Ciência
Alto ROI no processo de Inovação Industrial
Beowulf Cluster*Source: Intel per socket estimate comparing Intel DX4TM processor (Beowulf) versus Intel® Xeon PhiTM (Knights Corner) Other brands and names are the property of their respective owners.
O que é High Performance Computing ?HPC vs Big Data
HPC Big DataFORTRAN / C++
ApplicationsMPI
High Performance
Java* ApplicationsHadoop*
Simple to Use
SLURMSupports large scale startup
YARN*More resilient of hardware failures
Lustre*Remote Storage
HDFS*, SPARK*Local Storage
Compute & Memory Focused
High Performance Components
Storage FocusedStandard Server Components
Server StorageSSDs
SwitchFabric
Infrastructure
Modelo de Programaçã
oResource ManagerSistema de
arquivos
Hardware
Server Storage HDDs
SwitchEthernet
Infrastructure
Daniel Reed and Jack Dongarra, Exascale Computing and Big Data in Communications of the ACM journal, July 2015 (Vol 58, No.7), and Intel analysisOther brands and names are the property of their respective owners. 9
O que é High Performance Computing ?Big Data + HPC: Processamento “pesado” em tempo real
Small Data + Small Compute
e.g. Data analysis
Big Data + Small Compute e.g. Search, Streaming,
Data Preconditioning
Small Data +Big Compute
e.g. Mechanical Design, Multi-physics
Soluti
on
Urgency
Data
Compute
Big Data + Big Compute
e.g. Real-Time Local Weather Modeling, Convolutional Neural
Nets
FAST DATA
10
11
Visão da Intel para HPCBalanced compute, storage, and interconnects based on workload
NETWORKING SOFTWARECOMPUTE STORAGE
12
Quebra de paradigma para Sistemas Massivamente Paralelos Processador + Redes de alta velocidade + Memória = Knights Landing
Coprocessor
Fabric
Memory
Memory Bandwidth~500 GB/s STREAM
Memory CapacityOver 25x* KNC
ResiliencySystems scalable to >100 PF
Power EfficiencyOver 25% better than card1
I/OUp to 100 GB/s with int fabric
CostLess costly than discrete parts1
FlexibilityLimitless configurations
Density3+ KNL with fabric in 1U3
Knights Landing
*Comparison to 1st Generation Intel® Xeon Phi™ 7120P Coprocessor (formerly codenamed Knights Corner)1Results based on internal Intel analysis using estimated power consumption and projected component pricing in the 2015 timeframe. This analysis is provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 2Comparison to a discrete Knights Landing processor and discrete fabric component.3Theoretical density for air-cooled system; other cooling solutions and configurations will enable lower or higher density.
Server Processor
Arquitetura Única para HPC & Big Data
HPC Big DataFORTRAN / C++
ApplicationsMPI
High Performance
Java* ApplicationsHadoop*
Simple to Use
Lustre* with Hadoop* AdapterRemote Storage
Compute & Big Data CapableScalable Performance Components
Server Storage(SSDs and Burst Buffers)
Intel® Omni-Path Architecture
Infrastructure
Programming Model
Resource Manager
File System
Hardware
*Other names and brands may be claimed as the property of others
HPC & Big Data-Aware Resource Manager
13
Próximos passos para HPC & Big DataHierarquia de Memória & Storage adaptável
Processor
Compute Node
I/O Node
Remote Storage
ComputeToday
Caches
Local Memory
SSD Storage
Parallel File System (Hard Drive Storage)
High
er B
andw
idth
. Lo
wer L
aten
cy a
nd C
apac
ity
Some remote data moves onto I/O node
I/O Node storage moves to compute node
Local memory is now faster & in processor package
ComputeFuture
Caches
Non-Volatile Memory
Burst Buffer Storage
Parallel File System (Hard Drive Storage)
In-Package High Bandwidth Memory*
*cache, memory or hybrid mode 14
15
O que é High Performance Computing ?#HPC Matters
HPC Transforms Parkinson's Disease - SC15
16
O que é High Performance Computing ?#HPC Matters
SC 15 - Climate Modeling
17
HPC & Competitividade da Indústria, da Ciência e do País
18
HPC propicia uma nova Metodologia CientíficaInovação na Indústria
• Prediction
• Modeling & Simulation• Experiment Refinement
• Physical Prototyping
• Analysis• Conclusion• Refinement
• Physical Prototyping
• Analysis• Conclusion• Refinement
• Hypothesis
• Hypothesis
1. Satava, Richard M. “The Scientific Method Is Dead-Long Live the (New) Scientific Method.” Journal of Surgical Innovation (June 2005).
• Prediction
To Compete, You Must Compute
Accelerates the Method
Iterate
19
HPC & Competitividade da Indústria, da Ciência e do País• Ordem executiva do presidente Obama para um
“programa nacional de Supercomputação”• HPC como “Top priority” para alavancar a
competitividade dos EUA ”In order to maximize the benefits of HPC for economic competitiveness and scientific discovery, the United States Government must create a coordinated Federal strategy in HPC research, development, and deployment” Executive Order, Barack ObamaFonte: The White HouseOffice of the Press Secretary
20
Dyson Creates a Revolutionary FanUtilizing new scientific method
Reduced the number of costly, time-consuming physical prototypes
2.5x better fan performance while eliminating external moving parts By investigating 10x the number of design possibilities using virtual prototyping
Dyson Air Multiplier Fan
Virtual prototypeSource: Ansys Advantage Volume IV, Issue 2 2010 pp. 5-7© Ansys Corp.
ToplineInnovation
Bottom-lineCosts
Got the most for their Autodesk software investment with
optimized performance on Intel platforms
Intel® Xeon® Processor E5-2600 product family based
solution across workstations and clusters reduced
deployment and maintenance costs
More compelling, accurate visualization of
car designAvoid physical
prototyping spin by identifying body part fit
issuesReduce turn-around from
identifying design changes
Audi WorkflowReal-time, photo-realistic predictive rendering
Virtual prototyped images
Images courtesy of The Audi Group, Used by permission
22
Intel® Xeon® Processor E5-2600 product family enabled artist workstations
Large, shared rendering clusters configured with Intel® Xeon® Processor E5-2600 product family
Large Cluster Computation
Intel® Xeon ® Workstation
DreamWorks Animation ResultsEnables more iterations, improves movie production process
“By combining Xeon E5-2600 class processors with a Xeon Phi coprocessor, we are now able to provide artists with extremely high-quality light transport simulation in large scenes at interactive speeds. This enables us to bring further technical innovation to bear on the ways breathtaking film imagery is created."
-- Evan Smyth, Staff Architect
DreamWorks Animationproprietary software
23
Genomics search algorithm
Intel based display device(work done on cluster)
Expanded shared cluster capacity with 100+ node Intel® Xeon® processor E5-2600 product family cluster• Compute capacity
expanded 61%• Rack space increased by
only 22%
BLAST
Monsanto ResultGetting seeds to farmers quicker with fewer resources
Desktop
Large Cluster
28% faster BLAST workload performance
Research team decreased time-to-results from 2 weeks to 6 days
Source: Results courtesy Monsanto Corporation, 2012
24
Iniciativas da Intel em HPC no Brasil
25
Iniciativas da Intel em HPC no Brasil
Oil & Gas - Reservoir Simulator at PETROBRAS
LNCC - National Laboratory for Scientific Computing Largest HPC cluster in Latin America
NCC / UNESP An Intel® Modern Code Partner
• Up to 10.5x performance gains in
theirReservoir Simulator
software
• Up to 30x performance gain in Oil & Gas applications
• 5 HPC Hands-on Workshops • 340 developers trained• On-going white-papers together others Institutes
26
Iniciativas da Intel em HPC no Brasil
• Modernizing applications to increase parallelism and scalability
• Leverage cores, caches, threads, and vector capabilities of microprocessors and coprocessors.
• Current centers in Brazil
27¹Author: Gilvan Vieira - [email protected] – PETROBRAS/CENPES
Estudo de CasoPETROBRAS - Simulação de Reservatórios
Otimização do código através das ferramentas Intel® VTune™ Amplifier e Intel® CompilerAté 3.8x speedup em multiplicações de matrizes x
vetores(utilizando apenas 1 núcleo da CPU)
Ganhos de Performance¹
Assembly Fortran code using 3 scalar instructions
C++ templated assembly code 1 vectorized , 2 scalar
C++ template version speedup vs Fortran original code using Intel Compiler on Linux environment.
Part of the optimization: In this case VTune showed the vectorized code was inneficiency , thus #pragma novector was used
28¹Author: Gilvan Vieira - [email protected] – PETROBRAS / CENPES
Estudo de CasoPETROBRAS - Simulação de Reservatórios
• Intel Trace Analyzer and Collector facilitated the visualization of “serial effect communication” using blocking MPI_Sendrecv calls, thus non-blocking calls were used
• Event Timeline MPI communication using 16 ranks
Ganhos de performance em um ambiente paralelo utilizando 16 núcleos da CPU através do uso da ferramenta Intel® Trace Analyzer & Collector¹
Ganhos de1.28x a 10.5x de performance em kernels de multiplicação de matrizes x vetores
29¹Authors: Frederico L. Cabral – [email protected] , Marcio Murad – [email protected], Carla Osthoff [email protected]
Estudo de Caso LNCC – Laboratório Nacional de Computação Científica
1º projeto: “Fine-Tuning Xeon architecture Vectorization and Parallelization of a Numerical Method for convection-diffusion equations”
Aguardando publicação no volume CCIS 565, Springer:"Second Latin American Conference, CARLA 2015, Petrópolis, Brazil, August 26-28, 2015, Proceedings/Revised Selected Papers".
Ganho de performance em um servidor Dual-socket Xeon® utilizando 56 threads
30x performance gain vs código original
Cooperação Técnica com foco em projetos de pesquisa em Óleo & Gás
30¹Authors: Frederico L. Cabral – [email protected] , Marcio Murad – [email protected], Carla Osthoff [email protected]
1st passo: “não advinhe, meça !” Otimize aplicações para uma única thread através de Vetorização Passe um “raio-x” em sua aplicação com o Intel® VTune™ Amplifier
Foi identificado desperdício da CPU Módulo de divisão da CPU sobrecarregado Problemas de latência atrapalha a vetorização
Estudo de Caso LNCC – Laboratório Nacional de Computação Científica
31¹Authors: Frederico L. Cabral – [email protected] , Marcio Murad – [email protected], Carla Osthoff [email protected]
3º Passo – Dê algumas “dicas” ao compilador para uso do paralelismo dentro de cada core da CPU
double alfa_aux = 1.0 - 2.0*alfa;#pragma simd vectorlengthfor(double), private(alfa)
#pragma vector nontemporal(U_old) //improves cache usage#pragma prefetch *64:128 for (i = head+1 ; i <= N-2 ; i+=2) { U_old[i] = alfa*(U_new[i-1] + U_new[i+1]) + alfa_aux * U_new[i]; //U_old[i] = alfa*(U_new[i-1] + U_new[i+1]) + (1.0 - 2.0*alfa)*U_new[i];}
Estudo de Caso LNCC – Laboratório Nacional de Computação Científica
32¹Authors: Frederico L. Cabral – [email protected] , Marcio Rentes Borges – [email protected] , Carla Osthoff [email protected]
2º Projeto: “Fine Tuning Optimization applied in a Porous Media Flow Application using Intel Tools” (a ser publicado)
1ª fase: melhorar performance em aplicações single-threads no processador Intel® Xeon®
Up to 4.1x performance gain vs original code (resultados parciais)
Estudo de Caso LNCC – Laboratório Nacional de Computação Científica
Cooperação Técnica com foco em projetos de pesquisa em Óleo & Gás
33
Estudo de CasoFATEC – Baixada Santista Rubens Lara”Parallel Recommender System Based on the Intel® Xeon® and Xeon Phi™ “
Predição de performance através do Intel® Advisor antes de investir esforços otimizando o código
Xeon: 16 threads seria o melhor cenário
Xeon Phi : 120 threads seria o melhor cenário
34
Intel Compiler reportUnderstand what optimizations were performed...and how to extract the maximum performance.
LOOP BEGIN at regressao-xeon.c(116,18) inlined into regressao-xeon.c(55,6)
remark #15389: vectorization support: reference beta_756 has unaligned access [ regressao-xeon.c(118,11) ]remark #15389: vectorization support: reference entrada_756 has unaligned access [ regressao-xeon.c(118,11) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15427: loop was completely unrolled remark #15399: vectorization support: unroll factor set to 6 remark #15301: SIMD LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 12 remark #15477: vector loop cost: 13.500 remark #15478: estimated potential speedup: 3.640 remark #15479: lightweight vector operations: 7 remark #15488: --- end vector loop cost summary ---
LOOP ENDdouble *beta = (double*) _mm_malloc (TOTBETAS * sizeof(double), AVX_ALIGN);
HINTS TO DECLARE DATA ALIGNED
TO ASSIST VECTORIZATON
Estudo de CasoFATEC – Baixada Santista Rubens Lara”Parallel Recommender System Based on the Intel® Xeon® and Xeon Phi™ “
35
Partial conclusions – First part • Intel Advisor performance predictions were very precise
• Despite “OpenMP + MKL Offload to Xeon Phi” showed 1.2x speedup, there is room for higher speedups !
• Possible path: investigate a MPI + OpenMP version to explore Xeon + Xeon Phi
1 4 8 16 24 321
2.283.03
4.58 4.71 4.85
Threads
Spee
dup
Using only host processors as the number of threads is increasing.
OpenMP+MKL OpenMP+MKL Offload
11.23
Spee
dup
Speedup achieved by enabling Automatic Offload in MKL
Estudo de CasoFATEC – Baixada Santista Rubens Lara”Parallel Recommender System Based on the Intel® Xeon® and Xeon Phi™ “