Sunway TaihuLight China Supercomputer

Dark Kaeser

Colaborador
Staff
Os rumores acerca do CPU Shenwei afinal não eram um mito urbano, embora não se confirme o facto de serem baseados no Alpha da Digital.

The Sunway TaihuLight supercomputer, which was developed at the National Research Center of Parallel Computer Engineering and Technology (NRCPC) and is in full production running early-stage workloads at the National Supercomputing Center in Wuxi, China, features the custom-designed SW26010 processor.
http://www.nextplatform.com/2016/06/20/look-inside-chinas-chart-topping-new-supercomputer/

Todos os artigos acerca são baseados neste artigo do Jack Dongarra, Prof associado da Oak Ridge National Lab

m8l34z.png



2v0g6f6.png


http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
 
Inside Look at Key Applications on China’s New Top Supercomputer

The Chinese-developed CAM weather model has been focal point for teams to scale and represents some of the challenges inherent to exploiting a new architecture.

n315j8.jpg


Scalability and performance results for the CAM model can be seen above comparing both use with the management core and sub-cores and with just the management core. For some kernels that are compute intensive, the team saw a speedup of between 10-22X, but for others that were memory-bound, the speedup wasn’t high, just 2-3X. The results here show speedup for the entire model and if there is any takeaway here, this is scaling to quite impressive heights for code that’s still in process on a new architecture—1.5 million cores.
http://www.nextplatform.com/2016/06/30/inside-look-key-applications-chinas-new-top-supercomputer/

Embora no estudo do Dongarra já tivesse sido mencionado que dependendo das aplicações em alguns casos apenas apresentavam ganhos de 30%.


E uma perspectiva do que os chineses têm andado a fazer ou planeiam para o futuro

China’s Triple Play For Pre-Exascale Systems

As it turns out, China is not betting solely on the Shenwei chips, and apparently has plans to build three different pre-exascale systems with three very different architectures, according to some Tweets put out by James Lin, vice director for the Center of HPC at Shanghai Jiao Tong University
...
According to Lin, the three-way horse race for exascale machines in China will set up a horse race between three different organizations to build pre-exascale clusters based on ARM, Shenwei, and AMD (presumably Opteron) technologies.
http://www.nextplatform.com/2016/07/11/chinas-triple-play-pre-exascale-systems/
 
Rise of China, Real-World Benchmarks Top Supercomputing Agenda
The system burst onto the Top500 list last year on the strength of the dominant results from the Linpack benchmark. According to Dongarra, the TaihuLight has 1.1 times the performance of all the systems run by the US Department of Energy combined. However, the story changes a bit when the system is measured by the HPCG benchmark, with its focus on real applications. On that list, shown below, the massive system comes in fourth, and getting only 0.3% system utilization when looking at the theoretical peak. Most systems were getting somewhere between 1% and 3%, with the Riken system from Japan hitting 5.3%.

uploading images
https://www.nextplatform.com/2017/0...genda-marks-rise-china-real-world-benchmarks/
 
Um grupo de cientistas do Center for High Performance Computing at Shanghai Jiao Tong University (China) e do Tokyo Institute of Technology (Japan) publicaram um estudo ao tentarem correr uma aplicação, Open Source Field Operation and Manipulation (OpenFOAM) CFD (Computational Cluid Dynamics) neste sistema:

They called OpenFOAM was of the most popular CFD applications built on C++.

In their study, titled “Hybrid Implementation and Optimization of OpenFOAM on the SW26010 Many-core Processor,” the researchers laid out the challenge presented by the chip when running C++ programs.

“The processor includes four core groups(CGs), each of which consists of one management processing element (MPE) and sixty-four computing processing elements (CPEs) arranged by an eight by eight grid,” they wrote. “The basic compiler components on MPE support C/C++ programming language, while the compiler components on CPE only support C. The compilation incompatibility problem makes it difficult for C++ programs to exploit the computing power of the SW26010 processor.”

no estudo a aplicação foi comparada com um sistema Xeon E5-2695 v3 (2.3GHz)

” In the tests comparing the performance of the MPE, the CPE cluster and the Intel chip, they found that after optimizing the CPE cluster, there was an 8.03-times performance increase based on the optimized implementation on the MPE. In addition, the CPE cluster was 1.18 times faster than the single-core Intel chip. However, while the CPE cluster performance was better than that of the Intel processor, there were issues with efficiency. Those were due to a smaller cache and scratchpad memory (SPM) size of the SW26010, which means having to repeatedly load data into the SPM and hindering memory access. In addition, the DMA latency was high and the automatic optimizations of the SW26010 applied by the compiler was less efficient than with the Intel chip.
https://www.nextplatform.com/2017/0...computer-blazes-real-world-application-trail/

Tirando alguns problemas, como o facto de o MPE suportar C/C++ mas os CPE apenas suportarem C, e mesmo tendo em conta que terão conseguido dar à volta à questão, depois com as pequenas caches e necessidade de copiar dados deitaram tudo por terra, mas com mais algumas tentativas isto ainda é capaz de dar luta.
 
A First Peek At China’s Sunway Exascale Supercomputer
So, if you were building the Sunway TaihuLight supercomputer, how would you scale it from 125 petaflops peak double precision performance up to at least 1 exaflops? There are a bunch of ways to do it, but according to this paper, NRCPC took some obvious approaches enabled in part by the chip process manufacturing shrink that was no doubt used from Semiconductor Manufacturing International Corp
Drilling down into the architecture of the Sunway exascale machine, the first thing that NRCPC did was to double up the number of Core Groups on the future processor from four with the SW26010 to eight with what we will call the SW52020. That has the effect of cramming twice as many compute elements into the same building block, and that gives a 2X boost in peak performance at the same clock speed, device to device. So that is from 260 cores to 520 cores, just to lay it out.
The second thing that NRCPC did was stretch the vector engines in the MPE and CPE cores from 256-bits to 512-bits. And combined with the doubling of the compute elements, the performance of the SW52020 is now wider. Moreover, the vector units not only support 32-bit single precision and 64-bit double precision floating point operations, but they now also support 16-bit half precision floating point, which is useful for certain HPC and many AI workloads.
That is only halfway to exascale, unless you want to cheat and use lower precision – which we don’t want to do and neither does NRCPC. So there is only one option at that point, and that is to expand the network and add more nodes to the system. And this is precisely what NRCPC is doing with the Sunway exascale system. The paper says the machine will have over 80,000 nodes, but our math says it will take 81,920 nodes to do the job if the clock speeds stay the same at 1.45 GHz with what we are calling the SW52020. That gets you to the mythical and magical 1.028 exaflops number for peak performance, and the future Sunway machine can legitimately be called an exascale-class system for both HPC and AI.

Here is a block diagram of the Sunway exascale system:

sunway-exascale-block-diagram.jpg

https://www.nextplatform.com/2021/02/10/a-sneak-peek-at-chinas-sunway-exascale-supercomputer/
 
Um pequeno excerto de um pdf, com alguns detalhes do prototipo do Tianhe-3, que usa outros processadores chineses.

4.1 Tianhe-3 prototype

Tianhe-3 prototype has a high-performance cluster environment, which is designed for high-performance computing and massive data processing. It provides two kinds of CPUs, including Phytium MT2000+ and Phytium FT2000+. Tianhe-3 prototype owns 512 boards with three Phytium MT2000+ CPUs on each board, and 128 boards with four Phytium FT2000+ CPUs on each board. Each Phytium MT2000+ CPU is divided into four nodes, which owns 32 cores and 16GB RAM. Each Phytium FT2000+ CPU owns 64 cores and 64GB RAM. Besides, the floating-point computing performance of Tianhe-3 prototype could reach 3.146PFlops. The capacity of total parallel storage is 1PB, which could meet the needs of users.

https://link.springer.com/content/pdf/10.1186/s42774-020-00056-5.pdf

O que liga estes duas empresas, Phytium e Sunway é que há poucos dias atrás os Estados Unidos colocou-as na lista negra. Pelo menos a Phytium tinha produção na TSMC e usam tecnologia e ferramentas "ocidentais.

U.S. Blacklists Two Major Chinese CPU Developers​


The U.S. Commerce Department has added seven Chinese entities to the DoC's Entity List, essentially barring these companies and organizations from obtaining almost all advanced technologies developed in the U.S. Among the entities are two major CPU developers from China: Tianjin Phytium Information Technology and Sunway Microelectronics (or Shenwei Microelectronics).

The Department of Commerce’s Bureau of Industry and Security (BIS) believes that the newly added seven entities supported modernization of the Chinese People Liberation Army by producing supercomputers used for military purposes, development of new weapons of mass destruction as well as other destabilizing efforts. In particular, BIS blacklisted four supercomputer sites in China, including the National Supercomputing Center Jinan, the National Supercomputing Center Shenzhen, the National Supercomputing Center Wuxi, and the National Supercomputing Center Zhengzhou.

Also, the blacklist now includes CPU designer Tianjin Phytium Information Technology, which develops system-on-chips for client and server PCs based on the Armv8 ISA, and Sunway Microelectronics, which as a part of Shanghai High-Performance Integrated Circuit Design Center, designs proprietary supercomputer processors.

The inclusion of an entity into the Entity List restricts its ability to access items and technologies that are parts of the U.S. Export Administration Regulations (EAR). American companies cannot export, re-export, or transfer items subject to the EAR to entities in the Entity List without a special license, which will be subject to a presumption of denial.

CPUs and SoCs, including those for supercomputers, are designed using electronic design automation (EDA) as well as other tools and technologies developed in the U.S. Without access to these tools and technologies, it will be close to impossible for Phytium or Sunway to develop their processors. It is unclear whether contract makers of semiconductor like TSMC or SMIC can actually produce chips for Phytium and Sunway.

"I have not in my decade in China met a chip design company that isn’t using either Synopsys or Cadence," said Stewart Randall, a consultant in Shanghai who sells electronic design automation software to top Chinese chipmakers in a conversation with The Washington Post.

Many supercomputer centers in China nowadays use CPUs and SoCs developed in the country, but they still use certain technologies designed in the U.S. From now on, those who want to sell the aforementioned four supercomputer centers in China something made or developed in the U.S. will have to apply for an appropriate license.

"Supercomputing capabilities are vital for the development of many – perhaps almost all – modern weapons and national security systems, such as nuclear weapons and hypersonic weapons," said U.S. Secretary of Commerce Gina M. Raimondo in a statement. "The Department of Commerce will use the full extent of its authorities to prevent China from leveraging U.S. technologies to support these destabilizing military modernization efforts."

Previously the DoC blacklisted Huawei Technologies and its chip design arm HiSilicon as well as contract maker of chips SMIC on the same grounds of supporting Chinese military efforts.

https://www.tomshardware.com/uk/new...u-developers-supercomputer-centers-from-china

Será curioso ver que consequências terá para os próximos Supercomputadores Chineses.
 
Obrigado pelo artigo 🤓

Pois, realmente tinha lido o artigo sobre a adição de novas empresas e entidades chineses à lista negra. Pelo Biden! 😲

Sem acesso à TSMC, resta a SMIC, mas esta ainda está nos 14nm, com planos furados para os 7nm dada a proibição de acesso ao EUV da ASML e falta de alternativas no mercado doméstico (mencionei isso algures), não que não seja possível fazer 7nm sem EUV, mas não é fácil pois o EUV poupa bastantes "steps" durante o processo e em termos qualitativos usar o EUV é melhor (a Intel que o diga).

Mas a lista também inclui os EDA, e as empresas mencionadas (Cadence e Synpsys) não se limitam a vender programas de Software

Design IP Sales Grew 16.7% in 2020, to reach $4.6B and this is the best growth since year 2000!
Table-IP-vendors-2021.jpg.webp

https://semiwiki.com/ip/297995-design-ip-sales-grew-16-7-in-2020-best-growth-rate-ever/

Muito do IP e SW que "vendem" na realidade também tem a componente de HW


Cadence Dynamic Duo Upgrade Debuts
Dynamic-Duo-min-768x397.png.webp

https://semiwiki.com/eda/297771-cadence-dynamic-duo-upgrade-debuts/

Isto por norma tem incluído os PDK dos processos das foundry.

Da lista acima, falta o "outro" EDA que é a Siemens (na realidade é a antiga Mentor Graphics)

Siemens EDA Updates, Completes Its Hardware-Assisted Verification Portfolio
Siemens-Hardware-assisted-Verification-platform-launch_graphic-2_32521-min-768x432.png.webp

https://semiwiki.com/eda/297619-siemens-accelerator-line-updated-completed/

A ver as cenas dos próximos capítulos.
 
Parece que a China já tem, não 1, mas 2 Super Computadores que passam 1 ExaFlops em FP64, desde Março de 2021. Devem estar +/- 1 ano à frente dos Estados Unidos neste ponto.
1 deles é da Sunway, com o "OceanLite". O outro será da Phytium com o "Tianhe-3".

"OceanLite", com 42 milhões de cores:
We have it on outstanding authority (under condition of anonymity) that LINPACK was run in March 2021 on the Sunway “Oceanlite” system, which is the follow-on to the #4-ranked Sunway TaihuLight machine. The results yielded 1.3 exaflops peak performance with 1.05 sustained performance in the ideal 35 megawatt power sweet spot.

We’ve already published what little we knew about the Sunway Oceanlite architecture, and earlier this year (and now, in the absence of verified system information) our conjecture was that this new machine was a die shrink, allowing 2X the elements and 2X the performance per socket and with a doubling of sockets (and other engineering of course), Wuxi could create an exascale system. Clearly, Wuxi has.


Wuxi is using those 42 million cores for sustained exascale supercomputing in full-scale quantum simulation production, which we learned today via a preview ahead of the annual Supercomputing Conference (SC21). The TaihuLight follow-on is capable of running a quantum simulation that can be parallelized across the entire machine. This simulation also bodes well for an AI/ML training and inference workloads as it highlights extensive use of mixed-precision math, including 16-bit floating point performance of a reported 4.4 exaflops.

Without delving into all the quantum details, the Wuxi team, along with collaborators at Tsinghua University and the Shanghai Research Center for Quantum Sciences, have developed the tensor-based simulator for random quantum circuits that is optimized for compute density and can “reduce the simulation sampling time of Google Sycamore to 304 seconds from the previously claimed 10,000 years.”
This is just a preview abstract and there aren’t a lot of details on this result but it’s worth mentioning to tee up what we find out in mid-November when a paper is released detailing the simulation.
"Tianhe-3", ARM com aceleradores:
But let’s get back to fully benchmarked (LINPACK) exascale systems in China. The same authority confirmed that a second exascale run in China, this time on the Tianhe-3 system, which we previewed back in May 2019, reached almost identical performance with 1.3 exaflops peak and enough sustained to be functional exascale. We do not have a power figure for this but we were able to confirm this machine is based on the FeiTeng line of processors from Phytium, which is Arm-based with a matrix accelerator. (For clarity, FeiTeng is kind of like “Xeon,” it’s a brand of CPUs from Phytium).

This is not a new architecture. Here’s the analysis from 2015 when we first got wind of Phytium’s HPC ambitions, and here is a follow-on deep dive into the “Mars” 64-core FT-2000/64 architecture. The “Mars” processor then was always intended for us in China’s supercomputers but of course, has had to evolve with the times. The matrix engine that adds the real “oomph” to these devices is still based on an updated variant of that Matrix 2000 DSP accelerator we saw in Tianhe-2A (another top supercomputer of its day), which is called the Matrix-2000+ accelerator. The whole software stack for Tianhe-2A took major footwork to tune to the DSP. It was never likely that National University of Defense Technology would swap all of that effort for an architecture that performed quite well, especially on LINPACK.

Recall that this Phytium emergence and the emergence of the Matrix 2000 DSP accelerators for the Tianhe-2A system came about because China couldn’t use an Intel Xeon Phi many-core processor as planned due to trade restrictions at the time.
A parte "politica":
And here’s another subtle detail. Our source confirms these LINPACK results for both of China’s exascale systems — the first in the world — were achieved in March 2021. When did the entity list appear citing Phytium and Sunway and the centers that host their showboat systems? In April 2021.

The politics at play are strange and muddled. But our source, as close as can be to issues at hand, confirms China was first to exascale and with two separate machines based on two different (but fully Chinese native) architectures.


In the absence of US chips and accelerators being made available, it is clear that the trade restrictions will satisfy concerns in the near term that China is using US technology to boost development of its nuclear programs. But in the long term, this is major impetus for China to kickstart chip development, fab building, and gun all the engines needed for the semiconductor wars that will continue to simmer, if not yet boil over.
https://www.nextplatform.com/2021/10/26/china-has-already-reached-exascale-on-two-separate-systems/
 
Parece que um dos sistemas exascale chineses é mesmo uma evolução deste Taihu

Three Chinese Exascale Systems Detailed at SC21: Two Operational and One Delayed​


1 - OceanLight​


The new Sunway machine, called OceanLight, is the successor to TaihuLight.
OceanLight was reportedly completed in March 2021, delivering a purported 1.05 exaflops Linpack out of 1.3 exaflops theoretical peak. “That is not official, but that’s what we’ve been told by several people,” said Kahaner.
three of the six research teams nominated for the 2021 Gordon Bell Prize, leveraged the system, including the winning paper: Closing the “Quantum Supremacy” Gap: Achieving Real-Time Simulation of a Random Quantum Circuit Using a New Sunway Supercomputer. According to the paper’s authors, OceanLight achieved a sustained performance of 1.2 exaflops of single-precision computing power,
OceanLight employs the new-generation, domestically designed and manufactured Sunway architecture, based on upgraded SW26010Pro CPUs. The peak performance of the SW26010Pro processor is 14.026 teraflops in double precision and 55.296 teraflops in half precision, according to another Gordon Bell Prize nominated research group (authors of Extreme-Scale Ab initio Quantum Raman Spectra Simulations on the Leadership HPC System in China).
https://www.hpcwire.com/2021/11/24/...iled-at-sc21-two-operational-and-one-delayed/
 
Sunway SW26010-Pro
lxUHod4.png

Sunway SW26010 Pro is seemingly four times more powerful than the SW26010 chip. It runs faster and has more cores with wider vector widths.
Each Sunway SW26010 Pro chip seemingly has a maximum throughput in double-precision floating-point format (FP64) of 13.8 TFLOPS, which would be a pretty remarkable result when an AMD EPYC 9654 CPU has a peak FP64 performance of around 5.4 TFLOPS.
Each SW26010-Pro CPU includes a whopping total of 384 computing cores, which are packed in six different core groups (CG). A separate management processing element (MPE) provides a superscalar, out-of-order core with a vector engine to manage the computing traffic, which ultimately goes through a meager 128-bit DDR4-3200 memory interface.
The chip tries everything it can to reduce data movement between cores, and it does so with a 2.25 GHz clock for the computing cores and a 2.10 clock for the MPE. The previous Sunway SW26010 CPU boasted a 1.45GHz clock for both cores and MPE. The previously employed DDR3 memory controller has also been replaced with DDR4 memory, which increases the total amount of RAM supported by one CPU from 32GB to 96GB.
https://www.techspot.com/news/100962-sunway-sw26010-pro-china-most-powerful-supercomputer-chip.html

Números bastante impressionantes.
Talvez Chiplets? Porque 390 Cores (mesmo que grande parte sejam pequenos), 12 Canais de memória e rede, num chip monolítico parece-me complicado.
Seria interessante saber que ISA está ali, onde é produzido e que TDP tem.
 
Back
Topo