Post on 19-Jul-2015
Estudio de la robustez frente a SEUs de algoritmos auto-‐convergentes
Dr. Raoul Velazco
Laboratorio TIMA Grupo «ARIS»
Grenoble -‐ Francia h?p://Bma.imag.fr
Laboratorio PRiSME Grupo «SYSCOM»
Universidad de Versailles Saint QuenBn les Yvelines -‐ Francia
h?p://www.prism.uvsq.fr/
Universidad Complutense de Madrid -‐ 16th march 2015 2
1. RadiaBon effects in ICs 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
Universidad Complutense de Madrid -‐ 16th march 2015 3
1. RadiaBon effects in ICs 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
1. RadiaBon effects in ICs: context
• Aerospace electronic systems operate in a radiaBon environment
• Charged parBcles come from three main sources: Van Allen Belts, Cosmic Rays & Solar Flares
Cosmic rays
Protons from solar flares
Universidad Complutense de Madrid -‐ 16th march 2015 4
• The microelectronic technology is constantly changing: – higher density, – faster devices, – lower power.
• These increase the devices’ vulnerability to the effects of radiaLon (not only in nuclear-‐ space environments)
• In some applicaLons, no failure is allowed • Advanced technologies are potenLally sensiLve to the effects of
atmospheric neutrons • Space Agencies favor the use of COTS technologies
1. RadiaBon effects in ICs: context
Universidad Complutense de Madrid -‐ 16th march 2015 5
1. RadiaBon effects in ICs: types of faults
RadiaBon and Electronic Devices
Displacement
T.I.D.
Accumulated
Single Particle S. E. E.
Universidad Complutense de Madrid -‐ 16th march 2015 6
1. RadiaBon effects in ICs: descripBon of SEE
What you always wanted to know about Single Event Effects (SEE’s)
• What are they?:
One of the result of the interacLon between the radiaLon and the electronic devices
• How do they act?: CreaLng free charge in the silicon bulk that, in pracLcal, behaves as a short-‐life but intense current pulse
• Which are the ul4mate consequences? From simple bit-‐flips or noise-‐like signals unLl the physical destrucLon of the device
Universidad Complutense de Madrid -‐ 16th march 2015 7
The Physical Mechanism
The incident parLcle generates a dense track of electron hole pairs and this ionizaLon causes a transient current pulse if the strike occurs near a sensiLve volume
1. RadiaBon effects in ICs: descripBon of SEE’s
CHARGE COLLECTION
VOLUME
Universidad Complutense de Madrid -‐ 16th march 2015 8
1. RadiaBon effects in ICs: classificaBon of SEE
SINGLE EVENT UPSET (SEU): CHANGE OF DATA OF MEMORY CELLS
MULTIPLE BIT UPSET (MBU): SEVERAL SIMULTANEOUS SEU’s SINGLE EVENT TRANSIENT (SET): PEAKS IN COMBINATIONAL IC’s
SINGLE EVENT LATCH-UP (SEL): PARASITIC THYRISTOR TRIGGER
FUNCTIONAL INTERRUPTION (SEFI): PHENOMENA IN CRITICAL PARTS
AND OTHERS…
HARD ERRORS and SOFT ERRORS
Universidad Complutense de Madrid -‐ 16th march 2015 9
1. RadiaBon effects in ICs: descripBon of SEE
CROSS SECTION (σ)
.EVENTS
DEVN
Part Fluenceσ =
LINEAR ENERGY TRANSFER (LET)
SOFT ERROR RATE: PROBABILITY OF AN ERROR AT USUAL CONDITIONS FIT: Typical unit of SER à Probability of 1 ERROR every 109 h
E.g.- 180-nm SRAM: 1000-3000 FIT/Mb
Some Useful Definitions
Universidad Complutense de Madrid -‐ 16th march 2015 10
1. RadiaBon effects in ICs: sources of SEE’s Usually, SEE’s have been associated with space missions because of the absence of the atmospheric shield…
Cosmic rays
Protons from solar flares
Unfortunately, our quiet oasis seems to be vanishing since the enemy is knocking on the door…
• Alpha particles from vestigial U or Th traces • Atmospheric neutrons and other cosmic rays
Universidad Complutense de Madrid -‐ 16th march 2015 11
1. RadiaBon effects in ICs: sources of SEE’s
SomeBmes, they appeared without a warning and, aher some months and spending a lot of money, the source is detected*.
• In 1978, Intel had to stop a factory because water was extracted from a nearby river that, upstream, is too close to an old uranium mine.
Alpha Particles
* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad Complutense de Madrid -‐ 16th march 2015 12
1. RadiaBon effects in ICs: sources of SEE’s
SomeBmes, they appeared without a warning and, aher some months and spending a lot of money, the source is detected*
• In 1978, Intel had to stop a factory because water was extracted from a nearby river that, upstream, is too close to an old uranium mine.
Alpha Particles
* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad Complutense de Madrid -‐ 16th march 2015 13
1. RadiaBon effects in ICs: sources of SEE’s
SomeBmes, they appeared without a warning and, aher some months and spending a lot of money, the source is detected*.
• In 1986, IBM detected a high rate of useless devices and related it to the phosphoric acid, the bo?les of which were cleaned with a 210P deionizer gadget…hundreds of kms far.
Alpha Particles
* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad Complutense de Madrid -‐ 16th march 2015 14
1. RadiaBon effects in ICs: sources of SEE’s
SomeBmes, they appeared without a warning and, aher some months and spending a lot of money, the source is detected*.
• In 1992, the problem came from the use of bat droppings living in cavern with traces of Th and U to obtain phosphorus.
Alpha Particles
* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad Complutense de Madrid -‐ 16th march 2015 15
1. RadiaBon effects in ICs: sources of SEE’s
But someBmes, we are a li?le naive…
• Solder balls are usually made from Sn and Pb, which come from minerals where there may be uranium and thorium traces.
Nevertheless, the designer forgets this detail and places the solder balls too close to cri4cal nodes!
Alpha Particles
Universidad Complutense de Madrid -‐ 16th march 2015 16
1. RadiaBon effects in ICs: sources of SEE’s
• Fortunately, they are easily controlled following some simple rules during the manufacturing process.
But, some4mes, the enemy strikes back!
In 2005, a figure of 2·106 FIT/Mbit was observed in the SRAMs attached to pacemakers where: • the package had been removed by cosmetic reasons and the solder balls had not been previously purified*.
Fortunately, nobody deceased (We cross our fingers).
Alpha Particles
* J. Wilkinson, IEEE Trans. Dev. Mat. Reliab., 5 (3), pp. 428-433, 2005 Universidad Complutense de Madrid -‐ 16th march 2015 17
1. RadiaBon effects in ICs: sources of SEE’s
Usually, they had been a headache for the designers of electronics boarded in space missions…
Here you are some of their pracBcal jokes*…
• Cassini Mission (1997).- Some information was lost because of MBUs.
• Deep Space 1.- An SEU caused a solar panel to stop opening out.
• Mars Odyssey (2001).- Two weeks after the launch, alarms went off because some errors lately attributed to an SEU.
• GPS satellite network.- One of the satellites is out of work, probably because of a latch-up.
Cosmic Rays
* B. E. Pritchard, IEEE NSREC 2002 Data Workshop Proceedings, pp. 7-17, 2002 Universidad Complutense de Madrid -‐ 16th march 2015 18
1. RadiaBon effects in ICs: sources of SEE’s
A nice example… The birth of a star, picture taken by
the Hubble Telescope
Cosmic Rays
Don’t you realise that there is something odd in the picture?
Universidad Complutense de Madrid -‐ 16th march 2015 19
1. RadiaBon effects in ICs: sources of SEE’s • The highest fluency is reached between 15-‐20 km of alBtude. • Less than 1% of this parBcle rain reaches the sea level. • The composiBon has also changed…
• Basically, neutrons, muons and some pions
Usually, the neutron flux is referenced to that of New York City, its value been of (in appearance) only 15 n/cm2/h
• This value depends on the altitude (approximately, x10 each 3 km until saturation at
15-20 km). • And also on latitude, since the nearer the Poles, the higher rate. • South America Anomaly (SAA), close to Argentina. • 1.5 m of concrete reduces the flux to a half.
What a weak foe, really should be we afraid of?
Cosmic Rays at Ground Level
Universidad Complutense de Madrid -‐ 16th march 2015 20
1. RadiaBon effects in ICs: sources of SEE’s Perhaps, we may believe that we are in a safe shelter but…
– 1992.-‐ The PERFORM system, used by airplanes to manage the taking-‐off manoeuvre had to be suddenly replaced because of the SEUs in their SRAMs*.
– 1998.-‐ A study reported that, every day, the 1 out of 10000 SRAMs a?ached to pacemakers underwent biYlips**.
This factor being 300 Bmes higher if the paBent had taken an transoceanic aircrah.
Cosmics Rays at Ground Level
* J. Olsen, IEEE Trans. Nucl. Sci., 1993, 40, 74-77
** P. D. Bradley, IEEE Trans. Nucl. Sci., 45 (6), 2829-2940 Universidad Complutense de Madrid -‐ 16th march 2015 21
1. RadiaBon effects in ICs: sources of SEE’s
– The call of the Thousand (2000).-‐ Sun Unix server systems crashed in dozens of places all over the USA because of SEU’s happening in their cache memory, cosBng several millions of dollars*.
– In 2003 the elecBons in Belgium were realized simultaneously in the tradiBonal way and in electronic way. A difference of 4096 was find. Experts explained this difference as a consequence of an SEU**.
– 2005. Aher 102 days, the ASC Q Cluster supercomputer showed 7170 errors in its 81-‐Gb cache memory, 243 of which led to a crash of the programs or the operaBng system***.
Cosmic Rays at Ground Level
* Forbes, 2000
** Chantal Enguehard, Jean-Didier Graton. Electronic Voting: the Devil is in the Details 2008. hal-00274635
*** K. W. Harris, IEEE Trans. Dev. Mat. Reliab., 2005, 5, 336-342
Universidad Complutense de Madrid -‐ 16th march 2015 22
1. RadiaBon effects in ICs: sources of SEE’s
– The call of the Thousand (2000).-‐ Sun Unix server systems crashed in dozens of places all over the USA because of SEU’s happening in their cache memory, cosBng several millions of dollars*.
– 2005.-‐ Aher 102 days, the ASC Q Cluster supercomputer showed 7170 errors in its 81-‐Gb cache memory, 243 of which led to a crash of the programs or the operaBng system**.
Cosmic Rays at Ground Level
* FORBES, 2000
** K. W. Harris, IEEE Trans. Dev. Mat. Reliab., 2005, 5, 336-342 Universidad Complutense de Madrid -‐ 16th march 2015 23
ALWAYS DAMNING THE PROGRAM DEVELOPPER?
PERHAPS, IT MIGHT HAVE BEEN AN SEU!!!
Universidad Complutense de Madrid -‐ 16th march 2015 24
1. RadiaBon effects in ICs: sources of SEE’s Why these exoBc phenomena are appearing at lower and lower alBtude?
The present trend is to minimise the typical layout length.
This has helped to decrease the sensitive volume but, also, the critical charge does.
Most pessimistic simulations show a rock-bottom at 130-180 nm and a sudden increase is expected for more advanced technologies.
Cosmic Rays at Ground Level
T. Granlund, IEEE Trans. Nuc. Sci., 2003, 50, 2065-2068
Universidad Complutense de Madrid -‐ 16th march 2015 25
1. RadiaBon effects in ICs: sources of SEE’s
In any case, everybody agrees with an increasing error rate in the whole system…
And with the increasing sensitivity of the combinational logic devices.
Cosmic Rays at Ground Level
* R. Baumann, IEEE Trans. Dev. Mat. Reliab., 2005, 5, 305-316
Universidad Complutense de Madrid -‐ 16th march 2015 26
1. RadiaBon effects in ICs: sources of SEE’s
Can this background be worse? Yes, it can. Some details may increase the neutron sensiBvity.
– Power supply values.-‐ The lower, the more likely the SEU’s – Frequency of work.-‐ SEU’s are more dangerous while the system is reading
or wriBng. – Presence of Boron.-‐ There is an isotope of boron, 10B, able to trap low
energy thermal neutrons and release an energeBc alpha parBcle.
– AlBtude
10 1 4 75 0 2 3B n Liα+ → +
Cosmic Rays at Ground Level
Universidad Complutense de Madrid -‐ 16th march 2015 27
Universidad Complutense de Madrid -‐ 16th march 2015 28
1. RadiaBon effects in ICs 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
Universidad Complutense de Madrid -‐ 16th march 2015 29
2. Self-‐stabilizing algorithms • Self-‐Stabilizing Algorithms are used for communicaLons
between computer or sensor networks They are supposed to have fault tolerant capabiliLes
• Are there robust with respect to soh errors? The ASTERICS test plaYorm was used to simulate SEUs by HW/SW means SEU fault injecLon experiments were performed on the LEON3 while execuLng a self-‐converging applicaLon
• Final goal: idenLfy sensiLve resources and explore SW fault tolerance soluLons for the self-‐stabilizing algorithm
Universidad Complutense de Madrid -‐ 16th march 2015 30
2. Self-‐stabilizing algorithms • Defined by Edsger Dijkstra in 1974 • Is a property of distributed systems: when the system is wrongly iniLalized or perturbed, it can automaLcally go back to a correct operaLon in a finite number of calculaLon steps • ApplicaLons:
– in « theorethical compuLng science » in domains where the human intervenLon for restarLng a system aeer a failure is impossible
– In computer networks, sensor networks as well as in criLcal systems such as satellites.
Edsger Dijkstra
« Testing shows the presence, not the absence, of bugs ! » Edsger Dijkstra
Universidad Complutense de Madrid -‐ 16th march 2015 31
2. Self-‐stabilizing distributed algorithms
• Idea: a fault can put the system in any arbitrary state • From any state, resume a normal behavior and remains in it • Defined by:
– Convergence: the sytem eventually reaches a normal behavior
– Closure: when no fault occurs, the system behaves in the intended manner
Universidad Complutense de Madrid -‐ 16th march 2015 32
2. Self-‐stabilizing algorithms behaviour
Universidad Complutense de Madrid -‐ 16th march 2015 33
2. Self-‐stabilizing algorithms: Self-‐convergence
• A fault leads to an arbitrary state • The algorithm gives a correct answer:
– If the error occurs not too close to the end (e.g. just before return)
– If the error does not modify the data
Universidad Complutense de Madrid -‐ 16th march 2015 34
2. Self-‐stabilizing algorithms: Distributed Shortest Paths in a graph
• Given: – A weighted graph G defined by its matrix (an array) and its size (an integer)
• Computes:
– shortest paths from any node i to node 0
• Mimics the behavior of distributed self-‐stabilizing algorithm
Universidad Complutense de Madrid -‐ 16th march 2015 35
2. Self-‐stabilizing algorithms: Distributed Shortest Paths in a graph (cnt’d) • Any node i knows
– Its distance lij to any neighbor j
• Node 0 knows it is the sink – So its distance to itself is 0, and
the shortest path is to remain on 0
– Once no computaLon can modify d, di is the distance from i to 0 and nexti is the next step on the shortest path from i to 0.
If(i=0) di:=0 nexti:= 0
else di:=min{lij+dj} nexti:=argmin{lij+dj} // with j neighbor of i
endif
« The shortest path in a graph is never the one we think, it can come from nowhere and, most of the time, it does not exist » Edsger Dijkstra
Universidad Complutense de Madrid -‐ 16th march 2015 36
2. Self-‐stabilizing algorithms: Self-‐convergent shortest paths
b=c=1 T= NxN matrix Matrix T represents a graph. Nodes i and j are D= Nx1 matrix connected by an edge of length T(i,j) while(b||c) { c=b; The distance between node I and 0 is Di=min(Tij+Dij) b=0; D[0]=0; for(i=1; i<N; i++) { m = VERY LARGE; for(j = 0; j<N; j++) { if(m>=D[j]+T[N*i+j]) m=D[j]+T[N*i+j]; } if(D[i]!=m) b=1; D[i]=m; } }
Universidad Complutense de Madrid -‐ 16th march 2015 37
1. MoBvaBons 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
Universidad Complutense de Madrid -‐ 16th march 2015 38
• First studies on SEUs were done end of 60s • They strictly considered space applicaLons • ICs issued from advanced manufacturing processes are sensiLve
to thermal neutrons present in the Earth’s atmosphere even at the ground level
• Processor and memories embed significant number of SEU targets
• ApplicaLons for which soe errors may have criLcal consequences must be evaluated with respect to SEUs
3. SEUs in processor-based applications
Universidad Complutense de Madrid -‐ 16th march 2015 39
• Presented for the first Lme in 2000 • EsLmates the number of parLcles required to obtain an observable event on an applicaLon by combining fault injecLon and accelerated test results
• Provide data on system’s sensiLvity at a early stage of the development • How to do that?
1. Calculate the probability for a fault to provoke an error on the applicaLon
2. Obtain the staLc cross-‐secLon (literature or measurements)
3. Obtain the system error rate * R. Velazco, S. Rezgui, R. Ecoffet, “PredicLng Error Rate for Microprocessor-‐Based Digital Architectures through C.E.U. (Code
EmulaLng Upsets) InjecLon”, IEEE TransacLon of Nuclear Science, Vol. 47, No. 6, Dec. 2000, pp. 2405-‐2411.
faultsinjectederrorsnapplicatio
INJ ⋅⋅
⋅⋅=##τ
fluencymemoryionconfigurattheinerrors
SEU
⋅⋅⋅⋅⋅=#
σ
τστ INJSEUPRED*=
3. SEUs in processor-‐based applicaBons: The CEU method
Universidad Complutense de Madrid -‐ 16th march 2015 40
Fault injecBon mechanism: • Faults are injected using an external interrupLon of the processor • Bit-‐flip target using the instrucLon set
=> The accuracy of the method depends on the number of accessible memory elements compared to the total number of memory cells embedded in the DUT
3. SEUs in processor-‐based applicaBons: The CEU method
Universidad Complutense de Madrid -‐ 16th march 2015 41
• Can be applied to any processor : – In HW version – Implemented in an FPGA
• SEU targets are memory cells accessible though the instrucBon set: – Registers – Special funcLon registers (SP, PC,….) – Internal SRAM – Cache memory – …
• CEU codes strongly depend on the studied processor’s architecture and instrucBon set
3. SEUs in processor-‐based applicaBons: The CEU method
Universidad Complutense de Madrid -‐ 16th march 2015 42
1. RadiaBon effects in ICs 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
Universidad Complutense de Madrid -‐ 16th march 2015 43
4. LEON3 processor
Generalities:
LEON3 is a synthesizable VHDL model � 32-bit processor compliant with the SPARC V8 architecture Main features: � 7-stage pipeline � High-performance, fully pipelined IEEE-754 FPU � Separate instruction and data cache (Harvard architecture) � AMBA-2.0 AHB bus interface � Symmetric Multi-processor support (SMP) � Up to 125 MHz in FPGA and 400 MHz on 0.13 µm ASIC technologies � Fault-tolerant and SEU-proof version available for space applications � High Performance: 1.4 DMIPS/MHz, 1.8 CoreMark/MHz (gcc -4.1.2) � Free: http://www.gaisler.com/
Universidad Complutense de Madrid -‐ 16th march 2015 44
4. LEON3 processor: interfaces and peripherals
Universidad Complutense de Madrid -‐ 16th march 2015 45
4. LEON3 processor: specificiBes
• The LEON3 processor does not have a unique Stack Pointer (SP) register like in typical processors
• The LEON3 is organized around a system of 8 ‘windows’. Each window provides a separate register environment
• A function call or an interruption provoke a window switch
• input registers of window Wn become output registers of window Wn+1 and
Wn+1 receives a new set of local and out registers
• Each window has its own pointer stored in o6 (out register)
Universidad Complutense de Madrid -‐ 16th march 2015 46
4. LEON3 processor: Register file
• 136 General purpose registers 8 global registers + 128 window registers • Only 32 accessible at any time by an instruction:
- 8 global registers (g0 to g7)
- 24 window registers 8 in registers (i0 to i7) 8 local registers (l0 to l7) 8 out registers (o0 to o7)
Universidad Complutense de Madrid -‐ 16th march 2015 47
Processor control registers: * Processor State Register (PSR) * Current Window Pointer (CWP) * Window Invalid Mask (WIM) * Program Counters (PC & nPC)
User application registers and memories: * Register file 136 General purpose registers 8 global registers + 128 window registers. Program Counter (PC) and next Program Counter (nPC) are special registers in the interrupt Window * Data and Instruction caches They are both configurable caches, (associativity, size…) Our data cache is 1Kb direct mapped Our Instruction cache is 1Kb direct mapped
4. LEON3 processor: accessible SEU-‐targets
Universidad Complutense de Madrid -‐ 16th march 2015 48
Non-‐accessible using the instrucLon set
Accessible using the instrucLon set
LEON3 integer unit
4. LEON3 processor: Accessible and non-‐accessible registers
Universidad Complutense de Madrid -‐ 16th march 2015 49
1. RadiaBon effects in ICs 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
Universidad Complutense de Madrid -‐ 16th march 2015 50
• Built around two Virtex-4 FPGAs: • Control FPGA: XC4VFX60 • Chipset FPGA: XC4VLX40
• Use of the PowerPc embedded in the FPGA for controlling the tester
• Up to 1GB of DDR-SDRAM for the Control FPGA
• Compact Flash memory used to store the FPGA configuration and the PowerPC instruction code.
• Up to 180 IOs available for connecting the Device Under Test (DUT) to the tester via a high-speed connector
• The DUT can access to 32Mb of SRAM memory and 512Mb of DDR-SDRAM
• The configuration of the chipset FPGA is managed by the control FPGA
• Tester remotely controlled via a 10/100/1000 Ethernet link
5. ASTERICS (Advanced System for the TEst under RadiaBon of IC and Systems)
Universidad Complutense de Madrid -‐ 16th march 2015 51
Operating conditions: * The PowerPC embedded in the Control FPGA runs at 300MHz * DUT frequency up to 200MHz * Available IO voltages: 3.3V, 2.5V, 1.8V, 1.5V, 1.2V
Typical target DUTs (Device Under Test):
* Advanced digital processors up to 64bits
* Memories (SRAM, DRAM, etc …) * Mixed analog/digital circuits (ADC, DAC, SoC, …)
* MEMs (potential upgrade depending on the specs)
5. ASTERICS characterisBcs
Universidad Complutense de Madrid -‐ 16th march 2015 52
Control FPGA DDR-‐SDRAM for the PowerPC Ethernet link
DUT Connector Chipset FPGA
DUT DDR-‐SDRAM
DUT SRAM
5. ASTERICS characterisBcs
Universidad Complutense de Madrid -‐ 16th march 2015 53
1. RadiaBon effects in ICs 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
6. SimulaBon of SEUs on the LEON3: CEU fault-‐injecBon environment
Fault injection mechanism � Faults are injected using an external interruption of the processor � Bitflip target is selected using the instruction set
54
Experimental results can be used to predict the application error-rate
• The accuracy of the error-rate prediction method depends on the number of accessible memory elements compared to the total number of memory cells embedded in the DUT
Universidad Complutense de Madrid -‐ 16th march 2015
6. SimulaBon of SEUs on the LEON3: CEU fault-‐injecBon environment
55
� Hardware setup: PC + ASTERICS + Power supply
� No DUT board : Chipset FPGA used as DUT
� ASTERICS memory : LEON3 code & data
� Functions embedded in Chipset FPGA: - Shared-memory controller (allow access by the CP and by the Leon3) - Supervisor (control the experiment LEON3 and its peripherals)
� LEON3 application: a benchmark Self-stabilizing algorithm
Comm. FPGA
LEON3 + Peripherals
Shared-‐memory controller
Supervisor Memory
Ethernet link
ASTERICS
Chipset FPGA
Power supply
Universidad Complutense de Madrid -‐ 16th march 2015 55
6. SimulaBon of SEUs on the LEON3: CEU fault-‐injecBon environment
� Store the injection vectors: instant, target, register, bit mask � Start the execution of the LEON3 application � Generate the interruption according to the instant vector � Detect normal end of application � Compare the obtained results with the expected results and count the
errors. � Deal with timeouts: there are 3 type of timeouts
- Boot timeout: when the boot sequence does not finish - ASTERICS timeout: when the running application does not finish - Computer timeout: when the supervisor does not work properly or the ASTERICS stops responding
Expected end
Fault injecLon
ASTERICS Lmeout
Computer Lmeout
Boot Lmeout
Universidad Complutense de Madrid -‐ 16th march 2015 56
6. SimulaBon of SEUs on the LEON3: Experiment flowchart
Computer Supervisor LEON3
IniLalize shared-‐memory
Generate injecLon vectors
Store injecLon vectors
Send init. Memory command
ApplicaLon run Generate interrupt
Fault injecLon rouLne
Detect end of execuLon or generate Lmeout
Send Read Memory command
Send results Compare results with
reference Fault injection rate: 1 SEU/2 sec
Universidad Complutense de Madrid -‐ 16th march 2015 57
6. SimulaBon of SEUs on the LEON3: Preliminary results: target = register file Self converging algorithm: b=c=1!N= 16!T= NxN matrix!
D= Nx1 matrix!while(b||c){ ! c=b; ! b=0; !
D[0]=0; ! for(i=1; i<N; i++){ ! m = BIGNUMBER; !
for(j = 0; j<N; j++) { ! If(m>=D[j]+T[N*i+j])! ! m=D[j]+T[N*i+j]; ! } !
if(D[i]!=m) !! b=1;!
D[i]=m;! } !!
}!
Test # Inj. Faults Result errors Timeout Silent Run limit
1 130577 204 (0.15 %) 32143 (24.6 %) 219 1,5 2 199550 324 (0.16 %) 49478 (24.8 %) 384 1,5 3 15068 1709 (11.3 %) 992 (6.6 %) 28 5 4 14264 1614 (11.3 %) 900 (6.3 %) 0 8 5 8007 887 (11,07 %) 508 (6.3 %) 17 16
Preliminary Results of fault injecBon experiments
Variable Observed errors recoverable i timeouts yes j timeouts yes m errors and
timeouts yes
D errors and timeouts
no
T errors and timeouts
no
b timeouts yes c timeouts yes
SensiBvity of the program variables
• During Tests 1 and 2 were detected very few errors but high number of timeouts • Self-converging requires more than 1.5 x 336 ms (the nominal time) to converge • Tests 3, 4 and 5 proved that timeouts masked result errors: => a suitable timeout limit is higher than 5 times the nominal execution time
58
6. SimulaBon of SEUs on the LEON3: SW modificaBons
Using of modulo operator « % » when calling an array, i.e. m=D[j%16]+T[((N*(i%16))+(j%16))%256]; Specifying for every variable a register in the register file by using the following « C » instruction Goal: reduce the number of used registers register unsigned int variable asm ("register name"); Initialize the variables b and c with 8 bits number instead of « 1 » to avoid a bitflip that make them equal to « 0 »
Universidad Complutense de Madrid -‐ 16th march 2015 59
6. SimulaBon of SEUs on the LEON3: SEU injecBons on the modified version Target=register file • The running limit set to be 5 times
the time required for the application to end execution without fault injection
• The erronoeus decrease from 11.3% to 4.45%
• The timeouts decrease from 6.6% to 2.6%
#Runs # errors # timeouts # converges
8000 356 (4.45%) 208 (2.6%) 2972 (37.15%)
Results of fault injection on the modified source code
Universidad Complutense de Madrid -‐ 16th march 2015 60
6. SimulaBon of SEUs on the LEON3: SEU injecBons on the modified version Target: other ressources
Zone # of runs # of errors # of timeouts Inst. cache 12174 107(0.88%) 385(3.16%)
Data cache 12348 547(4,42%) 0 (0%)
Multi-resources 88410 2196 (2.48%) 1415(1.6%)
Results of fault injection in new resources
• Data and instruction caches are also very sensitive to SEUs. They both can be accessed by the CEU through the load and store instructions
• A fault injection campaign was performed on each of the caches, while the LEON3 executed the modified algorithm
• The last campaign was performed on all the resources at the same time (2075 registers of 32 bits each):
- Register file - PC and nPC - Instruction cache - Data cache
• Running limit was set to 5
Universidad Complutense de Madrid -‐ 16th march 2015 61
6. SimulaBon of SEUs on the LEON3: Triple Modular Redundancy (TMR)
Core 1
Core 2
Core 3
TMR Error Timeout
Converge
• A TMR was emulated : 3 LEON3 cores executing simultaneously the same self-convergent algorithm
• The comparison was done in the external PC
• SEUs can hit, one two or three cores in one simulation
• The executable is the modified self-convergence algorithm
• The TMR results will be: – Error: if there are two errors, or one error and a
Lmeout
– Timeout: if two Lmeouts occur
– Converge: if the self converging algorithm converge in at least one of the cores, with a correct result
Universidad Complutense de Madrid -‐ 16th march 2015 62
6. SimulaBon of SEUs on the LEON3: Three-‐cores fault injecBon results. Target: register file
• The running limit is set to be 5 times the time required for the application to end execution without fault injection
• In 17.73 % of the simulations the self-converging algorithm converges to correct results
• The error rate decreases from 4.45% to 0.64%
• The timeouts decrease from 2.6% to 0.18%
#Runs # errors # timeouts #converges
42543 276(0.64%) 77 (0.18%) 7543 (17.73%)
Results of fault injection on three cores processor
Universidad Complutense de Madrid -‐ 16th march 2015 63
#Run # of errors # of timeouts # of converges 100000 85(0.085%) 15(0.015%) 1825(1.825%)
Results of fault injection on three cores for all resources
• The running limit is set to be 5
• In 1.825 % of the simulations the self converging algorithm converges to correct results
• The erronoeus results decrease from 2.48% to 0.085%
• The timeouts decrease from 1.6% to 0.015%
Universidad Complutense de Madrid -‐ 16th march 2015
6. SimulaBon of SEUs on the LEON3: Three-‐cores fault injecBon results. Target: all ressources
64
6. SimulaBon of SEUs on the LEON3: Three-‐cores fault injecBon results. Target: all ressources
26 18
2 1
1
1DC/1IC
2DC
1DC/1RF
1DC/1nPC
1DC/1PC
48 double SEUs
9
1
15
1
9
1 1 2IC/1DC
2DC/1PC
2DC1IC
3IC
3DC
1DC/1IC/1PC
2DC/1RF
37 triple SEUs
Distribution of errors on all resources
Universidad Complutense de Madrid -‐ 16th march 2015 65
1. RadiaBon effects in ICs 2. The Self-‐Stabilizing Algorithm 3. SEUs in processor-‐based applicaBons 4. The LEON3 processor 5. The ASTERICS test plaYorm 6. SimulaBon of SEUs on the LEON3 7. Conclusions
Outline
Universidad Complutense de Madrid -‐ 16th march 2015 66
7. Conclusions and future work
• The sensitivity to SEUs of a self-converging algorithm was studied • Fault injection experiments were performed on a benchmark self-
converging program executed by a Leon3 processor implemented on an FPGA
• The CEU (Code Emulated Upsets) approach was adopted to perform SEU fault injection experiments using ASTERICS test platform was used
• Obtained results show the fault tolerance and “Achilles Hails” of the studied program
• Different versions were explored. The one implementing a TMR was immune to SEUs and quite robust with respect to MBU. SEUs in the voter were not injected
• In futur work new versions of self-converging algorithms will be implemented in a Network on Chip to perform radiation ground testing
Universidad Complutense de Madrid -‐ 16th march 2015 67
Acknowledgements
• Dr. Francisco Javier Franco Peláez (UCM)
• Dr. Juan Antonio Clemente (UCM)
• Dr. Devan Sohier (Prisme, Univ. de Versailles)
• Dr. Alain Bui (Prisme, Univ. de Versailles)
• Dr. Greicy Costa (TIMA Lab.)
Universidad Complutense de Madrid -‐ 16th march 2015 68
THANK YOU FOR YOUR ATTENTION!
TIME FOR QUESTIONS
Universidad Complutense de Madrid -‐ 16th march 2015 69