Fundación Séneca | Financiación

Arquitecturas tolerantes a fallos

For the last 4 years we have focused our research on architectural mechanisms for hardware fault mitigation. Specifically, we have developed different architectural designs which are able to detect and correct transient faults in parallel architectures.

The work to perform in this extension grant would be aimed at developing a number of open new research paths regarding the previous related work.

Following, we highlight potential research paths:

Study of different management policies in LBRA.
Our initial LBRA mechanism follows an eager version management. This means that memory values are updated in place, typically the L1 cache, even before verification. Although a known drawback is that, in case of a fault, all updated memory values must be discarded, yet this scheme is beneficial since the occurrence of a fault could be considered as the uncommon case.
A lazy version management policy, on the other hand, maintains the most updated version of data values in background while keeping old values in place. This would allow slave threads to access main memory directly to retrieve data values. However, two problems arise: a) how to hide the latency of deploying buffered values to memory after a successful verification, and b) how to provide data sharing in parallel applications. The study of these trade-offs will constitute a substantial improvement over previous work.
Expected Miss Ratio (EMR) due to permanent faults on caches
SRAM cells which are used in the manufacturing of embedded memory arrays such as caches are highly susceptible to permanent faults. In [3] we developed an analytical model which calculates the EMR of faulty caches when applying mitigation techniques such as block/word disabling. These techniques disable faulty portions of the cache allowing safe operation at a given performance cost due to cache capacity reduction.
In [4] we assumed that faults are distributed randomly across an entire population of caches. However, recent studies have determined that faults adhere to distributions implying clustering effects.

This way, we plan to extend our model with the following:

Take into account clustering effects to provide more accurate results for the EMR.
Derive a performance model for the whole architecture from our EMR model.
Apply our model to the study of parallel applications

[1] D. Sánchez, J. L. Aragón, and J. M. García. REPAS: Reliable Execution for parallel applications in tiled-cmps. In Proceedings of the 15th International European Conference on Parallel and Distributed Computing, pages 321¿333, 2009.
[2] D. Sánchez, J. L. Aragón, and J. M. García. A log-based redundant architecture for reliable parallel computation. In 17th International Conference on High Performance Computing, 2010.
[3] D. Sánchez, Y. Sazeides, J. L. Aragón, and J. M. García. An analytical model for the calculation of the exe
Programa
Talento Investigador y su Empleabilidad
Convocatoria
BECAS ASOCIADAS A LA REALIZACIÓN DE PROYECTOS EN MATERIA DE SUPERCOMPUTACIÓN
Área
Tecnologías de la información y de las comunicaciones (TIC) / Arquitectura y tecnología de Computadores (035)
Expediente
18279/BSCU/11
Investigador
Sanchez Pedreño, Daniel
Grupo de Investigación
GACOP

Arquitecturas tolerantes a fallos

Programa

Convocatoria

Área

Expediente

Investigador

Grupo de Investigación