Case Study | Intel & SK hynix: Memory Failure Analysis and Prevention in Data Centers Case Study

Data Center Intel® Memory Resilience Technology Intel & SK hynix: Memory Failure Analysis and Prevention in Data Centers Maximizing Memory Reliability in Data Centers Through Artificial Intelligence-Assisted Failure Analysis Servers rely heavily on dynamic random-access memory (DRAM) as the primary memory source for their speed and cost efficiency. However, DRAM failures can lead to computational errors, which can have a direct impact on the reliability, availability, and serviceability (RAS) of servers, potentially disrupting data center continuity. These memory failures often go unnoticed until a server crashes. To address this issue, Intel® Memory Resilience Technology was developed to provide system administators with an early detection tool for identifying and preventing potential memory failures before they occur. The Challenges of Memory Reliability Memory faults can lead to a variety of correctable errors (CEs) including single bit errors, single row errors, and multi- array errors, each with their frequency patterns (see Figure 1). These faults can also have their own victim patterns, with some having a higher risk of becomming uncorrectable errors (UEs). Some memory faults are intermittent and difficult to trace, while others can be replicated. Currently, there is no one-size-fits-all solution for addressing memory errors. For example, random single bit errors can be corrected using Error Correction Code (ECC), while other types of memory errors require different technologies such as System ECC, Single Data Device Correction (SDDC), Post-Package Repair (PPR), and ® Intel Memory Resilience Technology. The team at Intel and SK hynix have identified a small batch of DDR4 memory DIMMs with memory faults that can be replicated. This allowed Intel to conduct a deep dive analysis to better understand their failures. Additionally, Intel was able to gather large-scale data from its own data centers to complete a comprehensive memory failure analysis. Figure 1. A Simple classification of memory fault modes 1 Case Study | Intel & SK hynix: Memory Failure Analysis and Prevention in Data Centers Analyzing a Small Sample of Faulty DDR4 DIMMs To trace memory faults, the areas of the defective memory DIMMs were translated into hardware addresses. Then, using ® ® ® an Intel Xeon Scalable platform, the error characteristics were recorded both with and without Intel Memory Resilience ® Technology enabled. The goal was to study memory failure patterns to determine if Intel Memory Resilience Technology can mitigate errors caused by unreliable memory DIMMs. The root causes of memory errors are defects in the manufacturing for memory DIMMs, while the errors themselves are symptoms of these defects. These defects, such as row, column, or bank faults, can affect multiple memory pages in the operating system that share the daulty physical DRAM address. Additionally, simply counting the number of CEs per page does not fully capture the complexity of cross-page faults. The challenge is further exacerbated by the fact that traditional OS page offlining solutions lack knowledge of platform specific ECC implementations and DRAM-specific memory failure characteristics. ECC is the error correction capability provided by the CPU, and DRAM-specific failure characteristics depend on the microarchitecture of the DRAM. Furthermore, not all faults or pages within a certain rate of CEs are equally likely experience future UEs. The rate CEs in the past is not a reliable indicator of future