Facebook’s Open Compute Project wants servers to run cooler

Using thermal simulation to improve the thermal design of the Open Compute Project’s Intel-based servers, writes Tom Gregory of 6SigmaET.

The Open Compute Project Foundation is challenging itself to successfully scale computing infrastructure in a way that is both efficient and economical. Founded by Facebook, the foundation is fostering a fast-growing community of talented engineers around the world who are tasked with the design and delivery of the most efficient server, storage and data centre hardware designs for scalable computing.

16mar16CFD5bottom

Figure 2: CFD simulation showing airflow velocity through the improved system

The University of Texas at Arlington (UTA) is a good example of one such organisation involved in the Open Compute Project. UTA are investigating new cooling strategies to improve the thermal design of the Open Compute Project’s Intel-based servers. To assist this work the team at UTA turned to thermal simulation to help find a solution.

The approach used to improve the server’s thermal design was essentially two pronged.

The first looked at warm water cooling while the other looked to improve the ducting inside the server, to ensure the hardware maintains consistent temperatures. To assess the feasibility of both options, UTA used the 6SigmaET software throughout their project to create, simulate and fine-tune their proposed solutions to the server’s thermal design issues.

Liquid Cooling

In order to achieve a completely liquid-cooled system, the existing air-cooled heat sinks were replaced with liquid-cooled cold plates and the system was sealed from the ambient air to prevent gaseous and component contamination.

In addition to the liquid-cooled plates, a heat exchanger was incorporated to cool the air inside the server, assisting those remaining components that are not directly liquid cooled. A combination of warm water and recirculated air would be used to cool the server. This solution required custom ducting to direct the recirculated air over the server’s DIMMs (dual in-line memory modules), PCH (Platform Controller Hub), hard drive (HDD) and other heat generating components.

A range of candidate ducts were designed by the team at UTA, with the goal of ensuring adequate airflow through the system and keeping component temperatures within the server’s critical limits. These designs were modelled and simulated in 6SigmaET, and the results used to determine the optimum duct design.

From this design was the first prototype, which was then subjected to thermal testing. It was tested with water inlet temperatures from 27.5-45°C in increments of 2.5°C. The server was exercised computationally at idle, 40%, 60%, 80% and 100% CPU loading, and one test was performed with maximum CPU and memory power levels to provide continuous heat dissipation.

The thermal testing results demonstrated a correlation between the server’s performance and the increased water inlet temperatures. The server’s cooling power consumption, radiator fan speeds, CPU temperatures and IT power consumption increased as the water temperature increased.

However, the CPU temperatures remained below the critical die temperature of 80°C throughout the experiment.

The prototyped duct regulated the air flow through the DIMMs, HDD and other auxiliary components as expected, maintaining component temperatures below critical. The university concluded that there was ample incentive to operate at higher water temperatures, up to 45°C.

16mar16CFD3

Figure 3: DIMM temperatures from experimental tests. A4-A7 are closer to the internal radiator fan. A0-A3 are further from the fan, and are affected by pre-heat from
the other DIMMS. The modified design maintains component temperatures below critical, up to 45°C

Improved Ducting

The first server had a removable chassis cover with an integrated air ducting system. However, this ducting was only provided in the CPU1 region, causing excessive flow bypass in the CPU0 region, resulting in warm air entering heat sink 1. The UTA team decided to investigate whether modifications to the server’s ducting system would improve its thermal performance.

Physical experiments were conducted on the server to determine its system impedance, flow rates, total server power consumption, fan speeds and fan power consumption at various power levels.

The resulting experimental data was in turn used to generate and calibrate a detailed CFD model of the server using 6SigmaET. Solved using the KE turbulence model, and the CFD model was matched with the temperatures obtained from testing. The CFD model and experimental data showed good agreement (see figure 4), with a maximum error of 12%.

Following the experimental data, the university then used 6SigmaET to improve the server’s ducting system parametrically. The key goal was to reduce flow bypass in the CPU0 region without causing a temperature rise in the CPU1 region (and an increase in fan power, which would increase total server power consumption).

The calibrated CFD model was used to parameterise the size and location of the duct, and solved for each new design iteration to determine how the processor temperatures would be affected in each case.

This process led to a final design for the improved ducting system. This design was prototyped and tested in the same way as the original server.

cooling

Figure 4: Temperature plots of the original server (top) and the server with improved ducting (bottom)

The test results were positive: fan power consumption was reduced by 23.4-40%, fan speeds by 22-26% and flow rate by 31.3-37.3%, while the server’s temperature stayed within the recommended range (see figure 4).

This case study gives a perfect demonstration of how thermal simulations give engineers a unique visual representation of the temperature and airflow inside equipment. This insight allows them to make better engineering decisions. Even in the context of a global mission like the Open Compute Project, these details matter and can make a big difference.

Both projects used thermal simulation to analyse airflow and temperatures in the original server and determine where improvements could be made. This allowed the university to create and test a range of designs, iteratively improve them, and determine which performed best.

The design with the best results was then prototyped and physically tested, with good agreement between the CFD model and the experimental results. Thermal simulation using 6SigmaET reduced the time and cost of the design stage, and meant that a wide range of potential solutions could be investigated.

Tom Gregory is a product specialist at 6SigmaET Product Specialist

 


Leave a Reply

Your email address will not be published. Required fields are marked *

*