Overview
On-premise enterprise data centers house petabytes of data, thousands of applications, and numerous users, creating a complex environment for extracting project-related data. This complexity causes significant delays for engineering teams attempting to access on-premise data to leverage cloud compute resources.
One of Spillbox’s key strengths is identifying and making essential on-premise data readily available in the cloud for successful job execution. The system eliminates the need to pre-relocate on-premise data by enabling real-time file transfers to the cloud as required. Spillbox intelligently transfers only the specific files needed for each job, minimizing the demand for extensive storage.
Goal
Spillbox partnered with two major EDA companies to demonstrate the ability for users to run cloud regressions without manual data transfers or complex configurations. The primary objectives of the proof of concept (POC) were:
- Automated Data Availability: Ensure only dependent data is automatically prepared in the cloud, allowing users to run regressions seamlessly.
- Seamless Integration: Showcase smooth integration with the customer’s existing infrastructure without requiring changes to their R&D environment.
- Performance Assessment: Evaluate the performance impact of Spillbox on customer workloads.
- Scalability: Ensure multiple users can run workloads in the cloud independently, with the system supporting hundreds of users scaling linearly.
Key Findings/Challenges
- One EDA partner had hardcoded hostnames in their build machine list, limiting cloud portability. To address this, the host list needed to be dynamically generated in the cloud. Spillbox offers a solution that allows customers to detect whether workflows are running on-premise or in the cloud, ensuring seamless portability and adaptability.
- We encountered a major bug specific to Azure containers, where they failed to update the kernel’s PROCFS entry for file descriptors opened by the containers, which deviated from standard container behavior. This issue was resolved by running the file server directly on hosts.
- For running jobs in Docker containers, users should not have direct access to the containers. Spillbox implemented a tight Docker integration with the job manager to enforce this restriction.
- LSF currently cannot handle scenarios where output/error file paths exist only inside the container, not on the host. Spillbox developed a solution to address this limitation.
- For a C/C++ build at one of the EDA partners, Spillbox cloud bursting delivered 4X faster build times compared to on-premise setups with similar cores and hardware.
Further experiments revealed two key insights:
- The 4X performance gain is achieved only when the NFS server has a 100 Gbit/s network connection. At 10 Gbit/s, performance drops to 1X.
- For builds, both low-performance EBS and high-performance NVMe storage on the NFS server deliver similar performance.
The key takeaway is that, for build performance, investing in faster network speeds offers a significantly higher ROI than upgrading to high-performance storage.
Results
Across both EDA companies, millions of tests were conducted using various tools, including front-end simulation, formal verification, physical verification, static timing analysis, and place & route. Job managers like LSF and UGE were employed, with multiple users running tools concurrently, scaling up to several thousand cores in the cloud.
This was validated on both AWS and Azure Clouds. Spillbox successfully transferred only the necessary files, reducing required storage size in one case from 250TB to just 0.5TB. Spillbox demonstrated that cloud cost and performance were comparable to traditional data centers, with the added advantage of scalable compute and storage as user demand increased.