Cisco it Depplys Ai-Feedy Data Center on weekends while scaling for the future

Cisco has designed AI-Raada infrastructure with Cisco Compute, the best NVIDIA GPU in its class and Cisco Networking, which supports AI model training and brings it to dozens of uses for Cisco product and utilities.

It is no secret that printing to implement AI across the company represents challenges for IT teams. It invites us to deploy new technological fasting than ever and re -evaluate how the data centers are built to meet the growing, networking and storage. While the pace of innovation and business procedure is an exhibition, it may also feel daunting.

How quickly do you create a data center infrastructure needed to power AI workload and take a step with critical business needs? That’s exactly what our team, Cisco, faced.

Ask from business

The product team approached us that Needed a way to run the workload of AI who Would be used to develop and test new AI capacities for Cisco products. It Eventually he would support model training and inspiration for more teamsnUsing across a business. AND Therefore, they did not do so quickly. Need For the product teams Get innovations for our customers as quickly as possible, we Had to be deliver tea New environment In just three months.

Technological requirements

We started mapping the requirements for a new AI infrastructure. The non -blocked, lossless network was necessary for the production of AI Compute Manufacturing to ensure reliable, predictable and high -performance data transmission within the AI ​​cluster. Ethernet The choice was first -class. Other requirements include:

  • Intelligent equalization, low latency: Like any good data center, they are necessary to maintain smooth data flow and minimize delay and also for increasing the sensitivity of AI production.
  • Dynamic avoidance overload for different workloads: The workload of AI may vary significantly in its network request and calculate the resources. Avoiding dynamic overload would ensure that the sources were allocated efficiently, prevented performance degradation during the use of maximum, the main level of service, and preventing close problems that could question operations.
  • Reserved nets front-end and back-end, without blocking fabric: In order to create a scalable infrastructure and Non blocked production Would cause a sufficient bandwidth for free-flowing data and would also allow high-speed data transmission to the volume of large data typical with AI applications. By separating our front-end and back-end networks, we could increase security, performance and borderability.
  • Automation for Day 0 and Day 2: From the day we deployed, configured and dealt with nail management, we had to limit any manual intervention to keep the processes fast and minimize human mistake.
  • Telemetry and visibility: Together, these welds provide insight into the performance and health of the system, which would allow proactive management and problem -solving.

Plan – with several challenges to overcoma

With the requirements for the place we started to figure out where to build a cluster. Existing Data Center devices have not been designed to support AI working load. We knew that the building from scratch with complete refreshment of the data center would take 18-24 months-what was not the possibility. During the weekends we had to supply the AI ​​operating infrastructure, so we used existing devices with a minor change to the cabling and distribution of the device to adapt.

Our other concerns are surrounded by data used to train models. Some of this data itself would not be stored locally on the SANS device as our AI infrastructure, we decided to replicate data from other data centers in the AI ​​infrastructure system to avoid the performance related to the network. Our network team had to ensure sufficient network capacity to handle this data replication into AI infrastructure.

Now you get to the real infrastructure. We designed the heart of the infrastructure with Cisco Compute, the best in its GPU class from Nvidia and Cisco Networking. On the net side, we created a front-end Ethernet network and a back-end ethernet lossless network. With this model we were sure we could quickly put on advanced capacitive AI in any environment and continue to add them when we are B3

Products:

Support for a growing environment

After opening the initial infrastructure, the company added more cases of use every week and we added added clusters AI to support them. We needed a way to make it easier to take it, include the management of switches configurations and monitoring of packet loss. We used the Cisco Nexus control panel, which dramatically streamlined the operations and ensured that we could grow and scalp for the future. We have already used it in other parts of our data centers operations, so it was easy to expand it to our infrastructure and did not require the team to learn another tool.

Results

Our team was able to move quickly and overcome several obstacles to designing solutions. In less than three hours, we were able to design and deploy the Backend of AI and deploy the entire AI and fabric clusters in 3 months, which was 80% faster than an alternative reconstruction.

Today, the environment supports more than 25 cases of use throughout the company, with more added each week. This included:

  • Sound WebEx: Improving codec development to cancel noise and prediction of data with lower bandwidth
  • Video Webex: Training of background replacement, gesture recognition and facial landmarks
  • LLM’s own training for products and capabilities of cyber security

Not only are we able to support the needs of business today but also We are Designing how our data centers must evolve for the future. We actively build more clusters and share more details about our journey in future blogs. Modularity and flexibility of Cisco network, calculateAnd security gives us the certainty that we can maintain scaling with the trade.


Other sources:

Share:

Leave a Comment