Sunday, July 21, 2024

Where does eBay do most of its AI development? You’d be surprised

Must Read

For a company that did business in the cloud before the concept of cloud computing existed, eBay Inc. has taken a decidedly non-cloud-centric approach to training and deploying artificial intelligence.

While the company strives to use public cloud resources during seasonal peaks, most of its AI work is done in its own data centers. That allows it to meet high customer privacy and compliance standards and speeds time to market, said Parantap Lahiri (pictured), vice president of network and data center engineering at the e-commerce giant.

“The public cloud is a friend where you can rent resources to solve some of your load balancing problems, but we’re going to have our core competency on-premises,” he said in an interview with SiliconANGLE.

Blessed with talent

EBay makes much of its server hardware and has ample engineering talent, Lahiri said. “We found that beyond a certain point of scale, it makes much more business and financial sense to run the bulk of our workloads on-premises,” he said. “We’re fortunate to have the right engineering talent to pre-train models, fine-tune them, deploy them on our own infrastructure and integrate them with our applications.”

EBay was an early adopter of AI among retail organizations, creating its first apps in 2016. Its generative Magic Listing feature allows sellers to take or upload a photo and the AI ​​fills in the details about the item being listed. The app can write titles, descriptions, product release dates, detailed category and subcategory metadata, and even suggest a listing price and shipping cost.

A Personalized Recommendations feature launched late last year generates buyer recommendations from hundreds of candidates based on an individual user’s shopping experience and predicted purchasing behavior. The company just took top honors for “Best Overall Generation AI Solution” at Tech Breakthrough LLC’s AI Breakthrough Awards.

AI is also being used in customer service to analyze previous interactions with individual customers and summarize their concerns, setting the stage for “a much more effective call,” Lahiri said.

Dedicated AI stack

EBay’s local AI stack consists of a dedicated high-performance computing cluster with a dedicated set of Nvidia Corp. H100 Tensor Core graphics processing units and high-speed interconnects. Lahiri said standard cloud infrastructure is not suited to the needs of large training jobs.

“You can’t train those models on cloud infrastructure because you need more of an HPC approach with Infiniband, RDMA and back-end connections because one GPU needs to access the memory of another GPU to train the model,” he said. Remote direct memory access allows one computer to directly access the memory of another without involving the operating system of either.

The company uses a variety of language and machine learning models, from large language models to smaller open-source variants. Having a dedicated local resource has proven to be “very time-efficient because we don’t have to wait for resources to be acquired from the public cloud,” Lahiri said. “The local network is flat and high-speed, so data movement is much easier.” Any capacity that can’t be handled locally is moved to the public cloud.

Layered architecture

By refining its AI architecture over the years, eBay has created a layered approach that abstracts much of the complexity away from the user and application.

At the top level, “the application makes a call and then we create a shim (a small piece of code that acts as a middleman between two systems or software components), so it doesn’t matter who is serving,” Lahiri said. “It could be Nvidia, AMD or Intel hardware. The application doesn’t have to worry about the differences between them.”

Running a local HPC environment is not without its challenges. One is simply keeping up with the rapid evolution of GPUs.

“Capacities are growing three to six times with each generation,” Lahiri said. “The cycles are completely different than on the CPU side. You can’t mix and match (GPU generations) because if you train a model on a lower-quality GPU and a higher-quality GPU, it will default to the lower quality.”

Another challenge is that running a large number of GPUs taxes power and cooling infrastructure. While x86 server processors consume between 200 and 500 watts, that doesn’t compare to the peak consumption of a 700-watt Nvidia H100 GPU or Nvidia’s GB200 “superchip” at 1,200 watts.

Lahiri said that when eBay’s fleet of H100 GPUs are running at full power, the cooling systems create so much noise that data center employees have to wear ear protection. Liquid cooling is an alternative, but it is expensive and disruptive to install.

Discovering AI infrastructure

Lahiri said he’s confident these issues will be resolved over time. “Over the next two to three years, we’ll figure out the right kind of infrastructure for inference, training, and managing GPU infrastructure,” he said. “There will be a lot of innovation in the inference world as multiple chips emerge that focus primarily on that rather than training.”

There will be plenty of new options, as more than a dozen startups are working on AI-specific chipsets, mostly focused on inference. Lahiri said his team is staying up to date on their progress, but practical considerations warrant caution.

“You can fall in love with technology, but you have to look at the reality of how to implement it in your data center,” he said. “The technology may seem really interesting at the moment, but it has to keep up with the pressure of time.”

Photo from: eBay

Your vote of support is important to us and helps us keep the content FREE.

Clicking below supports our mission to provide free, in-depth, and relevant content.

Join our community on YouTube

Join the community that includes over 15,000 #CubeAlumni experts, including CEO Andy Jassy, ​​Dell Technologies Founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner for the industry. You guys are really a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy



Please enter your comment!
Please enter your name here

Latest News

Bulls are signing free agent guard to a two-way contract

It's not unusual to see young players impress during the NBA Summer League and land a spot on...

More Articles Like This