Artificial intelligence training now can require exabytescale data transfers, requiring a new transport architecture and much higher capacities, a Lumen Technologies white paper notes.
“If you go in to rent or start an AI workload and you have a 10-gig link, and a training is a petabyte, you are paying for 222 hours of idle time just to move your data into that cloud,” says David Ward, Lumen Technologies chief technology and product officer.
Consider the implications of capacity when running an AI training session. If data delivery (transport) costs about $0.02 per gigabyte, then using a 10-gigabyte-per-second connection will cost $431,000 when loading training data into a model on a remote basis.
Over a 400-gig connection costs are about 40 times less, Ward argues.
So, at petabyte and exabyte scale, the transport network, not the graphics processing units, set AI model training time and cost, Lumen argues.
For example, to egress an exabyte of data over a 10-Gbps connection would take 1,389 hours, Lumen argues. When conducted over a 400-Mbps or 800-Gbps connection, the time decreases to 694 and 347 hours, respectively.
The implication is that bandwidth now matters in a new way for optimizing the use of graphics processing units used for training operations.
No comments:
Post a Comment