Feed aggregator

Partitioning to optimize AI inference for multi-core platforms

EDN Network - Mon, 01/08/2024 - 09:22

Not so long ago, artificial intelligence (AI) inference at the edge was a novelty easily supported by a single neural processing unit (NPU) IP accelerator embedded in the edge device. Expectations have accelerated rapidly since then. Now we want embedded AI inference to handle multiple cameras, complex scene segmentation, voice recognition with intelligent noise suppression, fusion between multiple sensors, and now very large and complex generative AI models.

Such applications can deliver acceptable throughput for edge products only when run on multi-core AI processors. NPU IP accelerators are already available to meet this need, extending to eight or more parallel cores and able to handle multiple inference tasks in parallel. But how should you partition expected AI inference workloads for your product to take maximum advantage of all that horsepower?

Figure 1 Multi-core AI processors can deliver acceptable throughput for edge applications like scene segmentation. Source: Ceva

Paths to exploit parallelism for AI inference

As in any parallelism problem, we start with a defined set of resources for our AI inference objective: some number of available accelerators with local L1 cache, shared L2 cache and a DDR interface, each with defined buffer sizes. The task is then to map the network graphs required by the application to that structure, optimizing total throughput and resource utilization.

One obvious strategy is in processing large input images which must be split into multiple tiles—partitioning by input map where each engine is allocated a tile. Here, multiple engines search the input map in parallel, looking for the same feature. Conversely you can partition by output map—the same tile is fed into multiple engines in parallel, and you use the same model but different weights to detect different features in the input image at the same time.

Parallelism within a neural net is commonly seen in subgraphs, as in the example below (Figure 2). Resource allocation will typically optimize breadth wise then depth wise, each time optimizing to the current step. Obviously that approach won’t necessarily find a global optimum on one pass, so the algorithm must allow for backtracking to explore improvements. In this example, three engines can deliver >230% of the performance that would be possible if only one engine were available.

Figure 2 Subgraphs highlight parallelism within a neural net. Source: Ceva

While some AI inference models or subgraphs may exhibit significant parallelism as in the graph above, others may display long threads of operations, which may not seem very parallelizable. However, they can still be pipelined, which can be beneficial when considering streaming operations through the network.

One example is layer-by-layer processing in a deep neural network (DNN). Simply organizing layer operations per image to minimize context switches per engine can boost throughput, while allowing the following pipeline operations to switch in later but still sooner than in purely sequential processing. Another good example is provided by transformer-based generative AI networks where alternation between attention and normalization steps allows for sequential recognition tasks to be pipelined.

Batch partitioning is another method, providing support for the same AI inference model running on multiple engines, each fed by a separate sensor. This might support multiple image sensors for a surveillance device. And finally, you can partition by having different engines run different models. This strategy is useful especially in sematic segmentation, say for autonomous driving where some engines might detect lane markings. Others might handle free (drivable) space segmentation, and some others might detect objects (pedestrians and other cars).

Architecture planning

There are plenty of options to optimize throughput and utilization but how do you decide how best to tune for your AI inference application needs? This architecture planning step must necessarily come before model compile and optimization. Here you want to explore tradeoffs between partitioning strategies.

For example, a subgraph with parallelism followed by a thread of operations might sometimes be best served simply by pipelining rather than a combination of parallelism and pipelining. Best options in each case will depend on the graph, buffer sizes, and latencies in context switching. Here, support for experimentation is critical to determining optimal implementations.

Rami Drucker is machine learning software architect at Ceva.

Related Content

googletag.cmd.push(function() { googletag.display('div-gpt-ad-native'); }); -->

The post Partitioning to optimize AI inference for multi-core platforms appeared first on EDN.

ISRO’s Aditya-L1 Reaches Destination, successfully placed in a halo orbit around L1 point

ELE Times - Mon, 01/08/2024 - 08:06

India’s maiden solar mission Aditya-L1 successfully reached its destination and was placed in a halo orbit around the L1 point on January 6, 2024.

The Indian Space Research Organisation (ISRO) launched the solar mission on September 2, 2023. The L1 point is located roughly 1.5 million km from Earth and will enable the spacecraft to view the sun continuously.

ISRO Chairman S Somnath told reporters that the halo orbit insertion process was done as intended. “Today’s event was to place the Aditya-L1 in the precise halo orbit. The spacecraft was moving towards the halo orbit but we had to make some corrections to put it in the right place. If we do not do the correction today there could have been a possibility that it could escape from this point (L-1 point). But we would not have allowed that to happen as there are some contingencies in place, but I am only telling mathematically it can escape.”

“So that has been very precisely done [placing the spacecraft in the halo orbit]. What we have achieved today is exact placement based on our measurement and very correct prediction of the velocity requirement. Right now, in our calculation the spacecraft is in the right place,” he added.

The post ISRO’s Aditya-L1 Reaches Destination, successfully placed in a halo orbit around L1 point appeared first on ELE Times.

Latching relay question

Reddit:Electronics - Sun, 01/07/2024 - 16:05

Hi all, I'm new here. I'd appreciate your help with the following.

I have a latching relay module (the Adafruit one if it matters - 3V). It's meant to be used with their Feather MCU but it can also be used standalone (which is what I'm doing).

That relay has two separate trigger inputs for SET and RESET. I have a timer board with a single OUT wire that can automatically trigger a relay, an LED etc. after a set amount of time.

My question is, is it possible to trigger the relay on & off somehow (without an MCU) by having that single trigger voltage (e.g. 1 wire) alternate between the SET and RESET separate pins of the relay?

So effectively, pulse -> SET, pulse -> RESET, pulse -> SET, pulse -> RESET etc.

I hope I've explained this property. Any questions, let me know. Thank you.

submitted by /u/ThinkMuffin
[link] [comments]

Makita Radio DMR102

Reddit:Electronics - Sat, 01/06/2024 - 19:21
Makita Radio DMR102

Can somebody help me with this part? What is it called? Is it a transistor? On the right contact 12V arrive at it, but all the other ones are 0V. It's hard to read but I think it says YL4686 SA7BRLXD. Can't find anything by googling it.

submitted by /u/Olli_Ohh
[link] [comments]

Weekly discussion, complaint, and rant thread

Reddit:Electronics - Sat, 01/06/2024 - 18:00

Open to anything, including discussions, complaints, and rants.

Sub rules do not apply, so don't bother reporting incivility, off-topic, or spam.

Reddit-wide rules do apply.

To see the newest posts, sort the comments by "new" (instead of "best" or "top").

submitted by /u/AutoModerator
[link] [comments]

Pages

Subscribe to Кафедра Електронної Інженерії aggregator