Intel executives appeared at the Microsoft conference: unleashing the superpowers of AI PCs, optimizing the innovative platform for running AI models

AI PC includes optimized versions of OpenVino and DirectML, which can efficiently run generative AI models like Phi-3 on CPUs, GPUs, and NPUs. Deploying AI agents capable of inference and taking action using tools, running AI models efficiently on AI PCs, utilizing inference decoding and quantization techniques, suitable for various use cases such as personal assistants, secure local chat, code generation, Retrieval Augmented Generation (RAG), and more

Microsoft's annual Build developer conference is coming on Tuesday, with Saurabh Tangri, Principal Software Architect at Intel, and Guy Boudoukh, Head of AI Applications Research Team, introducing the development and application trends of AI PCs.

Tangri explained that AI agents and generative AI applications provide unparalleled capabilities for PC users. AI PCs include optimized versions of OpenVino and DirectML, which can efficiently run generative AI models like Phi-3 on CPUs, GPUs, and NPUs. Deploying AI agents capable of inference and taking action using tools on AI PCs, running AI models efficiently on AI PCs, utilizing inference decoding and quantization techniques, suitable for various use cases such as personal assistants, secure local chat, code generation, Retrieval Augmented Generation (RAG), and more.

Tangri mentioned that current AI technology can embed some functionalities into the platform. He stated, "When users have static language models trained on a static database, there is a need for the ability to run these models simultaneously. Currently, this can be enhanced by running Retrieval Augmented Generation (RAG) to enhance its capabilities, thereby enhancing AI's ability to perform more tasks."

He gave an example where in a consumer scenario, a common question is "Have I exceeded my budget?" Now, you can introduce your private data to AI, analyze it using advanced LLM (Large Language Models), place some content along these lines, and then extract conclusions and actions from it.

"This element is very novel. I am very excited about this. This is the first time we have shown this complete pipeline, from RAG to LLM to reaction, inference, all running on your PC. It's very interesting, very cutting-edge."

Guy Boudoukh then demonstrated the use of the multi-modal small model Phi-3 driven by the Intel Core Ultra processor, including Phi-3 AI agent's response, interaction with private data, how users engage in conversations with documents, and generate answers through RAG.

Boudoukh explained that the Phi-3 ReAct agent's frontend is the instructions and context provided by the user to the language model to accomplish the desired task, which could be chatting or question-answering. He introduced that ReAct prompts were first introduced last year by Princeton University and Google, which is a new prompting method, where ReAct stands for reasoning and execution.

He mentioned that this approach allows LLM to do more than simple text generation. It actually enables LLM to use tools and perform actions to better handle user input. It allows LLM to combine various tools like RAG, Gmail, Wikipedia, Bing search, some of which can access private data on the device, while others can access the internet First, the user's query can be input into the ReAct template, and then injected into the Phi-3 agent, which decides whether to use tools to answer the user's query. If a tool is needed, it will be called, and the output of the tool will be returned to the prompt dialog, and then back to the agent. The agent can decide whether to use another tool to answer the question, and this process will repeat. Only when the agent determines that there is enough information to answer the user's query, it will generate an answer.

In the demonstration, Boudoukh asked how many teams participated in the Champions League this year. The agent reasoned and understood that RAG was needed to answer this question, so it searched through 160 BBC sports news articles. He then asked the agent to send this answer via Gmail, so the agent called another tool, Gmail, to address this issue.

Next, Boudoukh demonstrated the specific process of the Phi-3 agent executing RAG. He said that RAG allows LLM to access external knowledge by injecting retrieved information. First, the user indexes hundreds or even thousands of files on the device, which are embedded in an index and saved to a vector database (Vector DB). Now, once the user provides a query, information is retrieved from the database, and a new unified prompt consisting of the user's query and the retrieved information is created, then this prompt is injected into LLM to generate an answer.

He said that RAG has several advantages. Firstly, it enhances LLM's knowledge without the need to train the model. Secondly, such data usage is very efficient because the entire document does not need to be provided, only the retrieved information. This reduces the model's hallucinations and improves reliability, as it references relevant data when providing answers.

In the subsequent demonstration, Boudoukh skipped the agent and directly asked how many teams participated in the Champions League this year. He did not use RAG at first, and the agent generated an incorrect answer, stating that there were 32 teams this year, but in reality, there were 36 teams competing. Then he called RAG to ask the same question, and the correct answer was obtained.

Boudoukh stated that this can show developers how to allocate work between the software stack in NPU, CPU, and integrated GPU. For example, the speech recognition model Whisper runs on the NPU, Phi-3 inference runs on the integrated GPU, and database search runs on the CPU.

Finally, Boudoukh conducted a demonstration of the LLaVA Phi3 multimodal model. He introduced that the model is trained on visual and color, so it can handle multimodal tasks involving text and images. He inserted an image into the model and asked the model to describe the image scene, and the model provided a detailed understanding of the scene, even suggesting fishing for relaxation here He also demonstrated one of the core parts of the model code, the LLM reasoning part. He said that running Phi-3 and LLM reasoning on an Intel Core Ultra processor is easy, just define the model name, define the quantization configuration, load the model, load the tokenizer, provide some examples, perform tokenization operations, tokenize the input, and then generate results. This demonstration uses an optimized version of OpenVino, a type of AI PC.

Tangri stated that this is a wonderful performance of AI PC running together with LLM. There are four pillars of AI in the real world: efficiency, security, the ability to collaborate with networks, and developer readiness. If you have the first three but are not prepared for developers, you will not be able to innovate on this platform.

He mentioned that high efficiency means extending the battery life of devices, not just pursuing the illusion of high teraflops. "Ultimately, what we are really pursuing is customer experience and user experience, which involves integrating natural language interfaces with graphical user interfaces. So, in the end, what we are pursuing is the experience, not false performance metrics."

Tangri mentioned that Intel has been working with Microsoft over the past few years to establish standards such as the Open Neural Network Exchange (ONNX) standard. Regarding developer readiness, he mentioned that Intel currently has a cutting-edge research demonstration that can run entirely in a PC environment. "So we are truly meeting the needs of developers, lowering the threshold for innovation on our platform, without the need for online or cloud usage, all of this can be done on your PC."