Apple unveiled its latest advancement in natural language processing technology, the Large Language Model (LLM) known as Reference Resolution As Language Modeling (REALM). According to Apple, REALM has shown remarkable performance, surpassing that of OpenAI’s GPT-4. Apple emphasized that while LLMs demonstrate significant power across various tasks, their potential in reference resolution, especially concerning non-conversational entities, has not been fully harnessed.
Tech giant Apple has finally hinted at some of its artificial intelligence plans with a research paper last week. Apple discussed its large language model (LLM) Reference Resolution As Language Modeling (ReALM) and how it can “substantially outperform” OpenAI’s GPT-4.
Although Large Language Models (LLMs) boast significant capabilities across a range of tasks, their application in reference resolution, especially concerning non-conversational entities, remains relatively untapped. Unlike humans who can effortlessly grasp such nuances, AI chatbots often struggle to discern the context of non-conversational entities and consequently comprehend them effectively.
“This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality,” Apple said in the paper.
According to Apple, the ability for any user to refer to something on a screen using “that” “it” “them” “they” or any such word and having a chatbot understand it, would be crucial in creating a hands-free screen experience.
The Cupertino-based company said it has seen large improvements over an existing system with similar functionality across different types of references, with its smallest model obtaining absolute gains of over 5% for on-screen references.
“We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it,” it added.
What is ReALM trying to understand?
Apple has delineated three different identities it is trying to comprehend. They are:
On-Screen Entities: These are entities that are currently displayed on a user’s screen
Conversational Entities: These are relevant to the conversation and might come from a previous instance for the user. For instance, when the user says “call mom”, the contact for mom would be the relevant entity in question.
Background Entities: These are relevant entities that come from background processes that might not necessarily be a direct part of what the user sees on their screen or their interaction with the virtual agent; for example, a song that starts playing in the background and ReALM can take note and understand that.
Researchers at Apple said their input consists of the prompt alone while GPT- 3.5 only accepts text.
However, in the case of GPT-4, which can also contextualize images, the researchers provided the system with a screenshot for the task of on-screen reference resolution, which they found to have helped improve the performance substantially.
“Note that our ChatGPT prompt and prompt+image formulation are, to the best of our knowledge, in and of themselves novel. While we believe it might be possible to further improve results, for example, by sampling semantically similar utterances up until we hit the prompt length, this more complex approach deserves further, dedicated exploration, and we leave this to future work,” the paper further noted.
This essentially means that ReALM is better than GPT-4 in certain benchmarks for which it was specifically designed, and a generalization that is better than GPT-4 is premature.
Also, Apple is yet to formally announce its AI plans but has scheduled its annual WWDC conference on June 10, where an unveiling is expected.