Apple’s ReALM Outperforms GPT-4 in Understanding Screen Context

Apple has yet another research paper. Their model ReALM can understand the context on the screen and perform tasks based on that information. Even with fewer parameters than GPT-4, ReALM’s heaviest models already outperform it in personal assistance tasks.

Personal assistants seem to be another lucrative direction for AI companies. Apple has been working on AI behind closed doors, releasing the one-off research paper every now and then, narrating their discoveries and their implications. On the proprietary side, Apple is likely working on so much more and only teasing us with the most remarkable headways. All of this will tie into a final AI update. Now, they have a new research paper (ReALM: Reference Resolution As Language Modeling). In this, they talk about this new system can resolve references of various types by converting resolution into a problem to be solved in the language model way.

Understanding on-screen context was significantly superior compared to GPT-3.5 and even GPT-4. It’s worth noting that GPT models are made for answering queries and not understanding on-screen context. They can do it, but that’s not their specialization.

The idea behind Apple’s ReALM is to offer the next step in the evolution of a “true hands-free experience in voice assistants” that can learn on-screen context and perform tasks. If this is bundled as a core capability in Siri, a task could be browsing the webpage of a law agency and asking Siri to call the free consultation number. Siri will understand the context from the webpage, find the section for a free consultation, and place the call for you.

This, alongside other research that Apple’s folks are doing, will be revealed in a fancy update that’s about to come with their next iOS update. Hopefully, the next generation of their hardware will also have better native AI capabilities, especially the iPhones.

By Abhimanyu