Why AI language models choke on too much text

Large language models represent text using tokens, each of which is a few characters. Short words are represented by a single token (like "the" or "it"), whereas larger words may be represented by several tokens (GPT-4o represents "indivisible" with "ind," "iv," and "isible").

When OpenAI released ChatGPT two years ago, it had a memory -- known as a context window -- of just 8,192 tokens. That works out to roughly 6,000 words of text. This meant that if you fed it more than about 15 pages of text, it would "forget" information from the beginning of its context. This limited the size and complexity of tasks ChatGPT could handle.

Today's LLMs are far more capable:

Still, it's going to take a lot more progress if we want AI systems with human-level cognitive abilities.

Many people envision a future where AI systems are able to do many -- perhaps most -- of the jobs performed by humans. Yet many human workers read and hear hundreds of millions of words during our working years -- and we absorb even more information from sights, sounds, and smells in the world around us. To achieve human-level intelligence, AI systems will need the capacity to absorb similar quantities of information.

Right now the most popular way to build an LLM-based system to handle large amounts of information is called retrieval-augmented generation (RAG). These systems try to find documents relevant to a user's query and then insert the most relevant documents into an LLM's context window.

This sometimes works better than a conventional search engine, but today's RAG systems leave a lot to be desired. They only produce good results if the system puts the most relevant documents into the LLM's context. But the mechanism used to find those documents -- often, searching in a vector database -- is not very sophisticated. If the user asks a complicated or confusing question, there's a good chance the RAG system will retrieve the wrong documents and the chatbot will return the wrong answer.

Why AI language models choke on too much text

POPULAR CATEGORY

commerce

tech

amusement

science

various

healthcare

sports