Secrets of GPT Input Tokens: The Way HTML Impacts Costs

Are you curious about how AI systems efficiently gather mountains of information while minimizing operational costs? Let’s explore an interesting strategies involving web content.

1. Teaching GPT on The Go: In-Context Learning

GPT is great for chatting and understanding a wide range of topics, almost like talking to a human. But when it comes to giving out information, it’s not perfect. It only knows what it’s been previously fed, without any built-in knowledge. That’s where in-context learning (ICL) steps in. ICL lets the GPT learn new stuff while it’s being used, without changing how the model fundamentally works. In simple terms, you can give GPT new information in the prompt, and then ask it a question based on that information. This means GPT doesn’t just rely on what it was trained on; it can use new info to answer questions. This makes its answers more accurate and trustworthy, and it helps avoid making up stuff that isn’t true.

This is why numerous prompting techniques, including web search and web append into the prompt (such as Retriever-Augmented Generation or RAG), have become increasingly important. These techniques are designed to expand the knowledge base of GPT beyond its initial training data. By integrating web search, GPT can access and utilize real-time, updated information from the internet. When a query is made, the model retrieves relevant information from the web, which it then appends to the prompt. This process significantly enhances the model’s ability to generate responses that are not only relevant but also current.

For instance, in RAG, the AI model first retrieves a set of documents from a large corpus (like the web) that are relevant to the input query. It then uses this additional context to generate a more informed and accurate response. This method bridges the gap between GPT’s pre-trained knowledge and the constantly evolving information available online. It’s particularly useful for questions where current or specialized knowledge is key. By leveraging these techniques, GPT becomes more versatile and adaptive, capable of providing responses that are aligned with the latest developments and data.

GPT-4 without In-Context Learning

    const prompt = `What is Twitter's new name?`;

    const chatCompletion = await openai.chat.completions.create({
        model: "gpt-4-1106-preview",
        messages: [{ role: "user", content: prompt }]
    });

    console.log(chatCompletion.choices[0].message.content);
    // As of my last update in April 2023, Twitter has not changed its name. It remains "Twitter." If there have been any recent developments or rebranding efforts that have resulted in a name change for Twitter, I would not have information on that. For the most current information, please check the latest news from Twitter or visit their official channels.

GPT-4 with In-Context Learning

    const response = await fetch('https://lite.duckduckgo.com/lite/?q=twitter');
    const webContent = await response.text();
    const prompt = `Based on this web search about Twitter: ${webContent}\nWhat is Twitter's new name?`;

    const chatCompletion = await openai.chat.completions.create({
        model: "gpt-4-1106-preview",
        messages: [{ role: "user", content: prompt }]
    });

    console.log(chatCompletion.choices[0].message.content);
    // According to the search results, Twitter's new name is "X." The snippet from the CBS News link indicates, "Twitter is now X. Here's what that means." Additionally, there are references throughout other snippets that mention "X," which suggest a name change or rebranding of Twitter. However, for the latest and most accurate information, it would be best to verify from direct sources like Twitter's official announcements or reputable news outlets.

As evident from the given examples, GPT alone is not sufficiently useful for tasks like retrieving real-time information without supplementing the prompt with accurate data. If we inquire about “What is Twitter’s new name?” without providing any additional context, GPT is essentially unaware of the answer. However, when we incorporate search results from DuckDuckGo for “Twitter” and then pose the same question, GPT is capable of discovering the response “X”.

2. Problem with ICL: Too Many Input Tokens

In-context learning (ICL) is a cool feature for AI models like GPT-4, but it’s got a downside when it comes to input tokens , processing time and costs. Let’s break it down. ICL works by adding extra information or context into the prompt that you give to the AI. For instance, if you’re asking about something new or recent, you need to put in some background details so the AI can understand and answer properly. The catch is, AI models like GPT-4 have a limit on how many tokens (words or pieces of words) they can handle at once and how quickly . So, when you add more context irresponsibly, you make things worse.

Now, this becomes a bit of a problem for the cost. Many AI services charge based on how many tokens you use. More tokens for context means you’re spending more every time you ask a question. This is especially tricky for tasks where you need to update the AI with new info often or cover lots of different topics, each needing its own background info. The more complex the topic, the more background you need, and the more tokens you use.

Furthermore, web pages have a lot of extra stuff like HTML, menus, ads, and more. When you use web content to give context, all these irrelevant parts also eat up your token capacity, which means higher costs and slower proccessing. It’s like paying for a lot of extra baggage and carrying them when you only need the essentials. So, if you’re pulling information from a web page to use as context, it’s important to be aware and only use the parts that are really needed. This way, you save on tokens and, in turn, on costs, making your use of the AI model more efficient and cost-effective for whatever project you’re working on.

In-Context Learning without HTML Page Content

    const start = performance.now();
    const loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sam is in Montego Bay Jamaica. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."; 
    const prompt = `Context: ${loremIpsum}\nQuestion: Where is Sam?`;

    const chatCompletion = await openai.chat.completions.create({
        messages: [{ role: "user", content: prompt }],
        model: "gpt-4-1106-preview"
    });

    console.log(chatCompletion.choices[0].message.content);
    // Sam is in Montego Bay, Jamaica.
    console.log(chatCompletion.usage.prompt_tokens);
    // 45 tokens ($0.00045)
    console.log((performance.now() - start) / 1000);
    // 1.037341266989708 seconds

In-Context Learning with HTML Page Content

    const start = performance.now();
    const loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sam is in Montego Bay Jamaica. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."; 
    const htmlContent = `
    <!DOCTYPE html>
    <html>
    <head>
        <title>Extended Sample Page</title>
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css">
        <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/pace-js@1.0.2/pace-theme-minimal.css">
        <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/pace-js@1.0.2/pace.min.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.0.0/dist/js/bootstrap.bundle.min.js"></script>
        <style>
            body { background-color: #f8f9fa; }
            header { color: white; background-color: #007bff; }
            .navbar { margin-bottom: 15px; }
            main { padding: 15px; }
            footer { font-size: 0.8em; text-align: center; background-color: #e9ecef; padding: 10px; }
            .custom-style { color: #555; border: 1px solid #ddd; }
        </style>
    </head>
    <body>
        <header>
            <nav class="navbar navbar-expand-lg navbar-dark bg-primary">
                <a class="navbar-brand" href="#">Extended Navbar</a>
                <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarNavAltMarkup" aria-controls="navbarNavAltMarkup" aria-expanded="false" aria-label="Toggle navigation">
                    <span class="navbar-toggler-icon"></span>
                </button>
                <div class="collapse navbar-collapse" id="navbarNavAltMarkup">
                    <div class="navbar-nav">
                        <a class="nav-item nav-link active" href="#">Home</a>
                        <a class="nav-item nav-link" href="#">Features</a>
                        <a class="nav-item nav-link" href="#">Pricing</a>
                        <a class="nav-item nav-link disabled" href="#" tabindex="-1" aria-disabled="true">Disabled</a>
                    </div>
                </div>
            </nav>
        </header>
        <main class="container custom-style">
            <p>${loremIpsum}</p>
        </main>
        <footer>
            <a href="https://example.com">Visit our extended website</a>
        </footer>
    </body>
    </html>`;
    const prompt = `Context: ${htmlContent}\nQuestion: Where is Sam?`;

    const chatCompletion = await openai.chat.completions.create({
        messages: [{ role: "user", content: prompt }],
        model: "gpt-4-1106-preview"
    });

    console.log(chatCompletion.choices[0].message.content);
    // Based on the content provided in the HTML paragraph element within the main tag, Sam is in Montego Bay, Jamaica.
    console.log(chatCompletion.usage.prompt_tokens);
    // 581 tokens ($0.00581)
    console.log((performance.now() - start) / 1000);
    // 3.171168835043907 seconds

In this experiment, we compared how GPT-4 responds to different types of prompts: a simple text versus a complex HTML page. Although GPT is able to answer correctly in both cases, There is a huge cost and processing time difference. The simple text, just a paragraph with key information about me, resulted in the GPT using fewer tokens and responding faster. However, when we used a lengthy HTML page with additional elements like styles and scripts for the same information, the token usage increased significantly, and the GPT took longer to respond.

ICL on Diet: Web to Information Compression

In the world of AI models like GPT-4, handling web content smartly is key, especially since these models have a limit on how much data they can process at a time. Plus, more data means higher costs. Let’s look at three straightforward strategies to reduce the amount of data, or tokens, when we work with web content.

1. Stripping HTML Tags: The Basic Cleanup

A simple and effective first step is to strip HTML tags from the web content. HTML tags are the parts of the code that format web pages, like styles, layouts, or images. But the GPT doesn’t need these to understand the text. By removing these tags, we focus on the main text, which cuts down the data the GPT has to handle. Think of it like removing unnecessary packaging to get to the product. It makes the process quicker and less costly.

In-Context Learning with Stripped HTML Page Content

    // Continuation of the html example.

    const prompt = `Context: ${striptags(htmlContent)}\nQuestion: Where is Sam?`;

    const chatCompletion = await openai.chat.completions.create({
        messages: [{ role: "user", content: prompt }],
        model: "gpt-4-1106-preview"
    });

    console.log(chatCompletion.choices[0].message.content);
    // Sam is in Montego Bay, Jamaica.
    console.log(chatCompletion.usage.prompt_tokens);
    // 193 tokens ($0.00193)
    console.log((performance.now() - start) / 1000);
    // 1.414866109299659 seconds

65% lower cost and 2X faster processing. Stripping HTML tags reduced input token amount highly, however still contain various unrelated texts around the header, footer, menu, event the style codes was in the prompt.

2. Smarter Article Extraction Algorithms

To get even more efficient, we can use article extraction algorithms. These are tools designed to smartly identify and extract just the main text (like articles or headlines) from a web page. Some popular tools include ‘unfluff’, ‘article-extractor’, and Mozilla’s ‘Readability’. These tools do more than just strip away HTML tags; they smartly pick out the important content and leave out things like ads, menus, or footers. Using these tools ensures the GPT gets just the relevant and useful part of the web page, which reduces the processing load.

In-Context Learning with Extracted HTML Page Content

    // Continuation of the html example.

    const prompt = `Context: ${unfluff(htmlContent).text}\nQuestion: Where is Sam?`;

    const chatCompletion = await openai.chat.completions.create({
        messages: [{ role: "user", content: prompt }],
        model: "gpt-4-1106-preview"
    });

    console.log(chatCompletion.choices[0].message.content);
    // Sam is in Montego Bay, Jamaica.
    console.log(chatCompletion.usage.prompt_tokens);
    // 45 tokens ($0.00045)
    console.log((performance.now() - start) / 1000);
    // 1.2853350500017406 seconds

90% lower cost and 3X faster processing. Extracting the article content input token amount to the just the content in this example, which is great compared to just stripping HTML tags. However this is not realistic on a real full size website.

3. Advanced Compression Techniques and APIs

For those looking to push the boundaries, there are advanced text compression and prompt compression techniques. Text compression squeezes the content down without losing its essence. Similarly, prompt compression tailors the input for GPT models, ensuring it’s short but still informative. There are also specialized services that offer optimized content compression for GPT. They might cost a bit, but they’re worth it for reducing data size and improving AI performance.

To put it briefly, various strategies enable us to optimize the feeding of web content to GPT models. Starting from basic HTML tag elimination, we progress to employing article extraction algorithms and sophisticated compression techniques. Implementing these methods allows us to streamline the data and concentrate on the relevant portions, resulting in 60% to 90% lower cost and more precise quicker responses from the GPT model.

Table of Contents