Editor’s notes
1. Small correction. On Tuesday, we wrote that Prasanna was the founder of Ripple in our podcast copy. Our ardent readers were quick to point out the error. Prasanna was formerly at Rippling. Some users were not able to use the code to access 0xppl. The code is now fixed and can be accessed here.
2. Today’s piece is special in that we look at how the broad arc of the web is evolving through the lens of crypto-natives. Shlok has gone above and beyond with context and insights in this piece.
While not a sponsored deep-dive, we received an incredible amount of time and insights from a few people that made it possible. I want to express our gratitude to Anatoly from No Limit Holdings, JJ from Anagram, and members of the Masa and Wynd team.
If you are a late-stage protocol building cool things that should be written about, reach out to Sid - our head of ventures.
3. We have laid down a thesis for an emergent sector. We are keen on working with and investing in early-stage founders who are building cool things within it. Use the form below to get in touch.
Or join our community of 6000 members on Telegram to discuss the piece at length. The article may break in your e-mail client. Use the view in browser button in the top right corner to read it directly on our website.
Okay, back to the story.
Joel
15 million images.
22,000 categories.
That was the size of ImageNet, the dataset Fei-Fei Li, then an assistant professor at Princeton University, wanted to create. She hoped that doing so would help spur advancements in the dormant field of computer vision. It was an audacious undertaking. 22,000 categories were at least two orders of magnitude more than any previous image dataset created.
Her peers, who believed that the answer to building better artificial intelligence systems lay in algorithmic innovation, questioned her wisdom. “The more I discussed the idea for ImageNet with my colleagues, the lonelier I felt.”
Despite the scepticism, Fei-Fei, along with her small team, which included PhD candidate Jia Deng and a few undergrad students paid $10 an hour, began labelling images sourced from search engines. Progress was slow and painful. Jia estimated that, at their pace, it would take them 18 years to finish ImageNet—time no one had. That is when a Master’s student introduced Fei-Fei to Amazon Mechanical Turk, a marketplace that crowdsourced contributors from around the world to complete “human intelligence tasks.” Fei-Fei realised immediately that this was exactly what they needed.
In 2009, three years after Fei-Fei started the most important project of her life, with the help of a distributed, global workforce, ImageNet was finally ready. In the shared mission of advancing computer vision, she had done her part.
Now, it was up to researchers to develop algorithms that would leverage this massive data set to help computers see more like humans. Yet, for the first two years, that didn’t happen. The algorithms barely outperformed the pre-ImageNet status quo.
Fei-Fei began to wonder if her colleagues had been right all along about ImageNet being a futile effort.
Then, in August 2012, as Fei-Fei was giving up hopes of her project spurring the changes she envisioned, a frantic Jia called to tell her about AlexNet. This new algorithm, trained on ImageNet, was outperforming every computer vision algorithm in history. Created by a trio of researchers from the University of Toronto, AlexNet used a nearly discarded AI architecture called “neural networks” and performed beyond Fei-Fei’s wildest expectations.
At that moment, she knew her efforts had borne fruit. “History had just been made, and only a handful of people in the world knew it.”1
ImageNet, combined with AlexNet, was historic for several reasons.
First, the reintroduction of neural networks, long thought to be a dead-end technique, became the de-facto architecture behind the algorithms that fueled more than a decade of exponential growth in AI development.
Second, the trio of researchers from Toronto (one of whom was Ilya Sutskever, a name you might have heard of) were among the first to use graphical processing units (GPUs) to train AI models. This, too, is now industry-standard.
Third, the AI industry had finally woken up to a realisation Fei-Fei Li first had many years earlier: the crucial ingredient for advanced artificial intelligence is large amounts of data.
We’ve all read and heard adages like “data is the new oil” and “garbage in, garbage out” a million times. We’d get sick of them if they weren’t fundamental truths about our world. Over the years, artificial intelligence, in the shadows, has become an increasingly bigger part of our lives—influencing everything from the Tweets we read and movies we watch to the prices we pay and credit we’re deemed worthy of. All of this is driven by data collected through meticulously tracking every move we make in the digital world.
But over the last two years, since a relatively unknown startup called OpenAI released a chatbot application called ChatGPT, the prominence of AI has come out of the shadows and into the open. We are on the cusp of machine intelligence permeating every aspect of our lives. And as the race for who gets to control this intelligence heats up, so does the demand for the data that drives it.
That is what this piece is about. We discuss the scale and urgency of data needed by AI companies and the problems they face in procuring it. We explore how this insatiable demand threatens everything we love about the internet and the billions who contribute to it. Finally, we cover some emerging startups that are using crypto to come up with solutions to some of these problems and concerns.
Quick note before we dive in: this article is written from the perspective of training large language models (LLMs), and not all AI systems. Thus, I often use “AI” and “LLMs” interchangeably. While this usage is not technically accurate, the same concepts and problems that apply to LLMs, especially when it comes to data, also apply to other forms of AI models.
Show Me The Data
The training of large language models is bounded by three primary resources: compute, energy, and data. Corporations, governments, and start-ups are simultaneously competing for these same resources with large pools of capital backing them. Among the three, the scramble for compute is the most documented, thanks in part to NVIDIA’s meteoric stock price rise.
Training LLMs require large clusters of specialised graphic processing units (GPUs), particularly NVIDIA’s A100, H100, and upcoming B100 models. These are not computers you can purchase off-the-shelf from Amazon or your local computer store. Instead, they cost tens of thousands of dollars. NVIDIA decides how to allocate this supply across its AI lab, startup, data center, and hyperscaler customers.
In the 18 months following ChatGPT’s launch, GPU demand far exceeded supply, with wait times as high as 11 months. However, supply-demand dynamics are normalising as the dust settles on the initial frenzy. Startups shutting down, improvements in training algorithms and model architectures, the emergence of specialised chips from other companies2, and NVIDIA ramping up production are all contributing to increased GPU availability and falling prices.
Second, energy. Running GPUs in data centres requires vast amounts of energy. By some estimates, data centres will consume 4.5% of global energy by 2030. As this surging demand stresses existing power grids, tech companies are exploring alternative energy solutions. Amazon recently purchased a data centre campus powered by a nuclear power plant for $650 million. Microsoft has hired a head of nuclear technologies. OpenAI’s Sam Altman has backed energy startups like Helion, Exowatt, and Oklo.
From the perspective of training an AI model - energy and compute are mere commodities. Access to a B100 over a H100, or nuclear power over traditional sources might make the training process cheaper, faster, or more efficient—but it won’t impact the model’s quality. In other words, in the race to create the most intelligent and human-like AI models, energy and compute are bare essentials, not the differentiating factors.
The critical resource is data.
James Betker is a research engineer at OpenAI. He has, in his own words, trained more generative models “than anyone really has any right to train.” In a blog post, he noted that “trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point.” This means that what differentiates one AI model from another is the dataset. Nothing else.
When we refer to a model as “ChatGPT,” “Claude,” “Mistral,” or “Lambda,” we’re not talking about the architecture, GPUs used, or energy consumed, but the dataset it was trained on.
If data is food for AI training, then the models are what they eat.
How much data does it take to train a state-of-the-art generative model?
The answer: a lot.
GPT-4, still considered the best large language model more than a year after its release, was trained on an estimated 12 trillion tokens (or ~9 trillion words). This data comes from scraping the publicly available internet, including Wikipedia, Reddit, Common Crawl (a free, open repository of web crawl data), over a million hours of transcribed YouTube data, and code platforms like GitHub and Stack Overflow.
If you think that's a lot of data, hold up. There is a concept in generative AI called the "Chinchilla Scaling Laws," which states that for a given computational budget, it is more efficient to train smaller models on larger datasets than larger models on smaller datasets. If we extrapolate the computational resources AI companies are estimated to allocate for training the next generation of AI models (such as GPT-5 and Llama-4) - we find that these models are expected to require five to six times more computational power, utilising up to 100 trillion tokens for their training.
With most of the public internet already scraped, indexed, and used for training existing models, where does the additional data come from? This has become a frontier research problem for AI companies. There are two ways to fix this. One is that you decide to come up with synthetic data, which is data generated directly by LLMs instead of humans. However, the usefulness of such data in making the models more intelligent is still untested.
An alternative is to simply seek high-quality data instead of creating it synthetically. However, obtaining additional data is challenging, especially as AI companies are confronting problems that threaten not only the training of future models but also the validity of existing ones.
The first data problem involves legal issues. Although AI companies claim to have trained models on “publicly available data,” much of it is copyrighted. For example, the Common Crawl dataset contains millions of articles from publications like the New York Times and The Associated Press and other copyrighted material such as published books and song lyrics.
Some publications and creators are taking legal action against AI companies, alleging copyright and intellectual property infringements. The Times sued OpenAI and Microsoft for the “unlawful copying and use of The Times’ uniquely valuable works.” A group of programmers collectively filed a class action lawsuit challenging the legality of using open source code to train GitHub copilot, a popular AI programming assistant.
Comedian Sarah Silverman and author Paul Tremblay have also sued AI companies for using their work without permission.
Others have embraced the changing times by partnering with AI companies. The Associated Press, Financial Times, and Axel Springer have all signed content licensing deals with OpenAI. Apple is exploring similar arrangements with news organisations like Condé Nast and NBC. Google agreed to pay Reddit $60 million a year for access to their API for training models, while Stack Overflow struck a similar deal with OpenAI. Meta allegedly considered outright buying publishing house Simon & Schuster.
These arrangements coincide with the second problem AI companies are facing: the closing off of the open web.
Internet forums and social media websites have recognised the value AI companies generate by training their models on data from their platforms. Before striking a deal with Google (and potentially other AI companies in the future), Reddit started charging for its previously free APIs, killing off its popular third party clients. Similarly, Twitter has both limited access to and increased prices for their APIs, with Elon Musk using Twitter data to train models for his own AI company, xAI.
Even smaller publications, fan fiction forums, and the other niche corners of the internet that produced content for everyone to freely consume while monetising through ads (if at all) are now closing off. The internet was envisioned as this magical cyberspace where every individual could find a tribe that shared their unique interests and quirks. That magic seems to be slowly dissipating.
This combination of lawsuit threats, the increasing trend of multi-million dollar content deals, and the closing off of the open web has two implications.
First, the data wars are highly skewed in favour of the tech giants. Startups and smaller companies can neither access previously available APIs nor afford the cash required to purchase usage rights without taking legal risks. This has obvious centralising effects where the rich—who can buy the best data, and by extension, create the best models—get richer.
Second, the business model of user-generated content platforms becomes increasingly lopsided against users. Platforms like Reddit and Stack Overflow rely on the contributions of millions of unpaid human creators and moderators. Yet, when these platforms enter multi-million dollar deals with AI companies, they neither compensate nor seek permission from their users, without whom there would be no data to sell.
Both Reddit and Stack Overflow have experienced prominent user strikes in protest of these decisions. The Federal Trade Commission (FTC), for their part, has opened an inquiry into Reddit’s sale, licensing, and sharing of user posts with outside organisations to train AI models.
The questions these problems raise are relevant for training the next generation of AI models and the future of content on the web. As things stand, that future looks unpromising. Can crypto solutions level the playing field for smaller companies and internet users, addressing some of these issues?
The Pipeline
Training AI models and creating useful applications are complex, expensive endeavours that require months of planning, resource allocation, and execution. These processes consist of multiple phases, each serving a different purpose and having different data needs.
Let's break down these phases to understand how crypto can fit into the larger AI puzzle.
Pre-training
Pre-training, the first and most resource-intensive step in the LLM training process, forms the model's foundation. In this step, the AI model is trained on a vast amount of unlabeled text to capture general knowledge and language usage information about the world. When we say GPT-4 was trained on 12 trillion tokens, this refers to the data used for pre-training.
We need a high-level overview of how LLMs work to understand why pre-training is the foundation for LLMs. Note that this is a simplified overview. You can find more thorough explanations in this excellent article by Jon Stokes, this delightful video by Andrej Karpathy, or an even deeper breakdown in this brilliant book by Stephen Wolfram.
LLMs use a statistical technique called next-token prediction. In simple terms, given a series of tokens (i.e., words), the model tries to predict the next most likely token. This process repeats to form complete responses. Thus, you can think of a large language model as a “completion machine.”
Let’s understand this with an example.
When I ask ChatGPT a question like “What direction does the sun rise from?”, it starts by first predicting the word “the'', followed by each subsequent word in the phrase “sun rises from the East.” But where do these predictions come from? How does ChatGPT determine that after “the sun rises from,” it should follow with “the East” rather than “the West,” “the North,” or “Amsterdam”? In other words, how does it know that “the East” is more statistically probable than other options?3
The answer lies in learning statistical patterns from massive quantities of high-quality training data. If you consider all the text on the internet, what is more likely to appear—“the sun rises in the East” or “the sun rises in the West”? The latter may be found in specific contexts, like literary metaphors (“That is as absurd as believing that the sun rises in the West”) or discussions about other planets (like Venus, where the sun does indeed rise in the West). But, by and large, the former is much more common.
By repeatedly predicting the next word, the LLM develops a general worldview (what we call common sense) and an understanding of the rules and patterns of language. Another way to think of an LLM is as a compressed version of the internet. This also helps understand why the data needs to be both in large quantities (more patterns to pick) and of high quality (increased accuracy of pattern learning).
But as discussed earlier, AI companies are running out of data to train larger models. The rate of training data requirement growth is much faster than the rate of new data generation on the open internet. With impending lawsuits and the closing off of major forums, AI companies face a serious problem.
This problem is exacerbated for smaller companies, which cannot afford to enter multi-million dollar deals with proprietary data providers like Reddit.
This brings us to Grass, a decentralised residential proxy provider that aims to solve some of these data problems. They call themselves the “data layer of AI.” Let’s first understand what a residential proxy provider does.
The internet is the best source for training data, and scraping the internet is the preferred method for companies to gain access to this data. In practice, scraping software is hosted in data centres for scale, ease, and efficiency. But companies with valuable data don't want their data to be used to train AI models (not unless they’re being paid, anyway). To implement these restrictions, they often block the IP addresses of known data centres, preventing mass scraping.
That is where a residential proxy provider comes in. Websites block IP addresses only for known data centres and not for regular internet users like you and me, making our internet connections, or residential internet connections, valuable. Residential proxy providers aggregate millions of such connections to scrape websites for AI companies at scale.
However, centralised residential proxy providers operate covertly. They are often not explicit about their intentions. Users may not be willing to part with their bandwidth without being compensated if they know a product is using it. Even worse, they may ask to be compensated for the bandwidth a product uses, which, in turn, reduces the profit they make.
To protect their bottom line, residential proxy providers piggyback their bandwidth-consuming code on free applications with wide distribution, such as mobile utility applications (think calculators and voice recorders), VPN providers, or even consumer TV screensavers. Users who believe they are getting access to a free product are often unaware that a third-party residential provider is consuming their bandwidth (these details are often buried in the terms of service, which few people read).
Eventually, some of this data makes its way to AI companies, who use it to train models and create value for themselves.4
Andrej Radonjic, when running his own residential proxy provider, realised the unethical nature of these practices and their unfairness to users. He looked at how crypto was evolving and identified a way to create a more equitable solution. This is how Grass was founded in late 2022. A few weeks later, ChatGPT released, changing the world and putting Grass in the right place at the right time.
Unlike the sneaky tactics employed by other residential proxy providers, Grass makes the usage of bandwidth to train AI models explicit to its users. In return, they are directly compensated with incentives. This model flips the way residential proxy providers operate on its head. By willingly giving access to bandwidth and becoming part owners of the network, users transform from being unsuspecting passive participants to active evangelists, increasing the network’s reliability and benefiting from the value generated by AI.
Grass’s growth has been remarkable. Since launching in June 2023, they have amassed over 2 million active users running nodes (by installing either a browser extension or mobile application) and contributing bandwidth to the network. This growth has occurred with zero external marketing costs, driven by a highly successful referral program.
Using Grass’s services allows companies of all sizes, from big AI labs to open-source startups, to access scraped training data without having to pay millions of dollars. At the same time, every day users get compensated for sharing access to their internet connections, becoming a part of the growing AI economy.
Beyond just raw scraped data, Grass also provides a few additional services to its customers.
First, they are converting unstructured web pages into structured data that can more easily be processed by AI models. This step, known as data cleaning, is a resource intensive task usually undertaken by AI labs. By providing structured, clean data sets, Grass enhances its value to customers. Additionally, Grass is training an open-source LLM to automate the process of scrapping, preparing, and tagging data.
Second, Grass is bundling data sets with irrefutable proofs of their origin. Given the importance of high-quality data for AI models, assurances that bad actors - both websites and residential proxy providers - have not tampered with a data set are crucial for AI companies.
The seriousness of this problem is reflected in the forming of bodies like the Data & Trust Alliance, a non-profit group of more than twenty companies, including Meta, IBM, and Walmart, working together to create the provenance standards that help organisations determine if a body of data is suitable and trusted for use.
Grass is undertaking similar measures. Every time a Grass node scrapes a webpage, it also records metadata that verifies the webpage it was scraped from. These proofs of provenance are stored on the blockchain and shared with customers (who can further share them with their users).
Even though Grass is building on Solana, one of the highest throughput blockchains, storing the provenance of every scraping job on an L1 is infeasible. Thus, Grass is building a rollup (one of the first ones on Solana) that uses ZK processors to batch proofs of provenance before posting them on Solana. This rollup, what Grass calls the “data layer of AI,” becomes a data ledger for all their scraped data.
Grass’s Web 3-first approach gives it a couple of advantages over centralised residential proxy providers. First, by making use of incentives to get users to directly share bandwidth, they are distributing the value generated by AI more equitably (while also saving on the costs of paying app developers to bundle their code). Second, they can charge a premium for providing customers with “legitimate traffic,” which is highly valued in the industry.
Another protocol building on the “legitimate traffic” angle is Masa. The network allows users to pass on their logins for platforms like Reddit, Twitter, or TikTok. Nodes on the network then scrape through for highly contextual, updated data. The advantage in such a model is that the data collected is what a normal user on Twitter would see in their feed. You can in real time, have rich data sets that explain sentiment or content that is just about to go viral.
What are their data-sets used for? As it stands, there are two primary use-cases for such contextual data.
Financial - If you have mechanisms to see what tens of thousands of people are seeing on their feeds, you could develop trading strategies off of them. Autonomous agents that feed off sentimental data can be trained on Masa’s data-sets
Social - The emergence of AI-based companions (or tools like Replika) would mean we need data-sets that mimic human conversations. These conversations also need to be updated with the latest information. Masa’s data-streams can be used to train agents that can meaningfully talk about the latest trends on Twitter.
Masa’s approach takes information from walled gardens (like Twitter) with user consent, and makes them available for developers to build applications on. Such a social-first approach to collecting data also allows for building datasets around regional languages.
For instance, a bot that speaks in Hindi could use data that is fed from social networks that are operated in Hindi. The kind of applications these networks open up are yet to be explored.
Model Alignment
A pre-trained LLM is not nearly ready for production use. Think about it. All the model knows so far is how to predict the next word in a sequence, and nothing else. If you give a pre-trained model some text like “Who is Satoshi”, any of these would be a valid response:
Completing the question: Nakamoto?
Turning the phrase into a sentence: is a question that has perplexed Bitcoin believers for years.
Actually answering the question: Satoshi Nakamoto is the pseudonymous person or group of people who created Bitcoin, the first decentralised cryptocurrency, and its underlying technology, blockchain.
An LLM designed to provide useful answers would provide the third response. Yet, pre-trained models do not respond as coherently or correctly. In fact, they often spout random text that would make no sense to an end user. Worst case, the model confidentially responds with factually incorrect, toxic, or harmful information. When this happens, the model is said to be “hallucinating.”
The goal of model alignment is to make a pre-trained model useful to an end user. In other words, to convert it from a mere statistical text completion tool to a chatbot that understands and aligns with user needs and holds coherent, useful conversations.
Conversational Finetuning
The first step of this process is conversational finetuning. Finetuning is the process of taking a pre-trained machine learning model and further training it on a smaller, targeted dataset, helping it adapt to a specific task or use case. For training an LLM, this specific use case is engaging in human-like conversations. Naturally, the dataset for such finetuning is a collection of human-generated prompt-response pairs that demonstrate to the model how to behave.
These datasets span different types of conversations (question-answer, summarization, translation, code generation) and are typically designed by highly educated humans (sometimes called AI tutors) who possess excellent language skills and subject-matter expertise.
State of the art models like GPT-4 are estimated to be trained on ~100,000 of these prompt-response pairs.
Reinforcement learning from human feedback (RLHF)
Think of this stage as similar to how a human trains a pet puppy: rewarding good and reprimanding bad behaviour. A model is given a prompt, and its response is shared with a human labeller who rates it on a numerical scale (e.g., 1-5) based on the accuracy and quality of the output. Another version of RLHF is getting a prompt to produce multiple responses that a human labeller then ranks from best to worst.
RLHF serves to nudge the model towards human preferences and desired behaviour. In fact, if you’re a user of ChatGPT, OpenAI also uses you as an RLHF data labeller! This happens when the model sometimes produces two responses and you’re asked to choose the better one.
Even the simple thumbs up or thumbs down icons that prompt you to rate the usefulness of an answer are a form of RLHF training for the model.
When using AI models, we rarely consider the millions of hours of human labour that go into building them. This isn’t unique to LLMs. Historically, even traditional machine learning use cases like content moderation, self driving, and tumour detection have required significant human involvement for data labelling. (This excellent story from 2019 by the New York Times shows what happens behind the scenes at the Indian offices of iAgent, a company that specialises in human labelling).
Mechanical Turk, the service used by Fei-Fei Lee to create the ImageNet database, has been called “Artificial Artificial Intelligence” by Jeff Bezos for the role its workers play behind the scenes in AI training.
In a bizarre story from earlier this year, it was revealed that Amazon’s Just Walk Out stores, where customers could simply pick items from shelves and walk out (being charged automatically later), were not powered by some advanced AI. Instead 1,000 Indian contractors were manually sifting through store footage.
The point is, every large-scale AI system relies on humans to some degree, and LLMs have only increased the demand for these services. Companies like Scale AI, whose customers include OpenAI, have reached 11-figure valuations on the back of this demand. Even Uber is repurposing some of its workers in India to label AI outputs when they’re not driving their vehicles.
In their quest to become a full stack AI data solution, Grass is entering this market as well. They will soon release an AI labelling solution (as an extension to their primary product) where users on their platform will be able to earn incentives for completing RLHF tasks.
The question is: what advantages does Grass gain by making this process decentralised against the hundreds of centralised companies in the same domain?
Grass can bootstrap a network of workers with token incentives. Just as they reward users for sharing their internet bandwidth with tokens, they can also reward humans for labelling AI training data. In the Web2 world, payments to gig economy workers, especially for globally distributed jobs, are an inferior user experience compared to the instant liquidity provided on a fast blockchain like Solana.
The crypto community in general, and Grass’s existing community in particular, already have a high concentration of educated, internet-native, and tech-savvy users. This reduces the resources Grass needs to spend on recruiting and training workers.
You might be wondering whether the task of labelling AI model responses in exchange for incentives might attract the attention of farmers and bots. I did as well. Fortunately, extensive research has been conducted into using consensus-based techniques to identify high-quality labellers and weed out bots.
Note that Grass is, at least for now, only entering the RLHF market, and not helping companies with conversational finetuning, which requires a highly specialised workforce and logistics that are much harder to automate.
Specialised Finetuning
Once the pre-training and alignment steps are completed, we have what is called a foundation model. A foundation model has a general understanding of how the world operates and can hold fluent, human-like conversations on a wide range of topics. It also has a solid grasp over language and can help users write emails, stories, poems, essays and songs with ease.
When you use ChatGPT, you’re interacting with the foundation model GPT-4.
Foundation models are general models. While they know sufficiently enough about millions of topics, they don’t specialise in any of them. When asked to help understand the tokenomics of Bitcoin, the response will be useful and mostly accurate. However, when you ask it to lay down the security edge case risks of a restaking protocol like EigenLayer, you shouldn’t trust it too much.
Recall that finetuning is the process of taking a pre-trained machine learning model and further training it on a smaller, targeted dataset, helping it adapt to a specific task or use case. Earlier, we discussed finetuning in the context of turning a raw text completion tool into a conversational model. Similarly, we can also finetune the resulting foundation model to specialise in a particular domain or a specific task.
Med-PaLM2, a finetuned version of Google’s foundation model PaLM-2, is trained to provide high quality answers to medical questions. MetaMath is finetuned on Mistral-7B to perform better at mathematical reasoning. Some finetuned models specialise in broad categories like storytelling, text summarization, and customer service, while others specialise in niches like Portuguese poetry, Hinglish translation, and Sri Lankan law.
Finetuning a model for a specific use case requires high quality data sets relevant to that use case. These datasets can be sourced from domain-specific websites (like this newsletter for crypto data), proprietary data sets (a hospital might transcribe thousands of patient-doctor interactions), or experiences of an expert (which would require thorough interviews to capture).
As we enter a world with millions of AI models, these niche long-tailed datasets are becoming increasingly valuable. The owners of such datasets, from big accounting firms like EY to freelance photographers in Gaza, are being courted for what is quickly becoming the hottest commodity in the AI arms race. Services like Gulp Data have emerged to help businesses fairly assess the value of their data.
OpenAI even has an open request for data partnerships with entities possessing “large-scale datasets that reflect human society and that are not already easily accessible online to the public today.”
We know of at least one great way to match buyers looking for sellers of niche products: internet marketplaces! Ebay created one for collectibles, Upwork for human labour, and so did countless platforms for countless other categories. Unsurprisingly, we’re seeing the emergence of marketplaces, some decentralised, for niche data sets as well.
Bagel is building the “artificial general infrastructure,” a set of tools that enables holders of “high-quality, diverse data” to share their data with AI companies in a trustless, privacy preserving way. They use technologies like zero knowledge (ZK) and fully homomorphic encryption (FHE) to achieve this.
Companies often sit on highly valuable data that they cannot monetise due to privacy or competitive concerns. For example, a research lab may have troves of genomic data that they can’t share to protect patient privacy, or a consumer goods manufacturer may have supply chain waste reduction data that it can’t reveal without also revealing competitive secrets. Bagel uses advancements in cryptography to make these data sets useful while allaying ancillary concerns.
Grass’s residential proxy service can also help create specialised datasets. For example, if you want to fine-tune a model to provide expert culinary advice, you could ask Grass to scrape data from Reddit subreddits like r/Cooking and r/AskCulinary. Similarly, the creators of a travel-oriented model could ask Grass to scrape data from TripAdvisor forums.
While these are not exactly proprietary data sources, they can still be valuable complements to other datasets. Grass also plans to use its network to create archived datasets that can be reused by any customer.
Context Level Data
Try asking your preferred LLM “what is your training cut off date?” and you’ll get an answer like November 2023. This means that foundational models only provide information available before that date. When you consider how computationally expensive and time consuming it is to train these models (or finetune them, even), this makes sense.
To keep them updated in real time, you’d have to train and deploy a new model every day, which is simply not feasible (at least so far).
Yet, an AI that doesn’t have the latest information about the world is pretty useless for many use cases. For example, if I’m using a personal digital assistant that relies on LLMs for responses, they would be handicapped when asked to summarise unread emails or provide the goal scorers from the last Liverpool game.
To circumvent these limitations and provide users with responses based on real-time information, app developers can query and insert information into what is called a foundation model’s “context window.” The context window is the input text an LLM can process for response generation. It is measured in tokens and represents the text an LLM can “see” at any given moment.
So, when I ask my digital assistant to summarise my unread emails, the application first queries my email provider for the contents of all unread emails, inserts the response into the prompt sent to the LLM, and appends the prompt with something like: “I’ve provided you with the list of unread emails from Shlok’s inbox. Please summarise them.” The LLM, with this new context, can then complete its task and provide a response. Think of this process as you copy-pasting an email into ChatGPT and asking it to generate a response, but happening in the backend.
To create applications with the latest responses, developers need access to real-time data. Grass nodes, which can scrape any website in real-time, can provide developers with this data. For example, an LLM-based news application can ask Grass to scrape all trending articles on Google News every five minutes. When a user queries “What was the magnitude of the earthquake that just hit New York City?” the news app retrieves the relevant article, adds it to the LLM’s context window, and shares the response with the user.
This is also where Masa fits in today. As it stands, Alphabet, Meta, and X are the only large platforms with constantly updating user data as they have the user base. Masa levels the playing ground for smaller startups.
The technical term for this process is retrieval-augmented generation (RAG). RAG workflows are central to all modern LLM-based applications. This process involves vectorising text, or converting text into arrays of numbers, which can then be easily interpreted, manipulated, stored, and searched by computers.
Grass aims to release physical hardware nodes in the future, providing customers with vectorised, low-latency real-time data to simplify their RAG workflows.
Most builders in the industry predict context-level querying (also called inference) to utilise the bulk of resources (energy, compute, data) in the future. This makes sense. The training of a model will always be a time-bound process which consumes a set allocation of resources. Application-level usage, on the other hand, can have a theoretically infinite demand.
Grass is already seeing this playing out with a bulk of their text data requests coming for customers looking for real-time data.
The context windows for LLMs are expanding over time. When OpenAI first released ChatGPT, it had a context window of 32,000 tokens. Less than two years later, Google’s Gemini models have context windows of more than a million tokens. A million tokens is equivalent to over eleven 300-page books—a lot of text.
These developments make the implications of what can be built with context windows much bigger than just accessing real time information. Someone can, for example, dump the lyrics of all Taylor Swift songs, or the entire archive of this newsletter, into the context window and ask the LLM to generate a new piece of content in a similar style.
Unless explicitly programmed not to, the model will produce a more than decent output.
If you can sense where this discussion is heading, hold up for what’s coming next. So far, we’ve mainly discussed text models, but generative models are becoming extremely proficient at other modalities like sounds, image, and video generation. I recently came across this very cool illustration of London by Orkhan Isayen on Twitter.
Midjourney, the popular (and extremely good) text-to-image tool has a feature called Style Tuner that can generate new images in the style of existing ones (this feature also relies on RAG-like workflows, but not exactly). I uploaded Orkhan’s human-made illustration and used Style Tuner to prompt Midjourney to change the city to New York. This is what I got:
Four images that, if you browse the artist’s illustrations, can easily be mistaken for their work. These were generated by an AI in 30 seconds based on just a single input image. I asked for ‘New York’ but the subject could have been anything, really. Similar kinds of replications are also possible in other modalities, like music.
Recall from our previous discussion that some of the various entities that are suing AI companies included creators and you can see why they may have a point.
The internet was a boon for creators, a way for them to share their stories, art, music, and other forms of creative expression with the whole world; a way for them to find their 1,000 true fans. Now, the same global platform is becoming the biggest threat to their livelihood.
Why pay someone like Orkhan a $500 commission when you can get a good-enough copy of a piece in their style for a $30/month Midjourney subscription?
Sounds dystopian?
The wonderful thing about technology is that it almost always comes up with a way to solve the very problems it creates. If you flip what seems like a dire situation for creators on its head, what you get is an opportunity for them to monetise their talents at an unprecedented scale.
Before AI, the amount of artwork Orkhan could create was bounded by the hours they had in a day. With AI, they can now theoretically serve an infinite clientele.
To understand what I mean, let’s look at elf.tech, an AI music platform by musician Grimes. Elf Tech allows you to upload a recording of a song, which it then turns into Grimes’ voice and style. Any royalties earned from the song are split 50-50 between Grimes and the creator. This means that as a fan of Grimes, her voice, her music, or her distribution, you can simply come up with an idea for a song, which the platform, using AI, turns into Grimes’ voice.
If the song goes viral, both you and Grimes benefit. This also enables Grimes to scale her talent and leverage her distribution passively.
TRINITI, the technology that powers elf.tech, is a tool created by the company CreateSafe. Their litepaper sheds light on what we foresee as one of the most interesting intersections of blockchain and generative AI technology.
Expanding the definition of digital content through creator-controlled smart contracts and reimagining distribution with blockchain-based, peer-to-peer, pay-for-access micro-transactions allows any streaming platform to instantly authenticate and access digital content. The generative AI then executes an instant micropayment based on creator-specified terms and streams the experience to consumers.
Balaji puts it more simply.
As new mediums emerge, we scramble to figure out how humanity would interact with them. When clubbed with networks, they become powerful engines for change. Books fuelled the Protestant Revolution. The radio and TV were a key part of the Cold Wars. Media is often a dual-edged sword. It can be used for good. And it can be used for bad.
What we have today are centralised firms that own the bulk of user data. It is almost as though we trust our corporations to do the right thing for creativity, our mental wellbeing, and the development of a better society. That is too much power to be handed off to a handful of firms, many of whose inner operations we barely understand.
We are very early in the LLM revolution. Much like Ethereum in 2016, we barely know the kind of applications that may be built using them. An LLM that can talk to my grandmother in Hindi? An agent that scours through feeds and surfaces only high quality data? A mechanism for independent contributors to share culture-specific nuances (like slangs) in languages? We don’t quite know what is possible yet.
What is evident, however, is that building these applications will be constrained by one key ingredient: data.
Protocols like Grass, Masa, and Bagel are infrastructure that powers its sourcing in an equitable way. Human imagination is the limit when you consider what can be built atop it. To me, that seems exciting.
Lost in 19th-century Central Asia,
Shlok Khemani
Fei-Fei Lee shares the story behind ImageNet (and much more) in her excellent memoir The Worlds I See
Google, Amazon, Meta, and Microsoft are all working on creating specialised chips to reduce their reliance on NVIDIA. AMD, NVIDIA’s biggest competitor is also upping its game. Startups like Grok, which is making chips specialised for inference, are also attracting increased funding.
Another way to understand this is by comparing the number of Wikipedia pages containing these phrases. 'The sun rises in the East' yields 55 pages, whereas 'the sun rises in the West' returns 27 pages. 'The sun rises in Amsterdam' shows no results! These are the patterns ChatGPT picks up.
COOL!