AI “gold rush” for chatbot training data could run out of human-written text
“Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.” “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore. And scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.” Further, most chatbots today require countless hours of time to train the AI platform in order to deliver the right responses to queries. For example, training a chatbot on average takes about 12 to 18 months of collecting just the right data. The problem (as you may have guessed) is that conversational data is a mess.
Staffing agency CEO offers help to Blair County Prison Re-entry group
According to Hong, organizations are devoting extensive data science and data engineering resources to prepare large amounts of raw chat transcripts and other conversational data so it can be used to train chatbots and agents. Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem. Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. You lose some of the information,” Papernot said. Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem.
Lawmakers demand Denver company provide information about U.S. contracts
“Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.” “There’d be something very strange if the best way to train a model was to just generate, like, a quadrillion tokens of synthetic data and feed that back in,” Altman said. AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said. Papernot, who was not involved in the Epoch study, said building more skilled AI systems can also come from training models that are more specialized for specific tasks. But he has concerns about training generative AI systems on the same outputs they’re producing, leading to degraded performance known as “model collapse.”
IDC Spotlight: Boosting AI Impact with Data Products
- Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem.
- In our commitment to covering our communities with innovation and excellence, we incorporate Artificial Intelligence (AI) technologies to enhance our news gathering, reporting, and presentation processes.
- Comparing it to a “literal gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, said the AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing.
- For starters, raw text files that serve as the training data must be cleansed, prepped, and labeled.
- Get 2 months free with an annual subscription at was $59.88 now $49.Access to eight surprising articles a day, hand-picked by FT editors.
As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating lots of synthetic data” for training. Still, Deckelmann said she hopes there continue to be incentives for people to keep contributing, especially as a flood of cheap and automatically generated “garbage content” starts polluting the internet. “I think what you need is high-quality data. There is low-quality synthetic data. There’s low-quality human data,” Altman said. But he also expressed reservations about relying too heavily on synthetic data over other technical methods to improve AI models.
In our commitment to covering our communities with innovation and excellence, we incorporate Artificial Intelligence (AI) technologies to enhance our news gathering, reporting, and presentation processes.
Newsletter
As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating lots of synthetic data” for training. A new study released Thursday by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade — sometime between 2026 and 2032. A new study released Thursday by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade — sometime between 2026 and 2032. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter — the tens of trillions of words people have written and shared online. 2.5 times per year, while computing has grown about 4 times per year, according to the Epoch study.
Comparing it to a “literal gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, said the AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing. AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said. While a good chatbot seems to work effortlessly, there’s a lot of work going on behind the scenes to get there. For starters, raw text files that serve as the training data must be cleansed, prepped, and labeled. Sentences must be strung together, and questions and answers in a conversation grouped. As part of this process, the data is typically extracted from a data lake and loaded into a repository where it can be queried and analyzed, such as a relational database.
- If real human-crafted sentences remain a critical AI data source, those who are stewards of the most sought-after troves — websites like Reddit and Wikipedia, as well as news and book publishers — have been forced to think hard about how they’re being used.
- Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy.
- “I think what you need is high-quality data. There is low-quality synthetic data. There’s low-quality human data,” Altman said.
- When you combine this democratization of NLP tech with the workplace disruptions of COVID, we have a situation where chatbots appear to have sprung up everywhere almost overnight.
That’s what drove the folks at Dashbot to develop a data platform specifically for chatbot creation and optimization. The researchers first made their projections two years ago — shortly before ChatGPT’s debut — in a working paper that forecast a more imminent 2026 cutoff of high-quality text data. Much has changed since then, including new techniques that enabled AI researchers to make better use of the data they already have and sometimes “overtrain” on the same sources multiple times. From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance. If real human-crafted sentences remain a critical AI data source, those who are stewards of the most sought-after troves — websites like Reddit and Wikipedia, as well as news and book publishers — have been forced to think hard about how they’re being used. Much has changed since then, including new techniques that enabled AI researchers to make better use of the data they already have and sometimes “overtrain” on the same sources multiple times.
Comparing it to a “literal gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, said the AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing. Recent advances in natural language processing (NLP) and transfer learning have helped to lower the technical bar to building chatbots and conversational agents. Instead of creating a whole NLP system from scratch, users can borrow a pre-trained deep learning model and customize just a few layers. When you combine this democratization of NLP tech with the workplace disruptions of COVID, we have a situation where chatbots appear to have sprung up everywhere almost overnight. One of the most compelling use cases for AI at the moment is developing chatbots and conversational agents. While the AI part of the equation works reasonably well, getting the training data organized to build and train accurate chatbots has emerged as the bottleneck for wider adoption.
Get 2 months free with an annual subscription at was $59.88 now $49.Access to eight surprising articles a day, hand-picked by FT editors. For seamless reading, access content via the FT Edit page on FT.com and receive the FT Edit newsletter. Andrew Hong also saw this sudden surge in chatbot creation and usage while working at a venture capital firm a few years ago. With the chatbot market expanding at a 24% CAGR (according to one forecast), it’s a potentially lucrative place for a technology investor, and Hong wanted to be in on it. “When we found out about ZeroShotBot we thought it would be a huge expensive investment but were pleasantly surprised how affordable it was to put up a ZeroShotBot AI chatbot!