Search Engines, Information Cocoons, and Corpus Database Pollution

Common Crawl, a nonprofit 501 organization that crawls the web and freely provides its archives and datasets to the public. Every June, the organization releases the metadata it crawled that year. What’s interesting is that after post-processing such as removing the NSFW content, it was found that the compressed data of simplified Chinese was only 6TB, and after decompression it was about 30TB, and the compressed data of traditional Chinese was 6TB. Compared with the proportion of Internet users in the world, this data volume is quite low. The Chinese data reached a maximum in 2019, and then fell back. The reason is that the “QingLang Action 2020” organized by the Cyberspace Administration of China closed a large number of forums and actually stopped the establishment of new forums. Such as Baidu Tieba, a forum that once produced half of the Chinese internet’s content, deleted all data before 2015.

Last month, OpenAI released GPT-4o, a decidedly flirty new large language model (LLM) equipped with new and advanced capabilities. However, some Chinese reseachers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases. It is not rare for a language model to crawl spam when collecting training data, but usually there will be significant effort taken to clean or wash the data before it’s used. What’s even more interesting is that in fact, it is not a difficult task to clean such data, at least most of them. This technology has been applied in spam filtering more than 20 years ago. The content of these Chinese tokens could suggest that they have been polluted by a specific phenomenon: websites hijacking unrelated content in Chinese or other languages to boost spam messages. Although OpenAI has not responded to this yet, if it was done intentionally for some purpose, it would be extremely terrifying. From a pessimistic perspective, this means that the Chinese corpus data has withered or be polluted to the point where it is difficult to clean. After all, the possibility that OpenAI made such a low-level mistake as “forgetting to clean the data” is very small.

Not only websites hijacking will destroy search results, AI driven SEO(search engine optimization) techniques are more destructive. Doubao, a AI chatbot of ByteDance, used SEO technology in order to gain a higher weight in search engines. It directly allows AI to produce content(conversation data), fix it into static web pages, and then it is captured by search engines to drive traffic to itself. A “content farm” or “content mill” is a company that employs large numbers of freelance writers or uses automated tools to generate a large amount of textual web content which is specifically designed to satisfy algorithms for maximal retrieval by search engines, i.e. SEO. Now, in the AI ​​era, the speed of producing low-quality content has increased thousands of times compared to before. I really didn’t expect that ByteDance officially would blatantly operate a content farm. Furthermore, this is no longer an SEO issue, but a privacy and security issue. We all know that the LLMs will use our conversation data for training, and this is already a default consensus. However, it is outrageous that the chat records are actually made public and can be searched by search engines.

I can’t imagine that if this phenomenon is allowed to continue, the Chinese corpus will be filled with low-quality data generated by AI. Training a good Chinese LLM is even more difficult, and ordinary users will be buried in a dump of information garbage. When a language can no longer generate new information, that is when it dies.

Recently some people have been discussing whether ChatGPT will replace search engines. I must say: it is dangerous to let AI chatbots such like ChatGPT replace search engines.

Not only should ChatGPT not replace search engines, but also short video platforms, social media platforms and shopping platforms that you have often used in recent years should not replace search engines.

The supply of knowledge must not be monopolized.

Search engines can certainly control the ranking of search results, but this is a weak intervention after all. At the same time, there are many different search engines in the world. Once we get used to something feeding the only answer of a question to our mouths, we are inviting a cyber BigBrother back.