Language research project ceases as generative AI has ‘polluted the data’

GLOBAL – An open-source project looking up the frequency of words online to analyse language popularity will no longer be updated, as its creator says generative artificial intelligence has ‘polluted’ the data.

speech bubble made up of lots of illustrations of people

The wordfreq Python library offers access to estimates of the frequency with which a word is used, in over 40 languages, based on data sources including social media, Wikipedia, news, books and web text.

According to a GitHub documentation for the project written by creator Robyn Speer, wordfreq will no longer be updated but its latest version will still be accessible.

Generative AI has “polluted the data”, Speer wrote in the documentation, saying: “I don't think anyone has reliable information about post-2021 language usage by humans.

“The open Web (via OSCAR) was one of wordfreq’s data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.”

While there was previously “spam” in the wordfreq data sources, it was manageable and often identifiable, Speer noted. The arrival of large language models, however, has inserted generated text into the data. Speer wrote: “Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.”

Speer cited the example of the word ‘delve', which is overused by ChatGPT, according to analysis by Philip Shapira

Changes in the availability of online information from public sources, specifically Twitter (now X) and Reddit, also mean that data that was previously free is now “expensive” for open source purposes, cited by Speer as another reason for sunsetting the project.

The story was first reported by 404 Media.

Journalists, NGOs, and researchers use raw data from Twitter’s API to gather insights for research. The company ended free access to its API in February 2023, a few months after Elon Musk took over the business, prompting the Coalition for Independent Technology Research to warn that halting free access would disrupt research projects. Reddit also began restricting free access to its API in April 2023. 

We hope you enjoyed this article.
Research Live is published by MRS.

The Market Research Society (MRS) exists to promote and protect the research sector, showcasing how research delivers impact for businesses and government.

Members of MRS enjoy many benefits including tailoured policy guidance, discounts on training and conferences, and access to member-only content.

For example, there's an archive of winning case studies from over a decade of MRS Awards.

Find out more about the benefits of joining MRS here.

0 Comments


Display name

Email

Join the discussion

Newsletter
Stay connected with the latest insights and trends...
Sign Up
Latest From MRS

Our latest training courses

Our new 2025 training programme is now launched as part of the development offered within the MRS Global Insight Academy

See all training

Specialist conferences

Our one-day conferences cover topics including CX and UX, Semiotics, B2B, Finance, AI and Leaders' Forums.

See all conferences

MRS reports on AI

MRS has published a three-part series on how generative AI is impacting the research sector, including synthetic respondents and challenges to adoption.

See the reports

Progress faster...
with MRS 
membership

Mentoring

CPD/recognition

Webinars

Codeline

Discounts