2025

‘Shadow Library’ Apparently Pirated My Book!

Print this entry

I was sitting in a Dallas doctor’s office the other day, reading The Atlantic on my phone since I forgot to bring a book. This was about a week before that venerable magazine — founded in 1857 — made headlines when its editor-in-chief was accidentally put on Signal, an encrypted messaging app, with key members of the administration’s team planning an attack on Houthi rebels in Yemen, including the vice president, CIA director, director of national intelligence and secretary of defense. Elect a clown. Expect a circus.

The Atlantic is now owned by Lauren Powell Jobs, widow of Apple co-founder Steve Jobs and a noted philanthropist. I have subscribed to it for decades and happily continue to do so. We need strong, independent journalism more than ever.

The article that got my attention was titled The Unbelievable Scale of AI’s Pirated-Books Problem, by Alex Reisner. He is a computer programmer who has written extensively about generative artificial intelligence, made famous by ChatGPT, and now being used in search engines, customer service and elsewhere. It has rapidly become ubiquitous. For example, if you use Google to look up something, the answer is generated by that company’s version of AI.

I am OK with that. I often use ChatGPT in research, carefully checking the citations and links it provides. I don’t use it to write these pieces. That would be cheating.

What I, and many others, take issue with is how these large language models are being trained. Reisner’s article points out that court documents he obtained indicate that Meta, the company that owns Facebook, pirated millions of books, and research papers to train its flagship AI model, Llama 3. It did so by using LibGen, a notorious buccaneer of copyrighted materials. Founded in Russia, it is known as a “shadow library.” As one of several lawsuits against LibGen states:

 Libgen enables users to download, for free, fiction and non-fiction books (among other types of works), including educational textbooks, instead of buying or renting lawful copies or checking them out from a legitimate library. Defendants have absolutely no legal justification for what they do and operate in complete and knowing defiance of the rule of law.”

Rather than paying authors to use their work, LibGen steals it. Meta reportedly decided to use LibGen to train Llama 3. Now it is being sued by several authors, including Sarah Silverman and Junot Diaz, for copyright infringement. Open AI, which owns ChatGPT, is also being sued and accused of copyright infringement by The Authors Guild, The New York Times, and others. (After the Atlantic article was published, OpenAI said the models powering ChatGPT now weren’t developed using the LibGen datasets, and that it hasn’t used them since 2021.)  So, while I use ChatGPT and accept that AI is going to increasingly be part of the intellectual fabric of society, these companies are making billions of dollars in profit off other people’s work. The originators of the works should be compensated. I don’t know how that would pan out but am hopeful the courts will side with the creators.

Reisner also provided a search bar in his article so one could search for an author in LibGen, with some caveats. There is no way of knowing what specific content Meta and Open AI used to train their models. Just because a particular title is in LibGen doesn’t necessarily mean it was used to train one of those AI models.

With that caveat in mind, I typed my name into Reisner’s search bar. A total of 36 results popped up, most having nothing to do with me. But the top two results were a book I wrote that was published 19 years ago by the University of Texas Press: A Hanging in Nacogdoches: Murder, Race, Politics, and Polemics in Texas’s Oldest Town, 1870-1916.

LibGen has stolen my book! Could I be entitled to compensation, as the lawyer commercials say? Doubtful. I don’t even know who to sue. LibGen is pretty shady as to ownership. From what I gather, it constantly changes domains and mirror sites to evade lawsuits from publishers, such as major foe Elsevier, a major academic journal and book publisher.

The Hanging book (my shorthand title) is still in print, thanks to print-on-demand technology. It still sells regularly in Nacogdoches, usually to tourists drawn to the title. The Nacogdoches T-Shirt Company (formerly Bosslight) kindly stocks it. It can be found on Amazon and other book sites.

The Hanging books can also be found in a box in my garage. I buy a box a year wholesale from UT Press to keep the Nacogdoches store stocked, plus (very) occasional requests from someone who wants a signed copy. Every October, UT-Press deposits a royalty payment in my bank account, usually enough to take my Beautiful Mystery Companion out for Mexican food. The royalty check won’t cover the margaritas, however.

What the heck? I asked ChatGPT if it uses LibGen. Here was its response:

LibGen is an unauthorized repository of copyrighted content, including books and academic papers, often shared without the permission of publishers or authors. OpenAI, the company behind ChatGPT, follows strict legal and ethical guidelines and does not use data from illegal or pirated sources like LibGen in training or in generating responses.

Instead, ChatGPT is trained on a mixture of licensed data, publicly available information, and data created by human trainers. When it comes to academic or scientific topics, it provides summaries or explanations based on general knowledge, not from pirated PDFs or paywalled articles

OK, then. I guess the courts eventually will have the final say.

Print this entry

Leave a reply

Fields marked with * are required