The research work led by André Duarte, a CMU Portugal Dual Ph.D. student in Language Technologies at Instituto Superior Técnico I INESC-ID, and at Carnegie Mellon University, has been highlighted in The Register, a leading British technology news and opinion website.
His research, which uncovers how AI models memorize copyrighted content, was described in the pre-print paper titled “RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline” co-authored by his supervisors Arlindo L. Oliveira (Instituto Superior Técnico I INESC-ID) and Lei Li (CMU).
The publication presents new methods for probing large language models (LLMs) and understanding what they retain from training data, specifically targeting copyrighted, proprietary and public-domain content.
Duarte and his team at CMU and INESC-ID, argue that current safety alignment measures often act as a barrier to transparency, preventing researchers from identifying when a model has memorized specific data, whether it be public domain or copyrighted material. By focusing on how models “refuse” direct requests to quote text, Duarte emphasizes that his work aims to understand the scientific mechanics of memorization across all data types, using copyrighted examples primarily to highlight the real-world stakes of model transparency and developer accountability.
To address the limitations of existing probing techniques, Duarte co-developed RECAP, an agentic software tool designed to extract memorized content through an iterative feedback loop. Unlike static prompts, RECAP uses a secondary agent to guide the model toward more complete extractions without providing the target text itself
Read the full article on “The Register’s” to learn more.
About André Duarte
André Duarte enrolled in the Dual Ph.D. Program in Language Technologies in the academic year 2024/2025 having participated in the CMU Portugal’s “Visiting Students” program in 2023 as a Master student. His Ph.D. research, focused on security, safety, and trustworthiness in LLMs, is supervised by Arlindo Oliveira (Técnico/INESC ID) and Lei Li (Carnegie Mellon).
Other articles featuring André Duarte: