Charles River Analytics is part of a team of researchers that won a $11.3 million contract to infer authorship of uncredited documents based on the writing style. The Human Interpretable Attribution of Text Using Underlying Structure (HIATUS) program sponsored by the Intelligence Advanced Research Projects Activity (IARPA) uses natural language processing techniques and machine learning to create stylistic fingerprints that capture the writing style of specific authors.
On the flip side is authorship privacy, mechanisms that can anonymize identities of authors, especially when their lives are in danger. Pitting the attribution and privacy teams against each other will hopefully motivate each, says Dr. Terry Patten, Principal Scientist at Charles River Analytics and Principal Investigator of AUTHOR: Attribution, and Undermining the Attribution, of Text while providing Human-Oriented Rationales, Charles River’s effort under the HIATUS program.
“One of the big challenges for the program and for authorship attribution in general is that the document you’re looking at may not be in the same genre or on the same topic as the sample documents you have for a particular author,” Patten says. The same applies to languages: We might have example articles for an author in English but need to match the style even if the document at hand is in French. Authorship privacy too has its challenges: Users must obfuscate the style without changing the meaning, which can be difficult to execute.
While authorship attribution has usually looked for words and their frequencies as identifying fingerprints, AUTHOR is expected to evaluate grammar as well—such as the use of passive voice or nominalization, where one turns a verb into a noun. Discourse features—how do authors structure their arguments—will also be included in the set of attribution parameters.
The growth of natural language processing (NLP) and one of its underlying techniques, machine learning, is motivating researchers to harness these new technologies in solving the classic problem of authorship attribution. The challenge, Patten says, is that while machine learning is very effective at authorship attribution, “deep learning systems that use neural networks can’t explain why they arrived at the answers they did.” Evidence in criminal trials can’t afford to hinge on such black-box systems. It’s why the core condition of AUTHOR is that it be “human-interpretable.” Deep learning’s black box problem has been well documented, so figuring out how to arrive at human-interpretable explanations will contribute to the larger AI field as well, Patten says.
Initially, the project is expected to focus on feature discovery: Beyond words, what features can we discover to increase the accuracy of authorship attribution?
The project has a range of promising applications – identifying counterintelligence risks, combating misinformation online, fighting human trafficking, and even figuring out authorship of ancient religious texts.
Patten is excited about the promise of AUTHOR as it is poised to make fundamental contributions to the field of NLP. “It’s really forcing us to address an issue that’s been central to natural language processing,” Patten says. “In NLP and AI in general, we need to find a way to build hybrid systems that can incorporate both deep learning and human-interpretable representations. The field needs to find ways to make neural networks and linguistic representations work together.”
“We need to get the best of both worlds,” Patten says.
The team includes some of the world’s foremost researchers in authorship analysis, computational linguistics, and machine learning from Illinois Institute of Technology, Aston Institute for Forensic Linguistics, Rensselaer Polytechnic Institute, and Howard Brain Sciences Foundation.
This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-2207220001. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.