Army intelligence analysts work with large volumes of information in the form of reports. Ironically, because the documents were written for human consumption, they contain ambiguous and implicit information that is difficult for computers to extract, leaving the analysts to manually comb through the data looking for key nuggets of information.
Charles River Analytics has been awarded two contracts totaling $3.6 million from the Small Business Innovation Research (SBIR) to address the problem. The technology being developed uses an innovative hybrid approach to natural language processing to extract relevant information from Army reports while adhering to the semantics that Army personnel use.
“Off-the-shelf large language models have impressive language processing capabilities. But they are not geared toward military language, so they struggle with the jargon and phraseology,” says Dr. Terry Patten, Principal Scientist at Charles River Analytics and Principal Investigator. “We’re training these models to understand the linguistic idiosyncrasies and nuances that appear in military reports.”
Fine-tuning a large language model (LLM) typically requires thousands of input-output training examples, which are difficult to obtain in practice. This project involves developing a novel approach that enables LLMs to be extensively fine-tuned given only a few representative examples.
Through an initial Phase II effort, Charles River developed and demonstrated an early prototype. Under the sequential Phase II effort, the team is continuing to mature the technology, perform evaluations on Army data, and prepare the technology for integration with the Army Intelligence Data Platform (AIDP).
“Through our work, Charles River has pioneered techniques to train an LLM for particular types of language, and we are already using these techniques on other programs,” says Dr. Michael Giancola, Co-Principal Investigator and AI Scientist. “It’s exciting to show how LLMs can be adapted efficiently to applications that involve idiosyncratic language and in different domains.”
The benefits of maturing this technology are far-reaching. Patten explains, “Every large organization faces challenges around extracting information from unstructured documents written by people—from technical product reviews to medical or legal documents. This technology shows how generic LLMs can be adapted to applications that involve highly specialized language.”
Contact us to learn more about capabilities in natural language processing and adaptive intelligent training.
This material is based upon work supported by the ASA(ALT) SBIR CCoE under Contract No. W51701-24-C-0126. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of ASA(ALT) SBIR CCoE.