Scale AI was contracted by the Chief Digital and Artificial Intelligence Office (CDAO) of the Pentagon to develop a reliable method for assessing and analyzing Large language models, which can both aid and hinder military decision-making and planning. For the Defense Department’s generative AI program, the business will develop a thorough T&E architecture.
The findings of this new one-year contract will provide the CDAO with “a framework to deploy AI safely by measuring model performance, offering real-time feedback for warfighters, and creating specialized public sector evaluation sets to test AI models for military support applications, such as organizing the findings from after action reports,” according to a statement exclusively shared with DefenseScoop by the San Francisco-based company.
As a subfield of generative AI, large language models encompass new tools that may, given human input, produce media such as convincing (but not necessarily correct) text, code, pictures, and more.
This dynamic field offers great opportunities for the Defense Department but presents unknown and significant threats. To speed up its components’ understanding, evaluation, and use of generative AI, the Pentagon leadership established Task Force Lima under the CDAO’s Algorithmic Warfare Directorate last year.
Departmental systems, platforms, and technologies have historically relied on test-and-evaluation (T&E) procedures to guarantee their safe and dependable operation before full fielding. T&E is already difficult when it relates to generative AI due to the lack of globally agreed-upon AI safety rules and regulations, and things get considerably more challenging when dealing with big language models and all the inherent uncertainties.
In general, T&E allows specialists to determine the initial performance level of a certain model. For example, to train a computer vision algorithm to distinguish between photographs of dogs and cats and non-dog or non-cat items, an authority may use millions of images of both types of animals to train the algorithm. At the same time, the expert will be holding aside a varied selection of data that may be later fed into the algorithm.
They may compare the evaluation dataset with the test set, sometimes known as the “ground truth,” and find out how often the model gets it wrong when trying to identify classifiers.
T&E with large language models follows a similar strategy at Scale AI, but there isn’t the same amount of “ground truth” for these complicated systems due to their generative nature and the difficulty of evaluating the English language. An LLM may provide five factually correct answers when asked, but the meaning of each output might be different because of differences in sentence form.
The company is working on the framework, methods, and technology that CDAO may employ for testing and evaluating large language models. As part of this effort, they are creating “holdout datasets” where DOD insiders provide response pairs and their responses are reviewed multiple times to ensure they are up to military standards. The whole thing is going to be iterative.
After the experts create and enhance datasets relevant to the DOD for aspects like honesty and global knowledge, they may compare them to existing large language models.
With these backup datasets in hand, experts may conduct tests and create model cards, which are brief documents that outline the optimal usage of different machine-learning models while offering metrics for gauging their efficacy.
Officials aim to automate this development process to a large extent so that when new models are introduced, a baseline understanding of their performance, strengths, and potential weak spots can be established.
The next step is for the models to communicate with the CDAO authorities who are evaluating them so that they may provide warnings if the models start to deviate from the parameters they were tested against.
.@scale_AI to set the Pentagon’s path for testing and evaluating large language models. @BrandiVincent_ has the story:https://t.co/G27WGiryyZ
— DefenseScoop (@DefenseScoop) February 20, 2024
Thanks to this project, the Department of Defense will be able to improve its test and evaluation standards in light of generative AI, which will measure and evaluate quantitative data through benchmarking and qualitative user input. Using the Department of Defense’s jargon and knowledge bases, the assessment criteria will find generative AI models that can back up military applications with relevant and accurate findings. According to Scale AI’s statement, “The rigorous T&E process aims to enhance the robustness and resilience of AI systems in classified environments, enabling the adoption of LLM technology in secure environments.”
Along with the CDAO, the business has formed partnerships with Meta, Microsoft, the US Army, OpenAI, GM, Toyota Research Institute, the Defense Innovation Unit, Nvidia, and others.
“Testing and evaluating generative AI will help the DoD understand the strengths and limitations of the technology so it can be deployed responsibly. Scale is honored to partner with the DoD on this framework,” Scale AI’s CEO and creator, Alexandr Wang.