Thank you for being here. Letβs take a deep breath and dive in to the best LLM papers of this week!
1. AIOS: LLM Agent Operating System
π Author(s): Kai Mei, et al. from Rutgers University
π Publication Date: Mar 26, 2024
β¨ Key Insights:
Whatβs New? They presented AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system βwith soulβ, taking a step towards AGI.
Behind the New. The AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, and maintain access control for agents.
So, How can we use this? Through AIOS you can improve performance and efficiency when using multiple LLM agents!
π Read Full Paper, Explore Github Repo
2. Can multiple-choice questions really be useful in detecting the abilities of LLMs?
π Author(s): Wangyue Li, et al. from Meetyou AI Lab
π Publication Date: Mar 28, 2024
β¨ Key Insights:
Whatβs New? They focused on testing the effectiveness of multiple choice question evaluating on LLMs, identifying the issue with positional sensitivity (biased to first answer) and its gap with long form generation outputs.
Behind the New. They revealed that LLMs show relatively low correlation between answers from multiple choice questions and long form generation for identical questions.
So, How can we use this? Is using multiple-choice benchmarks really a good way to evaluate LLMs?
π Read Full Paper
3. Long-form Factuality in Large Language Models
π Author(s): Jerry Wei, et al. from Google DeepMind
π Publication Date: Mar 27, 2024
β¨ Key Insights:
Whatβs New? They proposed Search-Augmented Factuality Evaluator(SAFE) method, using LLM agents as automated evaluators for long-form factuality.
Behind the New. The proposed SAFE method utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results.
So, How can we use this? Checking LLM response factuality is a must.
π Read Full Paper
4. Top Leaderboard Ranking = Top Coding Proficiency, Always? EVOEVAL: Evolving Coding Benchmarks via LLM
π Author(s): Chunqiu Steven Xia et al. from University of Illinois Urbana-Champaign
π Publication Date: Mar 28, 2024
β¨ Key Insights:
Whatβs New? They introduced EVOEVAL - a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains for a comprehensive evaluation of LLM coding abilities.
Behind the New. Prior benchmarks contain only a very limited set of problems, both in quantity and variety. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web and thus potentially in training data.
So, How can we use this? You can find that Claude-3 is really comparable to GPT-4-Turbo in EVOEVAL! I think benchmark should be changing continuously to avoid data leakage problems.
π Read Full Paper, Explore Github Repo
5. Understanding Emergent Abilities of Language Models from the Loss Perspective
π Author(s): Zhengxiao Du, et al. from Zhipu AI
π Publication Date: Mar 24, 2024
β¨ Key Insights:
Whatβs New? They demonstrated that the models with the same pre-training loss, but different model and data sizes, generate the same performance on various downstream tasks.
Behind the New. They observed that 1) smaller models can also exhibit high performance on emergent abilities and 2) there is doubt on the discontinuous metrics used to measure these abilities.
So, How can we use this? The size of the model is not important. Training loss is all you need!
π Read Full Paper
6. Defending Against Indirect Prompt Injection Attacks With Spotlighting
π Author(s): Keegan Hines, et al. from Microsoft
π Publication Date: Mar 20, 2024
β¨ Key Insights:
Whatβs New? They introduced spotlighting, a family of prompt engineering techniques that can be used to improve LLMβs ability to distinguish among multiple sources of input.
Behind the New. They evaluated spotlighting as a defense against indirect prompt injection attacks, and find that it is a robus defense that hs minimal detrimental impact to underlying NLP tasks.
So, How can we use this? Make it easier for the model to distinguish between our valid system instructions and any input text which should be treated as untrustworthy!
π Read Full Paper
Stay curious, and until next week!