Benchmarking Large Language Models Against Human Annotators in News Article Annotation
🚀

Benchmarking Large Language Models Against Human Annotators in News Article Annotation

📌 Key facts

icon
🎯 Objective: Evaluating the performance of Large Language Models (LLMs) in complex text-annotation tasks in the context of news articles

⏱️ When: Start date is flexible (ideally from October 14)! Applications are open!

📥 How to apply: Send us an e-mail (at the end of this page) with your CV and grade report

❗IMPORTANT: You should be proficient in German, as you will be required to perform labeling and annotation tasks in the context of German news articles.

💡 Background

In today's competitive media landscape, companies must effectively manage content performance, evaluate reader preferences, and mitigate bias in their articles. Achieving these goals requires transforming unstructured data, such as articles, into structured information that can be readily analyzed. One method to achieve this transformation is through text annotation, which prepares data for subsequent analysis. Typically, this task is performed by professional annotators (Spinde et al., 2021), but manual annotation is inherently time-consuming, non-scalable, and costly (Snow et al., 2008).

To address these challenges, media companies are increasingly exploring more efficient methods for text annotation. A promising solution lies in leveraging Large Language Models (LLMs). By utilizing LLMs for text annotation, companies can save both time and money, effectively addressing the pressures of a competitive market (Björkroth & Grönlund, 2018). This shift not only enhances efficiency but also allows media organizations to focus resources on other critical areas, ultimately improving their ability to deliver high-quality content.

🎯 Goals

Your goal is to annotate a set of German news articles ( n = 50 to 100) based on a predefined set of rules. After completing the initial annotations, you will collaborate with fellow students of this research group to cross-validate each other’s work, aiming to reach a consensus and create "perfect" text annotations. These annotations will serve as the test data set to evaluate the performance of Large Language Model (LLM) outputs for the same text annotation tasks in news articles.

You will also need to develop a comprehensive evaluation metric to assess the performance of LLM outputs. This metric should include comparisons of LLM performance across varying levels of extraction complexity and different article types.

The data collection process has already been completed for you. We have access to clean news article data provided by major media outlets, allowing you to focus on the annotation and evaluation tasks.

🦾 Who We Are

The Chair for Strategy and Organization is focused on research with impact. This means we do not want to repeat old ideas and base our research solely on the research people did 10 years ago. Instead, we currently research topics that will shape the future. Topics include Agile Organizations and Digital Disruption, Blockchain Technology, Creativity and Innovation, Digital Transformation and Business Model Innovation, Diversity, Education: Education Technology and Performance Management, HRTech, Leadership, and Teams. We are always early in noticing trends, technologies, strategies, and organizations that shape the future, which has its ups and downs.

🧠 Topics of Interest

  • AI
  • Data retrieval
  • Data validation & Data quality
  • GenAI
  • Large language models (LLMs)
  • Text annotations / Text labeling
  • Bias Detection

🎓 Profile

  • Interest in AI and Large Language Models (LLMs), especially LLM output quality and reliability
  • Analytical Thinking and Statistical Knowledge
  • Programming experience is desired, e.g. python
  • As the data you are working with are in German, proficient German language skills are preferred to guarantee a valid test data set
  • Your work is super accurate. In data collection, there is no room for the 80/20 principle. Thus, your academic track record is excellent!

📚 Further Reading

  • https://arxiv.org/pdf/2105.11910 Spinde, T., Rudnitckaia, L., Sinha, K., Hamborg, F., Gipp, B., & Donnay, K. (2021). MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics. arXiv preprint arXiv:2105.11910.
  • https://www.pnas.org/doi/pdf/10.1073/pnas.2305016120 Gilardi, F., Alizadeh, M., & Kubli, M. Chatgpt outperforms crowd-workers for text-annotation tasks (2023). arXiv preprint arXiv:2303.15056.
  • https://ceur-ws.org/Vol-3671/paper1.pdf Raza, S., Rahman, M., & Ghuge, S. (2024). Dataset Annotation and Model Building for Identifying Biases in News Narratives. In Text2Story@ ECIR (pp. 5-15).
  • https://arxiv.org/pdf/2404.09682 Choi, J., Yun, J., Jin, K., & Kim, Y. (2024). Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation. arXiv preprint arXiv:2404.09682.
  • https://dl.acm.org/doi/abs/10.1145/3639631.3639663 Vijayan, A. (2023, December). A prompt engineering approach for structured data extraction from unstructured text using conversational LLMs. In Proceedings of the 2023 6th International Conference on Algorithms, Computing and Artificial Intelligence (pp. 183-189).

📝 How to Apply

If you are interested, please contact Joe Yu by submitting your CV, grade report and a short motivation letter why you are interested in this topic and how you are a good fit for it.

Joe Yu (Chair for Strategy and Organization) 👉 joe.yu@tum.de