Research · LENS Lab

The LENS Lab (Lens for Empirical Navigation in Software Lab) works at the intersection of Software Engineering, NLP, and Empirical Methods, using large-scale data analysis, machine learning, and LLMs to study how developers communicate, how AI tools change the way software is built, and how emotion and toxicity shape open-source communities.

GenAI & Code Quality

Empirical study of how generative AI tools affect software development practice and quality.

"TODO: Fix the Mess Gemini Created": Towards Understanding GenAI-Induced Self-Admitted Technical Debt
International Conference on Technical Debt (TechDebt) 2026 [PDF] [Zenodo]
Investigates how developers explicitly acknowledge AI-generated technical debt in source code comments.

Key finding: Among 81 annotated comments, 15 cases of GenAI-Induced SATD (GIST) were identified — a new category where developers admit AI-generated code introduced debt requiring future fixes. AI plays four roles in SATD: Source, Catalyst, Mitigator, and Neutral.
OLAF: Towards Robust LLM-Based Annotation Framework in Empirical Software Engineering
WSESE Workshop 2026, ICSE Companion [PDF] [Project]
Proposes a robust LLM-based annotation framework to support large-scale labeling in empirical SE research.

Key finding: LLM-based annotation can closely replicate human judgment on SE datasets, with structured prompting strategies significantly improving annotation consistency and scalability.
DePro: Understanding the Role of LLMs in Debugging Competitive Programming Code
Foundations of Software Engineering — Ideas, Visions and Reflections (FSE-IVR) 2026 [PDF]
Introduces DePro, a test-case–driven approach that assists programmers by correcting existing code rather than generating new solutions, evaluated on 13 faulty Codeforces submissions.

Key finding: DePro consistently produces correct solutions, reducing debugging attempts by up to 64% and debugging time by an average of 7.6 minutes per problem compared to human programmers and zero-shot LLM debugging.
LLM-Enabled Open-Source Systems in the Wild: An Empirical Study of Vulnerabilities in GitHub Security Advisories
LLMSC Workshop 2026 [PDF]
Empirically studies security vulnerabilities in open-source systems that leverage LLMs, using GitHub Security Advisories.
Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub
Mining Software Repositories (MSR) — Mining Challenge 2026 [PDF] [ArXiv]
Analyzes 33,000 agent-authored pull requests on GitHub to identify what makes AI-generated contributions fail to merge.

Key finding: AI agents succeed most at documentation and CI/CD tasks but fail most at bug-fixing. Rejected PRs touch more files and make larger changes — but socio-technical misalignment (e.g., implementing features maintainers didn't want) is a major underexplored failure driver.

Toxicity & Conversational Derailment in Open Source Software

Understanding and mitigating harmful communication in open source developer communities.

Toxicity Ahead: Forecasting Conversational Derailment on GitHub
International Conference on Software Engineering (ICSE) 2026 [PDF] [Replication]
Proactively forecasts toxic derailment in GitHub issue threads using LLM-generated Summaries of Conversation Dynamics (SCDs).

Key finding: Achieved F1 = 0.901 (Qwen) and F1 = 0.852 (Llama) on 159 toxic + 207 non-toxic threads. The model generalizes to external benchmarks (F1 = 0.797 on Raman et al.), outperforming few-shot baselines.
Understanding and Predicting Derailment in Toxic Conversations on GitHub
[ArXiv] [Replication]
A comprehensive empirical study of derailment patterns in toxic OSS conversations with LLM-based prediction.

Key finding: Achieved 70% F1-score in derailment prediction. Linguistic markers like second-person pronouns, negation terms, and "Bitter Frustration and Impatience" emotion tone are strong early predictors of impending toxicity.
"Silent Is Not Actually Silent": An Investigation of Toxicity on Bug Report Discussion
Foundations of Software Engineering — Ideas, Visions and Reflections (FSE-IVR) 2025 [PDF] [ArXiv] [Replication]
Qualitatively investigates the nature and impact of toxicity specifically within bug report discussions.

Key finding: ~40% of analyzed bug threads (81 of 203) contained toxicity. Top drivers: misaligned perceptions of bug severity/priority, unresolved tool frustrations, and communication lapses. Toxic threads are measurably less likely to produce a linked PR resolution.
Incivility in Open Source Projects: A Comprehensive Annotated Dataset of Locked GitHub Issue Threads
Mining Software Repositories (MSR) 2024 [PDF] [Repo]
Releases a large annotated dataset of locked GitHub issues spanning 2013–2023 to support incivility research in OSS.

Key finding: From 338 locked ("too-heated") issues, 1,365 comments were annotated across 9 uncivil feature types, 8 triggers, 5 target categories, and 7 consequence types — the most comprehensive OSS incivility dataset to date.

Emotion & Communication in Software Engineering

Using emotion as a lens to study and improve developer communication.

Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers
Evaluation and Assessment in Software Engineering (EASE) 2026 [PDF]
Proposes cognitive-load aware automated refactoring techniques to improve code readability for novice programmers.
Learning Programming in Informal Spaces: Using Emotion as a Lens to Understand Novice Struggles on r/learnprogramming
ICSE — Software Engineering Education and Training (SEET) 2026 [PDF] [Replication]
Studies emotional experiences of novice programmers through 1,500 annotated posts from r/learnprogramming.

Key finding: Frustration and confusion dominate novice programming struggles. DBSCAN clustering revealed distinct emotional patterns tied to specific learning barriers — pointing to the need for affect-aware intelligent tutoring in informal learning spaces.
Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs
International Conference on Software Engineering (ICSE) 2024 [PDF] [Repo]
Applies zero-shot LLMs to identify root causes of emotions expressed in developer communications.

Key finding: Zero-shot LLMs can identify emotion causes in SE texts without fine-tuning. Technical disagreements, ambiguous requirements, and unresponsive collaborators are the most frequent emotional triggers in developer communication.
Shedding Light on Software Engineering-specific Metaphors and Idioms
International Conference on Software Engineering (ICSE) 2024 [PDF] [Repo]
Studies the prevalence and role of figurative language in software engineering texts.

Key finding: Figurative language is pervasive in SE communication and significantly degrades NLP tool performance — SE-specific models must account for domain idioms to avoid systematic misinterpretation.
Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers
Natural Language Processing in Software Engineering (NLBSE) 2024 [PDF]
Benchmarks pre-trained transformer models for emotion classification across SE communication datasets.

Key finding: No single transformer dominates across all SE emotion datasets — performance varies significantly by data source, showing that domain-aware model selection is critical rather than defaulting to general-purpose LLMs.
Data Augmentation for Improving Emotion Recognition in Software Engineering Communication
Automated Software Engineering (ASE) 2022 [PDF] [Repo]
Applies data augmentation strategies to address severe label scarcity in SE emotion recognition.

Key finding: Data augmentation consistently improves minority-class emotion detection — demonstrating that class imbalance, not model capacity, is the primary bottleneck in SE emotion recognition.

Bug Reports & Developer Tools

Improving software quality through better bug reporting and developer tooling.

LLPut: Investigating Large Language Models for Bug Report-Based Input Generation
ACM Foundations of Software Engineering (FSE) Companion 2025 [PDF] [ACM] [Replication]
Evaluates LLMs for automatically generating failure-reproducing test inputs from bug report descriptions.

Key finding: LLMs show promise for bug-report-based input generation but struggle with complex reproduction steps and environment-specific bugs — pointing to gaps that future LLM-based testing tools must address.
Using Clarification Questions to Improve Software Developers' Web Search
Information and Software Technology (IST) 2022 [PDF] [Repo]
Uses targeted clarification questions to expand and refine developer web search queries for better results.

Key finding: Asking developers a small number of clarification questions about their intent significantly improves query quality and search result relevance, outperforming standard query expansion baselines.
Automatically Selecting Follow-up Questions for Deficient Bug Reports
Mining Software Repositories (MSR) 2021 [PDF] [Repo]
Automatically ranks follow-up questions to elicit missing information from incomplete bug reports, trained on 25,000 GitHub issues.

Key finding: Neural ranking models substantially outperform retrieval baselines in selecting the most relevant follow-up questions — showing that bug report deficiencies follow learnable patterns that can be addressed automatically.