His research sits at the intersection of Software Engineering, NLP, and Empirical Methods, using large-scale data analysis, machine learning, and LLMs to study how developers communicate, how AI tools change the way software is built, and how emotion and toxicity shape open-source communities.
GenAI & Code Quality
Empirical study of how generative AI tools affect software development practice and quality.
-
"TODO: Fix the Mess Gemini Created": Towards Understanding GenAI-Induced Self-Admitted Technical Debt
International Conference on Technical Debt (TechDebt) 2026 [PDF] [Zenodo]
Investigates how developers explicitly acknowledge AI-generated technical debt in source code comments.Key finding: Among 81 annotated comments, 15 cases of GenAI-Induced SATD (GIST) were identified — a new category where developers admit AI-generated code introduced debt requiring future fixes. AI plays four roles in SATD: Source, Catalyst, Mitigator, and Neutral. -
OLAF: Towards Robust LLM-Based Annotation Framework in Empirical Software Engineering
WSESE Workshop 2026, ICSE Companion [PDF] [Project]
Proposes a robust LLM-based annotation framework to support large-scale labeling in empirical SE research.Key finding: LLM-based annotation can closely replicate human judgment on SE datasets, with structured prompting strategies significantly improving annotation consistency and scalability. -
Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub
Mining Software Repositories (MSR) — Mining Challenge 2026 [ArXiv]
Analyzes 33,000 agent-authored pull requests on GitHub to identify what makes AI-generated contributions fail to merge.Key finding: AI agents succeed most at documentation and CI/CD tasks but fail most at bug-fixing. Rejected PRs touch more files and make larger changes — but socio-technical misalignment (e.g., implementing features maintainers didn't want) is a major underexplored failure driver.
Toxicity & Conversational Derailment in Open Source Software
Understanding and mitigating harmful communication in open source developer communities.
-
Toxicity Ahead: Forecasting Conversational Derailment on GitHub
International Conference on Software Engineering (ICSE) 2026 [PDF] [Replication]
Proactively forecasts toxic derailment in GitHub issue threads using LLM-generated Summaries of Conversation Dynamics (SCDs).Key finding: Achieved F1 = 0.901 (Qwen) and F1 = 0.852 (Llama) on 159 toxic + 207 non-toxic threads. The model generalizes to external benchmarks (F1 = 0.797 on Raman et al.), outperforming few-shot baselines. -
Understanding and Predicting Derailment in Toxic Conversations on GitHub
[ArXiv] [Replication]
A comprehensive empirical study of derailment patterns in toxic OSS conversations with LLM-based prediction.Key finding: Achieved 70% F1-score in derailment prediction. Linguistic markers like second-person pronouns, negation terms, and "Bitter Frustration and Impatience" emotion tone are strong early predictors of impending toxicity. -
"Silent Is Not Actually Silent": An Investigation of Toxicity on Bug Report Discussion
Foundations of Software Engineering — Ideas, Visions and Reflections (FSE-IVR) 2025 [ArXiv] [Replication]
Qualitatively investigates the nature and impact of toxicity specifically within bug report discussions.Key finding: ~40% of analyzed bug threads (81 of 203) contained toxicity. Top drivers: misaligned perceptions of bug severity/priority, unresolved tool frustrations, and communication lapses. Toxic threads are measurably less likely to produce a linked PR resolution. -
Incivility in Open Source Projects: A Comprehensive Annotated Dataset of Locked GitHub Issue Threads
Mining Software Repositories (MSR) 2024 [PDF] [Repo]
Releases a large annotated dataset of locked GitHub issues spanning 2013–2023 to support incivility research in OSS.Key finding: From 338 locked ("too-heated") issues, 1,365 comments were annotated across 9 uncivil feature types, 8 triggers, 5 target categories, and 7 consequence types — the most comprehensive OSS incivility dataset to date.
Emotion & Communication in Software Engineering
Using emotion as a lens to study and improve developer communication.
-
Learning Programming in Informal Spaces: Using Emotion as a Lens to Understand Novice Struggles on r/learnprogramming
ICSE — Software Engineering Education and Training (SEET) 2026 [PDF] [Replication]
Studies emotional experiences of novice programmers through 1,500 annotated posts from r/learnprogramming.Key finding: Frustration and confusion dominate novice programming struggles. DBSCAN clustering revealed distinct emotional patterns tied to specific learning barriers — pointing to the need for affect-aware intelligent tutoring in informal learning spaces. -
Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs
International Conference on Software Engineering (ICSE) 2024 [PDF] [Repo]
Applies zero-shot LLMs to identify root causes of emotions expressed in developer communications.Key finding: Zero-shot LLMs can identify emotion causes in SE texts without fine-tuning. Technical disagreements, ambiguous requirements, and unresponsive collaborators are the most frequent emotional triggers in developer communication. -
Shedding Light on Software Engineering-specific Metaphors and Idioms
International Conference on Software Engineering (ICSE) 2024 [PDF] [Repo]
Studies the prevalence and role of figurative language in software engineering texts.Key finding: Figurative language is pervasive in SE communication and significantly degrades NLP tool performance — SE-specific models must account for domain idioms to avoid systematic misinterpretation. -
Emotion Classification In Software Engineering Texts: A Comparative Analysis of Pre-trained Transformers
Natural Language Processing in Software Engineering (NLBSE) 2024 [PDF]
Benchmarks pre-trained transformer models for emotion classification across SE communication datasets.Key finding: No single transformer dominates across all SE emotion datasets — performance varies significantly by data source, showing that domain-aware model selection is critical rather than defaulting to general-purpose LLMs. -
Data Augmentation for Improving Emotion Recognition in Software Engineering Communication
Automated Software Engineering (ASE) 2022 [PDF] [Repo]
Applies data augmentation strategies to address severe label scarcity in SE emotion recognition.Key finding: Data augmentation consistently improves minority-class emotion detection — demonstrating that class imbalance, not model capacity, is the primary bottleneck in SE emotion recognition.
Bug Reports & Developer Tools
Improving software quality through better bug reporting and developer tooling.
-
LLPut: Investigating Large Language Models for Bug Report-Based Input Generation
ACM Foundations of Software Engineering (FSE) Companion 2025 [ACM] [Replication]
Evaluates LLMs for automatically generating failure-reproducing test inputs from bug report descriptions.Key finding: LLMs show promise for bug-report-based input generation but struggle with complex reproduction steps and environment-specific bugs — pointing to gaps that future LLM-based testing tools must address. -
Using Clarification Questions to Improve Software Developers' Web Search
Information and Software Technology (IST) 2022 [PDF] [Repo]
Uses targeted clarification questions to expand and refine developer web search queries for better results.Key finding: Asking developers a small number of clarification questions about their intent significantly improves query quality and search result relevance, outperforming standard query expansion baselines. -
Automatically Selecting Follow-up Questions for Deficient Bug Reports
Mining Software Repositories (MSR) 2021 [PDF] [Repo]
Automatically ranks follow-up questions to elicit missing information from incomplete bug reports, trained on 25,000 GitHub issues.Key finding: Neural ranking models substantially outperform retrieval baselines in selecting the most relevant follow-up questions — showing that bug report deficiencies follow learnable patterns that can be addressed automatically.