Yiqing Xie's Personal Webpage

Rm 6607, Gates and Hillman Centers

4902 Forbes Ave

Pittsburgh, PA 15213, USA

I am a third-year Ph.D. student at the Language Technologies Institute of Carnegie Mellon University and I am working with Carolyn Rosé and Daniel Fried. Previously, I obtained my Master degree in the data mining group at the University of Illinois Urbana-Champaign supervised by Jiawei Han and obtained my Bachelor degree in Hong Kong University of Science and Technology, where I received the Academic Achievement Medal.

My research mainly focuses on annotation-efficient generation and evaluation systems, especially on code generation. The topics including (i) building generalizable and annotation efficient NLP systems to assist human with practical tasks, and (ii) building reliable and automatic evaluation systems for NLP methods.

Annotation-efficient NLP Systems

Pretraining & continuous pretraining (Anchor-DR, METRO-T0)
Training environment (RepoST)
Model-generated Reward Signals (FenCE)
Data augmentation (FenCE, Anchor-DR, CMTrans, Eider)
Guidance under heuristic metrics or prior knowledge (AlaGCN, RL-MMR, KoMen)
Unsupervised or Semi-supervised methods (Set-CoExpan, CoRel)

Reliable and Automatic Evaluation Systems

Evaluation Benchmarks (RepoST, TheAgentCompany, CodeRAG-Bench, CodeBenchGen)
Evaluation frameworks (DocLens)
Evaluator models (FenCE)

LLMs for Code Generation

Code generation training (RepoST, CMTrans)
Code generation evaluation (RepoST, TheAgentCompany, CodeRAG-Bench, CodeBenchGen)
Code generation analysis (SACL, Strong-Weak-Colab)

News

Jul 7, 2025	Oun paper on synthetic coding environment construction for repo-level code generation got accepted to COLM 2025! (RepoST)
Jun 25, 2025	Really excited about our two new preprints on analysis for code generation! (SACL, Strong-Weak-Colab)
May 15, 2025	One paper on factuality evaluator training got accepted to ACL 2025! (FenCE 🚧)
Apr 9, 2025	Gave a talk about repo-level coding environment construction at the EFML Reading Group (Stanford / UW)!
Feb 19, 2025	Really excited about our new preprint: RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing! (RepoST)
Jan 22, 2025	One paper on RAG-for-code benchmark got accepted to NAACL-Findings 2025! (CodeRAG-Bench)
Dec 18, 2024	Really excited about our new preprint on a benchmark for LLM agents! (TheAgentCompany)
May 16, 2024	One paper on medical evaluation got accepted to ACL 2024! (DocLens 🔍)
Apr 5, 2024	Gave a talk on our recent work on evaluation of medical text at Microsoft!

Educations

Carnegie Mellon University 2022 - Present

Ph.D. in Language and Information Technology

Research focus: code generation, LLM agent, evaluation

Advisors: Daniel Fried and Carolyn Rosé

University of Illinois at Urbana-Champaign 2020 - 2022

Master of Science in Computer Science (GPA: 4.0/4.0)

Research focus: information extraction, graph-based machine learning

Advisor: Jiawei Han

Hong Kong University of Science and Technology 2016 - 2020

B. Sc. in Computer Science and double major in Mathematics (GPA: 3.9/4.3)

Research focus: graph-based machine learning, text mining

Advisor: Raymond Chi-Wing Wong

Work experience

Meta AI2024.05 - 2024.10: Research Intern, GenAI; Work on training a fine-grained critic-based evaluator model and use it to improve generators' factuality [FenCE]; Manager: Hejia Zhang; Peers: Di Jin, Sinong Wang
Microsoft Research Redmond 2023.06 - 2023.08: Research Intern, Health Futures; Work on a multi-aspect fine-grained evaluation framework of medical text generation [DocLens]; Manager: Sheng Zhang, Hao Cheng, Hoifung Poon
Microsoft Research Redmond 2022.05 - 2022.08: Research Intern, Productivity and Intelligence group; Work on continuously pre-trained models for zero-shot dense retrieval [Anchor-DR]; Manager: Chenyan Xiong
Alibaba DAMO Academy 2020.07 - 2021.02: Research Intern, Data Analytics and Intelligence Lab; Work on few-shot interaction recommendation under multiple scenarios [KoMen]; Manager: Yaliang Li, Bolin Ding

Honors and Awards

CMU Presidential Fellowship in LTI2024-2025
Siebel Scholar, class of 2022 2021-2022
Hong Kong University of Science and Technology Academic Achievement Medal (top 1%) 2020
Hong Kong Special Administrative Region Government Scholarship Fund - Reaching Out Award 2018
Hong Kong University of Science and Technology's Scholarship for Continuing Undergraduate Students 2017-2019
Dean’s List, Hong Kong University of Science and Technology Three times, 2017-2019
Silver medal of China Girls Math Olympiad 2015

Additional Information

Conference Reviews: ARR (Oct 2024, June 2024, Apr 2024, Feb 2024, Dec 2023), EMNLP 2023, ACL 2023, TKDE 2023, COLING 2022, AACL 2022
Teaching Assistant: 11-711: Advanced NLP CMU, Fall 2024
Teaching Assistant: CS412: Introduction to Data Mining UIUC, Spring 2022
Teaching Assistant: COMP 2012: Object-Oriented Programming and Data Structures HKUST, Fall 2018
Teaching Assistant: COMP 1022P: Introduction to Java Programming HKUST, Fall 2018

Selected publications

For the completed list of publications, check here

COLM

RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose

(COLM, 2025)

PDF
ACL

Improving Model Factuality with Fine-grained Critique-based Evaluator

Yiqing Xie, Wenxuan Zhou, Pradyot Prakash, Di Jin, Yuning Mao, Quintin Fettes, Arya Talebzadeh, Sinong Wang, Han Fang, Carolyn Rose, Daniel Fried, and Hejia Zhang

(ACL, 2025)
ACL

DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation

Yiqing Xie, Sheng Zhang, Hao Cheng, Pengfei Liu, Zelalem Gero, Cliff Wong, Tristan Naumann, Hoifung Poon, and Carolyn Rose

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL, 2024)

PDF
EMNLP Findings

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Yiqing Xie, Atharva Naik, Daniel Fried, and Carolyn Rose

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings, 2023)

PDF
SIGIR

Unsupervised Dense Retrieval Training with Web Anchors

Yiqing Xie, Xiao Liu, and Chenyan Xiong

In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, 2023)

PDF
ACL Findings

Eider: Evidence-enhanced Document-level Relation Extraction

Yiqing Xie, Jiaming Shen, Sha Li, Yuning Mao, and Jiawei Han

In Findings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL Findings, 2022)

PDF
WWW

KoMen: Domain Knowledge-Guided Few-Shot Interaction Recommendation on Multiplex Networks

Yiqing Xie, Zhen Wang, Carl Yang, Yaliang Li, Hongbo Deng, Bolin Ding, and Jiawei Han

In Proceedings of the Web Conference (WWW, 2022)

PDF
IJCAI

When Do GNNs Work: Understanding and Improving Neighborhood Aggregation

Yiqing Xie*, Sha Li*, Carl Yang, Raymond Chi-Wing Wong, and Jiawei Han

In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI, 2020)

PDF
WWW

Guiding Corpus-Based Set Expansion by Auxiliary Sets Generation and Co-Expansion

Jiaxin Huang*, Yiqing Xie*, Yu Meng, Jiaming Shen, Yunyi Zhang, and Jiawei Han

In Proceedings of The Web Conference (WWW, 2020)

PDF