- Singapore
Lokasi Kerja
Penerangan Kerja
Tanggungjawab
We are looking for research interns to work on foundational areas for coding language models, including pre-training data, mid-training data, synthetic data generation, evaluation, and agentic coding.
Responsibilities
* Explore data-centric methods for improving coding LLMs, including data filtering, quality assessment, deduplication, data mixture, and diversity analysis.
* Build synthetic data and evaluation pipelines for code generation, code editing, repo-level reasoning, tool use, and multi-step coding tasks.
* Run experiments to analyze how data, model, and training strategies affect coding capabilities
* Work with large-scale code corpora, developer activity data, and agentic coding trajectories.
Requirements
* Strong programming skills in Python.
* Solid understanding of machine learning and large language models.
* Familiarity with LLM pre-training, mid-training, code models, data curation, evaluation, agents, or tool use.
* Strong experiment design, data analysis, and problem-solving skills.
* Interest in code intelligence, software engineering automation, and agentic coding.
Preferred Qualifications
* Experience with code data processing, GitHub-scale data, synthetic data, LLM evaluation, semantic deduplication, or agentic coding.
* Research experience, publications, or open-source projects in related areas are a
plus.
What We Offer
* Access to large-scale real-world coding data and agentic trajectories.
* Rich compute resources and model APIs for fast research iteration.
* Opportunities to work on real-world coding model applications and the full model development loop.
Peringatan Penting
Jangan pernah kongsikan maklumat bank atau kad kredit anda semasa memohon pekerjaan. Elakkan membuat sebarang pembayaran atau mengisi survey yang tidak berkaitan. Jika ada yang mencurigakan, sila laporkan iklan pekerjaan ini segera.