I am a Senior Machine Learning Engineering Manager at Apple, where I lead a team building large-scale ML systems for information extraction and curation. I am part of the core team behind Apple Intelligence, where I focus on pre-training data quality and curation for the foundation models that power AI features across Apple devices, as well as MM1, Apple's multimodal model family.

I have over 15 years of experience in AI and ML, with a focus on information extraction and retrieval at web scale. Previously, I worked as a Machine Learning Researcher at NASA Jet Propulsion Laboratory as part of the DARPA Memex program, where I worked on crawling and search technologies for the deep and dark web. At JPL, I also contributed to the Mars Target Encyclopedia, an NLP system for extracting compositional knowledge from Mars science literature.

I am an elected member of the Apache Software Foundation, a committer and PMC member on Apache Nutch, and co-creator of Sparkler, a distributed web crawler on Apache Spark. I hold a Master's in Computer Science from USC Viterbi School of Engineering.


Highlights

EACL 2026 [2026] Paper Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training accepted at EACL 2026
Apple Foundation Models Architecture [2025] Launched new versions of Apple's On-Device and Server Foundation Language Models supporting several new capabilities and languages
Apple Intelligence [2024] Apple Intelligence launched at WWDC; contributed to pre-training data quality and curation for the foundation models. Press: CNBC, TechCrunch, The Verge, NYT
MM1 [2024] Our paper on MM1, Apple's multimodal model family, appeared at ECCV 2024. Featured in Wired, VentureBeat, Nasdaq
Sparkler talk at Spark Summit [2017] Presented Sparkler at Spark Summit East (video, slides); also spoke at ApacheCon North America and ApacheCon Big Data Europe
Podcast [2017] Featured on the Science and Supercomputers podcast — "When Data's Deep, Dark Places Need to be Illuminated" — discussing how we used the TACC Wrangler supercomputer to combat human trafficking through deep web analysis
Polar Deep Insights [2017] Presented Polar Deep Insights at the EarthCube All Hands Meeting — mining scientific data from polar repositories using Sparkler and NSF XSEDE supercomputing resources [NSF award]
Apache Software Foundation [2017] Elected as a member of the Apache Software Foundation [announcement]
DARPA Memex [2016] Joined NASA JPL as part of the DARPA Memex program to build search technologies for the deep and dark web
Microsoft Imagine Cup [2009] Top 4 national finalist at the Microsoft Imagine Cup in India
Dell Social Innovation [2009] Worldwide semi-finalist at the Dell Social Innovation Competition at UT Austin

Publications

See Google Scholar for a complete list.

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
J. Li, J. P. Gardner, D. Kang, F. Shi, K. Singh, et al. EACL 2026
[paper]
Apple Intelligence Foundation Language Models: Tech Report 2025
E. Li, A. B. L. Larsen, ...K. Singh... et al. Apple, 2025
[paper] [arxiv]
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
B. McKinzie, Z. Gan, ...K. Singh... et al. ECCV 2024
[paper] [arxiv] [wired] [venturebeat]
Apple Intelligence Foundation Language Models
T. Gunter, Z. Wang, ...K. Singh... et al. Apple ML Research, 2024
[paper] [arxiv] [product] [wikipedia]
Mars Target Encyclopedia: Rock and Soil Composition Extracted from the Literature
K. Wagstaff, R. Francis, T. Gowda, Y. Lu, E. Riloff, K. Singh, N. Lanza. AAAI, 2018
[paper] [project]
An Automated Approach for Information and Referral of Social Services Using Machine Learning
M. Sharan, N. K. Ottilingam, C. A. Mattmann, K. Singh, et al. IEEE IRI, 2017
[paper]

Open Source

Apache Nutch — Committer & PMC Member
A highly extensible and scalable open-source web crawler built on Apache Hadoop. Used widely in production search and data-mining pipelines
Apache DRAT — Committer & PMC Member
A Distributed Release Audit Tool that automates license header checking and code compliance analysis across large codebases
Sparkler — Co-Inventor
A distributed web crawler built on Apache Spark that enables large-scale crawling with integrated NLP and ML capabilities

Recognition & Engagement