About Centific
Centific is a frontier AI data foundry that curates diverse, high-quality data, using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe, scalable AI deployment. Our team includes more than 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers. We harness the power of an integrated solution ecosystem—comprising industry-leading partnerships and 1.8 million vertical domain experts in more than 230 markets—to create contextual, multilingual, pre-trained datasets; fine-tuned, industry-specific LLMs; and RAG pipelines supported by vector databases. Our zero-distance innovation™ solutions for GenAI can reduce GenAI costs by up to 80% and bring solutions to market 50% faster.
Our mission is to bridge the gap between AI creators and industry leaders by bringing best practices in GenAI to unicorn innovators and enterprise customers. We aim to help these organizations unlock significant business value by deploying GenAI at scale, helping to ensure they stay at the forefront of technological advancement and maintain a competitive edge in their respective markets.
About Job
Key Responsibilities
- Multimodal Benchmark Design & Development: Design and build evaluation benchmarks for multimodal foundation models across one or more modality combinations (text-image, text-audio, text-video, or cross-modal retrieval). Define task formats, annotation guidelines, scoring criteria, and coverage dimensions.
- Benchmark Execution & Analysis: Run multimodal models against benchmark suites, analyze performance patterns, identify failure modes, and synthesize findings into clear, actionable research summaries and recommendations.
- Metric & Scoring Research: Investigate and compare automated scoring approaches for multimodal outputs—including model-as-judge methods, reference-free metrics, and human alignment studies. Assess tradeoffs in reliability, validity, cost, and scalability.
- Dataset Curation & Quality Assurance: Contribute to the collection, filtering, and quality review of multimodal evaluation data, including annotation scheme design and inter-rater reliability analysis.
- Literature Review & Methodology: Survey the state of the art in multimodal evaluation and benchmarking, identify gaps in existing benchmark coverage, and propose novel evaluation angles grounded in the literature.
- Documentation & Communication: Produce high-quality internal research write-ups, benchmark datasheets, and presentation-ready summaries of your findings for both technical and non-technical audiences.
Primary Focus Areas
This internship centers on multimodal benchmarking. Depending on your background and project fit, you may focus on one or more of:
- Vision-Language Evaluation: Benchmarking image captioning, visual question answering, document understanding, chart reasoning, and image-text alignment.
- Audio & Speech-Language Benchmarking: Evaluating spoken language comprehension, audio captioning, and cross-modal speech-text tasks.
- Video Understanding Benchmarks: Designing temporal reasoning, video QA, and video-text retrieval evaluation suites.
- Cross-Modal Consistency & Robustness: Testing model behavior under modality perturbations, distribution shifts, and adversarial multimodal inputs.
- Automated Multimodal Scoring: Developing and validating judge-model pipelines for evaluating open-ended multimodal generation.
Required Qualifications
- Education: Currently enrolled in an MS or PhD program in Computer Science, Machine Learning, Statistics, AI, Linguistics, or a closely related quantitative field.
- Multimodal or NLP Experience: Coursework, research projects, or hands-on experience with multimodal models, vision-language systems, or NLP. Familiarity with at least one modality beyond text (image, audio, or video).
- Benchmarking or Evaluation Background: Some exposure to model evaluation concepts—benchmark design, metric selection, experimental comparison—through coursework, a research project, or a prior internship.
- Technical Proficiency: Solid Python skills for data processing, model inference, and quantitative analysis. Working experience with PyTorch or Hugging Face Transformers.
- Analytical Rigor: Comfort with basic statistical analysis: understanding variance, significance, and the limits of benchmark conclusions.
- Communication: Ability to write clearly and present findings in an organized, audience-appropriate way.
Preferred Qualifications
- Hands-on experience running inference or fine-tuning with multimodal foundation models (e.g., LLaVA, GPT-4o, Gemini, Flamingo, or similar).
- Familiarity with existing multimodal benchmarks such as MMBench, MMMU, SeedBench, VQAv2, NaturalBench, AudioCaps, or ActivityNet-QA.
- Experience with annotation tools, dataset pipelines, or human evaluation studies.
- Prior research publications, workshop papers, or open-source contributions in multimodal ML, NLP, or evaluation.
- Familiarity with model-as-judge or LLM-based automated evaluation frameworks.
- Interest in measurement validity, inter-rater reliability, and the science of evaluation design.
What You’ll Gain
- Mentorship from senior research scientists and ML engineers working on frontier AI evaluation problems.
- Ownership of a focused, publishable research project with real-world impact on how leading AI models are evaluated.
- Exposure to enterprise AI workflows, customer-facing research consulting, and cross-functional applied research teams.
- Potential co-authorship on publications or open-source benchmark releases upon completion of high-quality work.
- A competitive internship stipend and flexible hybrid/remote working arrangement.
Hourly Rate : $40/hr
Centific is an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, citizenship status, age, mental or physical disability, medical condition, sex (including pregnancy), gender identity or expression, sexual orientation, marital status, familial status, veteran status, or any other characteristic protected by applicable law. We consider qualified applicants regardless of criminal histories, consistent with legal requirements.