LLM Evaluation, Benchmarking
1 week ago

Job summary
We're looking for an LLM Evaluation, Benchmarking & Experimentation Engineer to rigorously test our proprietary LLM API and build the infrastructure for systematic model improvement.
Job description
, consectetur adipiscing elit. Nullam tempor vestibulum ex, eget consequat quam pellentesque vel. Etiam congue sed elit nec elementum. Morbi diam metus, rutrum id eleifend ac, porta in lectus. Sed scelerisque a augue et ornare.
Donec lacinia nisi nec odio ultricies imperdiet.
Morbi a dolor dignissim, tristique enim et, semper lacus. Morbi laoreet sollicitudin justo eget eleifend. Donec felis augue, accumsan in dapibus a, mattis sed ligula.
Vestibulum at aliquet erat. Curabitur rhoncus urna vitae quam suscipit
, at pulvinar turpis lacinia. Mauris magna sem, dignissim finibus fermentum ac, placerat at ex. Pellentesque aliquet, lorem pulvinar mollis ornare, orci turpis fermentum urna, non ullamcorper ligula enim a ante. Duis dolor est, consectetur ut sapien lacinia, tempor condimentum purus.
Access all high-level positions and get the job of your dreams.
Similar jobs
LLM Evaluation and Benchmarking Mentor
1 month ago
+I'm seeking a technical mentor to help deepen my understanding of LLM evaluation and benchmarking, with particular attention to high-stakes applications (e.g., mental health), while developing a generalizable framework for reasoning about model performance across domains. · ...
We are seeking a skilled Technical Creative Writing Benchmark Developer to help us benchmark large language models (LLMs) with 30 hours per week. · Mandatory skills: · Creative Writing · Content Writing · Search Engine OptimizationWriting ...
AI/ML Engineer
2 weeks ago
We're looking for an ML engineer to help us evaluate and benchmark language models using proprietary datasets. · Assess how existing models perform against our specialized datasets · ...
Assistant Professor of Physics
1 month ago
Write and refine prompts to guide model behavior in physics contexts. · Evaluate LLM-generated responses to physics-related queries for conceptual accuracy, · mathematical correctness, and reasoning quality. · Conduct fact-checking using authoritative public sources and domain kn ...
Quality Assurance Specialist
4 weeks ago
Evaluate LLM-generated responses for effectiveness in answering user queries. Conduct fact-checking using trusted public sources and external tools. Generate high-quality human evaluation data by annotating response strengths areas for improvement and factual inaccuracies. · Eval ...
Conversational AI Evaluator
2 weeks ago
Mercor conecta talento creativo y técnico con laboratorios de investigación de IA. Se buscan evaluadores para evaluar respuestas generadas por modelos LLM. · ...
Data Annotator
2 weeks ago
Evaluate LLM-generated responses on their ability to effectively answer user queries. Conduct fact-checking using trusted public sources and external tools. · ...
Data Annotator
3 weeks ago
Mercor connects elite creative and technical talent with leading AI research labs. · Bachelor's degree · Significant experience using large language models (LLMs) · ...
Research Physicist
1 month ago
+We are seeking a Research Physicist to join our team of elite creative and technical talent. As a Physics AI Evaluator, you will write and refine prompts to guide model behavior in physics contexts. · + ...
Data Annotator
1 month ago
Evaluate LLM-generated responses for effectiveness in answering user queries. · Evaluate model responses align with expected conversational behavior and system guidelines. ...
LLM Evaluation Specialist
3 weeks ago
Evaluate LLM-generated responses on their ability to effectively answer user queries. Conduct fact-checking using trusted public sources and external tools. Generate high-quality human evaluation data by annotating response strengths, areas for improvement, and factual inaccuraci ...
Content Reviewer
2 weeks ago
+Mercor connects elite creative and technical talent with leading AI research labs. · +Bachelor's degree · Native speaker or ILR 5/primary fluency (C2 on the CEFR scale) in French · +,valid_job:1} ...
Conversational AI Evaluator
1 month ago
About Mercor connects elite creative and technical talent with leading AI research labs. We are looking for an experienced Conversational AI Evaluator to join our team. · Evaluate LLM-generated responses for effectiveness in answering user queries. Conduct fact-checking using tru ...
Mercor connects elite creative and technical talent with leading AI research labs. Headquartered in San Francisco. · ...
Electrical Engineering Consultant
4 weeks ago
Write and refine prompts to guide model behavior in engineering scenarios. · Evaluate LLM-generated responses to engineering-related queries for technical accuracy and applied reasoning. · Conduct fact-checking and verify technical claims using authoritative public sources and do ...
LLM Evaluation Specialist
4 weeks ago
Mercor connects elite creative and technical talent with leading AI research labs. · ...
Senior Financial Analyst | Upto 105/hr Hourly
1 month ago
Mercor connects elite creative and technical talent with leading AI research labs. · Headquartered in San Francisco, · our investors include Benchmark, · General Catalyst, · Peter Thiel, · Adam D'Angelo, · Larry Summers, · and Jack Dorsey. · ...
Linguist
1 month ago
Evaluate LLM-generated responses for effectiveness in answering user queries. · Evaluate LLM-generated responses for effectiveness in answering user queries. · Conduct fact-checking using trusted public sources and external tools. · Generate high-quality human evaluation data by ...
Conversational AI Specialist
1 month ago
+Mercor connects elite creative and technical talent with leading AI research labs. · +Evaluate LLM-generated responses for effectiveness in answering user queries. · Conduct fact-checking using trusted public sources and external tools. · +Bachelor's degreeNative speaker or ILR ...
Financial Reporting Analyst | Upto 105/hr Hourly
1 month ago
+Job summary · Write and refine prompts to guide model behavior in financial contexts.ResponsibilitiesWrite and refine prompts to guide model behavior in financial contexts. · Evaluate LLM-generated responses to finance-related user queries for accuracy, reasoning quality, and cl ...