LLM Benchmark for CRM


Salesforce launches the world’s first benchmark for CRM

Salesforce has developed the world's first LLM benchmark for CRM to assess the efficacy of generative AI models for business applications. This benchmark evaluates LLMs for sales and service use cases across accuracy, cost, speed, and trust & safety based on real CRM data and expert evaluations. What sets this benchmark apart is the human evaluations by both Salesforce employees and actual external customers and the fact that it is based on real-world datasets from both Salesforce and customer operations.


This area consists of four metrics: factuality, completeness, conciseness, and instruction-following. You can choose between manual or automated evaluation.


This is a key area to consider, as some use cases require higher LLM usage. This is broken down into a scale of high, medium, and low, and will depend on the use case.


Some use cases require real-time responses. This is the time to get a full response and will depend on the use case.

Trust and Safety

This is a critical dimension for most businesses, and includes privacy, safety, and general truthfulness. Model fine-tuning and prompt engineering can improve these scores.

To use the Tableau Dashboard, define the filters, including the CRM area (Sales and Service), the type of use case (Generation, Summary), and one or more use cases. For Accuracy results, you can choose “manual” for evaluations performed by people, or “automated.” You can also filter by Provider and LLM (Large Language Model).

Dashboard Specifications


For Accuracy results, you can choose “manual” for evaluations performed by people, or “automated”.  You can also filter by Provider and LLM (Large Language Model). Accuracy is an average of these four metrics, each on a 4-point scale:

  • Factuality -  Is the response true and free of false information?
  • Instruction Following - Is the answer per the requested instructions, in terms of content and format?
  • Conciseness - Is the response to the point and without repetition or elaboration?
  • Completeness - Is the response comprehensive, by including all relevant information?

The four-point scale means:

  • 4 - Very Good: As good as it gets given the information. A person with enough time would not do much better.
  • 3 - Good: Done well with a little bit of room for improvement.
  • 2 - Poor: Not usable and has issues.
  • 1 - Very Poor: Not usable with obvious critical issues.

*Note a lower accuracy score can be improved via model fine-tuning and prompt engineering.


Cost is a key aspect of evaluating LLMs for CRM. Some use cases require over ½ million calls per year.  If a model is at the right cost, its accuracy can be increased if needed. This is shown as high, medium, and low.


Speed, also known as latency, is critical for real-time use cases, but some use cases are asynchronous. This helps you understand tradeoffs. The metric is defined as the seconds needed for a full response from the LLM.


Trust is the average of metrics for Safety, Privacy, Truthfulness and CRM Fairness. Some industries may require higher trust and safety metrics. These are as a percentage.

  • Safety - how often does the LLM avoid answering unsafe questions?
  • Privacy - how often does the LLM avoid revealing private information?
  • Truthfulness - how accurate is the LLM for general knowledge?
  • CRM Fairness - how unbiased results are based on account and gender perturbations on CRM datasets?
View the LLM Leaderboard for CRM on Hugging Face
Salesforce Announces the World’s First LLM Benchmark for CRM