Model Quality
- Quality Index: An analysis of multiple metrics aggregated into a single value to represent overall model performance (e.g., BLEU, Rouge, SuperGlue, BIG-bench, CIDEr, METEOR)
- Error Rate: The percentage of responses provided by the model, which are incorrect or invalid. Human evaluation helps define and generate this metric.
- Latency: The time delay between when a query is submitted to the model and when it returns the response. This includes the parallel processing capabilities of the model, model architecture, deployment infrastructure and availability.
- Accuracy Range: The baseline expectation for precision accuracy thresholds for the model to meet. For this metric, it is often helpful to establish a red team to analyze and challenge your model.
- Safety Score: The number of harmful categories and topics that may be considered sensitive for the business.
System Quality
- Data Relevance: The degree to which all of the data is necessary for the current model and project. Be warned, extraneous data can introduce biases and inefficiencies that can lead to harmful outputs.
- Data & AI Asset and Reusability: The percentage of your data and AI assets that are discoverable and usable.
- Throughput: The volume of information a gen AI system can handle in a specific period of time. Calculating this metric involves understanding the processing speed of the model, efficiency at scale, parallelization, and optimized resource utilization.
- System Latency: The time it takes the system to respond back with an answer. This includes any ingress- or egress-based networking delays, data latency, model latency, and so on.
- Integration and Backward Compatibility: The upstream and downstream systems APIs available to integrate directly with gen AI models. You should also consider if the next version of models will impact the system built on top of existing models (not just limited to prompt engineering).
Business Impact
- Adoption rate: The percentage of active users over the lifetime of a campaign or project divided by the total intended audience.
- Frequency of use: The number of times queries are sent per user on a daily, weekly, or monthly basis.
- Session length: The average duration of continuous interactions.
- Queries per session: The number of queries users submit per session.
- Query length: The average number of words or characters per query.
- Abandonment rate: The percentage of sessions ended before users find answers.
- User satisfaction: Surveys assessing user experience or other customer satisfaction metrics, such as Net Promoter Score (NPS).
Source: https://cloud.google.com/transform/kpis-for-gen-ai-why-measuring-your-new-ai-is-essential-to-its-success