Open Persian LLM Alignment Leaderboard
Open Persian LLM Alignment Leaderboard
- "headers": [
- "Model",
- "Safty",
- "Fairness",
- "Socail-norm",
- "GuardBench_fa",
- "ProhibiBench_fa",
- "SafeBench_fa",
- "FairBench_fa",
- "SocialBench_fa",
- "Advbench_fa",
- "DecodingTrust_fa",
- "Anthropic_fa",
- "Harmbench_fa",
- "Average",
- "Type",
- "Precision",
- "Hub License",
- "#Params (B)",
- "Available on the hub",
- "Model sha"
- "data": [
- [
- "<a target="_blank" href="https://huggingface.co/ava/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">ava/model</a>",
- 95.16,
- 87.63,
- 86.16,
- 85.89,
- 89.35,
- 94.02,
- 94.62,
- 90.67,
- 95.61,
- 87.67,
- 95.87,
- 94.3,
- 87.91,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/Dorna2-Llama3.1-8B-Instruct/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Dorna2-Llama3.1-8B-Instruct/model</a>",
- 75.96,
- 77.73,
- 79,
- 79.22,
- 73.6,
- 63.54,
- 83.64,
- 99.38,
- 84.47,
- 76.49,
- 75.09,
- 74.25,
- 78,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/Ministral-8B-Instruct-2410/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Ministral-8B-Instruct-2410/model</a>",
- 74.61,
- 79.23,
- 45.91,
- 43.37,
- 72.92,
- 65.83,
- 80.37,
- 92.5,
- 80.2,
- 79.46,
- 79.74,
- 53.04,
- 59.34,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/Qwen2.5-3B-Instruct/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Qwen2.5-3B-Instruct/model</a>",
- 69.75,
- 62.66,
- 47.77,
- 46.53,
- 61.92,
- 59.8,
- 60.09,
- 63.12,
- 76.84,
- 62.35,
- 63.2,
- 62.37,
- 55.29,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/Qwen2.5-7B-Instruct/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Qwen2.5-7B-Instruct/model</a>",
- 77.8,
- 71.16,
- 51.74,
- 50.08,
- 69.28,
- 70.93,
- 75.66,
- 83.75,
- 83.5,
- 71.03,
- 78.02,
- 68,
- 61.09,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/aya-expanse-8b/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">aya-expanse-8b/model</a>",
- 89.89,
- 87.48,
- 85.55,
- 85.37,
- 85.17,
- 85.29,
- 93.87,
- 99.38,
- 93.96,
- 87.31,
- 93.49,
- 84.22,
- 86.36,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/gita/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">gita/model</a>",
- 89.34,
- 81.64,
- 78.14,
- 77.36,
- 84.38,
- 84.22,
- 75.85,
- 76.25,
- 90.04,
- 81.91,
- 94.74,
- 95.33,
- 80.69,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/gemma-2-2b-it/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">gemma-2-2b-it/model</a>",
- 94.67,
- 85.3,
- 43.75,
- 40.67,
- 87.42,
- 95.1,
- 92.99,
- 99.38,
- 95.1,
- 82.94,
- 93.71,
- 94.52,
- 63.2,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/gemma-2-9b-it/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">gemma-2-9b-it/model</a>",
- 97.25,
- 89.72,
- 72.1,
- 70.3,
- 91.51,
- 96.1,
- 97.17,
- 10,
- 97.76,
- 89.58,
- 98.03,
- 98.22,
- 80.71,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/sialk/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">sialk/model</a>",
- 80.21,
- 64.38,
- 81.56,
- 83.05,
- 83.13,
- 93.63,
- 94.86,
- 91.88,
- 78.96,
- 55.1,
- 89.05,
- 47.63,
- 76.71,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "<a target="_blank" href="https://huggingface.co/sialk_dpo/model" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">sialk_dpo/model</a>",
- 58.2,
- 55.71,
- 78.72,
- 78.84,
- 77.99,
- 49.51,
- 79.16,
- 83.12,
- 50.23,
- 51.98,
- 69.05,
- 27.26,
- 69.1,
- "",
- "float16",
- "?",
- 0,
- false,
- "revision on the hub"
- [
- "metadata": null
Open Persian LLM Alignment Leaderboard
Developed by MCILAB in collaboration with the Machine Learning Laboratory at Sharif University of Technology , this benchmark is based on the open-source ELAB where presents a comprehensive evaluation framework for assessing the alignment of Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms. Addressing the gaps in existing LLM evaluation frameworks, this benchmark is specifically tailored to Persian linguistic and cultural contexts.
It combines three types of Persian-language benchmarks:
1. Translated datasets (adapted from established English benchmarks)
2. Synthetically generated data (newly created for Persian LLMs)
3. Naturally collected data (reflecting indigenous cultural nuances)
Key Datasets in the Benchmark
The benchmark integrates the following datasets to ensure a robust evaluation of Persian LLMs:
Translated Datasets
- Anthropic-fa
- AdvBench-fa
- HarmBench-fa
- DecodingTrust-fa
Newly Developed Persian Datasets
- ProhibiBench-fa: Evaluates harmful and prohibited content in Persian culture.
- SafeBench-fa: Assesses safety in generated outputs.
- FairBench-fa: Measures bias mitigation in Persian LLMs.
- SocialBench-fa: Evaluates adherence to culturally accepted behaviors.
Naturally Collected Persian Dataset
- GuardBench-fa: A large-scale dataset designed to align Persian LLMs with local cultural norms.
A Unified Framework for Persian LLM Evaluation
By combining these datasets, our work establishes a culturally grounded alignment evaluation framework, enabling systematic assessment across three key aspects:
- Safety: Avoiding harmful or toxic content.
- Fairness: Mitigating biases in model outputs.
- Social Norms: Ensuring culturally appropriate behavior.
This benchmark not only fills a critical gap in Persian LLM evaluation but also provides a standardized leaderboard to track progress in developing aligned, ethical, and culturally aware Persian language models.
Download Dataset
The full dataset is not publicly accessible; however, you can download a sample of 1,500 entries here. The distribution of this sample is as follows:
Category Name | Accuracy |
---|---|
Fairness | 17% |
Saftey | 8.6% |
Social norm | 74.4% |
Some good practices before submitting a model
1) Make sure you can load your model and tokenizer using AutoClasses:
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
Note: make sure your model is public!
Note: if your model needs use_remote_code=True
, we do not support this option yet but we are working on adding it, stay posted!
2) Convert your model weights to safetensors
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the Extended Viewer
!
3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model ๐ค
4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
In case of model failure
If your model is displayed in the FAILED
category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add --limit
to limit the number of examples per task).