Reliability-Benchmark

📊 Results

LLM output reliability is critical, particularly for numerical operations and action execution. Upsonic addresses this through a multi-layered reliability system, enabling control agents and verification rounds to ensure output accuracy.

Upsonic is a reliability-focused framework. The results in the table were generated with a small dataset. They show success rates in the transformation of JSON keys. No hard-coded changes were made to the frameworks during testing; only the existing features of each framework were activated and run. GPT-4o was used in the tests.

10 transfers were performed for each section. The numbers show the error count. So if it says 7, it means 7 out of 10 were done incorrectly. The table has been created based on initial results. We are expanding the dataset. The tests will become more reliable after creating a larger test set. Reliability benchmark repo

Name	Reliability Score %	ASIN Code	HS Code	CIS Code	Marketing URL	Usage URL	Warranty Time	Policy Link	Policy Description
Upsonic	99.3	0	1	0	0	0	0	0	0
CrewAI	87.5	0	3	2	1	1	0	1	2
Langgraph	6.3	10	10	7	10	8	10	10	10

How can I run the benchmark?

Clone the repository

git clone https://github.com/Upsonic/Reliability-Benchmark.git

Install the dependencies and create an environment

pip install uv
uv venv
uv sync

Set your envinronment variable in .env

# for Upsonic
AZURE_OPENAI_ENDPOINT="https://**.com/"
AZURE_OPENAI_API_VERSION="****-**-**"
AZURE_OPENAI_API_KEY="***"
# for CrewAI
AZURE_API_KEY="***"
AZURE_API_BASE="https://**.com/"
AZURE_API_VERSION="****-**-**"
#for LangGraph
OPENAI_API_VERSION="****-**-**"

Run the benchmark

uv run run_benchmark.py

Compare the results

streamlit run compare_results.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
results		results
src/reliability_benchmark		src/reliability_benchmark
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
compare_results.py		compare_results.py
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reliability-Benchmark

📊 Results

How can I run the benchmark?

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Upsonic/Reliability-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Reliability-Benchmark

📊 Results

How can I run the benchmark?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages