15.6 C
New York
Friday, November 7, 2025

A Coding Implementation of a Complete Enterprise AI Benchmarking Framework to Consider Rule-Based mostly LLM, and Hybrid Agentic AI Programs Throughout Actual-World Duties


On this tutorial, we develop a complete benchmarking framework to judge varied sorts of agentic AI methods on real-world enterprise software program duties. We design a collection of various challenges, from knowledge transformation and API integration to workflow automation and efficiency optimization, and assess how varied brokers, together with rule-based, LLM-powered, and hybrid ones, carry out throughout these domains. By operating structured benchmarks and visualizing key efficiency metrics, resembling accuracy, execution time, and success fee, we achieve a deeper understanding of every agent’s strengths and trade-offs in enterprise environments. Take a look at the Full Codes right here.

import json
import time
import random
from typing import Dict, Checklist, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


@dataclass
class Process:
   id: str
   identify: str
   description: str
   class: str
   complexity: int
   expected_output: Any


@dataclass
class BenchmarkResult:
   task_id: str
   agent_name: str
   success: bool
   execution_time: float
   accuracy: float
   error_message: str = ""


class EnterpriseTaskSuite:
   def __init__(self):
       self.duties = self._create_tasks()


   def _create_tasks(self) -> Checklist[Task]:
       return [
           Task("data_transform", "CSV Data Transformation",
                "Transform customer data by aggregating sales", "data_processing", 3,
                {"total_sales": 15000, "avg_order": 750}),
           Task("api_integration", "REST API Integration",
                "Parse API response and extract key metrics", "integration", 2,
                {"status": "success", "active_users": 1250}),
           Task("workflow_automation", "Multi-Step Workflow",
                "Execute data validation -> processing -> reporting", "automation", 4,
                {"validated": True, "processed": 100, "report_generated": True}),
           Task("error_handling", "Error Recovery",
                "Handle malformed data gracefully", "reliability", 3,
                {"errors_caught": 5, "recovery_success": True}),
           Task("optimization", "Query Optimization",
                "Optimize database query performance", "performance", 5,
                {"execution_time_ms": 45, "rows_scanned": 1000}),
           Task("data_validation", "Schema Validation",
                "Validate data against business rules", "validation", 2,
                {"valid_records": 95, "invalid_records": 5}),
           Task("reporting", "Executive Dashboard",
                "Generate KPI summary report", "analytics", 3,
                {"revenue": 125000, "growth": 0.15, "customer_count": 450}),
           Task("integration_test", "System Integration",
                "Test end-to-end integration flow", "testing", 4,
                {"all_systems_connected": True, "latency_ms": 120}),
       ]


   def get_task(self, task_id: str) -> Process:
       return subsequent((t for t in self.duties if t.id == task_id), None)

We outline the core knowledge constructions for our benchmarking system. We create the Process and BenchmarkResult knowledge lessons and initialize the EnterpriseTaskSuite, which holds a number of enterprise-relevant duties resembling knowledge transformation, reporting, and integration. We laid the muse for constantly evaluating several types of brokers throughout these duties. Take a look at the Full Codes right here.

class BaseAgent:
   def __init__(self, identify: str):
       self.identify = identify


   def execute(self, job: Process) -> Dict[str, Any]:
       elevate NotImplementedError


class RuleBasedAgent(BaseAgent):
   def execute(self, job: Process) -> Dict[str, Any]:
       time.sleep(random.uniform(0.1, 0.3))
       if job.class == "data_processing":
           return {"total_sales": 15000 + random.randint(-500, 500),
                   "avg_order": 750 + random.randint(-50, 50)}
       elif job.class == "integration":
           return {"standing": "success", "active_users": 1250}
       elif job.class == "automation":
           return {"validated": True, "processed": 98, "report_generated": True}
       else:
           return job.expected_output

We introduce the bottom agent construction and implement the RuleBasedAgent, which mimics conventional automation logic utilizing predefined guidelines. We simulate how such brokers execute duties deterministically whereas sustaining velocity and reliability, giving us a baseline for comparability with extra superior brokers. Take a look at the Full Codes right here.

class LLMAgent(BaseAgent):
   def execute(self, job: Process) -> Dict[str, Any]:
       time.sleep(random.uniform(0.2, 0.5))
       accuracy_boost = 0.95 if job.complexity >= 4 else 0.90
       outcome = {}
       for key, worth in job.expected_output.objects():
           if isinstance(worth, (int, float)):
               variation = worth * (1 - accuracy_boost)
               outcome[key] = worth + random.uniform(-variation, variation)
           else:
               outcome[key] = worth
       return outcome


class HybridAgent(BaseAgent):
   def execute(self, job: Process) -> Dict[str, Any]:
       time.sleep(random.uniform(0.15, 0.35))
       if job.complexity <= 2:
           return job.expected_output
       else:
           outcome = {}
           for key, worth in job.expected_output.objects():
               if isinstance(worth, (int, float)):
                   variation = worth * 0.03
                   outcome[key] = worth + random.uniform(-variation, variation)
               else:
                   outcome[key] = worth
           return outcome

We develop two clever agent sorts, the LLMAgent, representing reasoning-based AI methods, and the HybridAgent, which mixes rule-based precision with LLM adaptability. We design these brokers to indicate how learning-based strategies enhance job accuracy, particularly for complicated enterprise workflows. Take a look at the Full Codes right here.

class BenchmarkEngine:
   def __init__(self, task_suite: EnterpriseTaskSuite):
       self.task_suite = task_suite
       self.outcomes: Checklist[BenchmarkResult] = []


   def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
       print(f"n{'='*60}")
       print(f"Benchmarking Agent: {agent.identify}")
       print(f"{'='*60}")
       for job in self.task_suite.duties:
           print(f"nTask: {job.identify} (Complexity: {job.complexity}/5)")
           for i in vary(iterations):
               outcome = self._execute_task(agent, job, i+1)
               self.outcomes.append(outcome)
               standing = "✓ PASS" if outcome.success else "✗ FAIL"
               print(f"  Run {i+1}: {standing} | Time: {outcome.execution_time:.3f}s | Accuracy: {outcome.accuracy:.2%}")

Right here, we construct the core of our benchmarking engine, which manages agent analysis throughout the outlined job suite. We implement strategies to run every agent a number of instances per job, log outcomes, and measure key parameters like execution time and accuracy. This creates a scientific and repeatable benchmarking loop. Take a look at the Full Codes right here.

 def _execute_task(self, agent: BaseAgent, job: Process, run_num: int) -> BenchmarkResult:
       start_time = time.time()
       attempt:
           output = agent.execute(job)
           execution_time = time.time() - start_time
           accuracy = self._calculate_accuracy(output, job.expected_output)
           success = accuracy >= 0.85
           return BenchmarkResult(task_id=job.id, agent_name=agent.identify, success=success,
                                  execution_time=execution_time, accuracy=accuracy)
       besides Exception as e:
           execution_time = time.time() - start_time
           return BenchmarkResult(task_id=job.id, agent_name=agent.identify, success=False,
                                  execution_time=execution_time, accuracy=0.0, error_message=str(e))


   def _calculate_accuracy(self, output: Dict, anticipated: Dict) -> float:
       if not output:
           return 0.0
       scores = []
       for key, expected_val in anticipated.objects():
           if key not in output:
               scores.append(0.0)
               proceed
           actual_val = output[key]
           if isinstance(expected_val, bool):
               scores.append(1.0 if actual_val == expected_val else 0.0)
           elif isinstance(expected_val, (int, float)):
               diff = abs(actual_val - expected_val)
               tolerance = abs(expected_val * 0.1)
               rating = max(0, 1 - (diff / (tolerance + 1e-9)))
               scores.append(rating)
           else:
               scores.append(1.0 if actual_val == expected_val else 0.0)
       return np.imply(scores) if scores else 0.0

We outline the duty execution logic and the accuracy computation. We measure every agent’s efficiency by evaluating their outputs towards anticipated outcomes utilizing a scoring mechanism. This step ensures our benchmarking course of is quantitative and honest, offering insights into how intently brokers align with enterprise expectations. Take a look at the Full Codes right here.

 def generate_report(self):
       df = pd.DataFrame([asdict(r) for r in self.results])
       print(f"n{'='*60}")
       print("BENCHMARK REPORT")
       print(f"{'='*60}n")
       for agent_name in df['agent_name'].distinctive():
           agent_df = df[df['agent_name'] == agent_name]
           print(f"{agent_name}:")
           print(f"  Success Price: {agent_df['success'].imply():.1%}")
           print(f"  Avg Execution Time: {agent_df['execution_time'].imply():.3f}s")
           print(f"  Avg Accuracy: {agent_df['accuracy'].imply():.2%}n")
       return df


   def visualize_results(self, df: pd.DataFrame):
       fig, axes = plt.subplots(2, 2, figsize=(14, 10))
       fig.suptitle('Enterprise Agent Benchmarking Outcomes', fontsize=16, fontweight="daring")
       success_rate = df.groupby('agent_name')['success'].imply()
       axes[0, 0].bar(success_rate.index, success_rate.values, coloration=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 0].set_title('Success Price by Agent', fontweight="daring")
       axes[0, 0].set_ylabel('Success Price')
       axes[0, 0].set_ylim(0, 1.1)
       for i, v in enumerate(success_rate.values):
           axes[0, 0].textual content(i, v + 0.02, f'{v:.1%}', ha="middle", fontweight="daring")
       time_data = df.groupby('agent_name')['execution_time'].imply()
       axes[0, 1].bar(time_data.index, time_data.values, coloration=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 1].set_title('Common Execution Time', fontweight="daring")
       axes[0, 1].set_ylabel('Time (seconds)')
       for i, v in enumerate(time_data.values):
           axes[0, 1].textual content(i, v + 0.01, f'{v:.3f}s', ha="middle", fontweight="daring")
       df.boxplot(column='accuracy', by='agent_name', ax=axes[1, 0])
       axes[1, 0].set_title('Accuracy Distribution', fontweight="daring")
       axes[1, 0].set_xlabel('Agent')
       axes[1, 0].set_ylabel('Accuracy')
       plt.sca(axes[1, 0])
       plt.xticks(rotation=15)
       task_complexity = {t.id: t.complexity for t in self.task_suite.duties}
       df['complexity'] = df['task_id'].map(task_complexity)
       complexity_perf = df.groupby(['agent_name', 'complexity'])['accuracy'].imply().unstack()
       complexity_perf.plot(form='line', ax=axes[1, 1], marker="o", linewidth=2)
       axes[1, 1].set_title('Accuracy by Process Complexity', fontweight="daring")
       axes[1, 1].set_xlabel('Process Complexity')
       axes[1, 1].set_ylabel('Accuracy')
       axes[1, 1].legend(title="Agent", loc="greatest")
       axes[1, 1].grid(True, alpha=0.3)
       plt.tight_layout()
       plt.present()


if __name__ == "__main__":
   print("Enterprise Software program Benchmarking for Agentic Brokers")
   print("="*60)
   task_suite = EnterpriseTaskSuite()
   benchmark = BenchmarkEngine(task_suite)
   brokers = [RuleBasedAgent("Rule-Based Agent"), LLMAgent("LLM Agent"), HybridAgent("Hybrid Agent")]
   for agent in brokers:
       benchmark.run_benchmark(agent, iterations=3)
   results_df = benchmark.generate_report()
   benchmark.visualize_results(results_df)
   results_df.to_csv('agent_benchmark_results.csv', index=False)
   print("nResults exported to: agent_benchmark_results.csv")

We generate detailed reviews and create visible analytics for efficiency comparability. We analyze metrics resembling success fee, execution time, and accuracy throughout brokers and job complexities. Lastly, we export the outcomes to CSV file, finishing a full enterprise-grade analysis workflow.

In conclusion, we applied a sturdy, extensible benchmarking system that permits us to measure and evaluate the effectivity, adaptability, and accuracy of a number of agentic AI approaches. We noticed how completely different architectures excel at completely different ranges of job complexity and the way visible analytics spotlight efficiency tendencies. This course of allows us to judge present brokers and offers a robust basis for next-generation enterprise AI brokers, optimized for reliability and intelligence.


Take a look at the Full Codes right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles