The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Quesma Releases OTelBench: Independent Benchmark Reveals Frontier LLMs Struggle with Real-World SRE Tasks

New benchmark shows top LLMs achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between coding ability and real-world SRE work.

OTelBench shows that while LLMs are impressive at generating code snippets, they’re not yet capable of the cross-cutting reasoning required for production engineering.”

— Jacek Migdał, founder of Quesma

WARSAW, POLAND, January 20, 2026 /EINPresswire.com/ — Quesma, Inc. announced the release of OTelBench, the first comprehensive benchmark for evaluating LLMs on OpenTelemetry instrumentation tasks. The open-source dataset tests 14 state-of-the-art models across 23 real-world tasks in 11 programming languages, revealing significant gaps in AI’s ability to handle production-grade Site Reliability Engineering (SRE) work.

While frontier LLMs have demonstrated impressive coding capabilities, the benchmark reveals a stark reality: the best-performing model, Claude Opus 4.5, achieved only a 29% pass rate on OpenTelemetry instrumentation tasks, compared to 80.9% pass rate in the SWE-Bench. This gap highlights a critical distinction between writing code and performing the complex, cross-cutting engineering work required for production systems.

The $1.4 Million Per Hour Problem
Enterprise outages cost an average of $1.4 million per hour, making production visibility mission-critical. Distributed tracing, the gold standard for debugging complex microservices, allows teams to link user actions to every underlying service call. However, implementing this visibility remains difficult, with 39% of organizations citing complexity as their top observability obstacle. OpenTelemetry has emerged as the industry standard with backing from 1,100+ organizations, yet configuring it correctly remains a major source of toil for SRE teams.

Fundamental Limitations Exposed
The benchmark tested models on agentic coding tasks where they were given source code from realistic applications, an interactive Linux terminal, and clear instrumentation objectives. The results revealed several critical failure modes:

Context propagation, passing trace context between services to maintain parent-child span relationships, proved to be an insurmountable barrier for most models. This is particularly concerning because context propagation is fundamental to distributed tracing.

“The backbone of the software industry consists of complex, high-scale production systems with mission-critical reliability, and seasoned engineers are architecting, evolving, and troubleshooting them,” said Jacek Migdał, founder of Quesma. “OTelBench shows that while LLMs are impressive at generating code snippets, they’re not yet capable of the cross-cutting reasoning and sustained problem-solving required for production engineering. This gap matters because many vendors are marketing AI SRE solutions with bold claims but no independent verification. We need benchmarks like this to separate reality from hype.”

Language Ecosystems Matter
Success rates varied dramatically across programming languages, revealing that AI generalization is far weaker than human engineers. Models had some moderate success with Go and, quite surprisingly, C++. A few tasks were completed for JavaScript, PHP, .NET, and Python. Just a single model solved a single task in Rust. None of the models solved a single task in Swift, Ruby, or (to our biggest surprise, due to a build issue) – Java.

Why This Matters for AI Development
OTelBench reveals several reasons why OpenTelemetry instrumentation challenges current LLMs:
– Reliability-critical applications reside in private repositories at companies like Apple, Airbnb, and Netflix, limiting training data.
– Instrumentation requires cross-cutting changes across codebases, rather than sequential additions.
– Some tasks required 50+ commands over 10+ minutes. Models consistently performed worse as tasks lengthened.

Migdał added, “AI SRE in 2026 is what DevOps Anomaly Detection was in 2016—lots of marketing, huge budgets, but lacking independent benchmarks. Just as SWE-Bench became the standard for coding evaluation, we need SRE-style benchmarks to determine what actually works. That’s why we’re releasing OTelBench as open-source: to create a North Star for navigating the AI hype and to enable the community to track real progress.”

A Path Forward
Despite the challenges, the benchmark reveals promising signals. Claude Opus 4.5, GPT-5.2, and Gemini 3 models show capability on specific tasks, with go-otel-microservices-traces reaching a 52% pass rate. With more environments for Reinforcement Learning with Verified Rewards, OpenTelemetry instrumentation appears to be a solvable problem for future AI systems.

Until then, organizations requiring distributed tracing across services should expect to write that code themselves—or work alongside AI assistants that understand their limitations.

OTelBench is available today as an open-source project at https://quesma.com/benchmarks/otel/, enabling researchers and practitioners to reproduce results and contribute additional test cases.

Lucie Šimečková
Quesma
press@quesma.com

Legal Disclaimer:

EIN Presswire provides this news content “as is” without warranty of any kind. We do not accept any responsibility or liability
for the accuracy, content, images, videos, licenses, completeness, legality, or reliability of the information contained in this
article. If you have any complaints or copyright issues related to this article, kindly contact the author above.

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Daniela Ruah Directs MY TYPE, a Type 1 Diabetes Short Starring Sadie Stanley & Jacob Ward, Set for SBIFF World Premiere

Daniela Ruah Directs MY TYPE, a Type 1 Diabetes Short Starring Sadie Stanley & Jacob Ward, Set for SBIFF World Premiere

Inspired by a People Magazine story, the mission-driven short film brings Type 1 Diabetes awareness to the national

January 27, 2026

LOFTAPPS Introduces LOFTai at Production Summit Los Angeles (PSLA), January 27–28, 2026

LOFTAPPS Introduces LOFTai at Production Summit Los Angeles (PSLA), January 27–28, 2026

First AI Solution Empowers Creative Teams With Reliable Prompt Tracking for Demanding AI-Driven Image and Video

January 27, 2026

Surveys Are Slowing Climate Action, DitchCarbon Introduces The Solution

Surveys Are Slowing Climate Action, DitchCarbon Introduces The Solution

Supplier surveys slow climate action. DitchCarbon launches an AI survey responder to cut reporting busywork (up to 65%)

January 27, 2026

Asian Egg Bank Announces Renewed Partnership With CAN-AM Cryoservices for Canadian Patients and Clinics

Asian Egg Bank Announces Renewed Partnership With CAN-AM Cryoservices for Canadian Patients and Clinics

Effective January 2026, the renewed partnership streamlines logistics, enables CAD payments, and reduces import

January 27, 2026

Retired University Professor Establishes Post-Academic Career as Visual Artist

Retired University Professor Establishes Post-Academic Career as Visual Artist

From Classroom to Canvas. ALBANY, NY, UNITED STATES, January 26, 2026 /EINPresswire.com/ — Dr. Gary Springer, a

January 27, 2026

ACHS Dean Lori Holdren Joins AMATYC Executive Board for 2026–2028 Term

ACHS Dean Lori Holdren Joins AMATYC Executive Board for 2026–2028 Term

ACHS announces Dean Lori Holdren’s appointment to the AMATYC Executive Board, strengthening leadership in mathematics

January 27, 2026

Belstra Milling Co. Feeds the Future Through Integrity, Quality, and Service Featured on Inside Business Today

Belstra Milling Co. Feeds the Future Through Integrity, Quality, and Service Featured on Inside Business Today

Nick DeKryger shares how the family-owned agricultural leader supports farmers, communities, and generations of growth

January 27, 2026

Major vs. Minor Home Renovations Explained by Area of the Home

Major vs. Minor Home Renovations Explained by Area of the Home

How Different Renovation Types Affect Scope, Planning, and Outcomes The difference between major and minor renovations

January 27, 2026

Ransomware Help Reinforces Its 100% Data Recovery Guarantee, Backed by Over 1,500 Global Success Cases

Ransomware Help Reinforces Its 100% Data Recovery Guarantee, Backed by Over 1,500 Global Success Cases

Ransomware Help announces industry-leading success rate and introduces enhanced money-back policy for unrecoverable

January 27, 2026

Hoosiers save lives in 2025 with gifts of life

Hoosiers save lives in 2025 with gifts of life

Indiana Donor Network coordinates 1,214 lifesaving organ transplants made possible by 1,379 donated organs from 467

January 27, 2026

Mystery Enterprises™ Announces February 27 Launch of Its Automated Murder Mystery Games™

Mystery Enterprises™ Announces February 27 Launch of Its Automated Murder Mystery Games™

The company’s new fully automated, host-free format allows every participant—including the organizer—to play as a fully

January 27, 2026

Ransomware Help Launches Enhanced 4-Step Incident Response Program to Expedite Ransomware Recovery

Ransomware Help Launches Enhanced 4-Step Incident Response Program to Expedite Ransomware Recovery

New structured recovery model ensures faster turnaround and greater transparency for affected businesses. Our enhanced

January 27, 2026

NeuroForce1 (NF1) Welcomes Renowned Sleep Coach Barry Bridges as Brand Ambassador and Strategic Advisor

NeuroForce1 (NF1) Welcomes Renowned Sleep Coach Barry Bridges as Brand Ambassador and Strategic Advisor

NeuroForce1 names sleep coach Barry Bridges brand ambassador and advisor to deliver true in-app sleep coaching tied to

January 27, 2026

Influential Women Spotlights Colleen Souza: Author, Podcaster & Real Estate Leader Inspiring Change Through Teaching

Influential Women Spotlights Colleen Souza: Author, Podcaster & Real Estate Leader Inspiring Change Through Teaching

CLOVIS, CA, UNITED STATES, January 26, 2026 /EINPresswire.com/ — Empowering Others to Lead with Purpose Through

January 27, 2026

WILLRICH PRECISION INSTRUMENTS ACHIEVES CMMC 2.0 LEVEL 2 CERTIFICATION

WILLRICH PRECISION INSTRUMENTS ACHIEVES CMMC 2.0 LEVEL 2 CERTIFICATION

Certification reinforces Willrich’s commitment to cybersecurity excellence and Department of Defense compliance.

January 27, 2026

Rebecca Carr, CEO of SmartRecruiters, Calls for Workforce Orchestration as the Next Economic Operating System

Rebecca Carr, CEO of SmartRecruiters, Calls for Workforce Orchestration as the Next Economic Operating System

DAVOS, SWITZERLAND, January 26, 2026 /EINPresswire.com/ — As global business enters an era defined by accelerating

January 27, 2026

Sino Biological Pioneers Life Sciences Innovation with High-Quality Bioreagents on Inside Business Today

Sino Biological Pioneers Life Sciences Innovation with High-Quality Bioreagents on Inside Business Today

Dr. Rob Burgess shares how Sino Biological empowers global biomedical research through precision, quality, and

January 27, 2026

MoveCrew Strengthens Its Position as One of the Most Trusted Moving Companies in South Carolina

MoveCrew Strengthens Its Position as One of the Most Trusted Moving Companies in South Carolina

MoveCrew continues to build its reputation as a premier moving company in Greenville SC, offering comprehensive

January 27, 2026

Pacaso Redefines Vacation Home Ownership on Inside Business Today with Bill and Giuliana Rancic

Pacaso Redefines Vacation Home Ownership on Inside Business Today with Bill and Giuliana Rancic

Pacaso Co-Founder & CEO Austin Allison Shares How Co-Ownership Is Transforming the Future of Second Homes on Fox

January 27, 2026

Free Scam-O-Meter Tool Helps Consumers Spot Fraud Before Losing Money

Free Scam-O-Meter Tool Helps Consumers Spot Fraud Before Losing Money

Interactive checklist based on official consumer protection warning signs now available at GetOutOfDebt.org I've spent

January 27, 2026

The Traveling Golf Diva Podcast Launches to Inspire Women Golfers Worldwide

The Traveling Golf Diva Podcast Launches to Inspire Women Golfers Worldwide

Born from a deep passion for adventure and an unwavering love for golf The Traveling Golf Diva began with a tee time

January 27, 2026

FIVE SKLAR KIRSH PARTNERS NAMED TO INAUGURAL LEGAL 500 – CALIFORNIA ELITE LIST

FIVE SKLAR KIRSH PARTNERS NAMED TO INAUGURAL LEGAL 500 – CALIFORNIA ELITE LIST

LOS ANGELES, CA, UNITED STATES, January 26, 2026 /EINPresswire.com/ — California-based law firm Sklar Kirsh LLP

January 27, 2026

PSE Group Strengthens Industry Leadership with Color Systems Acquisition

PSE Group Strengthens Industry Leadership with Color Systems Acquisition

PSE Group acquires Color Systems, enhancing their offerings and expanding service reach as part of their strategic

January 27, 2026

Health And Life Organization (HALO), Inc. Strengthens Sacramento Through Community Healthcare and Career Opportunities

Health And Life Organization (HALO), Inc. Strengthens Sacramento Through Community Healthcare and Career Opportunities

Working at a nonprofit does not mean asking its employees and staff to sacrifice their livelihood”— Bobby Bliatout,

January 27, 2026

Qubittron and INK PharmaSolutions Launch Continuum Systems™ to Power Enterprise AI for Regulated Pharma

Qubittron and INK PharmaSolutions Launch Continuum Systems™ to Power Enterprise AI for Regulated Pharma

An enterprise AI platform bringing decision-grade intelligence to regulated pharmaceutical operations This partnership

January 27, 2026

Leading Insurance Carriers Choose Ransomware Help as a Trusted Cyber Recovery Provider

Leading Insurance Carriers Choose Ransomware Help as a Trusted Cyber Recovery Provider

Partnerships with Hiscox, Zurich, MAPFRE, and SURA highlight the company’s reliability in high-stakes data recovery.

January 27, 2026

AI sharpens satellite eyes on air pollution

AI sharpens satellite eyes on air pollution

GA, UNITED STATES, January 26, 2026 /EINPresswire.com/ — Accurate monitoring of atmospheric aerosols is essential for

January 27, 2026

Homecoming for Sales Giant Aaron Corso: Seattle Expansion Powered by AI Tech and National Growth

Homecoming for Sales Giant Aaron Corso: Seattle Expansion Powered by AI Tech and National Growth

Sales consultant Aaron Corso returns to Seattle, bringing national B2B expertise, telecom partnerships, and AI-driven

January 27, 2026

Industrial Water Treatment Chemicals Market to Hit $25.28B by 2033 at 5.26% CAGR – Strategic Revenue Insights (SRI)

Industrial Water Treatment Chemicals Market to Hit $25.28B by 2033 at 5.26% CAGR – Strategic Revenue Insights (SRI)

SRI reveals water treatment chemicals market surge driven by environmental regulations, industrial expansion and

January 27, 2026

TAHAN Music Co. Releases ‘She Has a Name’ A Post-Christmas Country Story That Calls Courage Back Into the Conversation

TAHAN Music Co. Releases ‘She Has a Name’ A Post-Christmas Country Story That Calls Courage Back Into the Conversation

God is the expert of making miracles out of messes.”— Tahan Music Co. NASHVILLE, TN, UNITED STATES, January 26, 2026

January 27, 2026

Experts Examine Shift Toward Integrated Wellness Models as Burnout Concerns Rise Among Professionals

Experts Examine Shift Toward Integrated Wellness Models as Burnout Concerns Rise Among Professionals

Health and wellness specialists discuss evolving approaches to performance sustainability as professionals seek

January 27, 2026

Sendoso Redefines B2B Engagement Through Thoughtful Gifting on Inside Business Today with Bill and Giuliana Rancic

Sendoso Redefines B2B Engagement Through Thoughtful Gifting on Inside Business Today with Bill and Giuliana Rancic

Abhay Rajaram, Co-CEO of Sendoso, shares how human-first gifting and SmartSuite AI are transforming B2B relationships

January 27, 2026

Vincitù chooses Hub Affiliations: exclusive agreement for the management of the entire affiliate network

Vincitù chooses Hub Affiliations: exclusive agreement for the management of the entire affiliate network

NAPOLI, ITALY, January 26, 2026 /EINPresswire.com/ — Hub Affiliations announces the signing of an agreement for the

January 27, 2026

ITPS Canada to Deliver Hybrid Flight Test Training Program for Bombardier Beginning January 2026

ITPS Canada to Deliver Hybrid Flight Test Training Program for Bombardier Beginning January 2026

Test Pilot and Flight Test Engineer Training Solution for Major OEM Aerospace Flight Test Teams ITPS is proud to

January 27, 2026

STEMart Announces Enhanced In Vivo Rat Micronucleus Test Services to Accelerate Medical Device Biocompatibility Testing

STEMart Announces Enhanced In Vivo Rat Micronucleus Test Services to Accelerate Medical Device Biocompatibility Testing

STEMart announces the expansion of its genetic toxicology capabilities with the In Vivo Rat Micronucleus Test services.

January 27, 2026

Award Winning Central Florida Haunted Attraction Returns for Valentine’s Day

Award Winning Central Florida Haunted Attraction Returns for Valentine’s Day

PLANT CITY, FL, UNITED STATES, January 26, 2026 /EINPresswire.com/ — Horror theme park Sir Henry’s Haunted Trail will

January 27, 2026

Brandsymbol Releases 5th Edition of the Brand Name Annual FDA Report

Brandsymbol Releases 5th Edition of the Brand Name Annual FDA Report

A look at the FDA-approved proprietary names and what last year's approvals signal about name safety. CHARLOTTE, NC,

January 27, 2026

ANY.RUN Reveals How JA3 Fingerprints Help SOC Teams Expose Hacker Attacks on Companies Earlier

ANY.RUN Reveals How JA3 Fingerprints Help SOC Teams Expose Hacker Attacks on Companies Earlier

DUBAI, DUBAI, UNITED ARAB EMIRATES, January 26, 2026 /EINPresswire.com/ — ANY.RUN, a recognized provider of

January 27, 2026

LayerLogix Finds Business Downtime Now Exceeds Ransom Costs in Cyberattack Fallout

LayerLogix Finds Business Downtime Now Exceeds Ransom Costs in Cyberattack Fallout

NEW YORK, NY, UNITED STATES, January 26, 2026 /EINPresswire.com/ — When cyberattacks capture public attention,

January 27, 2026

Montana West Announces Valentine’s Day Sale Celebrating Love with Western Elegance

Montana West Announces Valentine’s Day Sale Celebrating Love with Western Elegance

Montana West announces its Valentine’s Day Sale, Jan 22–Feb 15, 2026, featuring up to 30% off select items plus free

January 27, 2026