Reliable Evaluations for LLMs and AI Agents : End-to-End Evaluation Frameworks for LLMs and Autonomous AI Agents

個数:
  • 予約

Reliable Evaluations for LLMs and AI Agents : End-to-End Evaluation Frameworks for LLMs and Autonomous AI Agents

  • 現在予約受付中です。出版後の入荷・発送となります。
    重要:表示されている発売日は予定となり、発売が延期、中止、生産限定品で商品確保ができないなどの理由により、ご注文をお取消しさせていただく場合がございます。予めご了承ください。

    ●3Dセキュア導入とクレジットカードによるお支払いについて
  • 【入荷遅延について】
    世界情勢の影響により、海外からお取り寄せとなる洋書・洋古書の入荷が、表示している標準的な納期よりも遅延する場合がございます。
    おそれいりますが、あらかじめご了承くださいますようお願い申し上げます。
  • ◆画像の表紙や帯等は実物とは異なる場合があります。
  • ◆ウェブストアでの洋書販売価格は、弊社店舗等での販売価格とは異なります。
    また、洋書販売価格は、ご注文確定時点での日本円価格となります。
    ご注文確定後に、同じ洋書の販売価格が変動しても、それは反映されません。
  • 製本 Paperback:紙装版/ペーパーバック版
  • 商品コード 9783032267481

Full Description

This book gives practitioners a concrete, systematic framework for designing evals that make AI systems safe, robust, and customer-ready before they reach production. Drawing on real-world failures, from chatbots that went off the rails to shopping assistants that hallucinated product information, it shows how seemingly small evaluation gaps can cascade into legal, financial, and reputational crisis, and how to close those gaps with disciplined, systematic testing.

Moving from foundational concepts to advanced practice, Reliable Evals for LLMs and AI Agents introduces the four core levers of effective evals: sets, templates, metrics, and evaluators. It then extends these to the unique challenges of autonomous AI agents, where systems perceive, reason, act, and adapt in iterative loops that demand fundamentally different eval approaches. Along the way, it guides readers through benchmark selection, custom eval set design, statistical rigor in metrics, human and LLM-as-a-judge rating strategies, and the infrastructure needed to automate evals at scale.

For engineering leaders, applied researchers, data scientists, and product teams shipping LLM- and agent-powered experiences, this volume offers a blueprint for building eval flywheels that continuously improve AI quality. It shows how to progress from ad-hoc checks to production-grade eval systems, align model metrics with real user satisfaction, integrate offline evals with online A/B testing, and design accessible interfaces that democratize rigorous testing across an organization.

Contents

1.The Evaluation Imperative - Why untested LLMs break products.- 2. Why Evals Are Critical: Learning from Practice.- 3. The Core Levers of LLM Evaluation: Sets, Templates, and Raters.- 4. Evaluating AI Agents.- 5. Evaluation Infrastructure.- 6. The Evaluation Flywheel for Continuous Improvement.

最近チェックした商品