> ## Documentation Index
> Fetch the complete documentation index at: https://toolathlon.xyz/llms.txt
> Use this file to discover all available pages before exploring further.

# Tool Decathlon

<div className="homepage-container">
  <div className="hero-section">
    <div className="hero-content">
      <h1 className="hero-title">The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution</h1>

      <p className="hero-description">
        Real-world language agents must handle complex, multi-step workflows across diverse applications.
        The Tool Decathlon (dubbed as
        Toolathlon) is a benchmark for language agents offering diverse applications and tools, realistic
        environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications
        and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional
        applications like WooCommerce, Kubernetes, and BigQuery. And it includes
        108 manually sourced or crafted tasks in total, requiring interacting with multiple applications over
        approximately 20 interaction turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts.
      </p>

      <div className="hero-actions">
        <a href="https://github.com/hkust-nlp/Toolathlon" className="btn-github">
          <svg width="20" height="20" viewBox="0 0 24 24" fill="currentColor">
            <path d="M12 0c-6.626 0-12 5.373-12 12 0 5.302 3.438 9.8 8.207 11.387.599.111.793-.261.793-.577v-2.234c-3.338.726-4.033-1.416-4.033-1.416-.546-1.387-1.333-1.756-1.333-1.756-1.089-.745.083-.729.083-.729 1.205.084 1.839 1.237 1.839 1.237 1.07 1.834 2.807 1.304 3.492.997.107-.775.418-1.305.762-1.604-2.665-.305-5.467-1.334-5.467-5.931 0-1.311.469-2.381 1.236-3.221-.124-.303-.535-1.524.117-3.176 0 0 1.008-.322 3.301 1.23.957-.266 1.983-.399 3.003-.404 1.02.005 2.047.138 3.006.404 2.291-1.552 3.297-1.23 3.297-1.23.653 1.653.242 2.874.118 3.176.77.84 1.235 1.911 1.235 3.221 0 4.609-2.807 5.624-5.479 5.921.43.372.823 1.102.823 2.222v3.293c0 .319.192.694.801.576 4.765-1.589 8.199-6.086 8.199-11.386 0-6.627-5.373-12-12-12z" />
          </svg>

          Github
        </a>

        <a href="https://arxiv.org/abs/2510.25726v2" className="btn-arxiv">
          <svg width="20" height="20" viewBox="0 0 24 24" fill="currentColor">
            <path fill="#bdb9b4" d="m6.565 9.368 2.266 2.738 6.674-7.84c.353-.47.52-.717.353-1.117a1.218 1.218 0 0 0-1.061-.748.953.953 0 0 0-.712.262Z" />

            <path fill="#b31b1b" d="M12.541 10.677 1.935.503a1.413 1.413 0 0 0-.834-.5 1.09 1.09 0 0 0-1.027.66c-.167.4-.047.681.319 1.206l8.44 10.242-6.282 7.716a1.336 1.336 0 0 0-.323 1.3 1.114 1.114 0 0 0 1.04.69.992.992 0 0 0 .748-.365l8.519-7.92a1.924 1.924 0 0 0 .006-2.855Z" />

            <path fill="#bdb9b4" d="M17.336 22.364 8.811 12.089 6.546 9.352l-1.389 1.254a2.063 2.063 0 0 0 0 2.965L15.969 23.99a.925.925 0 0 0 .742.282 1.039 1.039 0 0 0 .953-.667 1.261 1.261 0 0 0-.328-1.241Z" />
          </svg>

          arXiv
        </a>

        <a href="https://huggingface.co/datasets/hkust-nlp/Toolathlon-Trajectories" className="btn-huggingface">
          <svg fill="none" height="20" width="20" xmlns="http://www.w3.org/2000/svg" viewBox="24.974000000000036 32.14499999999995 206.94599999999997 191.3099999999999">
            <path d="M230.721 172.7a18.97 18.97 0 0 0-2.575-5.692c.25-.917.441-1.849.568-2.791.829-5.976-1.243-11.447-5.147-15.68-2.115-2.312-4.382-3.839-6.783-4.776a90.602 90.602 0 0 0 2.377-20.568c0-3.163-.179-6.261-.479-9.313a105.88 105.88 0 0 0-.567-4.56 90.985 90.985 0 0 0-3.051-13.21 91.22 91.22 0 0 0-3.054-8.374 91.93 91.93 0 0 0-6.041-11.754 81.369 81.369 0 0 0-4.907-7.262 68.979 68.979 0 0 0-2.704-3.446 90.535 90.535 0 0 0-9.033-9.486 69.938 69.938 0 0 0-3.315-2.862 81.76 81.76 0 0 0-3.424-2.704 96.056 96.056 0 0 0-7.262-4.907c-13.781-8.37-29.942-13.17-47.215-13.17-50.292 0-91.052 40.762-91.052 91.051-.002 7.012.81 14 2.42 20.824-2.16.938-4.23 2.4-6.15 4.515-3.903 4.231-5.976 9.682-5.147 15.658.126.949.315 1.889.567 2.813a19.006 19.006 0 0 0-2.573 5.694c-1.2 4.561-.805 8.674.72 12.278-1.658 4.71-1.244 9.726.915 14.087 1.57 3.185 3.817 5.649 6.587 7.851 3.293 2.618 7.415 4.842 12.387 6.976 5.932 2.53 13.173 4.907 16.466 5.779 8.506 2.202 16.662 3.598 24.928 3.666 11.777.109 21.919-2.66 29.18-9.747a88.02 88.02 0 0 0 10.752.654 93.752 93.752 0 0 0 11.358-.715c7.244 7.132 17.425 9.926 29.245 9.814 8.265-.066 16.421-1.462 24.905-3.667 3.315-.872 10.553-3.249 16.488-5.779 4.972-2.137 9.094-4.361 12.409-6.975 2.749-2.203 4.994-4.666 6.565-7.851 2.181-4.362 2.573-9.378.938-14.088 1.51-3.604 1.903-7.726.704-12.283zm-8.44 11.973c1.671 3.171 1.778 6.754.304 10.091-2.236 5.057-7.79 9.041-18.577 13.318-6.708 2.66-12.85 4.361-12.904 4.376-8.872 2.301-16.896 3.47-23.842 3.47-11.502 0-20.061-3.174-25.489-9.442a85.461 85.461 0 0 1-27.747.158c-5.435 6.164-13.945 9.284-25.35 9.284-6.947 0-14.97-1.169-23.843-3.47-.054-.015-6.194-1.716-12.904-4.376-10.786-4.277-16.342-8.258-18.577-13.318-1.474-3.337-1.367-6.92.304-10.091.154-.295.32-.582.497-.86a12.803 12.803 0 0 1-1.728-10.341c.664-2.523 2.035-4.621 3.897-6.128a12.75 12.75 0 0 1-1.73-4.822c-.536-3.714.697-7.422 3.47-10.446 2.16-2.353 5.213-3.648 8.593-3.648h.09a84.45 84.45 0 0 1-3.832-25.235c0-46.671 37.836-84.51 84.514-84.51 46.677 0 84.513 37.835 84.513 84.51a84.398 84.398 0 0 1-3.859 25.299c.408-.04.808-.06 1.201-.061 3.38 0 6.434 1.295 8.592 3.648 2.773 3.021 4.007 6.732 3.47 10.446a12.757 12.757 0 0 1-1.729 4.822c1.862 1.507 3.234 3.605 3.897 6.128a12.803 12.803 0 0 1-1.728 10.341c.177.275.345.562.497.857z" fill="#fff" />

            <path d="M221.784 183.816a12.798 12.798 0 0 0 1.728-10.341c-.664-2.523-2.036-4.621-3.897-6.128a12.74 12.74 0 0 0 1.729-4.822c.537-3.714-.696-7.422-3.47-10.446-2.158-2.353-5.212-3.648-8.592-3.648-.393 0-.793.021-1.201.061a84.415 84.415 0 0 0 3.852-25.297c0-46.672-37.836-84.51-84.509-84.51-46.674 0-84.514 37.834-84.514 84.51a84.46 84.46 0 0 0 3.832 25.235h-.09c-3.38 0-6.433 1.294-8.592 3.647-2.773 3.021-4.007 6.733-3.47 10.446a12.762 12.762 0 0 0 1.73 4.823c-1.862 1.506-3.234 3.604-3.898 6.127a12.808 12.808 0 0 0 1.73 10.343c-.178.278-.342.565-.497.86-1.67 3.171-1.778 6.754-.303 10.091 2.236 5.057 7.79 9.041 18.577 13.318 6.707 2.66 12.85 4.361 12.904 4.376 8.872 2.301 16.896 3.47 23.842 3.47 11.406 0 19.916-3.12 25.351-9.284a85.49 85.49 0 0 0 27.747-.158c5.428 6.268 13.987 9.442 25.489 9.442 6.946 0 14.97-1.169 23.841-3.47.055-.015 6.195-1.716 12.905-4.376 10.787-4.277 16.342-8.261 18.577-13.318 1.474-3.337 1.367-6.92-.304-10.091-.152-.297-.32-.585-.497-.86zm-111.647 13.181a34.659 34.659 0 0 1-1.502 2.394c-1.405 2.057-3.253 3.629-5.398 4.797-4.1 2.236-9.29 3.017-14.562 3.017-8.329 0-16.867-1.949-21.652-3.19-.236-.061-29.334-8.28-25.65-15.276.62-1.177 1.64-1.647 2.925-1.647 5.187 0 14.632 7.724 18.69 7.724.908 0 1.548-.386 1.809-1.328 1.73-6.204-26.293-8.812-23.933-17.796.416-1.59 1.546-2.236 3.134-2.236 6.858-.001 22.25 12.06 25.469 12.06.247 0 .424-.073.52-.225.014-.023.028-.045.041-.069 1.511-2.495.644-4.309-9.707-10.649l-.994-.605c-11.391-6.894-19.386-11.043-14.84-15.993.524-.571 1.266-.824 2.167-.824 1.068 0 2.36.357 3.785.957 6.016 2.537 14.354 9.456 17.837 12.473a146 146 0 0 1 1.633 1.441s4.41 4.586 7.076 4.586c.614 0 1.135-.242 1.488-.84 1.891-3.188-17.563-17.93-18.66-24.013-.744-4.121.522-6.209 2.862-6.209 1.113 0 2.47.474 3.97 1.425 4.65 2.951 13.628 18.379 16.915 24.381 1.102 2.011 2.983 2.861 4.678 2.861 3.363 0 5.992-3.343.308-7.591-8.543-6.392-5.545-16.84-1.468-17.483.174-.028.35-.042.525-.042 3.708 0 5.343 6.389 5.343 6.389s4.794 12.038 13.029 20.267c7.472 7.469 8.516 13.598 4.162 21.244zm26.629 1.41l-.427.051-.728.083c-.383.04-.767.078-1.152.113l-.375.034-.343.029-.486.039-.537.039-.536.035-.119.008c-.14.008-.28.017-.422.024l-.179.01c-.166.009-.332.017-.5.024l-.581.025-.527.018-.352.01h-.179c-.11 0-.219.006-.329.007h-.174c-.11 0-.219 0-.329.005l-.448.006h-.625c-.491 0-.981-.005-1.469-.015l-.396-.009c-.113 0-.226-.005-.337-.009l-.42-.012-.521-.02-.47-.021-.121-.005-.447-.023c-.125-.007-.248-.013-.372-.022l-.289-.017a79.64 79.64 0 0 1-1.089-.076l-.38-.031c-.16-.012-.32-.027-.479-.041-.187-.016-.374-.034-.561-.052a59.687 59.687 0 0 1-.939-.095h-.015c4.57-10.195 2.259-19.717-6.976-28.944-6.057-6.049-10.086-14.981-10.922-16.942-1.692-5.805-6.17-12.258-13.607-12.258-.629 0-1.257.05-1.878.148-3.258.513-6.106 2.388-8.138 5.21-2.196-2.731-4.33-4.902-6.26-6.128-2.91-1.845-5.814-2.781-8.643-2.781-3.531 0-6.687 1.45-8.887 4.08l-.056.067c-.042-.173-.082-.346-.123-.52l-.005-.023a73.685 73.685 0 0 1-1.054-5.412c0-.012 0-.024-.006-.036-.022-.137-.042-.275-.063-.412-.062-.406-.12-.813-.173-1.22-.024-.185-.05-.37-.073-.555l-.068-.555c-.022-.185-.04-.353-.06-.529l-.006-.044c-.08-.72-.15-1.44-.21-2.162l-.022-.277-.035-.472c-.01-.129-.02-.259-.027-.389 0-.031-.005-.061-.006-.09a52.476 52.476 0 0 1-.065-1.088c-.01-.189-.02-.377-.028-.567l-.02-.496-.005-.15-.016-.457-.01-.389c0-.155-.008-.31-.01-.465-.003-.155-.007-.325-.008-.489-.002-.164 0-.326-.005-.489-.004-.164 0-.327 0-.49 0-41.853 33.93-75.784 75.788-75.784 41.856 0 75.786 33.93 75.786 75.784v.979c0 .163-.005.327-.008.489 0 .135-.006.268-.01.405 0 .12-.005.241-.009.357 0 .153-.009.306-.014.459v.012l-.021.531c-.007.155-.013.311-.021.466l-.005.11-.027.496a80.723 80.723 0 0 1-.241 3.184v.013c-.017.174-.034.348-.053.522l-.045.411-.089.804-.051.407-.063.479c-.023.174-.046.349-.072.522-.026.195-.055.389-.084.583l-.069.459-.082.52c-.028.173-.058.345-.09.517-.033.173-.059.345-.089.517-.06.344-.123.688-.189 1.031-.101.513-.204 1.025-.31 1.537l-.11.507c-.036.169-.075.339-.113.508-2.133-2.073-4.958-3.202-8.073-3.202-2.827 0-5.734.935-8.643 2.78-1.93 1.226-4.063 3.398-6.26 6.128-2.035-2.822-4.883-4.697-8.139-5.21a12.05 12.05 0 0 0-1.878-.148c-7.439 0-11.914 6.453-13.607 12.258-.84 1.961-4.87 10.893-10.932 16.951-9.229 9.198-11.557 18.677-7.059 28.83zm78.241-20.409l-.03.089a5.416 5.416 0 0 1-.263.587c-.075.14-.156.276-.244.408-.167.249-.35.487-.549.711-.046.052-.09.104-.142.155a7.853 7.853 0 0 1-.22.227c-1.346 1.334-3.398 2.504-5.718 3.577-.263.119-.53.238-.799.358l-.268.119c-.179.079-.358.157-.546.234-.179.078-.365.156-.551.232l-.558.23c-1.305.537-2.642 1.049-3.946 1.554l-.558.217-.551.216c-.367.143-.729.286-1.085.429l-.531.214-.522.213-.256.108c-.171.071-.338.142-.505.213-3.837 1.647-6.598 3.322-6.018 5.4.016.059.034.115.054.17.052.154.123.299.212.436.052.081.112.158.179.228.682.709 1.923.597 3.488.034.22-.081.439-.165.656-.253l.136-.056c.358-.152.737-.322 1.124-.506.097-.046.195-.09.293-.141 1.914-.936 4.083-2.196 6.235-3.343a55.812 55.812 0 0 1 2.618-1.325c2.038-.959 3.954-1.639 5.494-1.639.723 0 1.361.148 1.893.488l.089.059c.334.235.614.537.823.887.041.067.081.138.12.211.761 1.445.124 2.941-1.367 4.408-1.431 1.409-3.657 2.79-6.187 4.068-.188.095-.376.19-.567.283-7.53 3.698-17.391 6.483-17.528 6.518-2.628.681-6.386 1.575-10.62 2.244l-.626.098-.103.015c-.474.072-.949.139-1.425.201-.483.065-.971.124-1.462.179l-.09.01a68.56 68.56 0 0 1-5.358.406h-.026c-.648.023-1.295.035-1.943.035h-.747a46.38 46.38 0 0 1-2.959-.134c-.023 0-.048 0-.071-.006a39.666 39.666 0 0 1-2.149-.231 24.641 24.641 0 0 1-.715-.107 57.653 57.653 0 0 1-.725-.121l-.329-.062-.025-.005a26.107 26.107 0 0 1-1.036-.219c-.2-.045-.399-.089-.596-.143l-.119-.03c-.098-.024-.193-.05-.29-.076l-.053-.014-.308-.09c-.112-.031-.224-.065-.336-.098l-.039-.011-.291-.089c-.11-.034-.22-.07-.329-.106l-.268-.089-.197-.069c-.19-.067-.379-.136-.566-.208l-.178-.07-.147-.058a23.243 23.243 0 0 1-.845-.358l-.185-.09-.031-.014c-.066-.031-.131-.062-.197-.089a17.142 17.142 0 0 1-.384-.191l-.039-.019-.184-.097a15.943 15.943 0 0 1-.961-.546l-.172-.106a8.547 8.547 0 0 1-.256-.164l-.224-.148-.241-.166-.144-.103c-.152-.108-.301-.22-.447-.335l-.233-.179a15.173 15.173 0 0 1-.276-.228c-.077-.063-.152-.129-.227-.195l-.006-.005c-.081-.071-.16-.142-.239-.215a11.85 11.85 0 0 1-.232-.216l-.009-.009a8.569 8.569 0 0 1-.235-.232c-.077-.078-.156-.156-.231-.236-.075-.079-.152-.16-.226-.243-.074-.082-.142-.157-.212-.238l-.023-.027a9.34 9.34 0 0 1-.201-.238 12.442 12.442 0 0 1-.416-.525c-.135-.18-.267-.364-.396-.551l-.123-.184c-.164-.24-.324-.482-.479-.728a18.6 18.6 0 0 1-.339-.536c-.071-.113-.139-.227-.207-.339l-.028-.046c-.065-.11-.129-.218-.191-.327a3.17 3.17 0 0 1-.102-.179c-.033-.062-.071-.125-.106-.188l-.057-.099-.035-.064c-.067-.12-.133-.241-.197-.363-.03-.054-.059-.108-.09-.16l-.089-.173-.09-.171c-.225-.45-.438-.906-.638-1.368l-.071-.169c-.046-.113-.089-.225-.135-.336-.022-.054-.044-.107-.063-.161a16.731 16.731 0 0 1-.776-2.639c-.011-.055-.022-.11-.031-.163a11.54 11.54 0 0 1-.127-.806c-.008-.053-.014-.106-.02-.159l-.017-.162a11.337 11.337 0 0 1-.049-.638c0-.054-.005-.108-.007-.16a8.436 8.436 0 0 1-.008-.318c-.056-4.273 2.106-8.381 6.729-13.002 8.235-8.227 13.029-20.266 13.029-20.266s.129-.505.397-1.232c.037-.101.076-.205.12-.314.156-.407.332-.807.527-1.197l.039-.075c.166-.332.348-.656.544-.971.046-.073.09-.145.141-.218.147-.217.302-.429.465-.634.089-.111.186-.221.283-.328.039-.042.077-.084.118-.124.477-.493 1.022-.895 1.639-1.109l.078-.026c.052-.017.104-.033.157-.048.061-.016.122-.03.185-.043l.029-.006c.13-.026.262-.043.395-.052h.011c.069 0 .139-.007.21-.007.089 0 .172 0 .259.009.09.008.179.018.269.032.742.118 1.448.56 2.056 1.242.231.26.439.54.621.836.12.192.233.395.34.609.043.089.084.171.124.259a7.8 7.8 0 0 1 .28.691c.195.55.342 1.116.439 1.692.084.505.134 1.015.15 1.526.008.273.008.55 0 .829a11.936 11.936 0 0 1-.787 3.792c-.042.111-.089.223-.134.335a9.264 9.264 0 0 1-.302.665c-.081.165-.166.331-.258.496-.06.11-.123.22-.186.33-.161.274-.335.546-.522.817l-.113.162a13.628 13.628 0 0 1-1.472 1.728 15.425 15.425 0 0 1-1.699 1.47c-.602.446-1.167.94-1.689 1.477-1.503 1.577-1.853 2.969-1.515 4.024.054.166.125.327.211.479.101.174.221.334.357.48l.053.055.054.054c.054.052.111.103.172.153l.06.048c.145.112.299.212.46.3.047.025.089.05.142.074.174.085.353.158.537.217.051.017.102.032.154.048l.065.017.09.024.077.019.084.018.083.017.079.013c.058.01.118.02.178.027l.057.009.104.01.064.007.105.007h.062l.11.006h.346l.099-.006.114-.007.139-.013.13-.015c.03 0 .06-.008.09-.014.4-.058.792-.164 1.167-.316l.159-.067a4.819 4.819 0 0 0 .772-.421c.229-.15.445-.318.647-.503.048-.043.095-.089.141-.133.023-.022.045-.043.067-.067.044-.045.089-.089.133-.138.32-.352.597-.74.825-1.157a228.925 228.925 0 0 1 6.151-10.514l.294-.471.297-.471c.148-.239.298-.474.447-.708l.15-.234c.498-.78 1.004-1.555 1.519-2.324l.305-.456c.612-.907 1.222-1.789 1.827-2.627l.301-.415a56.224 56.224 0 0 1 2.054-2.661l.282-.338c.047-.056.09-.112.141-.166.093-.11.186-.217.277-.321.046-.053.089-.105.138-.157l.268-.302.134-.147c.135-.145.267-.284.397-.417.09-.09.173-.179.259-.263a10.5 10.5 0 0 1 1.669-1.386l.14-.09c.134-.09.273-.174.415-.25 2.364-1.342 4.321-1.441 5.448-.314.682.682 1.06 1.813 1.039 3.387 0 .069 0 .139-.005.211v.077c0 .072-.006.144-.012.217 0 .09-.01.179-.019.269-.009.089-.014.157-.023.237 0 .022-.004.045-.008.069-.006.069-.015.14-.025.211 0 .021 0 .043-.008.065a5.354 5.354 0 0 1-.041.283c-.011.09-.026.174-.042.262l-.026.149a4.007 4.007 0 0 1-.1.42 6.16 6.16 0 0 1-.283.758 10.99 10.99 0 0 1-.514.987c-.104.178-.211.353-.322.526-.114.179-.234.36-.358.543-.316.452-.644.895-.985 1.328l-.156.197a50.56 50.56 0 0 1-1.722 2.035l-.187.21c-.252.281-.508.564-.77.848l-.197.214c-.131.143-.268.286-.4.43-.131.143-.268.288-.406.433l-.411.433-.417.436-.42.436c-.282.292-.565.584-.85.876-4.055 4.159-8.327 8.304-9.773 10.888a5.365 5.365 0 0 0-.262.519c-.206.47-.292.872-.233 1.197a.914.914 0 0 0 .111.303c.081.141.18.271.295.387.053.052.109.1.168.144.298.212.657.321 1.023.311h.114l.117-.009.117-.013.097-.014c.013-.002.027-.004.04-.008l.089-.017.023-.005.099-.021.035-.009.104-.028c.035-.01.083-.023.125-.037.176-.053.348-.116.517-.188.089-.036.176-.075.262-.117.045-.02.09-.041.132-.063l.134-.066c.32-.166.632-.347.936-.541l.133-.09c.045-.028.09-.057.133-.089l.133-.089.071-.049.192-.135c.179-.123.346-.25.515-.379l.015-.012.269-.208c.367-.29.715-.582 1.031-.857l.21-.184.019-.018.11-.097c.258-.232.488-.448.679-.626l.079-.077c.069-.065.132-.126.189-.178l.112-.111.04-.039.011-.011.117-.117.074-.077.009-.007.035-.032.044-.04.014-.013.037-.034.204-.179.114-.102c.061-.053.12-.107.179-.162l.136-.121c.025-.02.049-.042.074-.064l.143-.125.21-.185.112-.097c.435-.378.964-.835 1.572-1.35l.249-.211.411-.345.421-.351c.55-.457 1.142-.942 1.768-1.445l.411-.33c.35-.279.709-.563 1.073-.849.147-.115.296-.23.448-.344a100.786 100.786 0 0 1 3.762-2.788l.384-.268c.269-.185.537-.371.805-.552l.243-.164c.479-.325.965-.642 1.455-.951l.243-.153.241-.15c.242-.15.482-.297.721-.44l.239-.143.478-.278.469-.269.095-.052.371-.204c.155-.084.309-.165.463-.244l.229-.118.223-.112c.078-.037.155-.076.231-.113a21.446 21.446 0 0 1 1.954-.845l.41-.144c.123-.041.243-.08.358-.115l.04-.012c.062-.02.124-.038.185-.054l.018-.005c.128-.037.255-.069.38-.099h.009a8.156 8.156 0 0 1 1.077-.183c.167-.017.334-.026.502-.025h.084c.112 0 .22.007.327.018.049 0 .098.01.146.016h.02c.048.006.096.013.144.024.047.009.095.017.141.028h.015c.047.01.089.022.138.036.254.071.496.181.718.325.109.071.211.151.305.24l.027.026a.805.805 0 0 1 .051.05l.049.053c.402.421.734.904.984 1.43l.038.09a3.309 3.309 0 0 1 .042 2.553 4.676 4.676 0 0 1-.339.715 8.141 8.141 0 0 1-1.099 1.452l-.089.095c-.133.14-.271.28-.416.42-.064.063-.131.125-.198.188l-.205.189-.107.095a22.682 22.682 0 0 1-1.285 1.057c-.586.448-1.183.88-1.791 1.297a59.56 59.56 0 0 1-1.11.743 99.67 99.67 0 0 1-2.786 1.763c-1.968 1.21-4.149 2.504-6.474 3.911l-.602.365c-.659.402-1.281.786-1.867 1.152l-.295.185-.558.358c-.37.237-.739.476-1.108.715l-.297.196c-.145.094-.289.19-.433.286l-.141.09-.432.291-.229.157-.268.186-.249.173c-.417.295-.805.576-1.162.843l-.134.102c-.21.159-.418.321-.623.486-.309.249-.591.487-.844.716l-.124.113c-.071.065-.141.13-.208.194-.046.045-.089.09-.137.133l-.064.064c-.143.144-.283.291-.418.442l-.066.076c-.147.17-.275.333-.387.492l-.05.071c-.088.128-.169.26-.242.396-.019.034-.036.068-.053.102l-.049.102-.033.074-.021.05-.017.045-.023.061a2.55 2.55 0 0 0-.13.523l-.008.062-.006.058v.315c0 .026 0 .052.007.08l.005.048c0 .026.006.051.01.079.004.026.011.073.019.11v.005c.007.035.014.069.023.104s.018.075.029.111c.019.071.043.141.068.211.016.042.031.083.048.124 0 .008.006.017.01.025l.036.081.05.111c.054.115.113.227.177.336l.066.114.068.114a.426.426 0 0 0 .041.054l.022.023.025.023.026.02a.487.487 0 0 0 .193.083c.577.13 1.763-.347 3.339-1.179.089-.048.187-.098.282-.15l.48-.262.234-.13c.167-.089.337-.19.511-.289l.317-.179c2.083-1.199 4.571-2.739 7.143-4.243.241-.141.483-.282.725-.421l.486-.311c.564-.323 1.131-.642 1.7-.957a80.121 80.121 0 0 1 2.168-1.15l.476-.241c.316-.156.629-.308.938-.456a40.899 40.899 0 0 1 1.815-.809l.335-.136.04-.016c1.775-.703 3.384-1.137 4.686-1.137.281-.003.563.021.84.069h.009c.089.016.17.034.253.054h.015c.213.054.42.131.616.23.275.142.522.331.731.559.096.106.181.221.256.343.14.215.25.448.329.692.032.095.06.188.089.287a3.93 3.93 0 0 1-.06 2.305z" fill="#ff9d00" />

            <path clip-rule="evenodd" d="M203.21 123.685v-.491c0-41.854-33.918-75.783-75.775-75.783-41.856 0-75.787 33.931-75.787 75.783v.164a7.13 7.13 0 0 0 0 .327c.005.163.007.326.005.489l.005.36.003.129c0 .06.002.119.004.179.003.095.005.191.005.286l.011.389.016.457.005.15.02.473v.023c.008.185.018.369.027.553l.001.014c.01.188.02.377.032.566.01.174.02.348.033.522l.002.031c.009.149.019.299.03.448l.003.04.033.432.003.028c.006.084.012.168.02.249.06.721.13 1.442.21 2.161l.004.045.061.529.068.555.05.377.023.177c.053.408.111.815.173 1.221l.004.027.059.384c.286 1.829.64 3.647 1.06 5.45l.005.022.032.135.091.385.056-.067c2.2-2.63 5.356-4.08 8.887-4.08 2.83 0 5.733.936 8.643 2.781 1.93 1.226 4.064 3.397 6.26 6.128 2.032-2.822 4.88-4.698 8.138-5.21.621-.098 1.25-.147 1.878-.148 7.436 0 11.915 6.453 13.607 12.258.836 1.961 4.865 10.893 10.941 16.935 9.236 9.227 11.547 18.748 6.976 28.943h.016c.311.035.624.067.939.096.187.018.373.036.561.052l.066.006.413.035.38.03c.362.028.725.054 1.089.077l.289.017.229.014.142.008.447.023.122.005.469.021.522.02.419.012.07.002c.089.004.178.007.267.007l.096.003c.59.014 1.179.021 1.769.02h.626l.447-.005c.11-.005.219-.005.33-.005h.174l.151-.004c.059-.002.118-.004.178-.004h.179l.351-.009.528-.018.581-.026c.168-.006.334-.015.5-.023l.179-.011.266-.014.156-.009.118-.008.537-.035.536-.039.487-.039.342-.029.376-.034a62.347 62.347 0 0 0 1.88-.197l.427-.051c-4.499-10.152-2.17-19.632 7.027-28.822 6.063-6.058 10.092-14.99 10.932-16.952 1.693-5.804 6.169-12.257 13.607-12.257.629 0 1.258.05 1.879.148 3.255.512 6.103 2.388 8.138 5.21 2.197-2.73 4.33-4.903 6.261-6.129 2.909-1.844 5.815-2.78 8.642-2.78 3.116 0 5.94 1.13 8.073 3.203.039-.169.077-.338.114-.508l.109-.506c.039-.185.078-.37.115-.555.066-.327.132-.654.195-.984.066-.342.129-.686.189-1.03l.031-.186c.019-.11.037-.22.058-.331.034-.172.062-.344.09-.518l.011-.066.071-.453.07-.459v-.004c.051-.339.099-.678.144-1.017l.011-.084.063-.478.051-.408.09-.804.035-.323.009-.088c.019-.174.037-.348.053-.522v-.014c.013-.138.027-.277.039-.416.071-.788.131-1.58.179-2.375.009-.13.016-.261.024-.392v-.006l.026-.491.006-.11c.016-.332.03-.664.041-.996v-.012l.005-.13c.005-.109.009-.219.009-.329l.002-.044c.004-.103.008-.209.008-.314l.003-.09c.003-.105.006-.209.006-.314l.002-.089c.003-.133.006-.267.006-.4zm-94.572 75.706c6.002-8.801 5.576-15.407-2.658-23.637-8.236-8.231-13.029-20.267-13.029-20.267s-1.789-6.991-5.869-6.349-7.073 11.089 1.47 17.484c8.542 6.395-1.7 10.731-4.988 4.73-3.288-6.002-12.265-21.429-16.919-24.38s-7.927-1.297-6.83 4.785c.545 3.019 5.613 8.172 10.348 12.986 4.804 4.884 9.265 9.42 8.311 11.025-1.893 3.187-8.56-3.745-8.56-3.745s-20.876-18.998-25.42-14.047c-4.19 4.563 2.271 8.442 12.227 14.421.846.508 1.718 1.032 2.611 1.572 11.391 6.896 12.277 8.715 10.66 11.324-.597.964-4.41-1.325-9.1-4.14-7.995-4.801-18.537-11.13-20.026-5.465-1.288 4.903 6.468 7.907 13.502 10.632 5.86 2.27 11.22 4.346 10.431 7.164-.817 2.922-5.246.485-10.087-2.179-5.435-2.991-11.39-6.267-13.339-2.57-3.683 6.99 25.41 15.219 25.65 15.28 9.4 2.438 33.272 7.604 41.615-4.624zm38.665 0c-6.002-8.801-5.576-15.407 2.659-23.637 8.235-8.231 13.028-20.267 13.028-20.267s1.789-6.991 5.869-6.349 7.073 11.089-1.469 17.484c-8.543 6.395 1.699 10.731 4.987 4.73 3.289-6.002 12.26-21.429 16.914-24.38s7.929-1.297 6.831 4.785c-.544 3.019-5.613 8.172-10.348 12.987-4.804 4.884-9.265 9.419-8.312 11.024 1.893 3.187 8.565-3.749 8.565-3.749s20.875-18.997 25.421-14.046c4.189 4.562-2.272 8.442-12.229 14.421-.871.523-1.741 1.047-2.61 1.572-11.391 6.896-12.277 8.715-10.661 11.323.598.965 4.411-1.325 9.1-4.14 7.996-4.8 18.538-11.13 20.027-5.464 1.289 4.903-6.468 7.907-13.502 10.632-5.86 2.27-11.22 4.346-10.432 7.164.816 2.921 5.244.484 10.084-2.18 5.435-2.991 11.391-6.269 13.339-2.569 3.684 6.994-25.414 15.215-25.649 15.275-9.4 2.446-33.272 7.612-41.612-4.616z" fill="#ffd21e" fill-rule="evenodd" />

            <path clip-rule="evenodd" d="M152.047 102.567c1.182.418 2.061 1.69 2.897 2.901 1.13 1.636 2.182 3.159 3.796 2.301a10.909 10.909 0 0 0 4.247-15.214 10.912 10.912 0 0 0-7.742-5.198 10.904 10.904 0 0 0-11.689 6.589 10.909 10.909 0 0 0 .436 9.314c.748 1.407 2.408.743 4.16.042 1.373-.549 2.804-1.121 3.895-.735zm-51.375 0c-1.182.418-2.061 1.691-2.897 2.901-1.13 1.637-2.183 3.159-3.796 2.301a10.903 10.903 0 0 1 8.263-20.068 10.909 10.909 0 0 1 7.707 9.348 10.906 10.906 0 0 1-1.221 6.211c-.749 1.407-2.409.743-4.161.043-1.374-.55-2.803-1.122-3.895-.736zm43.427 46.751c8.143-6.415 11.134-16.889 11.134-23.341 0-5.1-3.431-3.495-8.924-.775l-.31.153c-5.042 2.497-11.754 5.822-19.122 5.822-7.369 0-14.081-3.325-19.122-5.823-5.671-2.809-9.228-4.571-9.228.624 0 6.656 3.182 17.585 11.916 23.934a18.968 18.968 0 0 1 11.575-9.786c.872-.26 1.77 1.241 2.689 2.778.887 1.482 1.794 2.998 2.716 2.998.983 0 1.948-1.494 2.891-2.952.985-1.525 1.946-3.01 2.875-2.713a18.967 18.967 0 0 1 10.91 9.081z" fill="#32343d" fill-rule="evenodd" />

            <path d="M144.097 149.317c-4.241 3.342-9.878 5.583-17.219 5.583-6.897 0-12.291-1.978-16.435-4.989a18.966 18.966 0 0 1 11.575-9.786c1.712-.511 3.527 5.776 5.405 5.776 2.01 0 3.947-6.246 5.766-5.665a18.974 18.974 0 0 1 10.908 9.081z" fill="#ff323d" />

            <path clip-rule="evenodd" d="M81.2 111.64a7.078 7.078 0 0 1-6.65.655 7.062 7.062 0 0 1-3.837-3.837 7.082 7.082 0 0 1 .657-6.65 7.087 7.087 0 1 1 9.83 9.832zm101.413 0a7.08 7.08 0 0 1-6.651.655 7.064 7.064 0 0 1-3.837-3.837 7.102 7.102 0 0 1-.504-3.407 7.103 7.103 0 0 1 3.411-5.385 7.083 7.083 0 0 1 8.656 1.07 7.079 7.079 0 0 1 1.536 7.724 7.089 7.089 0 0 1-2.611 3.18z" fill="#ffad03" fill-rule="evenodd" />
          </svg>

          Trajectories
        </a>

        <a href="/docs/test" className="btn-contribute">Quick Start</a>
      </div>
    </div>
  </div>

  <div className="performance-section">
    <div className="performance-header">
      <a href="/docs/leaderboard" className="view-full-link">view full leaderboard ↗</a>
    </div>

    <div className="category-section">
      <table className="performance-table">
        <thead>
          <tr>
            <th>Model</th>
            <th>Type</th>
            <th>Agent</th>
            <th>Date</th>
            <th>Pass\@1</th>
            <th>Pass\@3</th>
            <th>Pass^3</th>
            <th># Turns</th>
          </tr>
        </thead>

        <tbody>
          <tr className="rank-1">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" width="20px" height="20px" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><title>Gemini</title><path d="M20.616 10.835a14.147 14.147 0 01-4.45-3.001 14.111 14.111 0 01-3.678-6.452.503.503 0 00-.975 0 14.134 14.134 0 01-3.679 6.452 14.155 14.155 0 01-4.45 3.001c-.65.28-1.318.505-2.002.678a.502.502 0 000 .975c.684.172 1.35.397 2.002.677a14.147 14.147 0 014.45 3.001 14.112 14.112 0 013.679 6.453.502.502 0 00.975 0c.172-.685.397-1.351.677-2.003a14.145 14.145 0 013.001-4.45 14.113 14.113 0 016.453-3.678.503.503 0 000-.975 13.245 13.245 0 01-2.003-.678z" fill="#3186FF" /><path d="M20.616 10.835a14.147 14.147 0 01-4.45-3.001 14.111 14.111 0 01-3.678-6.452.503.503 0 00-.975 0 14.134 14.134 0 01-3.679 6.452 14.155 14.155 0 01-4.45 3.001c-.65.28-1.318.505-2.002.678a.502.502 0 000 .975c.684.172 1.35.397 2.002.677a14.147 14.147 0 014.45 3.001 14.112 14.112 0 013.679 6.453.502.502 0 00.975 0c.172-.685.397-1.351.677-2.003a14.145 14.145 0 013.001-4.45 14.113 14.113 0 016.453-3.678.503.503 0 000-.975 13.245 13.245 0 01-2.003-.678z" fill="url(#lobe-icons-gemini-fill-0)" /><path d="M20.616 10.835a14.147 14.147 0 01-4.45-3.001 14.111 14.111 0 01-3.678-6.452.503.503 0 00-.975 0 14.134 14.134 0 01-3.679 6.452 14.155 14.155 0 01-4.45 3.001c-.65.28-1.318.505-2.002.678a.502.502 0 000 .975c.684.172 1.35.397 2.002.677a14.147 14.147 0 014.45 3.001 14.112 14.112 0 013.679 6.453.502.502 0 00.975 0c.172-.685.397-1.351.677-2.003a14.145 14.145 0 013.001-4.45 14.113 14.113 0 016.453-3.678.503.503 0 000-.975 13.245 13.245 0 01-2.003-.678z" fill="url(#lobe-icons-gemini-fill-1)" /><path d="M20.616 10.835a14.147 14.147 0 01-4.45-3.001 14.111 14.111 0 01-3.678-6.452.503.503 0 00-.975 0 14.134 14.134 0 01-3.679 6.452 14.155 14.155 0 01-4.45 3.001c-.65.28-1.318.505-2.002.678a.502.502 0 000 .975c.684.172 1.35.397 2.002.677a14.147 14.147 0 014.45 3.001 14.112 14.112 0 013.679 6.453.502.502 0 00.975 0c.172-.685.397-1.351.677-2.003a14.145 14.145 0 013.001-4.45 14.113 14.113 0 016.453-3.678.503.503 0 000-.975 13.245 13.245 0 01-2.003-.678z" fill="url(#lobe-icons-gemini-fill-2)" /><defs><linearGradient gradientUnits="userSpaceOnUse" id="lobe-icons-gemini-fill-0" x1="7" x2="11" y1="15.5" y2="12"><stop stop-color="#08B962" /><stop offset="1" stop-color="#08B962" stop-opacity="0" /></linearGradient><linearGradient gradientUnits="userSpaceOnUse" id="lobe-icons-gemini-fill-1" x1="8" x2="11.5" y1="5.5" y2="11"><stop stop-color="#F94543" /><stop offset="1" stop-color="#F94543" stop-opacity="0" /></linearGradient><linearGradient gradientUnits="userSpaceOnUse" id="lobe-icons-gemini-fill-2" x1="3.5" x2="17.5" y1="13.5" y2="12"><stop stop-color="#FABC12" /><stop offset=".46" stop-color="#FABC12" stop-opacity="0" /></linearGradient></defs></svg>Gemini-3.5-Flash<span className="verified-badge" aria-hidden="true" title="Evaluated by us">✓</span></td>
            <td className="model-type-cell" data-label="Model Type">Proprietary</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-05-19</td>
            <td className="score-cell best-score" data-label="Pass@1">56.5<sub>± 2.7</sub></td>
            <td className="score-cell best-score" data-label="Pass@3">68.5</td>
            <td className="score-cell best-score" data-label="Pass^3">43.5</td>
            <td className="score-cell" data-label="# Turns">44.7</td>
          </tr>

          <tr className="rank-2">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" fill="currentColor" width="20px" height="20px" viewBox="0 0 24 24" role="img" xmlns="http://www.w3.org/2000/svg"><title>OpenAI icon</title><path d="M22.2819 9.8211a5.9847 5.9847 0 0 0-.5157-4.9108 6.0462 6.0462 0 0 0-6.5098-2.9A6.0651 6.0651 0 0 0 4.9807 4.1818a5.9847 5.9847 0 0 0-3.9977 2.9 6.0462 6.0462 0 0 0 .7427 7.0966 5.98 5.98 0 0 0 .511 4.9107 6.051 6.051 0 0 0 6.5146 2.9001A5.9847 5.9847 0 0 0 13.2599 24a6.0557 6.0557 0 0 0 5.7718-4.2058 5.9894 5.9894 0 0 0 3.9977-2.9001 6.0557 6.0557 0 0 0-.7475-7.0729zm-9.022 12.6081a4.4755 4.4755 0 0 1-2.8764-1.0408l.1419-.0804 4.7783-2.7582a.7948.7948 0 0 0 .3927-.6813v-6.7369l2.02 1.1686a.071.071 0 0 1 .038.052v5.5826a4.504 4.504 0 0 1-4.4945 4.4944zm-9.6607-4.1254a4.4708 4.4708 0 0 1-.5346-3.0137l.142.0852 4.783 2.7582a.7712.7712 0 0 0 .7806 0l5.8428-3.3685v2.3324a.0804.0804 0 0 1-.0332.0615L9.74 19.9502a4.4992 4.4992 0 0 1-6.1408-1.6464zM2.3408 7.8956a4.485 4.485 0 0 1 2.3655-1.9728V11.6a.7664.7664 0 0 0 .3879.6765l5.8144 3.3543-2.0201 1.1685a.0757.0757 0 0 1-.071 0l-4.8303-2.7865A4.504 4.504 0 0 1 2.3408 7.872zm16.5963 3.8558L13.1038 8.364 15.1192 7.2a.0757.0757 0 0 1 .071 0l4.8303 2.7913a4.4944 4.4944 0 0 1-.6765 8.1042v-5.6772a.79.79 0 0 0-.407-.667zm2.0107-3.0231l-.142-.0852-4.7735-2.7818a.7759.7759 0 0 0-.7854 0L9.409 9.2297V6.8974a.0662.0662 0 0 1 .0284-.0615l4.8303-2.7866a4.4992 4.4992 0 0 1 6.6802 4.66zM8.3065 12.863l-2.02-1.1638a.0804.0804 0 0 1-.038-.0567V6.0742a4.4992 4.4992 0 0 1 7.3757-3.4537l-.142.0805L8.704 5.459a.7948.7948 0 0 0-.3927.6813zm1.0976-2.3654l2.602-1.4998 2.6069 1.4998v2.9994l-2.5974 1.4997-2.6067-1.4997Z" /></svg> <a href="https://openai.com/index/introducing-gpt-5-5/">GPT-5.5-xhigh</a></td>
            <td className="model-type-cell" data-label="Model Type">Proprietary</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-04-24</td>
            <td className="score-cell" data-label="Pass@1">55.6</td>
            <td className="score-cell" data-label="Pass@3">--</td>
            <td className="score-cell" data-label="Pass^3">--</td>
            <td className="score-cell" data-label="# Turns">--</td>
          </tr>

          <tr className="rank-3">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" fill="#4D6BFE" width="20px" height="20px" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><title>DeepSeek</title><path d="M23.748 4.482c-.254-.124-.364.113-.512.234-.051.039-.094.09-.137.136-.372.397-.806.657-1.373.626-.829-.046-1.537.214-2.163.848-.133-.782-.575-1.248-1.247-1.548-.352-.156-.708-.311-.955-.65-.172-.241-.219-.51-.305-.774-.055-.16-.11-.323-.293-.35-.2-.031-.278.136-.356.276-.313.572-.434 1.202-.422 1.84.027 1.436.633 2.58 1.838 3.393.137.093.172.187.129.323-.082.28-.18.552-.266.833-.055.179-.137.217-.329.14a5.526 5.526 0 01-1.736-1.18c-.857-.828-1.631-1.742-2.597-2.458a11.365 11.365 0 00-.689-.471c-.985-.957.13-1.743.388-1.836.27-.098.093-.432-.779-.428-.872.004-1.67.295-2.687.684a3.055 3.055 0 01-.465.137 9.597 9.597 0 00-2.883-.102c-1.885.21-3.39 1.102-4.497 2.623C.082 8.606-.231 10.684.152 12.85c.403 2.284 1.569 4.175 3.36 5.653 1.858 1.533 3.997 2.284 6.438 2.14 1.482-.085 3.133-.284 4.994-1.86.47.234.962.327 1.78.397.63.059 1.236-.03 1.705-.128.735-.156.684-.837.419-.961-2.155-1.004-1.682-.595-2.113-.926 1.096-1.296 2.746-2.642 3.392-7.003.05-.347.007-.565 0-.845-.004-.17.035-.237.23-.256a4.173 4.173 0 001.545-.475c1.396-.763 1.96-2.015 2.093-3.517.02-.23-.004-.467-.247-.588zM11.581 18c-2.089-1.642-3.102-2.183-3.52-2.16-.392.024-.321.471-.235.763.09.288.207.486.371.739.114.167.192.416-.113.603-.673.416-1.842-.14-1.897-.167-1.361-.802-2.5-1.86-3.301-3.307-.774-1.393-1.224-2.887-1.298-4.482-.02-.386.093-.522.477-.592a4.696 4.696 0 011.529-.039c2.132.312 3.946 1.265 5.468 2.774.868.86 1.525 1.887 2.202 2.891.72 1.066 1.494 2.082 2.48 2.914.348.292.625.514.891.677-.802.09-2.14.11-3.054-.614zm1-6.44a.306.306 0 01.415-.287.302.302 0 01.2.288.306.306 0 01-.31.307.303.303 0 01-.304-.308zm3.11 1.596c-.2.081-.399.151-.59.16a1.245 1.245 0 01-.798-.254c-.274-.23-.47-.358-.552-.758a1.73 1.73 0 01.016-.588c.07-.327-.008-.537-.239-.727-.187-.156-.426-.199-.688-.199a.559.559 0 01-.254-.078c-.11-.054-.2-.19-.114-.358.028-.054.16-.186.192-.21.356-.202.767-.136 1.146.016.352.144.618.408 1.001.782.391.451.462.576.685.914.176.265.336.537.445.848.067.195-.019.354-.25.452z" /></svg> DeepSeek-V4-Pro Max<span className="verified-badge" aria-hidden="true" title="Evaluated by us">✓</span></td>
            <td className="model-type-cell" data-label="Model Type">Open-Source</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-04-25</td>
            <td className="score-cell" data-label="Pass@1">52.8<sub>± 1.9</sub></td>
            <td className="score-cell" data-label="Pass@3">63.9</td>
            <td className="score-cell" data-label="Pass^3">38.9</td>
            <td className="score-cell" data-label="# Turns">24.1</td>
          </tr>

          <tr className="rank-other">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" fill="#D97757" width="20px" height="20px" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><title>Claude</title><path d="M4.709 15.955l4.72-2.647.08-.23-.08-.128H9.2l-.79-.048-2.698-.073-2.339-.097-2.266-.122-.571-.121L0 11.784l.055-.352.48-.321.686.06 1.52.103 2.278.158 1.652.097 2.449.255h.389l.055-.157-.134-.098-.103-.097-2.358-1.596-2.552-1.688-1.336-.972-.724-.491-.364-.462-.158-1.008.656-.722.881.06.225.061.893.686 1.908 1.476 2.491 1.833.365.304.145-.103.019-.073-.164-.274-1.355-2.446-1.446-2.49-.644-1.032-.17-.619a2.97 2.97 0 01-.104-.729L6.283.134 6.696 0l.996.134.42.364.62 1.414 1.002 2.229 1.555 3.03.456.898.243.832.091.255h.158V9.01l.128-1.706.237-2.095.23-2.695.08-.76.376-.91.747-.492.584.28.48.685-.067.444-.286 1.851-.559 2.903-.364 1.942h.212l.243-.242.985-1.306 1.652-2.064.73-.82.85-.904.547-.431h1.033l.76 1.129-.34 1.166-1.064 1.347-.881 1.142-1.264 1.7-.79 1.36.073.11.188-.02 2.856-.606 1.543-.28 1.841-.315.833.388.091.395-.328.807-1.969.486-2.309.462-3.439.813-.042.03.049.061 1.549.146.662.036h1.622l3.02.225.79.522.474.638-.079.485-1.215.62-1.64-.389-3.829-.91-1.312-.329h-.182v.11l1.093 1.068 2.006 1.81 2.509 2.33.127.578-.322.455-.34-.049-2.205-1.657-.851-.747-1.926-1.62h-.128v.17l.444.649 2.345 3.521.122 1.08-.17.353-.608.213-.668-.122-1.374-1.925-1.415-2.167-1.143-1.943-.14.08-.674 7.254-.316.37-.729.28-.607-.461-.322-.747.322-1.476.389-1.924.315-1.53.286-1.9.17-.632-.012-.042-.14.018-1.434 1.967-2.18 2.945-1.726 1.845-.414.164-.717-.37.067-.662.401-.589 2.388-3.036 1.44-1.882.93-1.086-.006-.158h-.055L4.132 18.56l-1.13.146-.487-.456.061-.746.231-.243 1.908-1.312-.006.006z" fill-rule="nonzero" /></svg> Claude-Opus-4.7<span className="verified-badge" aria-hidden="true" title="Evaluated by us">✓</span></td>
            <td className="model-type-cell" data-label="Model Type">Proprietary</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-04-25</td>
            <td className="score-cell" data-label="Pass@1">52.8<sup>†</sup></td>
            <td className="score-cell" data-label="Pass@3">--</td>
            <td className="score-cell" data-label="Pass^3">--</td>
            <td className="score-cell" data-label="# Turns">16.2</td>
          </tr>

          <tr className="rank-other">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" fill="currentColor" fill-rule="evenodd" width="20px" height="20px" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><title>Kimi</title><path d="M19.738 5.776c.163-.209.306-.4.457-.585.07-.087.064-.153-.004-.244-.655-.861-.717-1.817-.34-2.787.283-.73.909-1.072 1.674-1.145.477-.045.945.004 1.379.236.57.305.902.77 1.01 1.412.086.512.07 1.012-.075 1.508-.257.878-.888 1.333-1.753 1.448-.718.096-1.446.108-2.17.157-.056.004-.113 0-.178 0z" /><path d="M17.962 1.844h-4.326l-3.425 7.81H5.369V1.878H1.5V22h3.87v-8.477h6.824a3.025 3.025 0 002.743-1.75V22h3.87v-8.477a3.87 3.87 0 00-3.588-3.86v-.01h-2.125a3.94 3.94 0 002.323-2.12l2.545-5.689z" /></svg> <a href="https://huggingface.co/moonshotai/Kimi-K2.6">Kimi-K2.6</a></td>
            <td className="model-type-cell" data-label="Model Type">Open-Source</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-04-21</td>
            <td className="score-cell" data-label="Pass@1">50.0</td>
            <td className="score-cell" data-label="Pass@3">--</td>
            <td className="score-cell" data-label="Pass^3">--</td>
            <td className="score-cell" data-label="# Turns">--</td>
          </tr>

          <tr className="rank-other">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" fill="#E2167E" width="20px" height="20px" viewBox="0 0 24 24" role="img" xmlns="http://www.w3.org/2000/svg"><title>Minimax icon</title><path d="M16.278 2c1.156 0 2.093.927 2.093 2.07v12.501a.74.74 0 00.744.709.74.74 0 00.743-.709V9.099a2.06 2.06 0 012.071-2.049A2.06 2.06 0 0124 9.1v6.561a.649.649 0 01-.652.645.649.649 0 01-.653-.645V9.1a.762.762 0 00-.766-.758.762.762 0 00-.766.758v7.472a2.037 2.037 0 01-2.048 2.026 2.037 2.037 0 01-2.048-2.026v-12.5a.785.785 0 00-.788-.753.785.785 0 00-.789.752l-.001 15.904A2.037 2.037 0 0113.441 22a2.037 2.037 0 01-2.048-2.026V18.04c0-.356.292-.645.652-.645.36 0 .652.289.652.645v1.934c0 .263.142.506.372.638.23.131.514.131.744 0a.734.734 0 00.372-.638V4.07c0-1.143.937-2.07 2.093-2.07zm-5.674 0c1.156 0 2.093.927 2.093 2.07v11.523a.648.648 0 01-.652.645.648.648 0 01-.652-.645V4.07a.785.785 0 00-.789-.78.785.785 0 00-.789.78v14.013a2.06 2.06 0 01-2.07 2.048 2.06 2.06 0 01-2.071-2.048V9.1a.762.762 0 00-.766-.758.762.762 0 00-.766.758v3.8a2.06 2.06 0 01-2.071 2.049A2.06 2.06 0 010 12.9v-1.378c0-.357.292-.646.652-.646.36 0 .653.29.653.646V12.9c0 .418.343.757.766.757s.766-.339.766-.757V9.099a2.06 2.06 0 012.07-2.048 2.06 2.06 0 012.071 2.048v8.984c0 .419.343.758.767.758.423 0 .766-.339.766-.758V4.07c0-1.143.937-2.07 2.093-2.07z" /></svg><a href="https://www.minimax.io/news/minimax-m27-en">MiniMax-M2.7</a></td>
            <td className="model-type-cell" data-label="Model Type">Open-Source</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-03-18</td>
            <td className="score-cell" data-label="Pass@1">46.3</td>
            <td className="score-cell" data-label="Pass@3">--</td>
            <td className="score-cell" data-label="Pass^3">--</td>
            <td className="score-cell" data-label="# Turns">--</td>
          </tr>

          <tr className="rank-other">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" width="20px" height="20px" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><title>ChatGLM</title><defs><linearGradient id="lobe-icons-chat-glm-fill" x1="-18.756%" x2="70.894%" y1="49.371%" y2="90.944%"><stop offset="0%" stop-color="#504AF4" /><stop offset="100%" stop-color="#3485FF" /></linearGradient></defs><path d="M9.917 2c4.906 0 10.178 3.947 8.93 10.58-.014.07-.037.14-.057.21l-.003-.277c-.083-3-1.534-8.934-8.87-8.934-3.393 0-8.137 3.054-7.93 8.158-.04 4.778 3.555 8.4 7.95 8.332l.073-.001c1.2-.033 2.763-.429 3.1-1.657.063-.031.26.534.268.598.048.256.112.369.192.34.981-.348 2.286-1.222 1.952-2.38-.176-.61-1.775-.147-1.921-.347.418-.979 2.234-.926 3.153-.716.443.102.657.38 1.012.442.29.052.981-.2.96.242-1.5 3.042-4.893 5.41-8.808 5.41C3.654 22 0 16.574 0 11.737 0 5.947 4.959 2 9.917 2zM9.9 5.3c.484 0 1.125.225 1.38.585 3.669.145 4.313 2.686 4.694 5.444.255 1.838.315 2.3.182 1.387l.083.59c.068.448.554.737.982.516.144-.075.254-.231.328-.47a.2.2 0 01.258-.13l.625.22a.2.2 0 01.124.238 2.172 2.172 0 01-.51.92c-.878.917-2.757.664-3.08-.62-.14-.554-.055-.626-.345-1.242-.292-.621-1.238-.709-1.69-.295-.345.315-.407.805-.406 1.282L12.6 15.9a.9.9 0 01-.9.9h-1.4a.9.9 0 01-.9-.9v-.65a1.15 1.15 0 10-2.3 0v.65a.9.9 0 01-.9.9H4.8a.9.9 0 01-.9-.9l.035-3.239c.012-1.884.356-3.658 2.47-4.134.2-.045.252.13.29.342.025.154.043.252.053.294.701 3.058 1.75 4.299 3.144 3.722l.66-.331.254-.13c.158-.082.25-.131.276-.15.012-.01-.165-.206-.407-.464l-1.012-1.067a8.925 8.925 0 01-.199-.216c-.047-.034-.116.068-.208.306-.074.157-.251.252-.272.326-.013.058.108.298.362.72.164.288.22.508-.31.343-1.04-.8-1.518-2.273-1.684-3.725-.004-.035-.162-1.913-.162-1.913a1.2 1.2 0 011.113-1.281L9.9 5.3zm12.994 8.68c.037.697-.403.704-1.213.591l-1.783-.276c-.265-.053-.385-.099-.313-.147.47-.315 3.268-.93 3.31-.168zm-.915-.083l-.926.042c-.85.077-1.452.24.338.336l.103.003c.815.012 1.264-.359.485-.381zm1.667-3.601h.01c.79.398.067 1.03-.65 1.393-.14.07-.491.176-1.052.315-.241.04-.457.092-.333.16l.01.005c1.952.958-3.123 1.534-2.495 1.285l.38-.148c.68-.266 1.614-.682 1.666-1.337.038-.48 1.253-.442 1.493-.968.048-.106 0-.236-.144-.389-.05-.047-.094-.094-.107-.148-.073-.305.7-.431 1.222-.168zm-2.568-.474c-.135 1.198-2.479 4.192-1.949 2.863l.017-.042c.298-.717.376-2.221 1.337-3.221.25-.26.636.035.595.4zm-7.976-.253c.02-.694 1.002-.968 1.346-.347.01-1.274-1.941-.768-1.346.347z" fill="url(#lobe-icons-chat-glm-fill)" fill-rule="evenodd" /></svg> <a href="https://z.ai/blog/glm-5.1">GLM-5.1</a></td>
            <td className="model-type-cell" data-label="Model Type">Open-Source</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-04-07</td>
            <td className="score-cell" data-label="Pass@1">40.7</td>
            <td className="score-cell" data-label="Pass@3">--</td>
            <td className="score-cell" data-label="Pass^3">--</td>
            <td className="score-cell" data-label="# Turns">--</td>
          </tr>

          <tr className="rank-other">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" width="20px" height="20px" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><title>Qwen</title><path d="M12.604 1.34c.393.69.784 1.382 1.174 2.075a.18.18 0 00.157.091h5.552c.174 0 .322.11.446.327l1.454 2.57c.19.337.24.478.024.837-.26.43-.513.864-.76 1.3l-.367.658c-.106.196-.223.28-.04.512l2.652 4.637c.172.301.111.494-.043.77-.437.785-.882 1.564-1.335 2.34-.159.272-.352.375-.68.37-.777-.016-1.552-.01-2.327.016a.099.099 0 00-.081.05 575.097 575.097 0 01-2.705 4.74c-.169.293-.38.363-.725.364-.997.003-2.002.004-3.017.002a.537.537 0 01-.465-.271l-1.335-2.323a.09.09 0 00-.083-.049H4.982c-.285.03-.553-.001-.805-.092l-1.603-2.77a.543.543 0 01-.002-.54l1.207-2.12a.198.198 0 000-.197 550.951 550.951 0 01-1.875-3.272l-.79-1.395c-.16-.31-.173-.496.095-.965.465-.813.927-1.625 1.387-2.436.132-.234.304-.334.584-.335a338.3 338.3 0 012.589-.001.124.124 0 00.107-.063l2.806-4.895a.488.488 0 01.422-.246c.524-.001 1.053 0 1.583-.006L11.704 1c.341-.003.724.032.9.34zm-3.432.403a.06.06 0 00-.052.03L6.254 6.788a.157.157 0 01-.135.078H3.253c-.056 0-.07.025-.041.074l5.81 10.156c.025.042.013.062-.034.063l-2.795.015a.218.218 0 00-.2.116l-1.32 2.31c-.044.078-.021.118.068.118l5.716.008c.046 0 .08.02.104.061l1.403 2.454c.046.081.092.082.139 0l5.006-8.76.783-1.382a.055.055 0 01.096 0l1.424 2.53a.122.122 0 00.107.062l2.763-.02a.04.04 0 00.035-.02.041.041 0 000-.04l-2.9-5.086a.108.108 0 010-.113l.293-.507 1.12-1.977c.024-.041.012-.062-.035-.062H9.2c-.059 0-.073-.026-.043-.077l1.434-2.505a.107.107 0 000-.114L9.225 1.774a.06.06 0 00-.053-.031zm6.29 8.02c.046 0 .058.02.034.06l-.832 1.465-2.613 4.585a.056.056 0 01-.05.029.058.058 0 01-.05-.029L8.498 9.841c-.02-.034-.01-.052.028-.054l.216-.012 6.722-.012z" fill="url(#lobe-icons-qwen-fill)" fill-rule="nonzero" /><defs><linearGradient id="lobe-icons-qwen-fill" x1="0%" x2="100%" y1="0%" y2="0%"><stop offset="0%" stop-color="#6336E7" stop-opacity=".84" /><stop offset="100%" stop-color="#6F69F7" stop-opacity=".84" /></linearGradient></defs></svg> <a href="https://qwen.ai/blog?id=qwen3.6">Qwen3.6-Plus</a></td>
            <td className="model-type-cell" data-label="Model Type">Proprietary</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2026-04-02</td>
            <td className="score-cell" data-label="Pass@1">39.8</td>
            <td className="score-cell" data-label="Pass@3">--</td>
            <td className="score-cell" data-label="Pass^3">--</td>
            <td className="score-cell" data-label="# Turns">--</td>
          </tr>

          <tr className="rank-other">
            <td className="model-name-cell" data-label="Model"><svg className="org-icon" fill="currentColor" fill-rule="evenodd" width="20px" height="20px" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><title>Grok</title><path d="M9.27 15.29l7.978-5.897c.391-.29.95-.177 1.137.272.98 2.369.542 5.215-1.41 7.169-1.951 1.954-4.667 2.382-7.149 1.406l-2.711 1.257c3.889 2.661 8.611 2.003 11.562-.953 2.341-2.344 3.066-5.539 2.388-8.42l.006.007c-.983-4.232.242-5.924 2.75-9.383.06-.082.12-.164.179-.248l-3.301 3.305v-.01L9.267 15.292M7.623 16.723c-2.792-2.67-2.31-6.801.071-9.184 1.761-1.763 4.647-2.483 7.166-1.425l2.705-1.25a7.808 7.808 0 00-1.829-1A8.975 8.975 0 005.984 5.83c-2.533 2.536-3.33 6.436-1.962 9.764 1.022 2.487-.653 4.246-2.34 6.022-.599.63-1.199 1.259-1.682 1.925l7.62-6.815" /></svg> Grok-4<span className="verified-badge" aria-hidden="true" title="Evaluated by us">✓</span></td>
            <td className="model-type-cell" data-label="Model Type">Proprietary</td>
            <td className="text-cell" data-label="Agent">Default</td>
            <td className="date-cell" data-label="Date">2025-10-28</td>
            <td className="score-cell" data-label="Pass@1">27.5<sub>± 1.7</sub></td>
            <td className="score-cell" data-label="Pass@3">38.9</td>
            <td className="score-cell" data-label="Pass^3">16.7</td>
            <td className="score-cell" data-label="# Turns">20.3</td>
          </tr>
        </tbody>
      </table>
    </div>
  </div>

  <div className="task-showcase-section">
    <div className="task-showcase-header">
      <a href="/docs/tasks/campus/34" className="view-full-link">view all 108 tasks ↗</a>
    </div>

    <div className="video-showcase-section">
      <Tabs>
        <Tab title="💰NV Market">
          <a href="https://toolathlon-traj.xyz/claude-4.5-sonnet_nvidia-market" className="task-showcase-card">
            <div className="task-card-header">
              <h3 className="task-title">NVIDIA Market</h3>
              <div className="task-category finance">💰 Finance & Market</div>
            </div>

            <p className="task-description">Analyze NVIDIA's institutional ownership trends across 8 quarters, adjust for stock split, populate results\_template.xlsx with common holdings only.</p>

            <video src="https://raw.githubusercontent.com/WaitHZ/toolathlon-website/main/videos/1.mp4" className="w-full" autoPlay muted loop />
          </a>
        </Tab>

        <Tab title="🏢Travel Reimbursement">
          <a href="https://toolathlon-traj.xyz/claude-4.5-sonnet_travel-expense-reimbursement" className="task-showcase-card">
            <div className="task-card-header">
              <h3 className="task-title">Travel Expense Reimbursement</h3>
              <div className="task-category office">🏢 Office & Business</div>
            </div>

            <p className="task-description">Validate expense claims against invoices.</p>

            <video src="https://raw.githubusercontent.com/WaitHZ/toolathlon-website/main/videos/2.mp4" className="w-full" autoPlay muted loop />
          </a>
        </Tab>

        <Tab title="💻Exp Recordings">
          <a href="https://toolathlon-traj.xyz/claude-4.5-sonnet_experiments-recordings" className="task-showcase-card">
            <div className="task-card-header">
              <h3 className="task-title">Exp Recordings</h3>
              <div className="task-category tech">💻 Tech & Dev</div>
            </div>

            <p className="task-description">Update the Notion table with best scores and steps per benchmark from W\&B runs, combining same-named runs and averaging available metrics.</p>

            <video src="https://raw.githubusercontent.com/WaitHZ/toolathlon-website/main/videos/3.mp4" className="w-full" autoPlay muted loop />
          </a>
        </Tab>

        <Tab title="🛒Product Recall">
          <a href="https://toolathlon-traj.xyz/claude-4.5-sonnet_woocommerce-product-recall" className="task-showcase-card">
            <div className="task-card-header">
              <h3 className="task-title">Woocommerce Product Recall</h3>
              <div className="task-category shopping">🛒 Shopping & E-commerce</div>
            </div>

            <p className="task-description">Sync latest unupdated product inventories from each city's SQLite warehouse database to WooCommerce online store.</p>

            <video src="https://raw.githubusercontent.com/WaitHZ/toolathlon-website/main/videos/4.mp4" className="w-full" autoPlay muted loop />
          </a>
        </Tab>

        <Tab title="🎓Homework Grader">
          <a href="https://toolathlon-traj.xyz/claude-4.5-sonnet_canvas-homework-grader-python" className="task-showcase-card">
            <div className="task-card-header">
              <h3 className="task-title">Canvas Homework Grader Python</h3>
              <div className="task-category campus">🎓 Campus & Study</div>
            </div>

            <p className="task-description">Grade Homework2 by downloading latest Python submissions from email, running them to check for errors, and assigning 10 (pass) or 0 (fail) in Canvas based on correctness.</p>

            <video src="https://raw.githubusercontent.com/WaitHZ/toolathlon-website/main/videos/5.mp4" className="w-full" autoPlay muted loop />
          </a>
        </Tab>

        <Tab title="💻K8S PR Preview Testing">
          <a href="https://toolathlon-traj.xyz/claude-4.5-sonnet_k8s-pr-preview-testing" className="task-showcase-card">
            <div className="task-card-header">
              <h3 className="task-title">K8S PR Preview Testing</h3>
              <div className="task-category tech">💻 Tech & Dev</div>
            </div>

            <p className="task-description">Deploy the feature/pr-123 branch of SimpleShopping to Kubernetes.</p>

            <video src="https://raw.githubusercontent.com/WaitHZ/toolathlon-website/main/videos/6.mp4" className="w-full" autoPlay muted loop />
          </a>
        </Tab>

        <Tab title="🎤Final Performance Analysis">
          <a href="https://toolathlon-traj.xyz/claude-4.5-sonnet_inter-final-performance-analysis" className="task-showcase-card">
            <div className="task-card-header">
              <h3 className="task-title">Inter Final Performance Analysis</h3>
              <div className="task-category daily">🎤 Daily & Entertainment</div>
            </div>

            <p className="task-description">Populate Inter Milan's 2023 and 2025 UCL final stats into three Google Sheets tabs, compute differences in "StatsDifference", and mark missing data.</p>

            <video src="https://raw.githubusercontent.com/WaitHZ/toolathlon-website/main/videos/7.mp4" className="w-full" autoPlay muted loop />
          </a>
        </Tab>
      </Tabs>
    </div>
  </div>

  <div className="homepage-footer">
    <div className="footer-content">
      <p>Built by <a href="https://github.com/hkust-nlp" target="_blank" rel="noopener noreferrer" className="footer-link">HKUST NLP</a></p>
    </div>
  </div>
</div>
