Arabic.AI and Stanford launch HELM Arabic Enterprise
Category: AI & ML
By Arin Sol
Published: 2026-06-10T12:15:21.000Z
If you cannot measure something, you cannot really improve it, and for a long time that was the quiet problem holding back Arabic AI. Arabic.AI, a Dubai based company, has launched HELM Arabic Enterprise with Stanford, a benchmark built to test how Arabic models perform on real business tasks.
I've got rich detail. The January collaboration established HELM Arabic, and the fresh news (two hours ago) is the launch of HELM Arabic Enterprise. This is a UAE/Arabic AI story, so the regional thread is built in. Here's the piece. If you cannot measure something, you cannot really improve it, and for a long time that was the quiet problem holding back Arabic artificial intelligence. The language had no shared, rigorous way to judge how well AI models actually handled it. Arabic.AI , a Dubai based enterprise AI company, has now taken another step toward fixing that, announcing the launch of HELM Arabic Enterprise in collaboration with Stanford University's Center for Research on Foundation Models. The new benchmark is built specifically to test how Arabic language models perform on the kind of work businesses actually need done. The partnership itself is not brand new. Earlier this year the two sides teamed up to create HELM Arabic, the first holistic benchmark and public leaderboard for Arabic models, extending Stanford's well regarded HELM framework into the language for the first time. HELM, which stands for holistic evaluation of language models, has become a global reference point for transparent and reproducible model testing, so bringing it to Arabic was a meaningful milestone. HELM Arabic Enterprise is the next layer, narrowing the focus from general language ability to practical business use, evaluating models across six enterprise tasks that span content generation, financial reasoning and legal question answering. The reason this matters comes down to trust and decision making. Until now, a company in the region trying to choose between competing Arabic models had little objective basis for comparison, often relying on vendor claims or ad hoc testing. A shared benchmark tied to real workflows gives teams a common baseline they can use for internal assessment, for comparing one vendor against another, and for keeping an eye on model performance over time. Arabic.AI 's chief executive Nour Al Hassan framed the need plainly, arguing that Arabic enterprise AI requires an evaluation framework that is rigorous, open and connected to genuine business tasks rather than abstract scores. There is a competitive subtext worth acknowledging too. Arabic.AI is not a neutral party, since its own flagship model, marketed as the region's top ranked Arabic system, sits at the head of the HELM Arabic leaderboard. That gives the company an obvious interest in promoting a benchmark it performs well on, though partnering with an independent academic institution like Stanford lends the effort credibility it could not manufacture alone. The framework also evaluates both open and commercial models, which helps it serve the wider community rather than just one vendor. The regional stakes are real. Arabic is spoken by more than 400 million people yet has long been underserved in AI development and testing, even as Gulf governments pour money into the field. Building proper measurement infrastructure is the unglamorous but essential groundwork that lets the rest of the ecosystem mature. As Saudi Arabia, the UAE and others push to build sovereign Arabic models, having a trusted yardstick to judge them against could prove just as important as the models themselves.