Tensormesh Secures $4.5 Million to Slash AI Costs Tenfold

Tensormesh: Pioneering AI Inference Efficiency

With the accelerating expansion of AI infrastructure, the demand for maximizing the inference efficiency of available Graphics Processing Units (GPUs) is increasing. In this context, Tensormesh stands out, having recently announced its emergence with seed funding of $4.5 million led by Laude Ventures, with participation from database pioneer Michael Franklin. Tensormesh aims to leverage this funding to develop a commercial version of its open-source tool LMCache, founded and overseen by co-founder Yihua Zheng. LMCache is known for its ability to reduce inference costs by up to ten times when used effectively, making it a vital component in open-source deployments and attracting the attention of giants like Google and NVIDIA. Tensormesh now seeks to transform this academic success into a sustainable and innovative commercial business model.


Illustrative image showing the structure of a knowledge graph

Core Technology: Key-Value Cache (KV cache) System

Without KV Cache (Traditional Method)

Repeated re-computation of previous tokens.

High resource consumption and slow response.

With KV Cache (Tensormesh Optimization)

Storage and reuse of key and value arrays.

Accelerated inference and significant resource savings.

The core technology of Tensormesh revolves around the Key-Value cache (KV cache) system. The KV cache is a crucial optimization mechanism used in Large Language Models (LLMs) to store computed key and value matrices for previously processed tokens. This process allows models to generate new tokens faster and more efficiently by avoiding redundant re-computations in the self-attention mechanism. In this way, the inference process is significantly accelerated, especially in tasks involving long text generation and AI conversations that require remembering previous context. KV cache reduces the time and effort required to predict new words by recalling previous computations instead of repeating them (Hugging Face). This caching contributes to accelerating model responses without the need to re-process information that has already been computed (Microsoft Research, May 8, 2024). In traditional architectures, this cache is typically discarded after each query, which Tensormesh CEO, Guochun Jiang, considers a major source of inefficiency.


A blurry image depicting a linear graphical representation on a screen

Overcoming Traditional Limitations and Reusing the KV Cache

Overcoming Traditional Limitations in KV Cache Management

Traditional Approach: The KV Cache is discarded after each query, leading to "forgetting" and continuous re-computation.

Tensormesh Solution: Retaining and reusing the KV Cache across consecutive queries for persistent "memory."

Intelligent Distribution: Data may be distributed across multiple storage tiers (GPU, CPU, disk) to conserve resources.

Maximum Efficiency: Achieving significantly higher inference capability with the same allocated server resources.

Guochun Jiang, co-founder of Tensormesh, describes the current situation by saying: "It's like having a very smart analyst who reads all the data, but forgets everything they learned after each new question." Tensormesh systems avoid this waste by retaining the KV cache, allowing it to be effectively reused when the model encounters similar operations in consecutive queries. Given the high value of Graphics Processing Unit (GPU) memory, this approach may require distributing data across multiple storage tiers. However, the result is significantly higher inference capability with the same allocated server resources, enhancing the efficiency of offloading key/value data from GPU memory to cheaper storage media such as CPU memory or disk, saving GPU resources while maintaining the ability to resume inference without re-computation (BentoML).


Dynamic animation illustrating the construction of a knowledge graph

Innovative Applications and the Value of Tensormesh Solution

Challenge: Building an In-house Solution

  • High Technical Complexity: A daunting task requiring specialized expertise.
  • Long Time and Huge Resources: May take months and employ dozens of engineers.
  • Exorbitant Costs: Significant investment without guaranteed optimal results.

Value: The Ready-to-Use Tensormesh Solution

  • Immediate High Efficiency: Reducing operational costs by up to ten times.
  • Time and Resource Savings: Overcoming technical challenges with a ready-to-use application.
  • Enhanced Performance and Reliability: Leveraging the specialized expertise of the Tensormesh team.

This innovation demonstrates particular effectiveness in applications such as chat interfaces, where language models constantly need to refer to a growing conversation history to provide coherent and relevant responses. Similarly, agent systems that manage an increasing log of actions and goals benefit from this technology to improve their performance. Although AI companies could theoretically implement these optimizations in-house, the high technical complexity makes this a daunting and time-consuming task. Given the specialized expertise of the Tensormesh team in this field and the deep research the company has conducted into the intricacies of this process, it bets on strong demand for a ready-to-use solution that delivers the required efficiency.


The image represents a digital display of glowing charts and data lines

Jiang explains the difficulty, stating: "Retaining the KV cache in secondary storage and efficiently reusing it without slowing down the entire system presents a significant technical challenge. We've seen companies employ 20 engineers and take three or four months to build such a system. With our product, they can achieve this with high efficiency and significantly save time and resources."

Next Post Previous Post
No Comment
Add Comment
comment url