
Work Here?
LM Studio provides a local-first AI platform that lets users download, install, and run large language models directly on their own computers, avoiding cloud services. It supports multiple LLM frameworks and offers an easy interface for configuration and customization, so users can tailor models for tasks like natural language processing, text generation, and data analysis. By running models locally, LM Studio keeps data on the user’s device, giving them greater privacy and control while removing reliance on external servers. This makes LM Studio different from many competitors that require cloud access or single-framework support, as it emphasizes local execution, privacy, and broad compatibility. The goal is to help developers, researchers, and enthusiasts use AI tools securely and efficiently, integrating local LLMs into their workflows and projects.
Industries
Data & Analytics
Consumer Software
Enterprise Software
AI & Machine Learning
Company Size
11-50
Company Stage
N/A
Total Funding
N/A
Headquarters
New York City, New York
Founded
2023
People at LM Studio who can refer or advise you
Help us improve and share your feedback! Did you find this helpful?
Health Insurance
401(k) Retirement Plan
Remote Work Options
Paid Vacation
Flexible Work Hours
Wellness Program
Mental Health Support
Conference Attendance Budget
Stock Options
Company Equity
Professional Development Budget
Tuition Reimbursement
Meal Benefits
Phone/Internet Stipend
Home Office Stipend
Parental Leave
Family Planning Benefits
Improving LM Studio's MLX Engine for agentic workflows. Jun 5, 2026 · LM Studio recently released mlx-engine v1.8.5 in LM Studio. This update dramatically improves performance for repeated, long-context agentic workflows by checkpointing your KV cache. It also adds continuous batching for VLM requests. This work is open source; you can view the PR here. In this post, I'll explain the cache-reuse problem this solves, why current open-source LLM models make rewinding harder, and how the new disk-backed cache works. Its benchmarks show up to 80% lower extra RAM usage, up to 2x more throughput, and up to 3.5x faster processing for image requests. Adrien uses the new mlx-engine for a local review of a URL-shortener app with codex -oss. What is mlx-engine? MLX Engine (mlx-engine) is an MIT-licensed inference engine optimized for Apple silicon. It was created and maintained by LM Studio. It uses Apple's MLX machine learning library, and builds on projects such as mlx-lm and mlx-vlm. MLX Engine is LM Studio's backend for all MLX inferencing. Current model architectures, and the shortcomings of mlx-engine. Two of the most popular open-source models right now are Qwen 3.5 (and 3.6) and Gemma 4. As part of each model's architecture, they use some nifty tricks to reduce the size of the KV cache at large context lengths. Qwen 3.5 uses a hybrid architecture and Gemma 4 uses a sliding window architecture. These attention strategies reduce memory usage at large context lengths, but they make the KV cache not arbitrarily rewindable. Let's walk through how Gemma 4 handles inference. This example is focused on Gemma 4 E2B; it interleaves "local" attention layers (sliding window of 512 tokens), and "global" attention layers. Gemma 4 interleaves local and global attention layers. Rewinding after a reasoning-heavy agent turn can leave parts of the local KV cache missing. Step 1: Prompt prefill. Compute the KV cache for the system prompt and user message. Step 2: Decode. Build up the KV cache while computing the assistant reasoning content and assistant message. Step 3: Rewind. Trim the KV cache to step (1) and append the assistant message without the prior reasoning content. So, a key problem that inferencing engines have to solve is avoiding re-computation when rewinding the KV cache to prepare a follow-up response. How LM Studio improved prompt caching in mlx-engine. LM Studio devised a solution for KV cache rewinding for these agentic use cases. By saving and restoring prompt cache to disk, the KV cache for follow-up requests does not need to be recomputed. Saving the KV cache to disk. Copying and storing these KV caches at 256-token boundaries lets LM Studio restore exact cached prefixes when the corresponding KV cache tensors are still present. If part of the prompt was edited, never computed, or evicted from the disk cache, mlx-engine falls back to recomputing that suffix. 256 tokens is small enough to avoid wasting much work on recomputation, while large enough to keep the disk cache efficient. mlx-engine saves KV cache records at fixed 256-token boundaries, then restores the longest available cached prefix for follow-up requests. First, at every boundary of 256 tokens (sequence len % 256 == 0), stream a copy of the local attention layers' KV cache to a disk-writer backend. While the model is processing the prompt or generating new tokens, a background disk-writing process is running. At every 256 token boundary, the system copies the KV cache corresponding to the most recent 256 tokens and sends it to the disk writer which then persists that block to disk. Since Apple silicon has a unified-memory architecture, LM Studio commit the local attention KV cache to disk and evict it from memory. This ensures that mlx-engine's memory usage footprint scales with active sequences, rather than all previously seen sequences. Restoring the KV cache from disk. First, calculate a key for each block of 256 tokens. Then, determine which global and local KV cache blocks need to be retrieved. Using the list of keys and cache types for the prompt, load as much KV cache as LM Studio can from the disk. For prompt sections that never had their KV cache computed (or had their KV cache evicted from disk), schedule those sections for prompt prefill. The disk cache is an LRU store, so whenever LM Studio save to or load from its disk store, the store evicts the least recently used KV cache tensors. This ensures that its disk store optimizes for the usage pattern. If the engine is sent short prompts using the same system prompt, the system prompt's local attention KV cache will not be evicted, but the KV cache of stale conversations will be evicted. And, if the engine is only receiving requests for one ever-growing conversation, the earlier local attention KV cache will get evicted in order to make room for the longer global attention KV cache. Disk cache design. LM Studio designed the disk cache to clean itself up after the model is unloaded. In other words, the cache is temporary and will not leave persistent files. The disk cache is one scratch file, not a folder full of independent cache files. LM Studio pack many cache records into that one file. Each KV cache entry is a serialized safetensors blob, and the engine keeps an in-memory table saying: "entry X starts at byte offset Y and is Z bytes long." When KV cache entries are evicted, their byte ranges are returned to a free list and reused by later records; if free space reaches the end of the file, the file is shrunk. LM Studio make the disk cache temporary by using the operating system's temporary-file mechanism in /tmp, and by treating all lookup metadata as model-lifetime state only. On model unload, the cache store clears its in-memory index and closes the scratch file. If the model process exits, the OS closes the file handle and releases the storage. And, continuous batching. LM Studio also added continuous batching to its vision model runner. Plenty of ink has already been spilled on the implementation and benefits of continuous batching; Hugging Face has a great explainer. Continuous batching allows users to use the same model for concurrent request processing. Along with the KV cache improvements described earlier, mlx-engine can now be used for serious agentic workloads. Benchmarks. To make the performance improvements more concrete, LM Studio ran a few end-to-end LM Studio API benchmarks on an M3 Max MacBook Pro with 36 GB of RAM, using lmstudio-community/Qwen3.6-27B-MLX-4bit. These benchmarks focus on the workloads that this update is intended to improve: parallel chat, long-prompt processing, and repeated high-resolution image prompts. Benchmark: parallel chat throughput. Setup: The model was loaded with parallel=4, then four short chat requests were sent concurrently through the LM Studio API. Each response was allowed to stop naturally. Parallel chat throughput mlx-engine Output tokens End-to-end output tok/s Total tokens End-to-end total tok/s 2.2x faster Result: for this four-way parallel chat workload, mlx-engine v1.8.5 completed the run about 2.2x faster end-to-end, with nearly identical output token counts. Benchmark: memory under parallel long prompts. Setup: The model was loaded with parallel=4, then four large prompts were sent concurrently through the LM Studio API. RAM usage was measured after the model loaded and again after the run completed. Memory under parallel long prompts mlx-engine Input tokens Output tokens Total tok/s RAM after load RAM after run Extra RAM after run 82% less extra RAM Result: for this parallel long-prompt workload, mlx-engine v1.8.5 used about 82% less extra RAM after the run, while maintaining similar wall-clock time and slightly higher total token throughput. This is the expected benefit of moving inactive prompt-cache records out of unified memory. The active sequences still need to stay resident, but stale cache records no longer have to keep accumulating in RAM. Benchmark: repeated high-resolution image prompt. Setup: The same image prompt was sent twice, generating one token per request. This isolates the cost of processing the image-expanded prompt and restoring the prompt cache. mlx-engine Cached prompt tokens Uncached prompt tokens
LM Studio: run any AI model on your computer with a beautiful GUI. Not everyone wants to live in a terminal. For developers, researchers, and curious users who prefer a point-and-click experience, LM Studio is the gold standard for running AI models locally. It combines a polished desktop application with serious technical capabilities - making local LLMs accessible to anyone, regardless of their command-line comfort level. Released as a free desktop app for macOS, Windows, and Linux, LM Studio has quietly become one of the most-used tools in the local AI space. As of 2026, it supports thousands of models from Hugging Face, features a built-in chat interface, and offers an OpenAI-compatible local server - all wrapped in one of the cleanest UIs in open source software. What is LM Studio? LM Studio is a desktop application that lets you discover, download, and run open source language models entirely on your local machine. It acts as a friendly frontend for llama.cpp and other inference backends, handling all the technical complexity behind the scenes. Where Ollama focuses on simplicity and developer-first CLI usage, LM Studio prioritizes visual accessibility. You can browse models, read their descriptions, check hardware compatibility warnings, download with a progress bar, and start chatting - all without writing a single line of code. Key features. Hugging Face model browser. LM Studio integrates directly with Hugging Face, giving you access to tens of thousands of models from a searchable in-app directory. Filters help you narrow by model type, size, quantization format, and hardware compatibility. GGUF model support. LM Studio runs models in GGUF format - the standard quantized format for consumer-grade local inference. Quantization shrinks model size by representing weights in lower precision (e.g., 4-bit instead of 32-bit), making large models runnable on everyday hardware with minimal quality loss. Built-in chat interface. Switch between models mid-conversation, adjust system prompts, tweak generation parameters (temperature, top-p, context length) all from the UI - no config files required. Local inference server. LM Studio can run as a local server that mimics the OpenAI API. This means tools like Cursor, Continue, or any custom application built against OpenAI's SDK can be pointed at LM Studio with minimal changes. Multi-model sessions. Recent versions allow running multiple models simultaneously and routing between them - useful for comparing outputs or building multi-agent workflows. How to get started. Step 1 - Download LM Studio Visit lmstudio.ai and download the installer for your platform. It is a standard application installer - no dependencies, no terminal required. Step 2 - Browse and Download a Model Open LM Studio and navigate to the Discover tab. Search for a model - try "Mistral 7B" or "Llama 3.2" to start. LM Studio will show you compatible quantized versions and flag whether they fit in your available RAM. Click Download. A progress bar shows the download status. Most 7B models are 4-6GB depending on quantization. Step 3 - Load the Model and Chat Go to the Chat tab, select your downloaded model from the dropdown, and start typing. The model loads into memory (usually 5-20 seconds depending on size and hardware) and you are ready to go. Step 4 - Start the Local Server Navigate to the Local Server tab, select a model, and click Start Server. LM Studio will run an OpenAI-compatible API at http://localhost:1234/v1. You can now use it with any compatible tool or SDK. Understanding quantization: what do those letters mean? When browsing models in LM Studio, you will see file names like mistral-7b-instruct.Q4_K_M.gguf. The quantization suffix tells you the quality/size trade-off: For most use cases, Q4_K_M is the sweet spot - it fits comfortably in 8GB of RAM and produces output that is nearly indistinguishable from the full-precision model. LM Studio vs Ollama: which should you use? Both tools are excellent. The right choice depends on your workflow. Choose LM Studio if: you prefer a GUI, you want to browse and discover models visually, you are not comfortable with the command line, or you want to quickly compare multiple models side-by-side. Choose Ollama if: you prefer CLI tools, you are building scripts or automated pipelines, you want to integrate with Docker or server environments, or you need the lightest possible footprint. Many practitioners use both - LM Studio for exploration and experimentation, Ollama for integration into development workflows. Privacy: the real selling point. It is worth stepping back and appreciating what LM Studio actually gives you from a privacy perspective. When you use ChatGPT, Claude, or Gemini, every prompt you send travels over the internet to a remote server. Your conversations may be used to improve models, reviewed by human trainers under certain conditions, or stored for extended periods. For many consumer use cases this is fine. For sensitive work - legal documents, medical notes, confidential business strategy, personal journaling - it is a meaningful concern. LM Studio eliminates this entirely. Your prompts never leave your machine. There is no account required (you can use it completely anonymously), no usage data sent to a server, and no terms of service governing what you can say. What you type stays on your computer. The bottom line. LM Studio is the most accessible entry point into local AI for users who want power without complexity. Its clean interface, deep model library, and seamless server functionality make it genuinely useful for both beginners and experienced practitioners. If you have been curious about running AI locally but were put off by command-line tools, LM Studio removes every barrier. Download it, pull a model, and have your first fully private AI conversation today.
LM Studio has released version 0.3.17, introducing support for the Model Context Protocol (MCP) - a step forward in enabling language models to access external tools and data sources.
Prepared, an AI and cloud-based platform that optimizes emergency response systems, has raised $80M in Series C funding led by General Catalyst.
A war is raging between the proponents of open-source and closed-source AI. Amid this landscape, one Singapore startup is making a splash. Menlo Research wants to build the “brain for robots,” software that helps machines see, think, and act in the real world.Image credit: Timmy LoenThe company is behind Jan, a lightweight desktop app that lets users run large language models locally, without sending any data to the cloud. Since launching in October 2023, Jan has exploded in popularity. The free app has been downloaded millions of times, with an average of 7,000 new installs per day. Some users have touted it as an easy way to run open-source LLMs without the need for powerful machines
Find jobs on Simplify and start your career today
Industries
Data & Analytics
Consumer Software
Enterprise Software
AI & Machine Learning
Company Size
11-50
Company Stage
N/A
Total Funding
N/A
Headquarters
New York City, New York
Founded
2023
Find jobs on Simplify and start your career today