Full-Time

Senior Solution Architect

Honeycomb.io

Honeycomb.io

201-500 employees

Observability platform for cloud applications

Compensation Overview

£143.7k - £169.2k/yr

+ Commission

Remote in UK

Remote

Remote in the United Kingdom; travel up to 20% for customer engagements and events.

Category
Sales & Solution Engineering (2)
,
Required Skills
Kubernetes
opentelemetry
Docker
Observability
REST APIs
Requirements
  • 5+ years of pre-sales, solution architecture, sales engineering, or customer-facing technical experience in B2B SaaS
  • Experience supporting complex technical sales cycles with multiple stakeholders, ideally in enterprise or commercial accounts with sophisticated engineering teams
  • Strong solutioning skills and the ability to turn ambiguous customer needs into clear technical approaches and evaluation plans
  • Deep working knowledge of observability, distributed systems, distributed tracing, and OpenTelemetry
  • Hands-on fluency with modern cloud environments, containers, Kubernetes, API-driven architectures, and at least one major cloud provider
  • Ability to troubleshoot instrumentation and integration issues across a range of languages, frameworks, and deployment patterns
  • Experience owning POCs or technical evaluations, including defining success criteria, managing risk, and driving toward measurable outcomes
  • Strong consultative communication skills, with the ability to tailor technical messaging across practitioners, architects, and director-level stakeholders
  • Credibility with developer, platform, SRE, and engineering-led audiences through both technical depth and business context
  • A clear point of view on Honeycomb’s value and differentiation in the observability landscape
  • Strong judgment, autonomy, and follow-through in fast-moving deal environments
  • A collaborative, low-ego working style with the ability to pull in cross-functional partners when the situation calls for it
  • A mentorship mindset and desire to make the team stronger through coaching, documentation, and reusable work
  • Familiarity with MEDDPICC or Command of the Message is a plus
  • Willingness to travel ~20% for customer engagements, field events, and internal collaboration
Responsibilities
  • Own the technical strategy for complex sales opportunities, partnering closely with Account Executives and sales leadership to move deals forward
  • Lead discovery, solution design, and technical validation for customer environments with multiple stakeholders and competing requirements
  • Run complex evaluations and proofs of concept end-to-end, including success criteria, technical execution, risk identification, and debrief
  • Translate customer architectures into clear solution narratives that connect Honeycomb’s platform to engineering priorities and business outcomes
  • Deliver compelling demos, workshops, and technical presentations tailored to audiences ranging from practitioners to director-level stakeholders
  • Get hands-on when it matters: support instrumentation strategy, troubleshoot integration issues, and help customers prove value in real environments
  • Manage competitive evaluations and bake-offs with strong judgment, disciplined execution, and a clear point of view on Honeycomb’s differentiation
  • Identify technical and strategic risks across active deals and proactively adjust the evaluation approach when needed
  • Partner across Sales, Product, Field Engineering, Marketing, and DevRel to unblock deals and improve the overall go-to-market motion
  • Contribute reusable architectures, demo assets, and technical content that increase team effectiveness and scale best practices
  • Bring structured field insight back into the company to help inform product feedback, messaging, and technical strategy
  • Mentor more junior Solution Architects through collaboration, coaching, and shared learning
Desired Qualifications
  • Familiarity with MEDDPICC or Command of the Message is a plus

Honeycomb.io provides a cloud observability platform for cloud applications. It collects telemetry data from applications and lets users query billions of events to understand performance and behavior. It uses Service Level Objectives (SLOs) to alert when customer experiences degrade, enabling teams to quickly debug issues. It offers tools like Query Builder, Query Assistant, and BubbleUp to analyze data and compare subsets against baselines. Its pricing is likely subscription or usage-based, charging based on data sent for analysis. The company differentiates itself by handling high-cardinality event data at scale with fast, flexible querying and by focusing on problem diagnosis and reliability for a wide range of customers. The goal is to help teams identify, debug, and optimize cloud applications to improve reliability and customer experience.

Company Size

201-500

Company Stage

Series D

Total Funding

$146.9M

Headquarters

San Francisco, California

Founded

2016

Simplify Jobs

Simplify's Take

What believers are saying

  • Private Cloud launched November 2025 secures enterprise data residency on dedicated AWS.
  • Refinery 3.0 cuts CPU 70% and memory 60%, enabling cost-efficient high-throughput sampling.
  • Hosted MCP integrates observability into GitHub Copilot, driving developer adoption.

What critics are saying

  • Datadog replicates MCP server within 6-12 months, capturing existing customers.
  • Amazon Q Developer bundles native observability, undercutting Honeycomb in 12-18 months.
  • Metrics promo pricing expires June 2026, spiking churn among mid-market users.

What makes Honeycomb.io unique

  • Honeycomb unifies logs, metrics, traces into single event type for sub-second queries.
  • Canvas AI copilot enables natural language investigations with autonomous anomaly detection.
  • Agent Timeline visualizes multi-agent AI workflows tracking LLM calls in real-time.

Help us improve and share your feedback! Did you find this helpful?

Your Connections

People at Honeycomb.io who can refer or advise you

Benefits

Health Insurance

Unlimited Paid Time Off

Remote Work Options

Home Office Stipend

Phone/Internet Stipend

Parental Leave

Professional Development Budget

Company Equity

Growth & Insights and Company News

Headcount

6 month growth

2%

1 year growth

3%

2 year growth

1%
Nurbak
Apr 2nd, 2026
Honeycomb vs Datadog: which observability tool in 2026?

Honeycomb vs Datadog: which observability tool in 2026? An honest comparison of Honeycomb and Datadog for observability. Honeycomb excels at event-driven debugging. Datadog is the all-in-one SaaS. Here is when to pick each - and when neither fits. Honeycomb and Datadog are both observability tools, but they approach the problem from fundamentally different directions. Honeycomb was built to answer unknown questions about your systems. Datadog was built to be the single platform for everything operations-related. This comparison covers how they actually differ in 2026, where each one excels, and - importantly - when neither is the right choice. Honeycomb: event-driven debugging. Honeycomb was founded by Charity Majors and Christine Yen, both from Facebook's infrastructure team. The core idea: traditional monitoring tools force you to decide what to measure before you know what questions you will ask. Honeycomb flips this. You send high-cardinality events, and query them later. How it works. Every span, log, or event you send to Honeycomb is stored in a columnar store optimized for ad-hoc queries. You can group by, filter, and break down on any attribute - user ID, shopping cart size, feature flag variant, database query text - without pre-defining indexes. This is the key differentiator. In Datadog, if you want to filter APM data by a custom attribute, you need to index it (and pay for it). In Honeycomb, every attribute is queryable by default. Standout features. * BubbleUp: Select a group of slow or erroring requests, and Honeycomb automatically identifies which attributes are different between the selected group and the baseline. Instead of guessing root causes, the tool shows you. * Query builder: Flexible enough to replace many custom dashboards. Group by multiple dimensions, calculate percentiles, heatmaps, and rates - all in a single query interface. * SLOs: Define service-level objectives tied to your trace data. Honeycomb tracks burn rate and alerts when you are consuming your error budget too fast. * OpenTelemetry native: First-class OTLP support. Honeycomb was an early and active contributor to the OpenTelemetry project. Pricing (2026). * Free: 20M events/month * Pro: $130/month for 100M events * Enterprise: Custom pricing, SSO, advanced roles Limitations. * Not an all-in-one platform. No infrastructure monitoring, no synthetics, no log management (though it can ingest structured logs as events). * Steeper learning curve. Getting value from Honeycomb requires understanding high-cardinality querying, which is a different mental model than dashboards-and-alerts. * Smaller ecosystem. Fewer integrations, fewer pre-built dashboards, fewer community resources than Datadog. Datadog: the all-in-one platform. Datadog started as infrastructure monitoring in 2010 and has expanded into a comprehensive observability and security platform. In 2026, it covers infrastructure, APM, logs, synthetics, real user monitoring (RUM), security, CI/CD visibility, database monitoring, and more. You install the Datadog Agent on your hosts (or use serverless integration). The agent collects metrics, traces, and logs automatically. Datadog's auto-instrumentation libraries handle most popular frameworks, so you get APM data with minimal code changes. Everything lands in a single platform with cross-linking: click on a trace to see related logs, jump from a metric spike to the traces that caused it, or correlate infrastructure metrics with application performance. * Unified platform: Infrastructure, APM, logs, RUM, synthetics, security - all in one UI with cross-linking between signals. * Auto-instrumentation: Datadog's agent auto-instruments most frameworks. Less manual work than OpenTelemetry-based tools. * Service map: Automatically generated dependency graph showing how services communicate, with health indicators on each edge. * Notebooks and dashboards: Rich visualization with team sharing, annotations, and incident timelines built in. * Watchdog AI: Automated anomaly detection that flags unusual patterns without manual threshold configuration. * Infrastructure: $15/host/month * APM: $31/host/month * Logs: $0.10/GB ingested (after plan inclusion) * Indexed Spans: $1.70 per million (after 1M included with APM) * Synthetics: $5/1000 API test runs The complexity of Datadog pricing is itself a feature. Or a bug, depending on your perspective. Many teams report bill shock after scaling up, because each product has separate metering and the costs compound. * Cost unpredictability at scale. The per-host, per-GB, per-million-spans pricing model makes budgeting difficult. * Vendor lock-in. Datadog's proprietary agent and query language make migration expensive. * Jack of all trades. Each individual feature is good but rarely best-in-class. Honeycomb's debugging is deeper. Grafana's dashboards are more flexible. PagerDuty's alerting is more sophisticated. Head-to-Head comparison. | Dimension | Honeycomb | Datadog | | Philosophy | Debug unknown unknowns | Monitor everything in one place | | Query power | Excellent (high-cardinality native) | Good (requires indexing for custom attributes) | | Infrastructure monitoring | No | Yes (core strength) | | Log management | No (events only) | Yes | | Synthetics | No | Yes | | Auto-instrumentation | Via OpenTelemetry | Proprietary agent (more automatic) | | SLO tracking | Yes (built-in) | Yes (built-in) | | Free tier | 20M events/month | 5 hosts, limited features | | Cost at scale | Predictable (event-based) | Unpredictable (multi-axis metering) | | Best for | Debugging distributed systems | Full-stack operations teams | When to choose Honeycomb. * Your primary pain is debugging - finding why specific requests fail or slow down. * You run microservices and need to trace requests across service boundaries. * You already use other tools for infrastructure (Prometheus, Grafana) and logs (Loki, ELK). * You want OpenTelemetry-native tooling without vendor lock-in. * Your team is comfortable with a query-driven workflow (vs. dashboard-driven). When to choose Datadog. * You want one platform for infrastructure, APM, logs, and more. * Your team prefers dashboards and pre-built views over ad-hoc queries. * You need auto-instrumentation with minimal code changes. * You have the budget and want to minimize the number of vendors. * You need compliance features (SOC 2, HIPAA, audit logging) from a single vendor. When neither fits. Both Honeycomb and Datadog are built for teams running distributed systems at meaningful scale. But a large portion of modern applications do not look like that. If you are running a single Next.js application - deployed on Vercel or a VPS - you do not have distributed traces to analyze. You do not have 50 hosts to monitor. You have API routes that need to be fast, reliable, and monitored. For that scenario, both tools are overkill. Honeycomb's high-cardinality debugging is powerful but unnecessary when your "distributed system" is one application. Datadog's all-in-one platform costs more per month than most indie products earn. What you need is focused API monitoring: per-endpoint response times, error rates, status code tracking, and instant alerts when something breaks. Nurbak Watch is built for this exact use case. It runs inside your Next.js server via instrumentation.ts - five lines of code - and monitors every API route automatically. Alerts hit Slack, email, or WhatsApp in under 10 seconds. $29/month flat, free during beta. No per-host pricing, no per-span charges, no bill surprises. If your architecture grows into microservices, you can graduate to Honeycomb or Datadog. But start with what your architecture actually needs today. The Nurbak Team builds developer-first API monitoring tools. Nurbak share insights on uptime, performance, alerting, and best practices for keeping APIs healthy in production. Ready to try it? Nurbak Watch is free during beta. 5 lines of code. First alert in under 5 minutes. Comparisons

PR Newswire
Mar 11th, 2026
Honeycomb launches AI observability tools and metrics platform for autonomous software development

Honeycomb, an observability platform, has launched new AI capabilities and announced general availability of Honeycomb Metrics, positioning itself as the first observability platform built for AI agent-driven development. The company has raised no new funding in this announcement. New features include Agent Skills for Claude Code and Cursor, Automated Investigations for autonomous issue detection, a Slackbot for natural language queries, and Pipeline Intelligence for automated telemetry pipeline creation. Honeycomb Metrics offers both time series and event-based models on one platform, eliminating cardinality-based billing concerns. The company is expanding its Model Context Protocol integrations across AI development tools, with early adopters including Notion, Scribe, and Fortune 500 retailers. Introductory pricing starts at $2 per 1,000 time series monthly until June 2026.

Honeycomb
Feb 17th, 2026
Teaching AI How to Refinery

Teaching AI how to refinery. honeycomb.io released a new tool for its MCP server which helps your AIs understand Refinery, and how Honeycomb handles sampling. In this blog, I'll walk you through the new feature and show how it can help you run Refinery. | February 17, 2026 March 11, 2025 The engineer's guide to sampling. At the beginning of February, honeycomb.io released v3.1 of Refinery, its advanced, tail-based sampling solution. The new version comes with more performance enhancements, bug fixes, and a few new pieces of telemetry. In tandem with the 3.1 release, honeycomb.io also released a new tool for its MCP server which helps your AIs understand Refinery, and how Honeycomb handles sampling. With this new tool, you can ask Canvas or your favorite AI questions, like: "What are my sampling rules doing?" "Is my refinery configured correctly?" or "Am I achieving my goal sample rate?" In this blog, I'll walk you through the new feature and show how it can help you run Refinery. Explore honeycomb telemetry pipeline. Make cost control feel safe, smart, and scalable. How honeycomb.io found a bug in its own sampling rules. A great example of how useful this feature can be is how it found an issue in its own sampling rules during development. Once honeycomb.io had written the static content about Refinery, I wanted to test it using an AI assistant. What better way to test than to ask questions about its own sampling? honeycomb.io generate a lot of telemetry from its ingest pipeline, so honeycomb.io use Refinery to sample the data. honeycomb.io use a RulesBasedSampler with seven different rules. One of its rules is set up to ensure that all its end-to-end test data (which runs roughly every minute in production) is all kept. - Name: E2E Data Conditions: - Field: service.name Operator: = Value: shepherd - Field: app.environment.id Operator: = Value: 49427 SampleRate: 1 Some very astute Refinery readers may already see the issue, but at the time, I didn't. While testing, I asked an AI assistant that was hooked up to Honeycomb's MCP server, Look at my production_us1_refinery_values.yaml production refinery rules. What do they do? Use the honeycomb mcp to learn about Refinery first. It used the new Refinery tool to learn about Refinery, review the rules, and give a detailed answer. For this story, the important part of the response was: Sampling Rules Breakdown 1. E2E Data Rule (lines 128-137) - Keeps: 100% of traces (SampleRate: 1) - When: shepherd service AND app.environment.id = "49427" (string comparison) - Purpose: Preserve all E2E test traces for the test environment Again, readers who have hit this issue before may see that the AI already knows something really important about app.environment.id. But again, I didn't see it. This is all I needed to test; the ability for an AI to accurately review and critique Refinery rules was its goal, and goal achieved! But I've been going through my own journey learning how to use AI, and I was curious (and skeptical) if it would be able to apply its knowledge to the data itself. So, I asked a follow-up question: Based on your understanding of the rules, is the data in shepherd dataset being sampled correctly? Also look in the refinery-metrics dataset. Use the honeycomb mcp server to review this data. The AI used the Honeycomb MCP to query data and assess if it was sampling correctly based on the rules it had reviewed. Eventually, it returned an unexpected answer: CRITICAL ISSUE: E2E Data Not Sampled Correctly | Problem: E2E environment traces (app.environment.id = 49427) are matching the wrong rule. - Expected: Should match "E2E Data" rule (lines 128-136) with SampleRate: 1 (100% retention) - Actual: Matching "DeterministicSampler for pro plus/enterprise handle\_batched\_event" rule with SampleRate: 10 - Impact: E2E data is being sampled at 1:10 instead of keeping all traces Since rules are evaluated in order and the E2E rule comes FIRST, it should match before the Pro/Enterprise rule. The fact that it's not matching suggests: 1. Possible cause: The app.environment.id field may not exist or not be set to 49427 on all spans in E2E traces 2. Rule evaluation: Without Scope: span on the E2E rule, it uses Scope: trace (default), meaning conditions can match across different spans - but if app.environment.id doesn't exist anywhere in the trace, the rule fails I was very skeptical, as this rule has been in place for a long time. Surely, this was just another case of AI getting it wrong, right? I told it to try again: This would be pretty serious. Rerun your analysis to confirm your math is correct. Since the e2e rule comes first it should take priority. Test your hypothesis that app.environment.id does not exist. While it ran, I went and searched the data myself and found the same result the AI had: the E2E Data rule was never being hit. I couldn't think of why that would be happening. At this point, my AI assistant was done thinking and returned the answer for me. I'm including the entire output this time because it did such a good job of supporting its claim. FOUND THE BUG\! Looking at the YAML, I can now provide my final analysis. The data confirms: Confirmed Issue 388 E2E traces exist that match BOTH conditions (service.name = shepherd AND app.environment.id = 49427), but 0 traces match the E2E rule. Root Cause The most likely issue is that the integer value 49427 on line 135 needs to be quoted as a string in Refinery's configuration: Value: "49427" \# Should be a string, not an integer Why This Matters In Refinery's rule evaluation, field values may need to be strings regardless of the underlying Honeycomb field type. Even though app.environment.id is stored as an integer in Honeycomb, Refinery may perform string-based comparison during rule evaluation. Evidence 1. Working rules: The Pro/Enterprise rule uses Value: handle\_batched\_event (string) and Value: shepherd (string) - both work 2. Failing rules: The E2E rule uses Value: 49427 (unquoted integer) - doesn't match 3. Boolean example: The API requests rule uses Value: true with Datatype: bool - also doesn't match (similar type issue) Impact Assessment - Current State: \~388 E2E traces/hour are being sampled at 1:10 instead of 1:1 - Data Loss: \~90% of E2E test traces are being dropped - Severity: HIGH - E2E monitoring is significantly degraded Recommendation Change line 135 from: Value: 49427 To: Value: "49427" This should be tested in a non-production environment first to verify it fixes the rule matching. I couldn't believe I had missed this bug. This is a well-known footgun in Refinery that is easy to get wrong. Back in Refinery 1.20, honeycomb.io even added support for Datatype so that users could explicitly set the expected type to use when comparing! It's hard to figure out, though (hence the defensive Datatype feature), as it requires a deep understanding of both your rules and your data. In this case, when honeycomb.io set these rules up a long time ago, honeycomb.io made this mistake and never checked if honeycomb.io were sampling correctly (whoops). Conclusion. This experience really opened my eyes on how good AI can be when pattern-matching correctly. It also highlights how flexible they can be when given the proper tools. honeycomb.io even found a second issue in its Refinery configuration by doing the same kind of "teach" and then "ask" flow. With the power of Honeycomb's MCP, all your high-cardinality, high-dimensionality data is readily available for your favorite AI assistant to be your pattern-matcher. Be curious, ask questions; you might be surprised what answers you'll receive.

Honeycomb
Nov 19th, 2025
Honeycomb Launches New Private Cloud Offering to Address Security, Compliance, and Cost Concerns

Honeycomb launches new Private Cloud offering to address security, compliance, and cost concerns. Latest release also includes revamped Metrics capabilities and Canvas in GA, providing secure, unified, and intelligent observability to aid enterprises in their adoption of AI. AI-Powered observability: business value insights for technology leaders featuring IDC. Discover how organizations using Honeycomb achieve measurable results, including faster issue resolution, proactive problem identification, and better cross-team collaboration. SAN FRANCISCO, November 19, 2025 - Honeycomb, the creators of observability, today announced a series of new product advancements: Honeycomb Private Cloud, Honeycomb Metrics, and Canvas, a part of Honeycomb Intelligence. Together, these new capabilities bolster Honeycomb's enterprise portfolio, combining security, performance, and agentic experiences to help organizations deliver resilient software at scale. Enterprises are rapidly increasing their adoption of private and hybrid clouds for better control, residency, and security. Observability is an essential function for teams who need deep visibility and rapid debugging across cloud-native multi-service architectures, regardless of deployment location. With Honeycomb Private Cloud, organizations get a dedicated AWS infrastructure built to meet stringent security, data residency, and regulatory requirements - all without sacrificing performance or developer experience. By providing complete management over data and environment isolation, this offering is especially well-suited for customers in highly regulated industries like finance and healthcare, who need to confidently access the lighting-fast query performance and intuitive workflows that define Honeycomb's platform. "With the rise of AI, enterprises can no longer just monitor uptime or latency, they must understand how systems are performing in real time," said Thomas Squeo, CTO, Americas at Thoughtworks. "Observability is now a business imperative, and Thoughtworks is thrilled to be Honeycomb's preferred partner for implementing Private Cloud. We believe this offering will unlock a new chapter of observability for the enterprise, particularly as organizations strive to effectively manage their complex data and development requirements." Customers are able to select self-managed or Honeycomb-managed options. Additionally, this new Bring Your Own Cloud (BYOC) option offers a cost-effective solution for organizations that want to keep their data in their own cloud accounts and control costs over time using their existing AWS discounts. With the introduction of overhauled and improved Metrics, Honeycomb is expanding its observability platform to include native support for standard OpenTelemetry metrics. This builds on Honeycomb's longstanding support for custom metrics through events, giving teams a unified way to explore both high-level signals and detailed event data in one place. With Metrics, users can now bring in gauges, counters, and histograms to track trends, monitor system health, and detect performance changes over time. All of this happens within the same intuitive Honeycomb experience. Unlike traditional monitoring tools that separate system data from application behavior, Honeycomb's unified model allows engineers to seamlessly connect what is happening in their infrastructure to why it is happening in their applications. This new capability gives teams faster insights, deeper context, and a clearer path from detection to resolution, making it easier than ever to understand how their systems and users interact. "Full time-series metrics are the industry standard for most developers and site reliability engineers," said Graham Siener, VP of Product at Honeycomb. "Our new Metrics capabilities meet customers where they are, combining open-standard metrics methodologies with our industry-leading tracing data to better determine the context around critical issues. The Honeycomb Intelligence platform, which includes our improved Metrics, is purpose-built for teams of every size and provides a platform for success that organizations can use to navigate their AI journey." Also launching today for General Availability is Canvas, Honeycomb's AI-guided dashboard that blends natural language investigation with interactive notebooks for collaborative debugging. "Writing code is no longer a bottleneck with AI, but quality is suffering and organizations are deploying software they may not fully understand, which makes observability more important than ever," said James Governor, Co-founder of RedMonk, the developer-focused analyst company. "Honeycomb is responding by making natural language a native interface for its strengths in troubleshooting, debugging, and problem resolution, integrating AI across the platform." With Canvas, engineers can ask questions in plain English and watch as Honeycomb autonomously explores telemetry data, surfaces anomalies, and visualizes findings in real time. Canvas runs multiple queries, comparisons, and BubbleUp analyses automatically, producing dynamic charts and trace visualizations that evolve as the investigation progresses. As part of Honeycomb Intelligence, Canvas works seamlessly with MCP Server and Anomaly Detection, bringing AI-driven investigation directly into the developer workflow rather than bolting it on as an afterthought. With these new offerings, Honeycomb further extends its leadership in helping enterprises achieve faster resolution, tighter control, and deeper understanding across distributed systems. To see Honeycomb in action, including Honeycomb Intelligence and Metrics, find honeycomb.io at AWS Re:Invent (booth 1378) on December 1-4, 2025. To learn more or get started, visit https://www.honeycomb.io/get-a-demo. About Honeycomb Honeycomb is the future proofed observability platform that enables engineering teams to find and solve problems they couldn't before. It unifies telemetry, returns fast queries, integrates with AI agents, and reveals issues others miss in your ever-evolving tech stack. Honeycomb's unique event-based pricing model provides a predictable way to ingest high-cardinality telemetry without penalties or hidden costs. Learn more at www.honeycomb.io and follow honeycomb.io on LinkedIn. Media Contact: Kaileigh Higgins [email protected]

Honeycomb
Oct 13th, 2025
How We Saved 70% of CPU and 60% of Memory in Refinery's Go Code, No Rust Required

How honeycomb.io saved 70% of CPU and 60% of memory in Refinery's Go code, no Rust required. The engineer's guide to sampling. honeycomb.io has just released Refinery 3.0, a performance-focused update which significantly improves Refinery's CPU and memory efficiency. Refinery has a big job: it performs dynamic, consistent tail-based sampling that maintains proportions across key fields, adjusts to changes in throughput, and reports accurate sampling rates. The traffic patterns it handles are challenging, with long or large traces requiring it to hold lots of information in memory, while sudden volume spikes leave little time for infrastructure to scale up - all in a package that people want to run as cheaply as possible, since one of the primary use cases for sampling is cost control. When you're spending money to save money, you always want to spend less. Version 3.0 is a big advance in that direction. When honeycomb.io upgraded its internal Refinery cluster, total CPU usage dropped by 70%, while RAM use dropped by 60%: With an improvement like this, honeycomb.io can downsize this 72-node cluster by half - a meaningful savings - while still keeping more headroom than before. If you're a Refinery user, hopefully so can you. How did honeycomb.io pull off such a big change? The code's all in this merge, but I'll cover the basics here. Like many programming languages, Go is capable of being very fast under the right circumstances (working with bounded quantities of strongly typed data), and very slow under the wrong ones. Unfortunately, Refinery's job of handling customer-defined trace spans is very close to the wrong one. Historically, honeycomb.io followed the standard approach and fully de-serialized every span that came in through the API. Since there's no fixed schema, the fields went into a big map[string]any - hundreds of heap allocations, pointers everywhere. It was simple and effective, but it was also expensive. Compounding this cost, in a cluster configuration, the majority of spans are handled twice since they're redirected from the receiving node to the node which "owns" the relevant trace. Here's what a profile of a typical clustered Refinery looked like: There's a lot going on here, but you can see almost a quarter of CPU time going to garbage collection. Digging further reveals that a lot of the leaf nodes are ultimately some form of malloc. In total, 50% of all CPU time in this process is allocation-related, all in order to hold onto span data as it waits for a sampling decision - after which, in most cases, it's simply thrown away without being sent to Honeycomb! There's also a lot of overhead just for setting up data structures which honeycomb.io is hardly going to use. The actual refining, which is the reason honeycomb.io is doing all of this, all happens in the collect loop, a mere 12% of time in this profile. Even that is mostly internal metrics instrumentation rather than the core decision-making algorithms. The best way to make all this de-serialization (and, eventually, re-serialization) fast is to not to do it at all. Refinery only ever looks at a handful of fields in any given span, the rest is just cargo. And it's very possible to extract only the fields you need from a serialized blob, as in the simplified example below. These two benchmarks demonstrate de-serializing into a map, and its new selective approach, where honeycomb.io pull out any fields Refinery needs, then hang onto the serialized data for re-transmission. I'm using MessagePack here because that's Refinery's native format, with a low-level serialization API provided by the tinylib/msgp library. You can see that selective involves much more code, but it's hard to argue with the results: This is a very simple scenario, and it's common for real spans to have hundreds or even thousands of fields, which in the old version meant much longer parsing times and thousands of distinct allocations per span. Instead, Refinery 3.0 keeps the serialized data, retaining it in a format which is much more compact than the web of headers and pointers created for a fully realized map. This more compact data is the main reason for Refinery's improved memory footprint. Of course, Refinery supports three other types of input data besides its native MessagePack (libhoney/json, otlp/proto, otlp/json). To handle the others, Refinery now transcodes those formats directly to serial MessagePack, binary-to-binary, again extracting any useful fields along the way. This code is even more voluminous than the selective extraction from MessagePack illustrated above, but it avoids an expensive additional step of translation from generated protobuf data structures into its own. To add icing to this cake, honeycomb.io also optimized its metrics instrumentation, implemented pools to re-use large buffers, and (coming soon as a minor version update) parallelized the core decision loop to scale across many CPUs. Notably, there are no clever algorithms or language tricks at play here. honeycomb.io didn't have to rewrite it in Rust. All honeycomb.io has done is reimagine which work this process really needs to do, and focus on only doing that. New to Honeycomb? Get your free account today. Get access to distributed tracing, BubbleUp, triggers, and more.