Full-Time
Posted on 9/5/2025
Open-source data preprocessing for unstructured data
No salary listed
Seattle, WA, USA
Hybrid
Unstructured.io provides tools for turning raw unstructured data into ML-ready formats. It delivers open-source libraries and APIs developers and data scientists use to build custom data-preprocessing pipelines for labeling, training, and production workflows. The pipelines support data from HTML, PDFs, CRM data, XML, PPTX, and DOCX, and can be orchestrated with machine learning models, cleaning scripts, and regular expressions, with easy integration to downstream services and strong data security. Users can publish their own APIs and format data for ingestion with various ML services, enabling scalable use of unstructured data. The goal is to help organizations extract value from unstructured data at scale by providing flexible, reusable preprocessing tools.
Company Size
51-200
Company Stage
Series B
Total Funding
$65M
Headquarters
San Francisco, California
Founded
2022
Help us improve and share your feedback! Did you find this helpful?
Remote Work Options
Unlimited Paid Time Off
Home Office Stipend
Health Insurance
Dental Insurance
Vision Insurance
Professional Development Budget
The most innovative data science companies of 2026. March 24, 2026 Why Unstructured, Feedzai, Synchron, and Chalk are among Fast Company's Most Innovative Companies in data science for 2026.
Unstructured has partnered with Teradata to embed its data processing platform natively inside Teradata Enterprise Vector Store, enabling enterprises to transform unstructured content into AI-ready data without external tools. The integration will be available to eligible Teradata customers from April 2026. The partnership allows automatic ingestion and processing of documents, PDFs, images, video and audio directly within Teradata Enterprise Vector Store. Unstructured's preprocessing capabilities support over 70 file types, converting them into structured data and embeddings whilst maintaining the same governance and security standards as Teradata's structured analytics. The integration addresses a critical challenge, as roughly 80% of enterprise data exists in formats AI systems cannot natively use. It supports hybrid deployment across AWS, Azure, GCP, on-premises and air-gapped environments, particularly benefiting regulated industries like financial services, healthcare and government.
Unstructured has been awarded a $2 million Tactical Funding Increase contract by AFWERX in partnership with the U.S. Air Force Test Center's 96th Test Wing. The contract will develop advanced multimodal data pipelines for generative AI-enabled testing tools and establish test and evaluation frameworks for AI applications across the Air Force. The technology will enable the Air Force to process complex test data formats including charts, diagrams, images, audio, video and telemetry, which current AI tools struggle to access. Unstructured's solution will allow personnel to query and analyse information through AI-powered assistants whilst reducing processing costs and storage requirements. The company will also work with AFTC to develop frameworks measuring accuracy, speed and reliability of AI tools, accelerating test cycles and reducing redundant analysis.
Unstructured secures FedRAMP High authorization to deliver ai-ready data to federal agencies and partners. SACRAMENTO, Calif. - (BUSINESS WIRE) - Unstructured, the leader in AI-ready data orchestration, today announced it has achieved FedRAMP High authorization. This milestone affirms Unstructured's commitment to delivering secure, scalable, and mission-ready solutions to US government agencies and industry partners, including those with the most stringent data security and compliance requirements. With this authorization, Unstructured becomes one of the few AI infrastructure companies authorized to operate at the FedRAMP High baseline. "FedRAMP High is more than a compliance milestone - it's our gateway to accelerating outcomes and unlocking data preparation cost savings for our public sector customers and partners," said Brian Raymond, Founder and CEO of Unstructured. "With this authorization, government users and industry partners can deploy Unstructured's enterprise-grade solution to get their data AI-ready and focus on delivering production-ready AI applications at scale." Government and industry partners are no longer just experimenting with GenAI - they're building real systems. But when it is time to move from pilot to production, most efforts hit a wall: brittle GenAI data pipelines, modality-specific workarounds, and fragmented architectures that can't adapt as models, file types, modalities or downstream systems evolve. Rather than rebuilding custom data pipelines for every GenAI use case, agencies and integrators can rely on Unstructured's Platform: a modular, enterprise-grade solution purpose-built to extract, transform, enrich, chunk, embed, and deliver AI-ready data - no matter the source or destination. It supports diverse modalities out of the box, works with any model or data store (vector, relational, etc.), and is now accessible in highly secure environments. Unstructured also helps reduce infrastructure and processing costs by intelligently adapting its transformation pipeline to the characteristics of each file - maximizing performance while minimizing costs where possible. Unstructured delivers the production-ready data layer that every GenAI application needs - so teams can focus on building outcomes, not maintaining open-source data pipelines. Unstructured's open source is already widely adopted across the federal government, powering tools like NIPRGPT, CamoGPT, and other systems within the military, national security, federal civilian, and even state and local governments. With the FedRAMP High authorized Platform, government users and industry partners can now operationalize these capabilities at enterprise scale - supported by full end-to-end orchestration across ingestion, transformation, enrichment, and delivery. "Our open-source tools have helped federal teams experiment with LLMs using unstructured data," said Raymond. "Now, with FedRAMP High authorization of our GenAI data orchestration platform, agencies can move beyond experimentation - deploying a secure, production-ready data platform to scale GenAI applications with confidence." About Unstructured Unstructured delivers mission-ready data transformation and orchestration solutions that turn unstructured, multimodal content into AI-ready data at scale. Its modular open platform eliminates the brittleness and high costs of traditional data engineering pipelines, enabling government and commercial organizations to rapidly build and deploy GenAI applications. To learn more or deploy Unstructured, contact [email protected].
Unstructured, a leading provider of scalable, mission-ready Generative AI (GenAI) solutions powered by advanced data transformation and orchestration, announced it has joined Palantir Technologies' FedStart program.