Full-Time

AI Operations Specialist

Northeastern University

Northeastern University

Compensation Overview

$87.8k - $124k/yr

No H1B Sponsorship

Boston, MA, USA

Hybrid

Three days on-site per week.

Category
DevOps & Infrastructure (1)
Required Skills
MLOps
Microsoft Azure
Airflow
Apache Spark
Apache Kafka
Infrastructure as Code (IaC)
Docker
Vulnerability Analysis
AWS
DevOps
Google Cloud Platform
Requirements
  • A Bachelor of Science degree in Computer Science, Information Technology, or a related field.
  • A minimum of three years of experience in information technology operations, with at least one year focused on AI/ML systems and data pipeline support.
  • Experience with cloud platforms (Amazon Web Services, Microsoft Azure, or Google Cloud Platform) and their AI/ML and data engineering service offerings.
  • MLOps experience: demonstrated experience in operationalizing and maintaining machine learning models in production environments, including deployment, monitoring, and lifecycle management.
  • Data pipeline operations experience: extensive experience maintaining and troubleshooting data pipelines built with tools like Apache Airflow, Prefect, cloud data services, Spark, and Kafka, ensuring reliable data flow for AI systems.
  • System monitoring experience: proficiency in monitoring AI system and data pipeline performance, detecting anomalies, and implementing proactive measures to ensure system reliability and availability; ability to troubleshoot, diagnose, and resolve issues, and to prioritize incidents by business impact.
  • Performance optimization knowledge: techniques to optimize AI system and data pipeline performance, including resource allocation, scaling strategies, and performance tuning.
  • Change management experience: implementing changes to production AI systems and data pipelines with testing, validation, and rollback procedures.
  • Data quality management understanding: knowledge of data quality principles and ability to address data-related issues in processing pipelines.
  • Documentation and knowledge management excellence: ability to create and maintain operational documentation, runbooks, and knowledge articles for AI systems and data pipelines.
  • Automation skills: ability to create and implement automation scripts and workflows to streamline routine operational tasks for AI systems and data flows.
  • DevOps practices familiarity: familiarity with DevOps and continuous integration/continuous deployment principles as applied to AI systems and data pipelines, including containerization, orchestration, and infrastructure as code.
  • Security awareness: understanding of security best practices for AI operations and data handling, including access control, data protection, and vulnerability management.
Responsibilities
  • Monitor AI system and data pipeline health, performance, and availability using established monitoring tools and dashboards. Detect, triage, and resolve incidents affecting AI systems and their data infrastructure, coordinating with technical teams as needed. Implement proactive measures to prevent recurring issues and minimize service disruptions.
  • Perform routine operational tasks to maintain AI systems and data pipelines, including model updates, data refreshes, pipeline maintenance, and system patches. Implement scheduled maintenance activities with minimal service disruption. Manage user access and permissions for AI platforms according to security policies.
  • Analyze AI system and data pipeline performance metrics, identify bottlenecks and inefficiencies, and implement optimizations to improve response times, data flow, accuracy, and resource utilization. Monitor for model drift and data quality issues, coordinating retraining or pipeline adjustments when necessary.
  • Create and maintain comprehensive operational documentation, including runbooks, standard operating procedures, and knowledge base articles. Document system configurations, data pipeline dependencies, and recovery procedures to ensure operational continuity.
  • Identify opportunities for process improvement and automation in AI operations. Develop and implement scripts and workflows to automate routine tasks, reducing manual effort and minimizing human error. Contribute to the evolution of MLOps practices based on operational experience and emerging best practices.
Desired Qualifications
  • Technical certifications in cloud platforms, MLOps, and data engineering preferred.
Northeastern University

Northeastern University

View

Company Size

N/A

Company Stage

N/A

Total Funding

N/A

Headquarters

N/A

Founded

N/A