At Robots & Pencils, we build meaningful, scalable digital products by blending strategy, design, and engineering. We are seeking a Level 4 AI Engineer to build production LLM applications for an enterprise client as part of a long-term, delivery-focused engagement.

You will own the AI stack end-to-end, including RAG pipelines, prompt engineering, and evaluation frameworks. This is a hands-on role: you will write production code, tune prompts, build evaluation and observability systems, and iterate based on real user feedback.

There is a working proof of concept in place. Your responsibility is to make it production-ready and extend it with intelligent, reliable features that operate at enterprise scale.

What You’ll Do

AI & LLM Application Delivery

  • Build, optimize, and evolve RAG pipelines, including retrieval strategies, chunking, and re-ranking
  • Develop prompts and guardrails for domain-specific LLM applications
  • Implement hallucination detection, mitigation, and fact-checking mechanisms
  • Build embeddings-based search and recommendation features
  • Validate AI features with real users and iterate based on qualitative and quantitative feedback

Evaluation, Monitoring & Reliability

  • Set up and maintain LLM evaluation frameworks to measure quality, relevance, and reliability
  • Implement observability and monitoring for production AI systems
  • Monitor live AI systems and resolve quality, accuracy, and performance issues
  • Continuously improve AI outputs based on evaluation data and user behavior

Platform & System Integration

  • Work closely with product and engineering teams to integrate AI into user-facing features
  • Build and maintain backend services in Python
  • Integrate with vector databases to support retrieval and semantic search workflows
  • Ensure AI solutions meet enterprise requirements for security, scalability, and maintainability

Delivery & Collaboration

  • Collaborate with cross-functional partners across product, engineering, and design
  • Operate effectively in environments with evolving requirements and ambiguity
  • Communicate clearly with technical and non-technical stakeholders
  • Take ownership of delivery outcomes from experimentation through production

Required Skills & Experience

  • 8+ years of professional software engineering experience, with 4+ years focused on applied AI/ML or data-driven systems in production environments
  • 3+ years building and operating production AI systems
  • Strong hands-on experience with LLM applications, including RAG, prompt engineering, and evaluation
  • Experience implementing hallucination detection and mitigation techniques

· Proficiency in Python

  • Experience working with vector databases (Weaviate, Pinecone, or similar)
  • Experience with LLM evaluation frameworks (Langfuse, Weights & Biases, or custom solutions)
  • Production experience using Claude and/or GPT APIs
  • Strong understanding of embeddings and semantic search
  • Comfortable working with ambiguity and iterating on unclear problems
  • Bachelor's degree in computer science, Engineering, Data Science, or a related technical field, or equivalent practical experience
  • Advanced degree (Master’s or PhD) in a relevant field

Nice to Have

  • Experience with Azure AI services, including Azure OpenAI and Cognitive Services
  • Experience with document processing (PDF extraction, OCR)
  • Exposure to audio or speech processing (e.g., Whisper or similar tools)
  • Experience building enterprise B2B software
  • Experience with ML classification and model training

Tech Stack

· LLMs: Claude (Anthropic), Azure OpenAI

· Vector Database: Weaviate

· Backend: Python

· Infrastructure: Azure

  • Evaluation & Observability: Langfuse or similar

How You Work

  • You are hands-on and delivery-focused, writing code and owning outcomes
  • You balance speed with quality in production environments
  • You communicate clearly and collaborate effectively across disciplines
  • You take ownership of ambiguous problems and drive them to resolution
  • You prioritize reliability, maintainability, and real-world impact

Why Robots & Pencils

  • Real production impact not a POC that sits on a shelf
  • Exposure to the full AI lifecycle: RAG, LLM applications, evaluation, classification, and monitoring
  • End-to-end ownership of the AI stack and technical decision-making
  • A small, senior team with direct access to enterprise clients