At Robots & Pencils, we build meaningful, scalable digital products by blending strategy, design, and engineering. We are seeking a Level 4 AI Engineer to build production LLM applications for an enterprise client as part of a long-term, delivery-focused engagement.
You will own the AI stack end-to-end, including RAG pipelines, prompt engineering, and evaluation frameworks. This is a hands-on role: you will write production code, tune prompts, build evaluation and observability systems, and iterate based on real user feedback.
There is a working proof of concept in place. Your responsibility is to make it production-ready and extend it with intelligent, reliable features that operate at enterprise scale.
What You’ll Do
AI & LLM Application Delivery
- Build, optimize, and evolve RAG pipelines, including retrieval strategies, chunking, and re-ranking
- Develop prompts and guardrails for domain-specific LLM applications
- Implement hallucination detection, mitigation, and fact-checking mechanisms
- Build embeddings-based search and recommendation features
- Validate AI features with real users and iterate based on qualitative and quantitative feedback
Evaluation, Monitoring & Reliability
- Set up and maintain LLM evaluation frameworks to measure quality, relevance, and reliability
- Implement observability and monitoring for production AI systems
- Monitor live AI systems and resolve quality, accuracy, and performance issues
- Continuously improve AI outputs based on evaluation data and user behavior
Platform & System Integration
- Work closely with product and engineering teams to integrate AI into user-facing features
- Build and maintain backend services in Python
- Integrate with vector databases to support retrieval and semantic search workflows
- Ensure AI solutions meet enterprise requirements for security, scalability, and maintainability
Delivery & Collaboration
- Collaborate with cross-functional partners across product, engineering, and design
- Operate effectively in environments with evolving requirements and ambiguity
- Communicate clearly with technical and non-technical stakeholders
- Take ownership of delivery outcomes from experimentation through production
Required Skills & Experience
- 8+ years of professional software engineering experience, with 4+ years focused on applied AI/ML or data-driven systems in production environments
- 3+ years building and operating production AI systems
- Strong hands-on experience with LLM applications, including RAG, prompt engineering, and evaluation
- Experience implementing hallucination detection and mitigation techniques
· Proficiency in Python
- Experience working with vector databases (Weaviate, Pinecone, or similar)
- Experience with LLM evaluation frameworks (Langfuse, Weights & Biases, or custom solutions)
- Production experience using Claude and/or GPT APIs
- Strong understanding of embeddings and semantic search
- Comfortable working with ambiguity and iterating on unclear problems
- Bachelor's degree in computer science, Engineering, Data Science, or a related technical field, or equivalent practical experience
- Advanced degree (Master’s or PhD) in a relevant field
Nice to Have
- Experience with Azure AI services, including Azure OpenAI and Cognitive Services
- Experience with document processing (PDF extraction, OCR)
- Exposure to audio or speech processing (e.g., Whisper or similar tools)
- Experience building enterprise B2B software
- Experience with ML classification and model training
Tech Stack
· LLMs: Claude (Anthropic), Azure OpenAI
· Vector Database: Weaviate
· Backend: Python
· Infrastructure: Azure
- Evaluation & Observability: Langfuse or similar
How You Work
- You are hands-on and delivery-focused, writing code and owning outcomes
- You balance speed with quality in production environments
- You communicate clearly and collaborate effectively across disciplines
- You take ownership of ambiguous problems and drive them to resolution
- You prioritize reliability, maintainability, and real-world impact
Why Robots & Pencils
- Real production impact not a POC that sits on a shelf
- Exposure to the full AI lifecycle: RAG, LLM applications, evaluation, classification, and monitoring
- End-to-end ownership of the AI stack and technical decision-making
- A small, senior team with direct access to enterprise clients
