Kutir Inc is seeking a Vision-Language Model (VLM) Engineer / Applied Scientist who can design, fine-tune, and deploy multimodal models that understand images/videos and text together. The role involves owning the path from prototype to production, including AWS-based deployment for scalable, secure inference.
Responsibilities:
- Build and adapt vision-language models (VLMs) for enterprise use-cases (visual inspection, safety monitoring, document/image understanding, workflow automation)
- Fine-tune pretrained models for custom datasets using LoRA / QLoRA / adapters
- Create pipelines for image/video ingestion model inference structured outputs (JSON, labels, alerts, summaries)
- Deploy inference services on AWS with monitoring, scaling, and cost control
- Optimize for performance and reliability (batching, quantization, caching, GPU utilization)
- Run evaluation, error analysis, and continuous improvement using task-specific metrics
- Partner with product and engineering teams to integrate VLM services into applications/APIs
Requirements:
- Strong hands-on experience with multimodal AI / Vision-Language Models
- Proficiency in Python and PyTorch (or equivalent deep learning framework)
- Real-world experience with fine-tuning and model adaptation (LoRA/QLoRA, prompt tuning)
- Experience deploying ML services on AWS, such as: Amazon SageMaker (endpoints, model hosting, pipelines), Amazon EC2 + GPU, Auto Scaling, Load Balancers, Amazon ECR (container registry) + Docker, AWS Lambda / API Gateway (where suitable), CloudWatch (logs/metrics)
- Strong understanding of computer vision fundamentals (classification, detection, embeddings)
- Hugging Face Transformers, OpenCV, ONNX/TensorRT
- ECS / EKS (Kubernetes) for container orchestration
- Infrastructure as Code: Terraform / AWS CDK / CloudFormation
- Security best practices: IAM roles, VPC setup, secrets management
- Multimodal RAG (Retrieval-Augmented Generation) with vector databases
- Experience with dataset labeling workflows and MLOps practices