We are looking for a Data Engineer to own the data pipelines, storage architecture, and AI-enablement layer for a media monitoring platform. This role will focus on building reliable data foundations for large-scale ingestion, processing, enrichment, and serving of multilingual media content for NLP and machine learning use cases.
Core Responsibilities
- Design and maintain scalable batch and streaming pipelines for news, social media, web, and broadcast data ingestion.
- Build ETL/ELT processes to clean, normalize, deduplicate, enrich, and structure unstructured media content.
- Prepare datasets, labels, and feature-ready data for AI/ML model training, fine-tuning, and evaluation.
- Support NLP workflows such as entity extraction, topic classification, sentiment analysis, clustering, summarisation, and alerting.
- Ensure data quality, schema consistency, lineage, observability, and fault tolerance across the platform.
- Optimize storage, compute, and query performance across the data stack.
- Implement governance, access control, and auditability for sensitive or regulated content.
- Work closely with the full stack developer to align backend data services with product requirements and API consumption patterns.
Required Skills
- Strong experience in data engineering, pipeline orchestration, and distributed data processing.
- Proficiency in Python and SQL.
- Experience with cloud data platforms, object storage, warehouses, workflow orchestration, and message queues.
- Familiarity with unstructured text data, NLP workflows, and ML data preparation.
- Understanding of data modeling, system reliability, monitoring, and performance tuning.
Preferred Experience
- Background in media monitoring, social listening, content intelligence, or news analytics.
- Experience with multilingual datasets and text-heavy pipelines.
- Exposure to LLM-based systems, vector search, or retrieval pipelines.
- Familiarity with secure deployment environments and data governance practices.