Design, develop, and maintain scalable and high-performance data pipelines using PySpark and Databricks.
Ensure data quality, consistency, and security throughout all pipeline stages.
Optimize data workflows and pipeline performance, ensuring efficient data processing.
Cloud-Based Data Solutions:
Architect and implement cloud-native data solutions using AWS services (e.g., S3, Glue, Lambda, Redshift), GCP (DataProc, DataFlow), and Azure (ADF, ADLF).
Work on ETL processes to transform, load, and process data across cloud platforms.
SQL & Data Modeling:
Utilize SQL (including windowing functions) to query and analyze large datasets efficiently.
Work with different data schemas and models relevant to various business contexts (e.g., star/snowflake schemas, normalized, and denormalized models).
Data Security & Compliance:
Implement robust data security measures, ensuring encryption, access control, and compliance with industry standards and regulations.
Monitor and troubleshoot data pipeline performance and security issues.
Collaboration & Communication:
Collaborate with cross-functional teams (data scientists, software engineers, and business stakeholders) to design and integrate end-to-end data pipelines.
Communicate technical concepts clearly and effectively to non-technical stakeholders.
Domain Expertise:
Understand and work with domain-related data, tailoring solutions to address the specific business needs of the customer.
Optimize data solutions for the business context, ensuring alignment with customer requirements and goals.
Mentorship & Leadership:
Provide guidance to junior team members, fostering a collaborative environment and ensuring best practices are followed.
Drive innovation and promote a culture of continuous learning and improvement within the team.
Required Qualifications
Experience:
6-8 years of total experience in data engineering, with 3+ years of hands-on experience in Databricks, PySpark, and AWS.
3+ years of experience in Python and SQL for data engineering tasks.
Experience working with cloud ETL services such as AWS Glue, GCP DataProc/DataFlow, Azure ADF and ADLF.
Technical Skills:
Strong proficiency in PySpark for large-scale data processing and transformation.
Expertise in SQL, including window functions, for data manipulation and querying.
Experience with cloud-based ETL tools (AWS Glue, GCP DataFlow, Azure ADF) and understanding of their integration with cloud data platforms.
Deep understanding of data schemas and models used across various business contexts.
Familiarity with data warehousing optimization techniques, including partitioning, indexing, and query optimization.
Knowledge of data security best practices (e.g., encryption, access control, and compliance).
Agile Methodologies: Experience working in Agile (Scrum or Kanban) teams for iterative development and delivery.
Communication: Excellent verbal and written communication skills, with the ability to explain complex technical concepts to non-technical stakeholders.
Skills
Python,Databricks,Pyspark,Sql
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job