Building a serverless data pipeline

This is a summary note/transcript for the technical workshop held in Andela Nairobi in February 2019.

OVERVIEW & PURPOSE

The technology we use today has become integral to our lives and increasingly we expect it to be always available and responsive to our unique needs, whether it is showing us when a service is almost failing or auto-playing a relevant song based on our play history. As software engineers, we are responsible for delivering this technology to meet these expectations and increasingly we rely on data (massive) to make that possible.In this workshop, we explore the process of organizing the data we receive to allow for real-time analytics. We use the example use case of an organization needing to visualize the logs from the backend servers to a dashboard.

SCOPE

Transform raw data to processed data.
GCP.

OBJECTIVES

Explain what a data pipeline is.
Give an overview of the evolution of data pipelines.
Build a working example using GCP Dataflow and BigQuery

PREREQUISITES

Basic Python knowledge

OUTCOMES

Success looks like:

Data read from a plain text/CSV file loaded to an analytics DB.
Attendees able to run the code on their own.

DATA PIPELINE

A set of data processing elements connected in series where the output of one element is the input of the next one.

PROPERTIES:

Low event latency Able to query recent events data within seconds of availability.
Scalability Adapt to growing data as product use rises & ensure data is available for querying.
Interactive querying Support both long-running batch queries and smaller interactive queries without delays.
Versioning Ability to make changes to the pipeline and data definitions without downtime or data loss.
Monitoring Generate alerts if data expected is not being received.
Testing Able to test pipeline components while ensuring test events are not added to storage.

TYPES OF DATA

Raw data
Unprocessed data in format used on source e.g JSON
No schema applied
Processed data
Raw data with schema applied
Stored in event tables/destinations in pipelines
Cooked data Processed data that has been summarized.

MOTIVATION FOR DATA PIPELINE

Why are you collecting and herding the data you have? Ask the end user what they want to ask or answer about the app or data. E.g how much traffic are we supporting, what is the peak period, location e.t.c.

PIPELINES EVOLUTION

Flat file era Save logs on server.
Database era Data staged in txt or CSV is loaded to the database.
Data lake era Data staged in hadoop/S3 is loaded to the database.
Serverless era Managed services are used for storage and querying.

GOTCHAS

Central point of failure
Bottlenecks due to too many events from one source. Query performance degradation where a query takes hours to complete.

KEY CONCEPTS IN APACHE BEAM

Pipeline Encapsulates the workflow of the entire data processing tasks from start to finish.
PCollection Represents a distributed dataset that the beam pipeline operates on.
PTransform
Represents a data processing operation or a step in the pipeline.
ParDo Beam transformation for generic parallel processing.
DoFn Applies the logic to each element in the input PCollection and populates the elements of an output collection.

CODE

Setup python environment and install beam https://beam.apache.org/get-started/quickstart-py/ Requires python2.7
Start virtual environment e.g source venv/bin/activate

GOOGLE CLOUD PLATFORM

Dataflow setup - https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python
Sign up for a free trial (Card required, $300 free )
Create a project
Create a storage bucket
Ensure dataflow api is enabled.

CODE EXECUTION

Run the code remotely on GCP example (from the root folder of the code):python -m kiva_org.loans \
--runner=DataflowRunner \
--project=kiva-org-pipeline \
--temp_location=gs://kiva_org_pipeline/tmp \
--staging_location=gs://kiva_org_pipeline/staging \
--output=gs://kiva_org_pipeline/results/loans \
--input=gs://kiva_org_pipeline/data/kiva/loans.csv
Run the code locally example:python -m kiva_org.loans \
--runner=DirectRunner \
--input=loans.csv \
--output=tmp/loans
Run the big query version of the code remotely example:python -m kiva_org.loans_bigquery \
--runner=DataflowRunner \
--project=kiva-org-pipeline \
--temp_location=gs://kiva_org_pipeline/tmp \
--staging_location=gs://kiva_org_pipeline/staging \
--output=gs://kiva_org_pipeline/results/loans \
--input=gs://kiva_org_pipeline/data/kiva/loans.csv \
--table_name=kiva_loans_summary \
--dataset=kiva_dataset

REFERENCES

Know how to solve hard problems with code? Join Andela as a software developer.

The latest articles from Andela.

Visit our blog

Future of Work

We have a 96%+
talent match success rate.

The Andela Talent Operating Platform provides transparency to talent profiles and assessment before hiring. AI-driven algorithms match the right talent for the job.

Get Started

Building a serverless data pipeline

OVERVIEW & PURPOSE

SCOPE

OBJECTIVES

PREREQUISITES

OUTCOMES

DATA PIPELINE

PROPERTIES:

TYPES OF DATA

MOTIVATION FOR DATA PIPELINE

PIPELINES EVOLUTION

GOTCHAS

KEY CONCEPTS IN APACHE BEAM

CODE

GOOGLE CLOUD PLATFORM

CODE EXECUTION

REFERENCES

Related posts

Customer-obsessed? 4 Steps to improve your culture

How to Build a RAG-Powered LLM Chat App with ChromaDB and Python

Navigating the future of work with generative AI and stellar UX design

We have a 96%+
talent match success rate.

Related posts

Customer-obsessed? 4 Steps to improve your culture

How to Build a RAG-Powered LLM Chat App with ChromaDB and Python

Navigating the future of work with generative AI and stellar UX design

We have a 96%+ talent match success rate.

We have a 96%+
talent match success rate.