Building a serverless data pipeline

This is a summary note/transcript for the technical workshop held in Andela Nairobi in February 2019.

OVERVIEW & PURPOSE

The technology we use today has become integral to our lives and increasingly we expect it to be always available and responsive to our unique needs, whether it is showing us when a service is almost failing or auto-playing a relevant song based on our play history. As software engineers, we are responsible for delivering this technology to meet these expectations and increasingly we rely on data (massive) to make that possible.In this workshop, we explore the process of organizing the data we receive to allow for real-time analytics. We use the example use case of an organization needing to visualize the logs from the backend servers to a dashboard.

SCOPE

  • Transform raw data to processed data.
  • GCP.

OBJECTIVES

  • Explain what a data pipeline is.
  • Give an overview of the evolution of data pipelines.
  • Build a working example using GCP Dataflow and BigQuery

PREREQUISITES

Basic Python knowledge

OUTCOMES

Success looks like:

  • Data read from a plain text/CSV file loaded to an analytics DB.
  • Attendees able to run the code on their own.

DATA PIPELINE

A set of data processing elements connected in series where the output of one element is the input of the next one.

PROPERTIES:

  • Low event latency Able to query recent events data within seconds of availability.
  • Scalability Adapt to growing data as product use rises & ensure data is available for querying.
  • Interactive querying Support both long-running batch queries and smaller interactive queries without delays.
  • Versioning Ability to make changes to the pipeline and data definitions without downtime or data loss.
  • Monitoring Generate alerts if data expected is not being received.
  • Testing Able to test pipeline components while ensuring test events are not added to storage.

TYPES OF DATA

  • Raw data
  • Unprocessed data in format used on source e.g JSON
  • No schema applied
  • Processed data
  • Raw data with schema applied
  • Stored in event tables/destinations in pipelines
  • Cooked data Processed data that has been summarized.

MOTIVATION FOR DATA PIPELINE

Why are you collecting and herding the data you have? Ask the end user what they want to ask or answer about the app or data. E.g how much traffic are we supporting, what is the peak period, location e.t.c.

PIPELINES EVOLUTION

  • Flat file era Save logs on server.
  • Database era Data staged in txt or CSV is loaded to the database.
  • Data lake era Data staged in hadoop/S3 is loaded to the database.
  • Serverless era Managed services are used for storage and querying.

GOTCHAS

  • Central point of failure
  • Bottlenecks due to too many events from one source. Query performance degradation where a query takes hours to complete.

KEY CONCEPTS IN APACHE BEAM

  • Pipeline Encapsulates the workflow of the entire data processing tasks from start to finish.
  • PCollection Represents a distributed dataset that the beam pipeline operates on.
  • PTransform
  • Represents a data processing operation or a step in the pipeline.
  • ParDo Beam transformation for generic parallel processing.
  • DoFn Applies the logic to each element in the input PCollection and populates the elements of an output collection.

CODE

GOOGLE CLOUD PLATFORM

CODE EXECUTION

  • Run the code remotely on GCP example (from the root folder of the code):python -m kiva_org.loans \
    --runner=DataflowRunner \
    --project=kiva-org-pipeline \
    --temp_location=gs://kiva_org_pipeline/tmp \
    --staging_location=gs://kiva_org_pipeline/staging \
    --output=gs://kiva_org_pipeline/results/loans \
    --input=gs://kiva_org_pipeline/data/kiva/loans.csv
  • Run the code locally example:python -m kiva_org.loans \
    --runner=DirectRunner \
    --input=loans.csv \
    --output=tmp/loans
  • Run the big query version of the code remotely example:python -m kiva_org.loans_bigquery \
    --runner=DataflowRunner \
    --project=kiva-org-pipeline \
    --temp_location=gs://kiva_org_pipeline/tmp \
    --staging_location=gs://kiva_org_pipeline/staging \
    --output=gs://kiva_org_pipeline/results/loans \
    --input=gs://kiva_org_pipeline/data/kiva/loans.csv \
    --table_name=kiva_loans_summary \
    --dataset=kiva_dataset

REFERENCES

Know how to solve hard problems with code? Join Andela as a software developer.

Related posts

The latest articles from Andela.

Visit our blog

Customer-obsessed? 4 Steps to improve your culture

If you get your team's culture right, you can create processes that will allow you to operationalize useful new technologies. Check out our 4 steps to transform your company culture.

How to Build a RAG-Powered LLM Chat App with ChromaDB and Python

Harness the power of retrieval augmented generation (RAG) and large language models (LLMs) to create a generative AI app. Andela community member Oladimeji Sowole explains how.

Navigating the future of work with generative AI and stellar UX design

In this Writer's Room blog, Carlos Tay discusses why ethical AI and user-centric design are essential in shaping a future where technology amplifies human potential.

We have a 96%+
talent match success rate.

The Andela Talent Operating Platform provides transparency to talent profiles and assessment before hiring. AI-driven algorithms match the right talent for the job.