Building a serverless data pipeline

This is a summary note/transcript for the technical workshop held in Andela Nairobi in February 2019.

OVERVIEW & PURPOSE

The technology we use today has become integral to our lives and increasingly we expect it to be always available and responsive to our unique needs, whether it is showing us when a service is almost failing or auto-playing a relevant song based on our play history. As software engineers, we are responsible for delivering this technology to meet these expectations and increasingly we rely on data (massive) to make that possible.In this workshop, we explore the process of organizing the data we receive to allow for real-time analytics. We use the example use case of an organization needing to visualize the logs from the backend servers to a dashboard.

SCOPE

  • Transform raw data to processed data.
  • GCP.

OBJECTIVES

  • Explain what a data pipeline is.
  • Give an overview of the evolution of data pipelines.
  • Build a working example using GCP Dataflow and BigQuery

PREREQUISITES

Basic Python knowledge

OUTCOMES

Success looks like:

  • Data read from a plain text/CSV file loaded to an analytics DB.
  • Attendees able to run the code on their own.

DATA PIPELINE

A set of data processing elements connected in series where the output of one element is the input of the next one.

PROPERTIES:

  • Low event latency Able to query recent events data within seconds of availability.
  • Scalability Adapt to growing data as product use rises & ensure data is available for querying.
  • Interactive querying Support both long-running batch queries and smaller interactive queries without delays.
  • Versioning Ability to make changes to the pipeline and data definitions without downtime or data loss.
  • Monitoring Generate alerts if data expected is not being received.
  • Testing Able to test pipeline components while ensuring test events are not added to storage.

TYPES OF DATA

  • Raw data
  • Unprocessed data in format used on source e.g JSON
  • No schema applied
  • Processed data
  • Raw data with schema applied
  • Stored in event tables/destinations in pipelines
  • Cooked data Processed data that has been summarized.

MOTIVATION FOR DATA PIPELINE

Why are you collecting and herding the data you have? Ask the end user what they want to ask or answer about the app or data. E.g how much traffic are we supporting, what is the peak period, location e.t.c.

PIPELINES EVOLUTION

  • Flat file era Save logs on server.
  • Database era Data staged in txt or CSV is loaded to the database.
  • Data lake era Data staged in hadoop/S3 is loaded to the database.
  • Serverless era Managed services are used for storage and querying.

GOTCHAS

  • Central point of failure
  • Bottlenecks due to too many events from one source. Query performance degradation where a query takes hours to complete.

KEY CONCEPTS IN APACHE BEAM

  • Pipeline Encapsulates the workflow of the entire data processing tasks from start to finish.
  • PCollection Represents a distributed dataset that the beam pipeline operates on.
  • PTransform
  • Represents a data processing operation or a step in the pipeline.
  • ParDo Beam transformation for generic parallel processing.
  • DoFn Applies the logic to each element in the input PCollection and populates the elements of an output collection.

CODE

GOOGLE CLOUD PLATFORM

CODE EXECUTION

  • Run the code remotely on GCP example (from the root folder of the code):python -m kiva_org.loans \
    --runner=DataflowRunner \
    --project=kiva-org-pipeline \
    --temp_location=gs://kiva_org_pipeline/tmp \
    --staging_location=gs://kiva_org_pipeline/staging \
    --output=gs://kiva_org_pipeline/results/loans \
    --input=gs://kiva_org_pipeline/data/kiva/loans.csv
  • Run the code locally example:python -m kiva_org.loans \
    --runner=DirectRunner \
    --input=loans.csv \
    --output=tmp/loans
  • Run the big query version of the code remotely example:python -m kiva_org.loans_bigquery \
    --runner=DataflowRunner \
    --project=kiva-org-pipeline \
    --temp_location=gs://kiva_org_pipeline/tmp \
    --staging_location=gs://kiva_org_pipeline/staging \
    --output=gs://kiva_org_pipeline/results/loans \
    --input=gs://kiva_org_pipeline/data/kiva/loans.csv \
    --table_name=kiva_loans_summary \
    --dataset=kiva_dataset

REFERENCES

Know how to solve hard problems with code? Join Andela as a software developer.

Related posts

The latest articles from Andela.

Visit our blog

How to transition your company to a remote environment

Organizations need to embrace a remote-first mindset, whether they have a hybrid work model or not. Remote-first companies can see a boost in more employee morale and productivity. Here’s how to successfully shift to a remote-first environment.

Andela Appoints Kishore Rachapudi as Chief Revenue Officer

Andela scales to meet rising demand among companies to source technical talent in other countries for short or long-term projects.

How Andela's all-female learning program in Lagos is still transforming careers ten years on

In 2014, we launched an all-female cohort in Lagos, Nigeria, to train women in software engineering and development. We spoke to three of the trailblazing women from cohort 4 to find out how the program still inspires their technology careers 10 years later.

We have a 96%+
talent match success rate.

The Andela Talent Operating Platform provides transparency to talent profiles and assessment before hiring. AI-driven algorithms match the right talent for the job.