Serverless NBA Data Lake Application with API Gateway, AWS Lambda, Amazon S3, AWS Glue and Athena Using Terraform

In sports analytics, the ability to process and analyze vast amounts of data in real time has become a game-changer. Having the power to ingest, store, and query large datasets of NBA statistics seamlessly and also enjoying the scalability and cost-efficiency of serverless architecture is awesome. In this project, we’ll explore how to build a Serverless NBA Data Lake Application using API Gateway, AWS Lambda, Amazon S3, AWS Glue, and Amazon Athena — all orchestrated with Terraform. System Architecture Overview The architecture leverages the following components: • Amazon S3: Serves as the central data lake for storing raw, processed, and curated NBA data in JSON format. • AWS Lambda: Lambda functions Fetches NBA Data from sportdata.io, formats it and upload to Amazon S3 • Amazon API Gateway: Provides a RESTful API that triggers the Lambda function to fetch NBA data from sportdata.io and upload to an S3 bucket. • AWS Glue: Automatically discovers and catalogs the data stored in S3 into a schema using the Glue Database Catalog and Glue crawler for efficient querying. • Amazon Athena: Enables serverless querying of the data lake using standard SQL, allowing users to retrieve insights from the curated NBA data and store result in an Amazon S3 bucket Prerequisites: • AWS account with required access and permission to configure services such as Lambda, S3, Glue API Gateway and Athena. • Experience with programming languages supported by AWS Lambda, such as Python. • Terraform installed on your local machine • AWS CLI Installed and configured on your local machine. Define Your Lambda function We will develop a Python script for our Lambda function to retrieve NBA data from sportdata.io, process it, and uploads it to Amazon S3. The complete python code is available in the repository. Terraform Configuration We will use Terraform modules for this deployment to ensure modularity, reusability, and maintainability in our infrastructure as code. Each folder in the modules directory will define the infrastructure configurations required for deploying specific AWS services. See below • API Gateway Module: This module deploys an API Gateway that will serve as a trigger to the lambda function to retrieve data from sportdata.io and upload it in Amazon S3. • iam_role Module: This module contains the terraform codes that defines the necessary permissions for lambda to be able to retrieve and upload NBA data to Amazon S3 and API Gateway to be able to trigger the lambda function. • Lambda Module: This module defines the terraform codes for archiving the python code in a zip file and also create a lambda function that retrieves NBA data from sportsdata.io, process it and uploads to Amazon S3. • S3 module: This module defines the terraform codes that creates the Amazon S3 bucket that will be used to store data retrieved form sportdata.io by the lambda function. • glue module: This module defines the terraform codes that creates the Amazon Glue catalogs database, Glue crawler and Glue table which automatically discovers the data stored in S3 and catalogs it into a schema for efficient querying. • athena module: This module defines the terraform codes that creates an Athena workgroup that enables serverless querying of the sport data lake stored in S3 using standard SQL. Check the link below for the full terraform configurations https://github.com/OjoOluwagbenga700/sport-data-lake.git Step 1: Clone the Terraform Code By cloning the Terraform code, we'll have access to the infrastructure-as-code configurations needed for our deployment process. Clone Repository: Use the git clone command to clone the Terraform code repository to your local machine. Ensure that you have Git installed and configured on your system. https://github.com/OjoOluwagbenga700/sport-data-lake.git Change directory to the folder name sport-data-lake. Ensure you update the terraform.tfvars file with your API Key from sportdata.io Step 2: Running Terraform Commands Terraform init: Initialize Terraform in the project directory to download necessary plugins and modules. Terraform Plan: Generate an execution plan to preview the changes that Terraform will make to the infrastructure. Terraform Apply: Run terraform apply --auto-approve to deploy the infrastructure on AWS. Step 3: Confirm resources deployed on AWS Lambda Function Glue crawler Glue catalog database and Table S3 Bucket without Data upload Athena Workgroup API Gateway Step 4: Testing the Application To trigger the lambda function to retrieve, process and upload NBA data to S3, we will send a GET request through the API Gateway Invoke URL. Copy the API Gateway invoke url to your browser, add /dev/data to indicate the API stage and path and click enter. https://r3zks22udh.execute-api.us-east-1.amazonaws.com/dev/data NBA Data Uploaded into S3 Preview data table in Athena Performing Simple SQL query in Athena Athena

Jan 15, 2025 - 15:22
Serverless NBA Data Lake Application with API Gateway, AWS Lambda, Amazon S3, AWS Glue and Athena Using Terraform

In sports analytics, the ability to process and analyze vast amounts of data in real time has become a game-changer. Having the power to ingest, store, and query large datasets of NBA statistics seamlessly and also enjoying the scalability and cost-efficiency of serverless architecture is awesome.
In this project, we’ll explore how to build a Serverless NBA Data Lake Application using API Gateway, AWS Lambda, Amazon S3, AWS Glue, and Amazon Athena — all orchestrated with Terraform.

System Architecture Overview
The architecture leverages the following components:
Amazon S3: Serves as the central data lake for storing raw, processed, and curated NBA data in JSON format.
AWS Lambda: Lambda functions Fetches NBA Data from sportdata.io, formats it and upload to Amazon S3
Amazon API Gateway: Provides a RESTful API that triggers the Lambda function to fetch NBA data from sportdata.io and upload to an S3 bucket.

AWS Glue: Automatically discovers and catalogs the data stored in S3 into a schema using the Glue Database Catalog and Glue crawler for efficient querying.
Amazon Athena: Enables serverless querying of the data lake using standard SQL, allowing users to retrieve insights from the curated NBA data and store result in an Amazon S3 bucket

Image description

Prerequisites:
• AWS account with required access and permission to configure services such as Lambda, S3, Glue API Gateway and Athena.
• Experience with programming languages supported by AWS Lambda, such as Python.
• Terraform installed on your local machine
• AWS CLI Installed and configured on your local machine.

Define Your Lambda function
We will develop a Python script for our Lambda function to retrieve NBA data from sportdata.io, process it, and uploads it to Amazon S3. The complete python code is available in the repository.

Image description

Terraform Configuration
We will use Terraform modules for this deployment to ensure modularity, reusability, and maintainability in our infrastructure as code. Each folder in the modules directory will define the infrastructure configurations required for deploying specific AWS services. See below

Image description

API Gateway Module: This module deploys an API Gateway that will serve as a trigger to the lambda function to retrieve data from sportdata.io and upload it in Amazon S3.

Image description

• iam_role Module: This module contains the terraform codes that defines the necessary permissions for lambda to be able to retrieve and upload NBA data to Amazon S3 and API Gateway to be able to trigger the lambda function.

Image description

Lambda Module: This module defines the terraform codes for archiving the python code in a zip file and also create a lambda function that retrieves NBA data from sportsdata.io, process it and uploads to Amazon S3.

Image description

S3 module: This module defines the terraform codes that creates the Amazon S3 bucket that will be used to store data retrieved form sportdata.io by the lambda function.

Image description

glue module: This module defines the terraform codes that creates the Amazon Glue catalogs database, Glue crawler and Glue table which automatically discovers the data stored in S3 and catalogs it into a schema for efficient querying.

Image description

athena module: This module defines the terraform codes that creates an Athena workgroup that enables serverless querying of the sport data lake stored in S3 using standard SQL.

Image description

Check the link below for the full terraform configurations

https://github.com/OjoOluwagbenga700/sport-data-lake.git

Step 1: Clone the Terraform Code
By cloning the Terraform code, we'll have access to the infrastructure-as-code configurations needed for our deployment process.

Clone Repository: Use the git clone command to clone the Terraform code repository to your local machine.

Ensure that you have Git installed and configured on your system.

https://github.com/OjoOluwagbenga700/sport-data-lake.git

Change directory to the folder name sport-data-lake.

Ensure you update the terraform.tfvars file with your API Key from sportdata.io

Image description

Step 2: Running Terraform Commands

Terraform init: Initialize Terraform in the project directory to download necessary plugins and modules.

Image description

Terraform Plan: Generate an execution plan to preview the changes that Terraform will make to the infrastructure.

Image description

Terraform Apply: Run terraform apply --auto-approve to deploy the infrastructure on AWS.

Image description

Step 3: Confirm resources deployed on AWS

Lambda Function

Image description

Glue crawler

Image description

Glue catalog database and Table

Image description

S3 Bucket without Data upload

Image description

Athena Workgroup

Image description

API Gateway

Image description

Step 4: Testing the Application

To trigger the lambda function to retrieve, process and upload NBA data to S3, we will send a GET request through the API Gateway Invoke URL.

Copy the API Gateway invoke url to your browser, add /dev/data to indicate the API stage and path and click enter.

https://r3zks22udh.execute-api.us-east-1.amazonaws.com/dev/data

Image description

Image description

NBA Data Uploaded into S3

Image description

Image description

Image description

Preview data table in Athena

Image description

Performing Simple SQL query in Athena

Image description

Athena Query Result

Query results are stored in a defined folder in the s3 bucket and can be downloaded accordingly. See below

Image description

Image description

Conclusion: Congratulations!!!, we have successfully built a Serverless NBA Data Lake Application by leveraging AWS services like API Gateway, Lambda, S3, Glue, and Athena. Terraform adds to the elegance by ensuring your infrastructure is provisioned consistently and can be replicated or modified with ease. This architecture not only showcases the potential of serverless computing but also opens up endless possibilities for expanding into other domains, such as real-time analytics, machine learning, or personalized user experiences.

To Clean up: Run terraform destroy to delete all infrastructure deployed by the terraform codes.

Image description