How to extract and handle real time data with AWS Lambda and Snowflake

This demo shows you how to use AWS and Snowflake to handle real time data from VR API and visualize the real-time data in power BI. This blog will walk you through

  • The basics of AWS S3, IAM and Lambda
  • How to use AWS Lambda to load data automatically from API
  • How to parse json to tabular format in Snowflake via Snowflake
  • Connect Snowflake to Power BI for data visualization

The process of this demo is to fetch data from open API, implement lambda function to automatically fetch data from API and save it as json format in S3 bucket in a desired frequency, connect Snowflake with AWS and parse json data to tabular format, connect Power BI to Snowflake and visualize data in Power BI.

 

Data API

VR is sharing their real-time data via API, and from here you can get access to different types of data that you would like to stream https://www.digitraffic.fi/rautatieliikenne/

I have chosen https://rata.digitraffic.fi/api/v1/train-locations/latest/ to see the real time train location information. And similarly, you can get data from different applications by accessing to their APIs.

 

Fetch data automatically via AWS

AWS lambda function retrieves recent information from the API in JSON format. We can add a trigger (like Amazon cloudwatch event) in the lambda function so that the data is automatically fetched, for example daily.

Via the selection in IAM inside the lambda function, you can specify which rights the lambda has connection to / what services are connected to lambda. In our case, we need to attach rights with access to S3 bucket, so that once lambda function is triggered, the data will be stored in S3 bucket.

 

Connecting AWS S3 bucket to Snowflake

Snowflake can be connected directly to your AWS S3 bucket by building a stage in your warehouse. You can choose to build the stage manually or run code in Snowflake console like

 

create or replace stage trainstage url='s3://<your s3 bucket name>'

   credentials= (AWS_KEY_ID = '<your AWS KEY>' AWS_SECRET_KEY = '<your AWS Secret KEY>')

file_format = (type = 'JSON');

 

Snowpipe

After you have created a table with one column—in order to store json data, you can build a snowpipe which is called by the Lambda through Snowflake API to load the data in near real-time. As you can see in the pipe command, it pulls data from your stage.

create or replace pipe <your pipename>  auto_ingest=true as 

copy into <your tablename> from @<your stagename> 
file_format=(type='json', strip_outer_array = true);


copy into train from (select * from @<your_snowpipe>)
pattern=".*activities.*.json"
file_format = (type = JSON, strip_outer_array = true);

The next thing you want to do is to parse the json data format as tabular data, in Snowflake you can do this by the following command, which will show you the first level dictionaries inside the json—dictionary key and value.

select * from train, 
lateral flatten (input => jsontext) b,
lateral flatten(input => b.value, recursive=>True) c;

You can parse the json format data to tabular data by using :: . Take the first line of the code as an example: “jsontext:departureDate::date departureDate” Jsontext is the column name of the original table, departureDate is the key of the first dictionary, :: transfers data type, date is the desired data type, and departureDate is the new column name.

create or replace table train as (select

jsontext:departureDate::date departureDate,

jsontext:location::variant location,

jsontext:speed::int speed,

jsontext:timestamp::datetime timestamp,                                   

jsontext:trainNumber::int trainNumber

from train);

 

Sometimes, you will need to do this few times, if there are dictionaries inside a dictionary, so the logic is to do it level by level: parse first the first level dictionary data into tabular data, if there are dictionaries inside a dictionary, then you parse it as “variant” type of data, and then run the above data to parse the variant data.

After you can parse all data, you can select certain columns and create a view. View is recommended in here as it will be updated every time when data is loaded.

 

Data visualization

You can connect your Snowflake account with Power BI, You can find the detailed instructions on how to download ODBC driver for the purpose of connecting Power BI and Snowflake.

https://docs.snowflake.net/manuals/user-guide/odbc-download.html

You can create the desired visual type after power BI get access to your Snowflake data warehouse.

This demo gives you a brief overview how to do simple data extraction and data handling from API with the help of the above-mentioned tools. I personally found that Snowflake is very handy in parsing data, and it is easy to use with standard SQL commands. The most challenge part could be AWS lambda function if you are a beginner user of AWS. Check my other blog out if you are interested in AWS would like to know the basic of AWS services regarding data extraction, handling and analysis.