Julian Swart

Data Professional



JJ Furniture - Data Model

Julian Swart Aug 2022 10 min read



Table of Contents

Intro
Data Model
AWS
Scripts
Results
What was Learned


This post was in collaboration with Jonathan Adams.


Intro

As data scientists, our job is mainly focused on using data to solve business problems, not engineering the enviroment for our use. However, a colleague and I were keen to learn more about the nuances of this work because we were suddenly tasked with some of these data engineering responsbilities. We each needed to learn for our jobs and saw it as a chance to build our online portfolio.

Instead of using a commonly accessible dataset, we decided to simulate what a real company’s data looks like. This allowed the project to be more fun and gave us a data structure we were familiar with. We called the fake company JJ Furnitiure (JJ stands for Julian and Jon) and we created artifical transaction, customer, and product data.

This post (Part 1) is about data engineering and our learnings. Part 2 will be data science oriented. Thanks for reading and I hope you enjoy!



Data Model

Data engineers use two types of databases for their main tasks of:

  • loading in raw data from source systems (OLTP: writing data | recording system | AWS RDS)
  • designing tables that benefit end-users (OLAP: reading data | analytical system | AWS Redshift)

The below image is the usual data model that teams follow. They load in their raw transactional data into an OLTP (online transactional processing) database. This is best for writing data - actions like insert, update, and delete. Next is ETL (extract, transform, load). Here, you design a star schema and load in the results to an OLAP (online analytical processing) database. This is best for reading data, aka SQL queries. End-users who do analytics and make dashboards benefit from this system.


Typical cycle of data processing for a business



For our project, we wanted to create raw source data that somewhat imitated reality. This type of data is usually messy and needs wrangling. We also tried to design it how real data engineers would. Generally, this consists of having small tables that you would later join on in an ETL process. This enables the engineer to more easily keep track of and manipulate tables. Here is the OLTP database we designed.


ERD: Entity Relationship Diagram of our OLTP database





AWS

Businesses create data through transactions and need to store it somewhere. That’s where the OLTP database comes in. End-users should never see this database.

We used Amazon Web Services free tier to create our OLTP database. In AWS, it’s called RDS (relational database service). It has row-wise storage which is optimized for writing data. Conversely, the OLAP model in AWS is the familiar Redshift. It stores data column-wise to minimize query processesing time.

For this basic project, all we had to do is make an AWS account and go to the service called RDS and click “Create database”.



AWS Sign-in page




Amazon RDS page for creating our OLTP database




We followed the steps the website took us through to make the RDS. One of the decision points was which SQL language to use - we chose postgres because it’s common and familiar. Next, we needed a SQL client to connect to it. DBeaver is a good for Mac’s, so we used that.

At this point is where we encountered friction. Thankfully, others had experienced the same errors and made good posts online on how to overcome them. Main issues were:

  • “Connection timeout” error. Solution: make a security group with Amazon EC2. This was not an intuitive design by Amazon.
  • Our database name didn’t work for the connection. Solution: use the default name “postgres”, not what you named your database on the Amazon RDS website…geez.

Here are screenshots of the solutions we found online that saved us:


Database connection hurdle #1





Database connection hurdle #2




We finally achieved a successful connection thanks to the above two posts and were glad to begin the next stage of writing python code to simulate our fake furniture company’s data. This, along with learning how to automate table creation and data insertion into the database we just made.



DBeaver database connection window to our Amazon RDS





A great sight! Successful database connection



Scripts

These four scripts work in concert with one another to connect to the database, drop/create empty tables, and then create/insert data. Jon audited a data enginnering Coursera class and we re-purposed some of that code for the first three scripts listed below. #4 we created from scratch.


1) Postgres connector



2) SQL queries



3) Create tables



4) Insert rows of fake data




Results

We achived our goal of creating business-like raw datasets and loading them into a database. The below image shows our tables: colors, descriptions, materials, products, and transactions, along with a subset example of our fake customers table. The next step would be to design a star-schema, create an ETL process, and load those results into a Redshift databse. This is what data scientists and analysts would use.


Image showing the OLTP tables and some results of our fake customer dataset



What was Learned

This is an overview of what we learned doing this project:

  • what an RDS is
  • how to set up an RDS in AWS with postgres
  • connecting using DBeaver sql client
  • better github practices
  • olap vs oltp
  • wrtiting python scripts for automating table creation and inserting rows

Things we didn’t do but know need to happen:

  • automating data quality checks
  • slow-changing dimensions
  • design fact and dimension tables (ETL)
  • data security
  • group permissions
  • privacy data omissions

This was a fun project and we feel more in-tune with the challenges data engineers face. Mainly, what it takes to architect a data managment solution that is efficient, secure, and makes end-users happy. We are mindful of the fact that real-world data is a lot messier and larger than the watered-down example we used here.





Website built with Hugo • Theme is a variation of Tanka • Hosted on Github Pages • Custom domain name from GoDaddy.com