Full Cycle Data Science is an educational format that explore the application of data science and analytics implementations in an end-to-end fashion. The series often comprises of 4 or more articles, and aims to bring breadth and depth to the discussion on analytics implementation.
Streaming Data Pipeline: the full cycle
Timotius Marselo
Timotius is a voracious learner from his days studying and living in China. He returned to Indonesia with the goal of seriously pursuing data science, investing his time across mathematics, NLP (natural language processing) and machine learning along the way.
- Project Overview and Environment Setup: An overview of what it takes to build a scalable streaming data processing pipeline using fully open source technologies (Kafka, Spark Streaming, Cassandra, MySQL, Streamlit, Docker) and how to set up our Docker Compose file to getting all of this working together.
- OLTP and OLAP database setup: Design and implement OLTP and OLAP databases using Cassandra and MySQL respectively.
- Transaction data ingestion and processing: Introducing Apache Kafka and Spark Streaming into our stack, we will build a highly efficient, high-throughput data pipelilne that loads and process the data in a microbatch fashion feeding into the OLTP and OLAP databases created earlier.
- Querying transactional, real-time data with Streamlit: With the streaming data now arriving in our databases (OLAP, OLTP), we focus on the analytical aspect by diving into queries and visualization. We’ll build a web application with Streamlit and Altair, where our interactive visualizations are hosted.
In the first part of the article, we will discuss the overview of the project and how to set up the environment using Docker Compose.
In the second part of the article, we will walk through the design and implementation of OLTP and OLAP databases using Cassandra and MySQL respectively.
In the third part of the article, we will demonstrate how to ingest and process transactional data using Kafka and Spark Streaming.
In the fourth and final part of the article, we will perform some analytical queries on the MySQL database and create a dashboard using Streamlit.
Create end-to-end machine learning projects
Join Supertype Fellowship for a peer-to-peer, elective-based learning program; Ready to contribute? Be a part of Supertype Collective and be connected to a highly curated network of data scientists and full stack product developers.
Open source. community-driven.
peer to peer learning
But with 🥇 badges, timelines, points and elective-based curations.
The Supertype Fellowship platform is developed to be open source from day 1. It is the backbone of the Fellowship program (formerly Supertype Development Program), where participants learn by writing code, creating features, and shipping code to complete one of 10+ electives under mentorship and peer support.
Open source. community-driven.
Developer Profiles
But with PDF-print, automatic icons, mobile layouts and automatic blog feed. Lightning fast ⚡.
The Supertype Collective platform is an open source platform tool that generates visually stunning Developer Profiles that look great on every medium: mobile, tablets, PCs, and even offer a PDF export option.
It features plug and play React Components and Hooks you can bring in to effortlessly create a Developer Profile that impress, in 20 lines or less (low-code).
Automatic Sentiment Analysis for your Mobile Apps
A report generation pipeline that takes as input an AppStore or Google Play Store URL and outputs a custom 30-page PDF with aggregated summary and text analysis of app user reviews in minutes.
It uses a keyword extractor routine developed in-house to handle much of the language processing tasks related to the identification and grouping of these reviews into topics (“unfriendly paywall”, “long tutorial”, “app crashes”, “poor customer service”, “very fast loading time” etc), a task known as topic detection in the natural language processing (NLP) space.
The program also uses components of Spacy and NLU by John Snow Labs to meaningfully sift through up to tens of thousands of user reviews in determining user sentiment, with full support of 20+ languages.
The PDF that is generated can be customized to include the client’s logo, as well as an ending CTA (call-to-action) slide, making it perfect for mass lead-generation and client outreach.
Read more about our work
Case Studies
Learn more about our Data Science Consultancy services for companies & enterprises
Portfolio
See a gallery of projects built and shipped by Supertype's finest product developers