supertype consultancy
Full Cycle Data Science Series

Full Cycle Data Science is an educational format that explore the application of data science and analytics implementations in an end-to-end fashion. The series often comprises of 4 or more articles, and aims to bring breadth and depth to the discussion on analytics implementation.

Streaming Data Pipeline: the full cycle

streaming data pipeline
About the author
Timotius Marselo

Timotius Marselo

Timotius is a voracious learner from his days studying and living in China. He returned to Indonesia with the goal of seriously pursuing data science, investing his time across mathematics, NLP (natural language processing) and machine learning along the way.

Streaming Data Pipeline: The Full Cycle features an end-to-end data engineering project where our data scientist, Timotius Marselo, walks you through an entire process of building a data streaming pipeline.
 
The series consists of (4) parts, and by the end of it, we have a live, fully ready Streamlit application powered by a data ingestion and processing pipeline for real-time data streaming.
 
Here’s how the series breaks down:
 
  1. Project Overview and Environment Setup: An overview of what it takes to build a scalable streaming data processing pipeline using fully open source technologies (Kafka, Spark Streaming, Cassandra, MySQL, Streamlit, Docker) and how to set up our Docker Compose file to getting all of this working together.
  2. OLTP and OLAP database setup: Design and implement OLTP and OLAP databases using Cassandra and MySQL respectively.
  3. Transaction data ingestion and processing: Introducing Apache Kafka and Spark Streaming into our stack, we will build a highly efficient, high-throughput data pipelilne that loads and process the data in a microbatch fashion feeding into the OLTP and OLAP databases created earlier.
  4.  Querying transactional, real-time data with Streamlit: With the streaming data now arriving in our databases (OLAP, OLTP), we focus on the analytical aspect by diving into queries and visualization. We’ll build a web application with Streamlit and Altair, where our interactive visualizations are hosted.

Building a Streaming Data Pipeline with Open Source Stacks | Project Overview and Environment Setup (Part 1)

Building a Streaming Data Pipeline with Open Source Stacks | OLTP and OLAP Databases Setup (Part 2)

Building a Streaming Data Pipeline with Open Source Stacks | Transactional Data Ingestion and Processing (Part 3)

Building a Streaming Data Pipeline with Open Source Stacks | Analyzing and Visualizing Data in a Dashboard (Part 4)

Create end-to-end machine learning projects

Join Supertype Fellowship for a peer-to-peer, elective-based learning program; Ready to contribute? Be a part of Supertype Collective and be connected to a highly curated network of data scientists and full stack product developers.

Open source. community-driven.

peer to peer learning

But with 🥇 badges, timelines, points and elective-based curations.

The Supertype Fellowship platform is developed to be open source from day 1. It is the backbone of the Fellowship program (formerly Supertype Development Program), where participants learn by writing code, creating features, and shipping code to complete one of 10+ electives under mentorship and peer support.

Supertype Fellowship
supertype_collective

Open source. community-driven.

Developer Profiles

But with PDF-print, automatic icons, mobile layouts and automatic blog feed. Lightning fast ⚡.

The Supertype Collective platform is an open source platform tool that generates visually stunning Developer Profiles that look great on every medium: mobile, tablets, PCs, and even offer a PDF export option.

It features plug and play React Components and Hooks you can bring in to effortlessly create a Developer Profile that impress, in 20 lines or less (low-code).

Automatic Sentiment Analysis for your Mobile Apps

A report generation pipeline that takes as input an AppStore or Google Play Store URL and outputs a custom 30-page PDF with aggregated summary and text analysis of app user reviews in minutes.

It uses a keyword extractor routine developed in-house to handle much of the language processing tasks related to the identification and grouping of these reviews into topics (“unfriendly paywall”, “long tutorial”, “app crashes”, “poor customer service”, “very fast loading time” etc), a task known as topic detection in the natural language processing (NLP) space.

The program also uses components of Spacy and NLU by John Snow Labs to meaningfully sift through up to tens of thousands of user reviews in determining user sentiment, with full support of 20+ languages.

The PDF that is generated can be customized to include the client’s logo, as well as an ending CTA (call-to-action) slide, making it perfect for mass lead-generation and client outreach.

Read more about our work

Case Studies

Learn more about our Data Science Consultancy services for companies & enterprises

Portfolio

See a gallery of projects built and shipped by Supertype's finest product developers