Data
Last updated
Last updated
Data Engineering
Developing a robust data engineering pipeline is essential for Kgen to effectively leverage its data. This section outlines a comprehensive data engineering architecture that integrates diverse data sources, processing mechanisms, and tools to generate actionable insights. Specifically, the pipeline consolidates multiple data sources into a centralized data lake architecture to achieve two primary objectives:
Supporting analytics and business intelligence through diverse ingestion patterns and processing workflows.
Generating a POG Score for users.
This section is organized into the following layers, each of which is depicted in the accompanying diagrams and described in detail, along with the AWS services utilized at each stage:
Data Source Layer
Data Ingestion Layer
Data Lake Layer
Processing and Cataloguing Layer
Consumption Layer
Data Source Layer
This layered approach ensures that data from both internal and external sources is effectively ingested, processed, and made actionable for analytics and decision-making. The platform processes data from over 90 tables spanning multiple domains, enabling comprehensive insights into user interactions, financial transactions, and platform engagement. Key data domains include:
User Profiles & KYC: Information related to user identities, verification, and profiles.
Earnings & Rewards: Data tracking user earnings, rewards distribution, and incentive structures.
Gaming & Quests: Insights into user participation in gaming activities, leader board participation and quest completions.
Transactions & Wallets: Detailed records of transactions and wallet activities onchain.
Social Features & Clans: Data on user engagement with social features, clan memberships, and collaborations.
KShop & Commerce: Information about in-platform purchases and commerce activities.
Esport and Loyalty: Data on user participation, and earning on all 7 games that we have integration on so far.
External sources: Data collected for user activity outside the Kgen ecosystem which includes social, games, commerce.
The platform employs two primary data processing methodologies: batch processing and real-time streaming.
Batch Processing Sources
Batch processing aggregates data at scheduled intervals to handle large volumes of information. This is particularly suitable for internal data and external integrations requiring substantial preprocessing time.
Internal Data (Processed Twice Daily):
DynamoDB Tables (Primary Source: 50+ tables): User Profile Data,Transaction Records, Wallet Balances and Histories, Gaming and Quest Engagement Data, KYC Verification Information, Social and Clan Features
RDS Tables:Transaction Events, Blockchain Native Transactions, Inscription Events
External Data: External APIs enhance the platform’s data ecosystem by integrating diverse third-party sources.
Social Media Platforms: X (Twitter): Tracks user interactions, trends, and activities.Discord and Telegram: Captures community engagement and communication patterns.
Blockchain Data:On-chain Transactions: Records user-initiated and system-generated blockchain events.Wallet Balances: Tracks holdings and changes across multiple wallets.
All external sources have different frequencies depending on the api itself and the need of the data.
Daily Integrations: Social media platforms, including X (formerly Twitter).
Periodic Integrations (e.g., every few days): Blockchain on-chain transaction data. This extended timeline accommodates the processing of high-volume wallets with significant activity.
Real-Time Sources
Real-time data ingestion ensures up-to-the-minute insights, enhancing user engagement tracking and event responsiveness.
Live Event Collection: Discord bot continuously monitors and captures user engagement.
add the diagram of the real-time data collection from Discord
Real-time CDC pipelines: Pixel tracker eventing system
Why We Built a Pixel Tracker?
Imagine running a dynamic platform where users engage in a variety of activities — completing quests, interacting with content, and clicking on CTAs. Capturing these interactions in real-time and transforming them into actionable insights is crucial for improving user experience, optimising features, and driving growth. However, as the number of users and the complexity of interactions increase, managing and analysing this data becomes a significant challenge.
Traditional data pipelines often fall short in such scenarios, leading to delays in data processing and insights generation. To address these challenges, we built Px Tracker — a system that can handle the complexities of modern data needs, ensuring that every user interaction is captured, processed, and analysed as it happens.
Px Tracker is a carefully crafted data pipeline that leverages a suite of AWS services to provide a scalable, cost-effective solution for real-time data processing.
Data Ingestion Layer
The data ingestion layer supports both real-time and batch ingestion workflows with scalability and fault tolerance as core priorities. Real-time ingestion begins with data capture via Amazon Kinesis, where it is processed in near real-time using AWS Lambda or AWS Glue streaming jobs. The processed data is then stored in Amazon S3, enabling immediate availability for analytics or further transformation. Batch workflows, on the other hand, rely heavily on scheduled AWS Glue ETL jobs, which extract data from primary sources, apply transformations such as schema standardization or enrichment, and load the refined data into the lake. Apache Airflow workflows further extend this capability by handling integrations with external APIs, such as social media platforms or blockchain networks, and writing the fetched data to Amazon S3.
Scalability and fault tolerance are integral to the design of this layer. The combination of Kinesis for high-velocity data streams, Airflow on Kubernetes for elastic workload management, and AWS Glue with its distributed computing framework ensures that the system can handle spikes in data volume and complexity without compromising performance. By integrating these tools and services, the data ingestion layer delivers a robust, automated, and highly reliable pipeline that effectively bridges the gap between raw data and actionable insights.
Real-time data integration
This workflow begins with event capture via API Gateway, followed by authentication and streaming through Kinesis Data Streams. The data is enriched, stored in Amazon S3 in a queryable format, and made accessible for analysis through tools like AWS Glue and Athena. The streamlined process supports real-time dashboards and aggregated insights, providing the foundation for optimizing user engagement and platform performance.
Let’s bring this architecture to life with a practical example of a gamer doing a quest. Here’s how Px Tracker would track these activities in real-time:
Architecture
Step-by-Step Flow
User Completes a Quest: A user completes a quest on the platform. This action triggers an event that’s immediately captured by API Gateway, which uses the Lambda Authoriser to authenticate the request.
Processing the Event: Once the request is authenticated, API Gateway pushes the event directly into Kinesis Data Streams. From there, the event is processed and enriched with additional metadata such as the user’s ID, the exact time of completion, and the quest’s difficulty level by a downstream Lambda function.
Streaming the Data: The enriched event data is then streamed in real-time to Kinesis Data Streams. Kinesis Firehose reads from the stream, cleanses & enriches the data with an inverted Lambda function, and delivers it to Amazon S3 in a partitioned parquet format.
Storing the Data: The event data is stored in Amazon S3, where it’s cataloged by AWS Glue. Glue organizes the data and makes it ready for querying, ensuring that you can easily access it later for deeper analysis.
Querying and Analysis: Using Amazon Athena, you can run SQL queries on the stored data to answer questions like “How many users completed this quest today?” or “Which quests have the highest completion rates?”
Visualisation: The results from Athena can be visualized on a real-time dashboard, showing key metrics such as quest completion rates, user engagement trends, and the effectiveness of CTAs.
Aggregated Insights: For more aggregated insights, EventBridge triggers a Lambda function that aggregates the data and stores it in Aurora RDS. The Data Mesh Reader Service powered by GraphQL fetches this data, delivering it to the user in real-time.
Optimised Performance: Frequently accessed computations are cached in Redis, ensuring that users receive data with minimal delay.
This end-to-end example illustrates how Px Tracker captures, processes, and analyses user interactions in real-time, providing you with the insights needed to optimise your platform and enhance the user experience.
Batch Processing data integration-Glue Job
The batch data integration layer utilizes AWS Glue to orchestrate and execute Extract, Transform, Load (ETL) jobs, enabling seamless ingestion of data from DynamoDB, RDS, and other sources into an S3 data lake. These Glue jobs are designed to handle both incremental and full-load scenarios while ensuring data quality through validation steps. Partitioning strategies are applied to optimize query performance and downstream analytics.
Incremental and Full-Load Processing
AWS Glue jobs are categorized into incremental and full-load ETL processes:
• Incremental Load: Captures only the changes since the last load, leveraging Hudi for upserts and ensuring efficient updates. For example, the DynamoToS3Incremental-ProdStage-earnings job processes updates with partitioning by entity, year, month, day. This minimizes processing overhead and reduces the time to integrate fresh data. This approach is used for all tables with big size to save AWS cost.
• Full Load: Retrieves the entire dataset from the source for cases where incremental tracking is unnecessary or infeasible due to small size and unexpected increase like in case of quest tables which include metadata about the quests available on the platform.
Transformation Example: Flattening Data for better queries
In our system, data often contains nested structures that can be challenging to query directly. For example, PlatformCampaignHub_AirdropUserCampaignProgressStore DynamoDB table contains nested arrays and objects in the progressDetails field that must be flattened for analytics. Using AWS Glue and PySpark, we transform and flatten the data.
Data Quality Checks
Data quality is an integral part of the ETL pipeline. Glue jobs incorporate mechanisms to validate key aspects of the data, such as schema integrity, presence of mandatory fields, and format consistency. This ensures downstream systems work with accurate and reliable data.
Hudi Integration for Data Storage
Many Glue jobs, particularly for incremental loads, utilize Apache Hudi to enable efficient upserts and versioning within the S3 data lake. By organizing data into partitioned directories (e.g., year, month, day), Glue jobs make querying with Athena faster and more cost-effective. For instance, DynamoToS3IngestionETLIncremental-ProdStage-user-wallets partitions data by point_of_creation, year, month, day.
Future improvement
While we export the data from DynamDB we do full export which can be improved by applying incremental load which will save AWS cost.
POG Glue Jobs
While Glue Job has been used to get the data from Dynamo tables in prod account to Data account after applying required transformation needed for Analysts, other sets of Glue jobs built to do aggregation of most of the tables to create POG attributes used in the equations. More details on this will be presented while we talk about the usage of Stepfunction on our system.
Batch Processing data integration-Airflow
Apache Airflow is an orchestration platform that efficiently schedules and manages workflows, utilizing Kubernetes for dynamic scaling and resource management. In this implementation, Airflow orchestrates a range of ingestion tasks using Directed Acyclic Graphs (DAGs) to process data in batches. Some of the key DAGs include:
user_activity_timestamp: Tracks and stores user activity timestamps for analytics purposes.
twitter_daily_dag: Handles daily data ingestion and processing from Twitter.
twitter_fetch_user_details: Fetches detailed user information from Twitter for analysis.
telegram_engagement_dag: Collects and processes engagement metrics from Telegram.
hedera_onchain_transactions: Retrieves transaction data from the Hedera blockchain.
hedera_onchain_balances: Extracts and processes balance data from the Hedera blockchain.
polygon_onchain_data_retrieval: Gathers data from the Polygon blockchain for further processing.
polygon_onchain_balance_retrieval: Collects balance information from the Polygon blockchain.
kgen_discord_server_users_dag: Ingests and processes user data from Discord servers.
discord_OAuth: Manages OAuth authentication and related data for Discord.
Dynamic Resource Allocation with Kubernetes
To ensure cost efficiency and scalability, Airflow utilizes the Kubernetes Pod Operator to dynamically spin up Kubernetes pods for task execution. This approach offers several advantages:
Resource Isolation
Each pod is allocated dedicated resources, ensuring that workloads do not interfere with one another.
Scalability
Kubernetes can scale pods up or down based on workload demand, maintaining high availability while optimizing costs.
Flexibility
Pods can run custom containers, allowing the pipeline to handle diverse workloads, including blockchain data retrieval, social media ingestion, and engagement metrics processing.
By dynamically spinning up resources only when required, this architecture reduces operational costs compared to traditional fixed clusters. Additionally, Airflow’s ability to schedule and monitor tasks ensures reliability and transparency in batch processing workflows. AWS services, such as S3, often serve as the target for storing processed data, enabling efficient querying and downstream analytics.
Step function for generating POG
Step function is an orchestration tool like Airflow, then why do we need to use it if we are already using Airflow? We have only one State machine that is used to generate the POG score. The reason behind preferring the Step function was
Ease of Use: Highly intuitive and tightly integrated with AWS services, including AWS Glue. Simple to set up with minimal learning curve. Visual workflow designer for easy orchestration.
Flexibility: Focused on AWS services, with built-in support for Glue jobs, Lambda, DynamoDB, and more. Great for serverless and event-driven workflows.
Scalability: Serverless, fully managed, and scales automatically with workload.
Cost: Charged based on state transitions in a workflow: A Step Functions workflow might include states. Each movement from one state to the next is a state transition.
Monitoring & Debugging: Built-in visual console with clear success/failure paths for each state. Logs automatically integrated into CloudWatch."
Integration with AWS Glue: Directly supports AWS Glue as a native service. Simple integration with parameters passed between steps.
More details about what every glue of those do explained on the POG section below.
Data Lake Layer
Our system leverages Amazon S3 as the central repository for our data lake, providing scalable, secure, and cost-effective storage for all data. The data lake is organized into three distinct zones, each serving a specific purpose within the data pipeline. This multi-zone architecture ensures Scalability,Cost Efficiency,Flexibility and High Performance.
Scalability: Amazon S3 provides limitless storage, accommodating growing data volumes without compromising performance.
Cost Efficiency: Storing raw data in the Raw Zone allows us to retain the original data inexpensively, while the Curated Zone optimizes query costs by pre-aggregating data.
Flexibility: The separation of zones allows different teams—such as Data Engineers, Data Analysts, and Backend Developers—to access data tailored to their needs.
High Performance: By storing data in optimized formats (e.g., Parquet) in the Curated Zone, query times are significantly reduced.
The Data Lake Layer is central to our data ecosystem, enabling seamless ingestion, transformation, and consumption of data while supporting a wide array of business use cases.
Bronze Layer- Raw Zone
The Raw Zone serves as the initial storage layer where ingested data is stored in its original, unprocessed format. This zone primarily consists of full data exports from DynamoDB tables, capturing the entirety of the raw data. Below is an example of a typical DynamoDB export configuration:
The raw data in this zone includes full exports such as earnings, user notifications, and leaderboard information. These exports provide a complete snapshot of the data at a given time, ensuring no data is lost during ingestion.
Silver Layer- processed Zone
Data in the Processed Zone undergoes ETL (Extract, Transform, Load) operations using AWS Glue. Glue jobs are responsible for transforming raw data into a structured and enriched format, optimized for analytics and downstream processing. The processed data is stored in designated S3 paths, categorized based on specific use cases or domains. In this layer we do standard. In this layer, raw data from DynamoDB is unmarshalled and transformed into a structured format suitable for analytics and downstream processing. Key operations include:
Unmarshalling: Deserializing DynamoDB JSON records into human-readable formats, such as splitting composite keys into meaningful fields (e.g., gameid and userid from userID).
Timestamp Formatting: Converting timestamps into ISO-8601 formatted strings with millisecond precision and deriving partitioning fields like year, month, and day.
Nested Data Handling: Exploding and expanding nested arrays such as progressDetails to create flattened structures while retaining all hierarchical information.
Data Validation and Enrichment: Ensuring null or empty fields are handled gracefully, enriching the data by splitting fields, and reformatting structured data into JSON strings for compatibility.
Hudi Integration: Leveraging Apache Hudi to enable upsert and delete operations, ensuring the processed data remains consistent and optimized for querying. The data is partitioned dynamically based on derived fields like year, month, day, or specific domains such as gameid.
Delta Handling: Comparing incoming data with existing records to determine changes (additions, updates, and deletions) and writing results to the respective S3 paths.
These transformations ensure that the processed data in this zone is clean, well-organized, and ready for advanced analytics, machine learning, and reporting use cases.
Gold Layer- aggregated
The Curated Zone is the final layer in the data lake, containing aggregated and highly processed data tailored for analytics, reporting, and machine learning workloads. Glue jobs further transform the data from the Processed Zone into insightful aggregates, such as List of all quests completed by a user, Sum of all earnings per user and User’s last active time
This zone is optimized for business-critical use cases, where data is stored in formats that maximize query performance for tools like Amazon Athena and Amazon QuickSight.
Processing and Cataloguing Layer
In this layer, AWS Glue Crawlers are utilized to automatically update table metadata in the AWS Glue Data Catalog, which serves as a central repository for storing and querying data using services like Amazon Athena. However, given the large number of tables being processed, several challenges have arisen:
1. Crawler Failures: Instances where the Glue Crawler fails to run or executes without updating the metadata.
2. Outdated Metadata: These issues result in outdated table metadata in Athena, which directly impacts the accuracy and efficiency of analytics workflows.
To address these challenges and reduce dependency on Glue Crawlers, an alternative solution is being implemented. Apache Airflow is being used to create custom DAGs that update the metadata for all tables. This approach offers several advantages:
• Cost Optimization: Eliminates the recurring cost associated with AWS Glue Crawlers.
• Improved Performance: Custom DAGs ensure faster and more reliable metadata updates, reducing the latency that affects analytics tasks.
• Scalability: Airflow provides flexibility in scheduling and orchestrating metadata updates, accommodating the growing volume of tables and datasets.
By transitioning to this Airflow-based solution, the processing and cataloging layer achieves better efficiency, improved reliability, and significant cost savings.
This solution also aligns with best practices outlined in the AWS Glue Documentation, which recommends designing workflows to manage metadata updates efficiently for environments with a high volume of data. For example:
• Optimizing Crawlers: AWS Glue documentation suggests partitioning data and limiting the crawler scope to improve performance.
• Manual Updates: For dynamic data environments, manual updates to the Glue Data Catalog, or the use of APIs via frameworks like Airflow, is recommended for better control over metadata.
By leveraging Airflow to manage metadata updates, this architecture ensures the Processing and Cataloging Layer remains robust, scalable, and cost-effective for analytics workflows.
Consumption Layer
The Consumption Layer is designed to serve a variety of personas, each leveraging the data ecosystem to derive insights and drive decision-making. Here’s a breakdown of the key roles and their interactions within this layer:
Data Analyst
Data Analysts play a critical role in transforming raw data into actionable insights. Their primary activities include:
Using Amazon Athena: Analysts frequently query Athena to explore and analyze large datasets stored in Amazon S3. They utilize SQL-based queries to generate insights and create aggregated views of the data.
Working with Superset: Superset serves as their main tool for building interactive dashboards and visualizations. Analysts use these dashboards to answer key business questions and communicate findings to stakeholders.
Downloading Large Datasets: Occasionally, analysts download large datasets for offline analysis or for use in specialized data tools.
Collaboration with Data Engineers
To effectively meet business needs, Data Analysts closely collaborate with Data Engineers:
Defining Requirements: Analysts communicate their requirements for specific datasets, transformations, or aggregations needed to deliver dashboards and reports.
Feedback Loop: There’s a continuous feedback loop between Data Analysts and Data Engineers to refine data models, optimize queries, and ensure that the data infrastructure supports evolving analytics needs.
Enhanced Tools and Practices
To improve the efficiency and effectiveness of the Consumption Layer, several practices and tools are employed:
Data Governance: Implementing robust data governance practices ensures that analysts have access to clean, consistent, and accurate data.
Pre-built Views: Engineers often create reusable, pre-aggregated views in Athena to accelerate analysis and reduce query complexity.
Cost Optimization: Query performance in Athena is monitored to minimize data scanning costs, ensuring that analysts can run large queries efficiently without incurring high expenses.
Role-based Access Control (RBAC): Analysts have secure, role-based access to datasets, ensuring compliance with data security policies.
Backend Team
Once a feature is developed and tested by the QA team, it goes live. However, post-launch, the team often needs to monitor, diagnose, and analyze data to address issues and optimize performance.
Monitoring Post-Launch Features
Issue Identification: After a feature is live, the backend team monitors for potential issues such as transaction Failures: Instances where processes like payments or data synchronization fail unexpectedly.Feature Performance Tracking is another usecase, continuously tracking a feature’s performance to ensure it meets expected benchmarks or usage patterns before implementing further changes.
Tools and Processes
To effectively handle these post-launch scenarios, the backend team relies on data aggregation, visualization, and querying tools:
Custom Dashboards:
Dashboards are specifically designed to serve the needs of the QA and backend teams. These dashboards provide aggregated data views based on specific logic to help pinpoint issues quickly. Visual cues and alerts for anomalies or critical thresholds.
Amazon Athena:
The backend team uses Athena to query large datasets stored in Amazon S3. This allows them to extract insights efficiently and address issues in real time. Athena provides flexibility compared to querying directly from DynamoDB, as it enables: Easier access to historical and aggregated data,the ability to run SQL-based queries on vast datasets without needing additional infrastructure.