-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Pypaimon support ML intergation #6770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kaori-seasons
wants to merge
26
commits into
apache:master
Choose a base branch
from
kaori-seasons:ml-intergation
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Related to issue-6762
1. Overview
1.1 Background and Problem Statement
Paimon is an open-source data lake storage system that supports efficient data queries and update operations. However, in the ML engineering field, Paimon lacks deep integration with mainstream ML frameworks (PyTorch, TensorFlow), leading to the following problems:
Main Issues:
Low Data Pipeline Efficiency
Insufficient Feature Engineering Capabilities
Distributed Training Support Defects
Incomplete Sampling and Data Augmentation
1.2 Industry Benchmarking and Existing Solutions
Industry Benchmarks:
Paimon's Current Capabilities:
2. Technical Solution Design
2.1 Overall Architecture
2.2 Core Module Design
2.2.1 PyTorch Advanced Sampling
Module Path:
pypaimon/ml/pytorch/advanced_sampling.pyDesign Principles:
Core Classes:
Use Cases:
2.2.2 Feature Engineering
Module Path:
pypaimon/ml/pytorch/feature_engineering.pyDesign Principles:
Core Classes:
Use Cases:
2.2.3 Online Feature Computation
Module Path:
pypaimon/ml/pytorch/online_features.pyDesign Principles:
Core Classes:
Use Cases:
2.2.4 TensorFlow Performance Optimization
Module Path:
pypaimon/ml/tensorflow/performance.pyDesign Principles:
Core Classes:
Optimization Strategies:
2.2.5 Distributed Training Support
Module Path:
pypaimon/ml/tensorflow/distributed.pyDesign Principles:
Core Classes:
Use Cases:
3. Scenarios Considered and Tradeoff Solutions
3.1 Class Imbalance Scenario
Scenario Description:
Tradeoff Options:
Adopted Solution: Weighted sampling + Hard example mining
3.2 Memory Constraint Scenario
Scenario Description:
Tradeoff Options:
Adopted Solution: Partial cache + Disk cache option
3.3 Distributed Training Scenario
Scenario Description:
Tradeoff Options:
Adopted Solution: AllReduce (MirroredStrategy and MultiWorkerMirroredStrategy) + Parameter server (optional)
3.4 Data Preprocessing Latency Scenario
Scenario Description:
Tradeoff Options:
Adopted Solution: Feature cache + LRU eviction policy
4. Data Volume Analysis and Necessity
4.1 Performance Benchmarking and Data Assumptions
Scenario A: Medium-scale Dataset
Scenario B: Large-scale Dataset (Recommendation)
Scenario C: Massive-scale Dataset (Industrial)
4.2 Performance Comparison Analysis
Scenario A: 10GB Dataset
Necessity Assessment:
Scenario B: 500GB Dataset
Necessity Assessment:
Scenario C: 20TB Dataset
Necessity Assessment:
4.3 Cost-Benefit Analysis
GPU Cost Comparison (V100 as example)
Scenario A: 10GB Dataset, 100 epochs
Scenario B: 500GB Dataset, 10 epochs
Scenario C: 20TB Dataset, 10 epochs
4.4 Necessity Summary
5. New API Inventory
5.1 PyTorch Module API
advanced_sampling.py
feature_engineering.py
online_features.py
5.2 TensorFlow Module API
performance.py
distributed.py
5.3 API Usage Examples
Example 1: Handle Class Imbalance
Example 2: Feature Normalization and Encoding
Example 3: Data Pipeline Optimization
Example 4: Distributed Training
6. Production Readiness Guarantee
6.1 Exception Handling and Logging
All modules implement comprehensive exception handling:
Key Features:
exc_info=True)6.2 Memory and Performance Monitoring
Tests
API and Format
Documentation