Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
86038ee
feat: support ml dataset interface
kaori-seasons Dec 8, 2025
035faef
enhance: add test && type check
kaori-seasons Dec 8, 2025
aa41f54
feat: add distribute compute engine
kaori-seasons Dec 8, 2025
d981daa
enhance: support online learning
kaori-seasons Dec 8, 2025
df7d169
chore: translate doc
kaori-seasons Dec 8, 2025
fd54d95
chore: translate english
kaori-seasons Dec 8, 2025
c6738a9
enhance: add batch size
kaori-seasons Dec 8, 2025
07bef5e
bugfix: fix tests
kaori-seasons Dec 8, 2025
7a8a8d5
bugfix: fix tests
kaori-seasons Dec 8, 2025
571f0d7
bugfix: fix test
kaori-seasons Dec 8, 2025
46d4bc2
bugfix: fix test
kaori-seasons Dec 8, 2025
6064773
bugfix: fix test
kaori-seasons Dec 8, 2025
899069c
bugfix: fix test
kaori-seasons Dec 8, 2025
735cb23
bugfix: fix test
kaori-seasons Dec 8, 2025
bc1490b
bugfix: fix test
kaori-seasons Dec 8, 2025
bd46ec4
chore: add deps
kaori-seasons Dec 8, 2025
0771173
Merge remote-tracking branch 'upstream/master' into ml-intergation
kaori-seasons Dec 8, 2025
e27384b
chore: remove useless deps
kaori-seasons Dec 8, 2025
f92d434
chore: add deps
kaori-seasons Dec 8, 2025
40c1397
chore: add cleanup ci
kaori-seasons Dec 8, 2025
404afad
chore: add cache pip choice clean
kaori-seasons Dec 8, 2025
47ab4a4
chore: add deps check
kaori-seasons Dec 8, 2025
ff0a9bc
chore: add deps check
kaori-seasons Dec 8, 2025
0dae28b
chore: add deps check
kaori-seasons Dec 8, 2025
be6581e
chore: fix system deps retry logic
kaori-seasons Dec 8, 2025
c20a546
chore: add deps check
kaori-seasons Dec 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 61 additions & 4 deletions .github/workflows/paimon-python-checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,24 @@ jobs:
python-version: ['3.6.15', '3.10']

steps:
- name: Clean up disk space
shell: bash
run: |
echo "Disk usage before cleanup:"
df -h
echo "Cleaning old pip cache and build artifacts..."
# Only clean pip cache older than 7 days
find ~/.cache/pip -type f -atime +7 -delete 2>/dev/null || true
# Clean Maven cache older than 7 days
find ~/.m2/repository -type f -atime +7 -delete 2>/dev/null || true
# Clean old Maven artifacts in /tmp (but not recent ones)
find /tmp -maxdepth 2 -type f -name '*.jar' -o -name '*.pom' -atime +2 2>/dev/null | xargs rm -f 2>/dev/null || true
# Clean system package caches
apt-get clean 2>/dev/null || true
rm -rf /var/lib/apt/lists/* 2>/dev/null || true
echo "Disk usage after cleanup:"
df -h

- name: Checkout code
uses: actions/checkout@v2

Expand All @@ -66,11 +84,28 @@ jobs:
- name: Install system dependencies
shell: bash
run: |
apt-get update && apt-get install -y \
# Retry logic for apt-get update
for i in {1..3}; do
echo "apt-get update attempt $i..."
apt-get update -y && break
if [ $i -lt 3 ]; then
echo "Retrying in 10 seconds..."
sleep 10
else
echo "ERROR: apt-get update failed after 3 retries"
exit 1
fi
done

# Install packages with --fix-missing for resilience
echo "Installing system dependencies..."
apt-get install -y --fix-missing \
build-essential \
git \
curl \
&& rm -rf /var/lib/apt/lists/*
curl

# Clean up package lists to free space
rm -rf /var/lib/apt/lists/*

- name: Verify Java and Maven installation
run: |
Expand All @@ -94,8 +129,30 @@ jobs:
python -m pip install -q readerwriterlock==1.0.9 'fsspec==2021.10.1' 'cachetools==4.2.4' 'ossfs==2021.8.0' pyarrow==6.0.1 pandas==1.1.5 'polars==0.9.12' 'fastavro==1.4.7' zstandard==0.19.0 dataclasses==0.8.0 flake8 pytest py4j==0.10.9.9 requests parameterized==0.8.1 2>&1 >/dev/null
else
python -m pip install --upgrade pip
python -m pip install -q readerwriterlock==1.0.9 fsspec==2024.3.1 cachetools==5.3.3 ossfs==2023.12.0 ray==2.48.0 fastavro==1.11.1 pyarrow==16.0.0 zstandard==0.24.0 polars==1.32.0 duckdb==1.3.2 numpy==1.24.3 pandas==2.0.3 pylance==0.39.0 flake8==4.0.1 pytest~=7.0 py4j==0.10.9.9 requests parameterized==0.9.0 2>&1 >/dev/null
# Use compatible numpy version for Python 3.10: numpy>=1.26.0 or 1.23.x
python -m pip install -q numpy==1.26.0
python -m pip install -q readerwriterlock==1.0.9 fsspec==2024.3.1 cachetools==5.3.3 ossfs==2023.12.0 ray==2.48.0 fastavro==1.11.1 pyarrow==16.0.0 zstandard==0.24.0 polars==1.32.0 duckdb==1.3.2 pandas==2.0.3 pylance==0.39.0 flake8==4.0.1 pytest~=7.0 py4j==0.10.9.9 requests parameterized==0.9.0 2>&1 >/dev/null
# Install PyTorch CPU version - optional for ML module testing
echo "Installing PyTorch CPU version..."
python -m pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu 2>&1 && echo "PyTorch installed successfully" || echo "Warning: PyTorch installation failed, ML tests will be skipped"
# Install TensorFlow CPU version - optional for ML module testing
echo "Installing TensorFlow CPU version..."
python -m pip install -q tensorflow-cpu 2>&1 && echo "TensorFlow installed successfully" || echo "Warning: TensorFlow installation failed, ML tests will be skipped"
fi

- name: Clean pip cache after dependency installation
shell: bash
run: |
echo "Disk usage before cleanup:"
df -h
echo "Cleaning old pip cache to free space..."
# Only clean pip cache files older than 3 days, not all cache
find ~/.cache/pip -type f -atime +3 -delete 2>/dev/null || true
# Use pip cache purge if available (Python 3.9+)
python -m pip cache purge 2>/dev/null || true
apt-get clean 2>/dev/null || true
echo "Disk usage after cleanup:"
df -h
- name: Run lint-python.sh
shell: bash
run: |
Expand Down
56 changes: 56 additions & 0 deletions paimon-python/pypaimon/ml/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
PyPaimon ML/AI Integration Module

This module provides seamless integration with PyTorch and TensorFlow
for machine learning workloads on Paimon tables.

Example:
PyTorch usage::

from pypaimon import CatalogFactory
from pypaimon.ml.pytorch import PaimonIterableDataset
from torch.utils.data import DataLoader

catalog = CatalogFactory.create({'warehouse': '/path/to/warehouse'})
table = catalog.get_table('default.training_data')
read_builder = table.new_read_builder()

dataset = PaimonIterableDataset(read_builder=read_builder)
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

TensorFlow usage::

from pypaimon import CatalogFactory
from pypaimon.ml.tensorflow import PaimonTensorFlowDataset
import tensorflow as tf

catalog = CatalogFactory.create({'warehouse': '/path/to/warehouse'})
table = catalog.get_table('default.training_data')
read_builder = table.new_read_builder()

tf_dataset = PaimonTensorFlowDataset.from_paimon(
table=table,
read_builder=read_builder
)
tf_dataset = tf_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
"""

__all__ = ['pytorch', 'tensorflow']
Loading