Welcome to Federated Data Valuation, a federated learning framework that implements the methods proposed in the paper:
Data Valuation and Detections in Federated Learning Wenqian Li, Shuran Fu, Fengrui Zhang, Yan Pang arXiv:2311.05304
This project leverages Wasserstein distance to evaluate client contributions and detect noisy or irrelevant data in a privacy-preserving manner. It provides a scalable and efficient solution for data valuation in federated learning without relying on validation datasets
- π Features
- π Getting Started
- π οΈ Usage
- π¦ Blockchain Integration for DP FL
- π Project Structure
- π Detailed Description
- π Results
- π Notes
- π€ Contributing
- π License
- π©βπ» Authors
- π Acknowledgements
- π Contact
- π References
- Privacy-Preserving Data Valuation: Evaluate client contributions using Wasserstein distance without sharing raw data.
- No Validation Dataset Required: Efficient computation of Wasserstein barycenter eliminates the need for validation datasets.
- Data Detection: Identify and filter out noisy or irrelevant data points.
- Scalability: Optimized for large-scale federated learning with numerous clients.
- Advanced Models: Supports Vision Transformer (ViT) and ResNet architectures.
- Optimized Training and Evaluation: Utilizes multiprocessing and efficient data loading techniques.
- Logging and Visualization: Detailed logging and plots for client contributions and training accuracy.
- Extensible: Easy to integrate with custom datasets and models.
Ensure you have the following installed:
- Python 3.11.8 (has not been tested with other versions of Python)
- PyTorch
- torchvision
- transformers
- Other dependencies listed in
requirements.txt - Or, simply run the
setup.shfile, which will set up a new conda environment for you and install the necessary requirements
-
Clone the repository
git clone https://github.com/datanovatrust/federated-data-valuation.git cd federated-data-valuation -
Create a virtual environment (optional but recommended)
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies
pip install -r requirements.txt
Before running the training script, you can adjust the configurations in src/config/config.yaml:
model:
name: vit # Options: 'vit', 'resnet'
num_labels: 10
training:
batch_size: 32
epochs: 1
learning_rate: 1e-4
federated_learning:
num_clients: 5
rounds: 5
fraction_fit: 0.6 # Fraction of clients selected
num_shards: 50 # Partitions data into 50 shards
num_samples: 100 # Limits dataset size to 100 samples
fedbary:
use_validation: true # Use validation data for global distributionTo start federated training:
python scripts/train_federated.pyYou can also specify the differential privacy epsilon value via command-line argument:
python scripts/train_federated.py --epsilon=10After training, evaluation metrics and plots are saved in the experiments directory:
- Client Contributions Plot:
experiments/client_contributions.png - Training Accuracy Plot:
experiments/training_accuracy.png - Confusion Matrix:
experiments/confusion_matrix_round_{round_num}.png(One for each round)
To enhance privacy and security, we integrate Blockchain and Differential Privacy into the federated learning pipeline. This approach records global model updates on-chain, leverages IPFS for decentralized storage, and can use differential privacy to protect client data. By default, we use a local Ethereum node (Anvil from Foundry) and a locally running IPFS node to avoid authentication issues with Infura.
-
Ethereum Node (Anvil):
We use Anvil (part of Foundry) as the local Ethereum node. -
Brownie:
A Python framework for Ethereum. Used to compile, deploy, and interact with smart contracts. -
IPFS Node (Local):
We run a local IPFS daemon to store global model checkpoints without relying on external services.
Ensure you are in the federated-data-valuation directory and have the virtual environment activated:
pip install eth-brownieCheck Brownie installation:
brownie --versionYou should see a Brownie version output.
-
Install Foundry & Anvil:
curl -L https://foundry.paradigm.xyz | bash # Follow the instructions, then: source ~/.zshenv foundryup
-
Run Anvil:
In a separate terminal:
anvil
Keep anvil running in this terminal. It listens at
http://127.0.0.1:8545.
-
Compile the Contract
We assume you have
FLRegistry.solincontracts/. Run:brownie compile
-
Deploy the Contract
Use the provided deployment script (e.g.,
scripts/deploy_contract.py):brownie run scripts/deploy_contract.py --network development
Brownie connects to anvil. It prints the deployed contract address. Copy that address.
-
Extract the ABI
After compilation, your contract ABI is in
build/contracts/FLRegistry.json. Extract the ABI:cat build/contracts/FLRegistry.json | jq '.abi' > src/config/FLRegistry_abi.json
-
Update
blockchain_config.yamlEdit
src/config/blockchain_config.yamlto point to your deployed contract, local RPC, and local IPFS:blockchain: enabled: true rpc_url: "http://127.0.0.1:8545" # Anvil endpoint contract_address: "0xYourDeployedContractAddress" ipfs_gateway: "http://127.0.0.1:5001" # Local IPFS daemon endpoint abi_file: "src/config/FLRegistry_abi.json"
-
Install IPFS:
On macOS (Homebrew):
brew install ipfs
On Linux, follow instructions from: https://docs.ipfs.tech/install/
-
Initialize and Run IPFS Daemon:
ipfs init ipfs daemon
This runs IPFS locally with an API at
127.0.0.1:5001.
With Anvil running, contract deployed, IPFS daemon running, and blockchain_config.yaml pointing to local IPFS and Ethereum node:
python scripts/train_blockchain_federated.pyWhat this does:
- Performs federated training as in
train_federated.py. - Uses DP (if
--epsilonis provided) for privacy. - Uploads global model checkpoints to IPFS (no authentication required locally).
- Records the model hash on the Blockchain contract.
- If a
PRIVATE_KEYenvironment variable is set, it can send transactions to record model updates on-chain. Without a private key, it will only read from the Blockchain.
- IPFS Connection Refused: Make sure
ipfs daemonis running locally. - Blockchain Issues: Ensure Anvil is running and contract is deployed.
This project also references privacy-preserving approaches in Blockchain-based Federated Learning as discussed in:
Privacy-Preserving in Blockchain-based Federated Learning Systems
Sameera K. M., Serena Nicolazzo, Marco Arazzi, Antonino Nocera, Rafidha Rehiman K. A., Vinod P., Mauro Conti
arXiv:2401.03552
If you are using Blockchain and DP features, consider citing:
@misc{sameera2024privacy,
title={Privacy-preserving in Blockchain-based Federated Learning Systems},
author={K. M., Sameera and Nicolazzo, Serena and Arazzi, Marco and Nocera, Antonino and Rehiman K. A., Rafidha and P., Vinod and Conti, Mauro},
year={2024},
eprint={2401.03552},
archivePrefix={arXiv},
note={Computer Communications Journal, 2024}
}
federated-data-valuation
βββ Dockerfile
βββ README.md
βββ banner.svg
βββ checkpoints
β βββ ...
βββ data
β βββ ...
βββ docs
βββ experiments
β βββ ...
βββ logs
β βββ ...
βββ notebooks
β βββ federated_training.ipynb
βββ requirements.txt
βββ scripts
β βββ deploy_contract.py
β βββ run_rmia_attack.py
β βββ train_federated.py
β βββ train_peft_federated.py
βββ src
β βββ attacks
β βββ config
β β βββ config.yaml
β β βββ peft_config.yaml
β β βββ blockchain_config.yaml
β βββ models
β βββ trainers
β βββ utils
β βββ blockchain_utils.py
β βββ ...
βββ tests
βββ ...- Data Loading: MNIST dataset is loaded and transformed to match the input requirements of ViT.
- Data Partitioning: Data is partitioned among clients in a non-IID fashion using shards.
- Client Setup: Local clients are initialized with their respective datasets.
- Client Contribution Evaluation: Wasserstein distance is computed to evaluate data distribution similarity.
- Client Selection: Clients are selected based on their Wasserstein distances.
- Training Rounds: For each round:
- Selected clients train the model locally.
- Models are aggregated to update the global model.
- The global model is evaluated on the test dataset.
- Results Visualization: Plots for client contributions and training accuracy are generated.
- Wasserstein Distance: Measures the distribution similarity between client data and the global distribution.
- Federated Barycenter Computation: Approximates the Wasserstein barycenter among client distributions.
- Privacy Preservation: No raw data is shared; only interpolating measures are communicated.
- Client Selection Strategy: Clients with the smallest Wasserstein distances are selected for training.
- Duality Theorem: Utilizes the dual formulation of the Wasserstein distance to compute calibrated gradients.
- Datum Evaluation: Calculates the contribution of individual data points to the overall distance.
- Noisy Data Detection: Identifies and filters out noisy or irrelevant data points before training.
- Efficiency: Detects data issues without the need for model training or validation datasets.
- Device Selection: The script automatically uses GPU if available.
- Optimizations: Data loading and evaluation are optimized for performance.
- Error Handling: Extensive error handling and logging are implemented for robustness.
- Scalability: Designed to handle large numbers of clients efficiently.
Contributions are welcome! Please open an issue or submit a pull request for improvements.
This project is licensed under the MIT License.
- David Zagardo - Initial work - dzagardo
This project implements methods from the paper:
- Data Valuation and Detections in Federated Learning
Wenqian Li, Shuran Fu, Fengrui Zhang, Yan Pang
arXiv:2311.05304
We thank the authors for their valuable contributions to the field.
Additionally, we acknowledge the use of the Fast Differential Privacy (fastDP) library developed by Zhiqi Bu and colleagues, which provides efficient differentially private optimization for PyTorch models.
- Fast Differential Privacy Library
GitHub Repository
Please consider citing their work:
@inproceedings{bu2023differentially,
title={Differentially private optimization on large model at small cost},
author={Bu, Zhiqi and Wang, Yu-Xiang and Zha, Sheng and Karypis, George},
booktitle={International Conference on Machine Learning},
pages={3192--3218},
year={2023},
organization={PMLR}
}
We are grateful for their valuable contributions to the field and for making their library available.
Feel free to reach out for any inquiries or support.
- Email: dave@greenwillowstudios.com
- GitHub: dzagardo
- Li et al., "Data Valuation and Detections in Federated Learning," arXiv:2311.05304.
- Sameera K. M. et al., "Privacy-preserving in Blockchain-based Federated Learning Systems," arXiv:2401.03552.
- Bu et al., "Differentially private optimization on large model at small cost," ICML 2023.
- fastDP
- PyTorch Docs
- Hugging Face Transformers
- Federated Learning Concepts
Just when you thought it couldn't get better, we've added more features! π
You can now use your own custom datasets by modifying the load_custom_dataset function in src/utils/data_loader.py. The function supports loading images and CSV files.
In train_federated.py, replace the MNIST data loading with your custom dataset:
from src.utils.data_loader import load_custom_dataset
train_dataset = load_custom_dataset(data_dir='path/to/your/data', file_type='jpg', transform=transform)We've extended support for more models:
- MobileNetV2
- DenseNet
You can specify the model in the configuration:
model:
name: mobilenet # Options: 'vit', 'resnet', 'mobilenet', 'densenet'
num_labels: 10- Data Privacy: Implemented differential privacy mechanisms to ensure client data remains secure.
- Secure Aggregation: Models are aggregated using secure protocols to prevent leakage.
- Confusion Matrix: Generate confusion matrices to analyze model performance.
- Per-Class Accuracy: Evaluate accuracy for each class individually.
from sklearn.metrics import confusion_matrix
import seaborn as sns
# After model evaluation
cm = confusion_matrix(all_targets, all_predictions)
sns.heatmap(cm, annot=True)
plt.savefig('experiments/confusion_matrix.png')We've integrated a live monitoring dashboard using TensorBoard to visualize training progress in real-time.
-
Start TensorBoard
tensorboard --logdir=runs
-
Access Dashboard
Open http://localhost:6006 in your browser.
I hope you enjoy these new features! If you have any suggestions or encounter any issues, please let me know. Happy coding! π»π


