cuDF - GPU DataFrames
Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.
For example, the following snippet downloads a CSV, then uses the GPU to parse it into rows and columns and run calculations:
import cudf, io, requests
from io import StringIO
url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')
tips_df = cudf.read_csv(StringIO(content))
tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100
# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())
Output:
size
1 21.729201548727808
2 16.571919173482897
3 15.215685473711837
4 14.594900639351332
5 14.149548965142023
6 15.622920072028379
Name: tip_percentage, dtype: float64
For additional examples, browse our complete API documentation, or check out our more detailed notebooks.
Quick Start
Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.
Installation
CUDA/GPU requirements
CUDA 11.0+
NVIDIA driver 450.80.02+
Pascal architecture or better (Compute Capability >=6.0)
Conda
cuDF can be installed with conda (miniconda, or the full Anaconda distribution) from the rapidsai
channel:
For cudf version == 21.06
:
# for CUDA 11.0
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
cudf=21.06 python=3.7 cudatoolkit=11.0
# or, for CUDA 11.2
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
cudf=21.06 python=3.7 cudatoolkit=11.2
For the nightly version of cudf
:
# for CUDA 11.0
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
cudf python=3.7 cudatoolkit=11.0
# or, for CUDA 11.2
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
cudf python=3.7 cudatoolkit=11.2
Note: cuDF is supported only on Linux, and with Python versions 3.7 and later.
See the Get RAPIDS version picker for more OS and version info.
GitHub
https://github.com/rapidsai/cudf
Source: https://pythonawesome.com/a-gpu-dataframe-library-for-loading-and-otherwise-manipulating-data/