Files
strategy-lab/to_explore/pyquantnews/53_PolarsVsPandas.ipynb
David Brazda e3da60c647 daily update
2024-10-21 20:57:56 +02:00

474 lines
10 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "122e94bb",
"metadata": {},
"source": [
"<div style=\"background-color:#000;\"><img src=\"pqn.png\"></img></div>"
]
},
{
"cell_type": "markdown",
"id": "aed964b7",
"metadata": {},
"source": [
"This code retrieves S&P 500 company tickers and fetches historical index data for those companies using the OpenBB SDK. It converts the data between Pandas and Polars dataframes and performs various operations like writing to CSV, reading from CSV, selecting, filtering, and grouping data. The code also measures the performance of these operations using Pandas and Polars. This is useful for comparing the efficiency of data manipulation operations between the two libraries in practice."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fb65f4c",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import polars as pl"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ec4ddfd",
"metadata": {},
"outputs": [],
"source": [
"from openbb_terminal.sdk import openbb"
]
},
{
"cell_type": "markdown",
"id": "f3646b59",
"metadata": {},
"source": [
"Retrieve S&P 500 tickers from Wikipedia and create a list of symbols"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9a984b2",
"metadata": {},
"outputs": [],
"source": [
"table = pd.read_html(\"http://en.wikipedia.org/wiki/List_of_S%26P_500_companies\")[0]\n",
"tickers = table.Symbol.tolist()"
]
},
{
"cell_type": "markdown",
"id": "d90b333a",
"metadata": {},
"source": [
"Fetch historical index data for the retrieved tickers using OpenBB SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f46d2ccb",
"metadata": {},
"outputs": [],
"source": [
"df_pandas = openbb.economy.index(tickers, start_date=\"1990-01-01\")"
]
},
{
"cell_type": "markdown",
"id": "ab6544ab",
"metadata": {},
"source": [
"Save the fetched data to a CSV file using Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "133d2e71",
"metadata": {},
"outputs": [],
"source": [
"df_pandas.to_csv(\"data.csv\")"
]
},
{
"cell_type": "markdown",
"id": "354713f3",
"metadata": {},
"source": [
"Convert the Pandas dataframe to a Polars dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e24a6b45",
"metadata": {},
"outputs": [],
"source": [
"df_polars = pl.from_pandas(df_pandas)"
]
},
{
"cell_type": "markdown",
"id": "dffdb2ef",
"metadata": {},
"source": [
"Measure the time taken to write the Pandas dataframe to a CSV file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7d3c7eac",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas.to_csv(\"data.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"id": "21c90011",
"metadata": {},
"source": [
"Measure the time taken to write the Polars dataframe to a CSV file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b18f6cb",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.write_csv(\"data.csv\")"
]
},
{
"cell_type": "markdown",
"id": "21fcfb58",
"metadata": {},
"source": [
"Measure the time taken to read the CSV file into a Pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ef305fac",
"metadata": {},
"outputs": [],
"source": [
"%timeit pd.read_csv(\"data.csv\")"
]
},
{
"cell_type": "markdown",
"id": "e146436a",
"metadata": {},
"source": [
"Measure the time taken to read the CSV file into a Polars dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e2a5ab4",
"metadata": {},
"outputs": [],
"source": [
"%timeit pl.scan_csv(\"data.csv\")"
]
},
{
"cell_type": "markdown",
"id": "bf112c8b",
"metadata": {},
"source": [
"Select the first 100 tickers from the list"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c37ceba",
"metadata": {},
"outputs": [],
"source": [
"selected = tickers[:100]"
]
},
{
"cell_type": "markdown",
"id": "07db719c",
"metadata": {},
"source": [
"Measure the time taken to select columns in the Pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "be6affd1",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas[selected]"
]
},
{
"cell_type": "markdown",
"id": "1952cfdc",
"metadata": {},
"source": [
"Measure the time taken to select columns in the Polars dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "787fe7d3",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.select(pl.col(selected))"
]
},
{
"cell_type": "markdown",
"id": "a6ff1cdc",
"metadata": {},
"source": [
"Measure the time taken to filter rows in the Pandas dataframe where 'GE' > 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bfcbf64d",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas[df_pandas[\"GE\"] > 100]"
]
},
{
"cell_type": "markdown",
"id": "e7ac6d8d",
"metadata": {},
"source": [
"Measure the time taken to filter rows in the Polars dataframe where 'GE' > 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ed35dc2",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.filter(pl.col(\"GE\") > 100)"
]
},
{
"cell_type": "markdown",
"id": "8f26a363",
"metadata": {},
"source": [
"Measure the time taken to group by 'GE' and calculate the mean in the Pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43a2fe0d",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas.groupby(\"GE\").mean()"
]
},
{
"cell_type": "markdown",
"id": "41a927a8",
"metadata": {},
"source": [
"Measure the time taken to group by 'GE' and calculate the mean in the Polars dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af594a24",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.groupby(\"GE\").mean()"
]
},
{
"cell_type": "markdown",
"id": "ba86d47f",
"metadata": {},
"source": [
"Measure the time taken to create a new column 'GE_Return' with percentage change in Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "067e1c7c",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas.assign(GE_Return=df_pandas[\"GE\"].pct_change())"
]
},
{
"cell_type": "markdown",
"id": "7f218df1",
"metadata": {},
"source": [
"Measure the time taken to create a new column 'GE_return' with percentage change in Polars"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa653184",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.with_columns((pl.col(\"GE\").pct_change()).alias(\"GE_return\"))"
]
},
{
"cell_type": "markdown",
"id": "b8037123",
"metadata": {},
"source": [
"Measure the time taken to fill missing values in the 'GE' column with 0 in Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9715f643",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas[\"GE\"].fillna(0)"
]
},
{
"cell_type": "markdown",
"id": "e7495700",
"metadata": {},
"source": [
"Measure the time taken to fill missing values in the 'GE' column with 0 in Polars"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7113f4df",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.with_columns(pl.col(\"GE\").fill_null(0))"
]
},
{
"cell_type": "markdown",
"id": "5ff4e917",
"metadata": {},
"source": [
"Measure the time taken to sort the dataframe by the 'GE' column in Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8df68642",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas.sort_values(\"GE\")"
]
},
{
"cell_type": "markdown",
"id": "1c98f0ef",
"metadata": {},
"source": [
"Measure the time taken to sort the dataframe by the 'GE' column in Polars"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69b396c5",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.sort(\"GE\")"
]
},
{
"cell_type": "markdown",
"id": "40cea3be",
"metadata": {},
"source": [
"Measure the time taken to calculate the rolling mean for 'GE' with a window of 20 in Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d59086c",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_pandas.GE.rolling(window=20).mean()"
]
},
{
"cell_type": "markdown",
"id": "9c533081",
"metadata": {},
"source": [
"Measure the time taken to calculate the rolling mean for 'GE' with a window of 20 in Polars"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4fcf1c5e",
"metadata": {},
"outputs": [],
"source": [
"%timeit df_polars.with_columns(pl.col(\"GE\").rolling_mean(20))"
]
},
{
"cell_type": "markdown",
"id": "e9a81f98",
"metadata": {},
"source": [
"<a href=\"https://pyquantnews.com/\">PyQuant News</a> is where finance practitioners level up with Python for quant finance, algorithmic trading, and market data analysis. Looking to get started? Check out the fastest growing, top-selling course to <a href=\"https://gettingstartedwithpythonforquantfinance.com/\">get started with Python for quant finance</a>. For educational purposes. Not investment advise. Use at your own risk."
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"main_language": "python",
"notebook_metadata_filter": "-all"
}
},
"nbformat": 4,
"nbformat_minor": 5
}