Files
strategy-lab/to_explore/pyquantnews/48_KMeans.ipynb
David Brazda e3da60c647 daily update
2024-10-21 20:57:56 +02:00

5.6 KiB

No description has been provided for this image

This code performs k-means clustering on the Dow Jones Industrial Average (DJIA) stock data from 2020 to 2022. It extracts historical stock prices, calculates returns and volatility, and then clusters the stocks based on these metrics. The 'Elbow Method' is used to determine the optimal number of clusters. Finally, it visualizes the clusters with a scatter plot, annotating each stock with its cluster label. This is useful for identifying patterns or groupings in stock performance metrics.

In [ ]:
from math import sqrt
In [ ]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
plt.rc("font", size=10)
In [ ]:
from openbb_terminal.sdk import openbb

Fetch the list of Dow Jones Industrial Average (DJIA) component symbols from Wikipedia

In [ ]:
dji = (
    pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average')[1]
)
symbols = dji.Symbol.tolist()

Download historical stock price data for the DJIA components using OpenBB SDK

In [ ]:
data = openbb.stocks.ca.hist(
    symbols, 
    start_date="2020-01-01",
    end_date="2022-12-31"
)

Calculate annualized returns and volatility for each stock in the dataset

In [ ]:
moments = (
    data
    .pct_change()
    .describe()
    .T[["mean", "std"]]
    .rename(columns={"mean": "returns", "std": "vol"})
) * [252, sqrt(252)]

Compute sum of squared errors (SSE) for k-means clustering with different cluster counts to use the Elbow Method for optimal k determination. SSE helps identify the point where adding more clusters doesn't significantly improve the model.

In [ ]:
sse = []
for k in range(2, 15):
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(moments)
    sse.append(kmeans.inertia_)
In [ ]:
plt.plot(range(2, 15), sse)
plt.title("Elbow Curve");

Perform k-means clustering with 5 clusters on the calculated returns and volatility metrics. Visualize the clusters in a scatter plot and annotate each stock with its cluster label for easy identification.

In [ ]:
kmeans = KMeans(n_clusters=5, n_init=10).fit(moments)
plt.scatter(
    moments.returns, 
    moments.vol, 
    c=kmeans.labels_, 
    cmap="rainbow",
);
In [ ]:
plt.title("Dow Jones stocks by return and volatility (K=5)")
for i in range(len(moments.index)):
    txt = f"{moments.index[i]} ({kmeans.labels_[i]})"
    xy = tuple(moments.iloc[i, :] + [0, 0.01])
    plt.annotate(txt, xy)
In [ ]:
 

PyQuant News is where finance practitioners level up with Python for quant finance, algorithmic trading, and market data analysis. Looking to get started? Check out the fastest growing, top-selling course to get started with Python for quant finance. For educational purposes. Not investment advise. Use at your own risk.