strategy-lab/to_explore/pyquantnews/48_KMeans.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c60b36d4",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#000;\"><img src=\"pqn.png\"></img></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0709d1d",
   "metadata": {},
   "source": [
    "This code performs k-means clustering on the Dow Jones Industrial Average (DJIA) stock data from 2020 to 2022. It extracts historical stock prices, calculates returns and volatility, and then clusters the stocks based on these metrics. The 'Elbow Method' is used to determine the optimal number of clusters. Finally, it visualizes the clusters with a scatter plot, annotating each stock with its cluster label. This is useful for identifying patterns or groupings in stock performance metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09e34b5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import sqrt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d73dcf13",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.cluster import KMeans\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rc(\"font\", size=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da61cf6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openbb_terminal.sdk import openbb"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c671ff3",
   "metadata": {},
   "source": [
    "Fetch the list of Dow Jones Industrial Average (DJIA) component symbols from Wikipedia"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0dcdce9",
   "metadata": {},
   "outputs": [],
   "source": [
    "dji = (\n",
    "    pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average')[1]\n",
    ")\n",
    "symbols = dji.Symbol.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7140d32a",
   "metadata": {},
   "source": [
    "Download historical stock price data for the DJIA components using OpenBB SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82f64558",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = openbb.stocks.ca.hist(\n",
    "    symbols, \n",
    "    start_date=\"2020-01-01\",\n",
    "    end_date=\"2022-12-31\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08436f25",
   "metadata": {},
   "source": [
    "Calculate annualized returns and volatility for each stock in the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3306d91",
   "metadata": {},
   "outputs": [],
   "source": [
    "moments = (\n",
    "    data\n",
    "    .pct_change()\n",
    "    .describe()\n",
    "    .T[[\"mean\", \"std\"]]\n",
    "    .rename(columns={\"mean\": \"returns\", \"std\": \"vol\"})\n",
    ") * [252, sqrt(252)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b3b12c5",
   "metadata": {},
   "source": [
    "Compute sum of squared errors (SSE) for k-means clustering with different cluster counts to use the Elbow Method for optimal k determination. SSE helps identify the point where adding more clusters doesn't significantly improve the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da6607e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "sse = []\n",
    "for k in range(2, 15):\n",
    "    kmeans = KMeans(n_clusters=k, n_init=10)\n",
    "    kmeans.fit(moments)\n",
    "    sse.append(kmeans.inertia_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4853d749",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(range(2, 15), sse)\n",
    "plt.title(\"Elbow Curve\");"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d4b74d2",
   "metadata": {},
   "source": [
    "Perform k-means clustering with 5 clusters on the calculated returns and volatility metrics. Visualize the clusters in a scatter plot and annotate each stock with its cluster label for easy identification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cab4e3bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "kmeans = KMeans(n_clusters=5, n_init=10).fit(moments)\n",
    "plt.scatter(\n",
    "    moments.returns, \n",
    "    moments.vol, \n",
    "    c=kmeans.labels_, \n",
    "    cmap=\"rainbow\",\n",
    ");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e0d3176",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.title(\"Dow Jones stocks by return and volatility (K=5)\")\n",
    "for i in range(len(moments.index)):\n",
    "    txt = f\"{moments.index[i]} ({kmeans.labels_[i]})\"\n",
    "    xy = tuple(moments.iloc[i, :] + [0, 0.01])\n",
    "    plt.annotate(txt, xy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48ae8500",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5b962026",
   "metadata": {},
   "source": [
    "<a href=\"https://pyquantnews.com/\">PyQuant News</a> is where finance practitioners level up with Python for quant finance, algorithmic trading, and market data analysis. Looking to get started? Check out the fastest growing, top-selling course to <a href=\"https://gettingstartedwithpythonforquantfinance.com/\">get started with Python for quant finance</a>. For educational purposes. Not investment advise. Use at your own risk."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}