{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "968808e1",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#000;\"><img src=\"pqn.png\"></img></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28fcc5f0",
   "metadata": {},
   "source": [
    "This code extracts financial data from the SEC's EDGAR database, specifically targeting Apple's filings. It downloads and processes quarterly and annual financial statement datasets, converts them to a more efficient format (parquet), and constructs a comprehensive dataset of Apple's financials. The code then calculates key financial metrics such as P/E ratios from earnings per share (EPS) and stock price data. Additionally, it prepares the dataset for further analysis or visualization. This is useful for financial analysis, investment research, and academic purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef94fd0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from io import BytesIO\n",
    "from zipfile import ZipFile, BadZipFile\n",
    "from pathlib import Path\n",
    "from tqdm import tqdm\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6383b976",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openbb import obb"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffc1ec08",
   "metadata": {},
   "source": [
    "Set the base URL and file path for SEC data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb09a939",
   "metadata": {},
   "outputs": [],
   "source": [
    "SEC_URL = \"https://www.sec.gov/\"\n",
    "FSN_PATH = \"files/dera/data/financial-statement-and-notes-data-sets/\"\n",
    "DATA_PATH = Path(\"edgar\")\n",
    "user_agent = \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c76c3c09",
   "metadata": {},
   "source": [
    "Generate a list of filing periods (quarters) to download"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aebfd944",
   "metadata": {},
   "outputs": [],
   "source": [
    "filing_periods = [\n",
    "    (d.year, d.quarter) for d in pd.date_range(\"2015\", \"2015-12-31\", freq=\"QE\")\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e6a00ad",
   "metadata": {},
   "source": [
    "Loop through each filing period to download and extract data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c2bde09",
   "metadata": {},
   "outputs": [],
   "source": [
    "for yr, qtr in tqdm(filing_periods):\n",
    "    path = DATA_PATH / f\"{yr}_{qtr}\" / \"source\"\n",
    "    if not path.exists():\n",
    "        path.mkdir(parents=True)\n",
    "    filing = f\"{yr}q{qtr}_notes.zip\"\n",
    "    url = f\"{SEC_URL}{FSN_PATH}{filing}\"\n",
    "    response = requests.get(url, headers={\"User-Agent\": user_agent}).content\n",
    "    with ZipFile(BytesIO(response)) as zip_file:\n",
    "        for file in zip_file.namelist():\n",
    "            local_file = path / file\n",
    "            if local_file.exists():\n",
    "                continue\n",
    "            with local_file.open(\"wb\") as output:\n",
    "                for line in zip_file.open(file).readlines():\n",
    "                    output.write(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c658bd59",
   "metadata": {},
   "source": [
    "Convert downloaded TSV files to parquet format for efficiency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7c2850f",
   "metadata": {},
   "outputs": [],
   "source": [
    "for f in tqdm(sorted(list(DATA_PATH.glob(\"**/*.tsv\")))):\n",
    "    parquet_path = f.parent.parent / \"parquet\"\n",
    "    if not parquet_path.exists():\n",
    "        parquet_path.mkdir(parents=True)\n",
    "    file_name = f.stem + \".parquet\"\n",
    "    if not (parquet_path / file_name).exists():\n",
    "        df = pd.read_csv(\n",
    "            f, sep=\"\\t\", encoding=\"latin1\", low_memory=False, on_bad_lines=\"skip\"\n",
    "        )\n",
    "        df.to_parquet(parquet_path / file_name)\n",
    "        f.unlink()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d8efa7c",
   "metadata": {},
   "source": [
    "Filter the subset of data related to Apple Inc. for further analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e10dd7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "sub = pd.read_parquet(DATA_PATH / '2015_3' / 'parquet' / 'sub.parquet')\n",
    "name = \"APPLE INC\"\n",
    "cik = sub[sub.name == name].T.dropna().squeeze().cik"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc2b687f",
   "metadata": {},
   "source": [
    "Aggregate Apple's filings into a single DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef3debc0",
   "metadata": {},
   "outputs": [],
   "source": [
    "aapl_subs = pd.DataFrame()\n",
    "for sub in DATA_PATH.glob(\"**/sub.parquet\"):\n",
    "    sub = pd.read_parquet(sub)\n",
    "    aapl_sub = sub[\n",
    "        (sub.cik.astype(int) == cik) & (sub.form.isin([\"10-Q\", \"10-K\"]))\n",
    "    ]\n",
    "    aapl_subs = pd.concat([aapl_subs, aapl_sub])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81ce9915",
   "metadata": {},
   "source": [
    "Extract numerical data from the filings and convert to parquet format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70b07ffb",
   "metadata": {},
   "outputs": [],
   "source": [
    "aapl_nums = pd.DataFrame()\n",
    "for num in DATA_PATH.glob(\"**/num.parquet\"):\n",
    "    num = pd.read_parquet(num).drop(\"dimh\", axis=1)\n",
    "    aapl_num = num[num.adsh.isin(aapl_subs.adsh)]\n",
    "    aapl_nums = pd.concat([aapl_nums, aapl_num])\n",
    "aapl_nums.ddate = pd.to_datetime(aapl_nums.ddate, format=\"%Y%m%d\")\n",
    "aapl_nums.to_parquet(DATA_PATH / \"aapl_nums.parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3b458e0",
   "metadata": {},
   "source": [
    "Filter EPS data and adjust for stock splits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88af52cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "eps = aapl_nums[\n",
    "    (aapl_nums.tag == \"EarningsPerShareDiluted\") & (aapl_nums.qtrs == 1)\n",
    "].drop(\"tag\", axis=1)\n",
    "eps = eps.groupby(\"adsh\").apply(\n",
    "    lambda x: x.nlargest(n=1, columns=[\"ddate\"]), include_groups=False\n",
    ")\n",
    "eps = eps[[\"ddate\", \"value\"]].set_index(\"ddate\").squeeze().sort_index()\n",
    "ax = eps.plot.bar()\n",
    "ax.set_xticklabels(eps.index.to_period(\"Q\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69b0411a",
   "metadata": {},
   "source": [
    "Retrieve historical stock price data and calculate P/E ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08f476fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "aapl = (\n",
    "    obb.equity.price.historical(\n",
    "        \"AAPL\", start_date=\"2014-12-31\", end_date=eps.index.max(), provider=\"yfinance\"\n",
    "    )\n",
    "    .to_df()\n",
    "    .resample(\"D\")\n",
    "    .last()\n",
    "    .loc[\"2014\":\"2015\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "969044d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "pe = aapl.close.to_frame(\"price\").join(eps.to_frame(\"eps\")).ffill().dropna()\n",
    "pe[\"pe_ratio\"] = pe.price.div(pe.eps)\n",
    "ax = pe.plot(subplots=True, figsize=(16, 8), legend=False, lw=0.5)\n",
    "ax[0].set_title(\"Adj Close\")\n",
    "ax[1].set_title(\"Diluted EPS\")\n",
    "ax[2].set_title(\"Trailing P/E\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf2e1d3d",
   "metadata": {},
   "source": [
    "Define fields of interest for further financial analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7071c22c",
   "metadata": {},
   "outputs": [],
   "source": [
    "fields = [\n",
    "    \"EarningsPerShareDiluted\",\n",
    "    \"PaymentsOfDividendsCommonStock\",\n",
    "    \"WeightedAverageNumberOfDilutedSharesOutstanding\",\n",
    "    \"OperatingIncomeLoss\",\n",
    "    \"NetIncomeLoss\",\n",
    "    \"GrossProfit\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cadcb87b",
   "metadata": {},
   "source": [
    "<a href=\"https://pyquantnews.com/\">PyQuant News</a> is where finance practitioners level up with Python for quant finance, algorithmic trading, and market data analysis. Looking to get started? Check out the fastest growing, top-selling course to <a href=\"https://gettingstartedwithpythonforquantfinance.com/\">get started with Python for quant finance</a>. For educational purposes. Not investment advise. Use at your own risk."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}