Files
ttools/tests/data_loader_tryme.ipynb
David Brazda b23a772836 remote fetch
2024-11-10 14:08:41 +01:00

32 KiB
Raw Blame History

Load data

Make sure you have .env file in ttools or any parent dir with your Alpaca keys.

ACCOUNT1_LIVE_API_KEY=api_key
ACCOUNT1_LIVE_SECRET_KEY=secret_key

Cache directories

Daily trade files - DATADIR/tradecache Agg data cache - DATADIR/aggcache

DATADIR - user_data_dir from appdirs library - see config.py

In [1]:
import pandas as pd
import numpy as np
from ttools.utils import AggType
from datetime import datetime
from ttools.aggregator_vectorized import generate_time_bars_nb, aggregate_trades
from ttools.loaders import load_data, prepare_trade_cache, fetch_daily_stock_trades
from ttools.utils import zoneNY
import vectorbtpro as vbt
from lightweight_charts import PlotDFAccessor, PlotSRAccessor


vbt.settings.set_theme("dark")
vbt.settings['plotting']['layout']['width'] = 1280
vbt.settings.plotting.auto_rangebreaks = True
# Set the option to display with pagination
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_rows', 10)  # Number of rows per page
TTOOLS: Loaded env variables from file /Users/davidbrazda/Documents/Development/python/.env

Fetching aggregated data

Available aggregation types:

  • time based bars - AggType.OHLCV
  • volume based bars - AggType.OHLCV_VOL, resolution = volume threshold
  • dollar based bars - AggType.OHLCV_DOL, resolution = dollar threshold
  • renko bars - AggType.OHLCV_RENKO resolution = bricksize
In [7]:
#This is how to call LOAD function
symbol = ["SPY"]
#datetime in zoneNY 
day_start = datetime(2024, 9, 15, 9, 30, 0)
day_stop = datetime(2024, 10, 20, 16, 0, 0)
day_start = zoneNY.localize(day_start)
day_stop = zoneNY.localize(day_stop)

#requested AGG
resolution = 12 #12s bars
agg_type = AggType.OHLCV #other types AggType.OHLCV_VOL, AggType.OHLCV_DOL, AggType.OHLCV_RENKO
exclude_conditions = ['C','O','4','B','7','V','P','W','U','Z','F','9','M','6'] #None to defaults
minsize = 100 #min trade size to include
main_session_only = False
force_remote = False

data = load_data(symbol = symbol,
                     agg_type = agg_type,
                     resolution = resolution,
                     start_date = day_start,
                     end_date = day_stop,
                     #exclude_conditions = None,
                     minsize = minsize,
                     main_session_only = main_session_only,
                     force_remote = force_remote,
                     return_vbt = True, #returns vbt object
                     verbose = False
                     )
data.ohlcv.data[symbol[0]]
#data.ohlcv.data[symbol[0]].lw.plot()
Out[7]:
<style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
open high low close volume
time
2024-09-16 04:01:24-04:00 562.22 562.22 562.22 562.22 200.0
2024-09-16 04:02:24-04:00 562.17 562.17 562.17 562.17 293.0
2024-09-16 04:04:36-04:00 562.54 562.54 562.54 562.54 100.0
2024-09-16 04:10:00-04:00 562.39 562.39 562.39 562.39 102.0
2024-09-16 04:10:24-04:00 562.44 562.44 562.44 562.44 371.0
... ... ... ... ... ...
2024-10-18 19:57:24-04:00 584.80 584.80 584.80 584.80 100.0
2024-10-18 19:57:48-04:00 584.84 584.84 584.84 584.84 622.0
2024-10-18 19:58:48-04:00 584.77 584.79 584.77 584.79 4158.0
2024-10-18 19:59:36-04:00 584.80 584.82 584.80 584.82 298.0
2024-10-18 19:59:48-04:00 584.76 584.76 584.72 584.72 258.0

64218 rows × 5 columns

In [ ]:
data.ohlcv.data[symbol[0]]

Prepare daily trade cache

This is how to prepare trade cache for given symbol and period (if daily trades are not cached they are remotely fetched.)

In [ ]:
symbols = ["BAC", "AAPL"]
#datetime in zoneNY 
day_start = datetime(2024, 10, 1, 9, 45, 0)
day_stop = datetime(2024, 10, 27, 15, 1, 0)
day_start = zoneNY.localize(day_start)
day_stop = zoneNY.localize(day_stop)
force_remote = False

prepare_trade_cache(symbols, day_start, day_stop, force_remote, verbose = True)

Prepare daily trade cache - cli script

Python script prepares trade cache for specified symbols and date range.

Usually 1 day takes about 35s. It is stored in /tradescache/ directory as daily file keyed by symbol.

To run this script in the background with specific arguments:

# Running without forcing remote fetch
python3 prepare_cache.py --symbols BAC AAPL --day_start 2024-10-14 --day_stop 2024-10-18 &

# Running with force_remote set to True
python3 prepare_cache.py --symbols BAC AAPL --day_start 2024-10-14 --day_stop 2024-10-18 --force_remote &

Aggregated data are stored per symbol, date range and conditions. If requested dates are matched with existing stored data with same conditions but wider data spans they are loaded from this file.

This is the matching part:

In [ ]:
from ttools.utils import list_matching_files, print_matching_files_info, zoneNY
from datetime import datetime
from ttools.config import AGG_CACHE

# Find all files covering January 15, 2024 9:30 to 16:00
files = list_matching_files(
    symbol='SPY',
    resolution="1",
    agg_type='AggType.OHLCV',
    start_date=datetime(2024, 1, 15, 9, 30),
    end_date=datetime(2024, 1, 15, 16, 0)
)

#print_matching_files_info(files)

# Example with all parameters specified
specific_files = list_matching_files(
    symbol="SPY",
    agg_type="AggType.OHLCV",
    resolution="12",
    start_date=zoneNY.localize(datetime(2024, 1, 15, 9, 30)),
    end_date=zoneNY.localize(datetime(2024, 1, 15, 16, 0)),
    excludes_str="4679BCFMOPUVWZ",
    minsize=100,
    main_session_only=True
)

print_matching_files_info(specific_files)

From this file the subset of dates are loaded. Usually this is all done automatically by load_data in loader.

In [1]:
#loading manually range subset from existing files
start = zoneNY.localize(datetime(2024, 1, 15, 9, 30))
end = zoneNY.localize(datetime(2024, 10, 20, 16, 00))

ohlcv_df = pd.read_parquet(
    AGG_CACHE / "SPY-AggType.OHLCV-1-2024-01-15T09-30-00-2024-10-20T16-00-00-4679BCFMOPUVWZ-100-True.parquet", 
    engine='pyarrow',
    filters=[('time', '>=', start), 
            ('time', '<=', end)]
)

ohlcv_df
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 2
      1 #loading manually range subset from existing files
----> 2 start = zoneNY.localize(datetime(2024, 1, 15, 9, 30))
      3 end = zoneNY.localize(datetime(2024, 10, 20, 16, 00))
      5 ohlcv_df = pd.read_parquet(
      6     AGG_CACHE / "SPY-AggType.OHLCV-1-2024-01-15T09-30-00-2024-10-20T16-00-00-4679BCFMOPUVWZ-100-True.parquet", 
      7     engine='pyarrow',
      8     filters=[('time', '>=', start), 
      9             ('time', '<=', end)]
     10 )

NameError: name 'zoneNY' is not defined
In [1]:
from ttools.loaders import fetch_daily_stock_trades, fetch_trades_parallel
from ttools.utils import zoneNY
from datetime import datetime
TTOOLS: Loaded env variables from file /Users/davidbrazda/Documents/Development/python/.env

Fetching trades for whole range

In [2]:
#fethcing one day
# df = fetch_daily_stock_trades(symbol="SPY",
#                               start=zoneNY.localize(datetime(2024, 1, 16, 9, 30)),
#                               end=zoneNY.localize(datetime(2024, 1, 16, 16, 00)))
# df.info()

#fetching multiple days with parallel
df = fetch_trades_parallel(symbol="BAC",
                              start_date=zoneNY.localize(datetime(2024, 1, 16, 0, 0)),
                              end_date=zoneNY.localize(datetime(2024, 1, 16, 23, 59)),
                              main_session_only=False,
                              exclude_conditions=None,
                              minsize=None,
                              force_remote=True)

df.info()
BAC Contains 1  market days
BAC Remote fetching: 100%|██████████| 1/1 [00:00<00:00, 434.55it/s]
Fetching from remote.
BAC Receiving trades:   0%|          | 0/1 [00:00<?, ?it/s]
Remote fetched completed whole day 2024-01-16
Exact UTC range fetched: 2024-01-16 05:00:00+00:00 - 2024-01-17 04:59:59.999999+00:00
BAC Receiving trades: 100%|██████████| 1/1 [00:42<00:00, 42.76s/it]
Saved to CACHE /Users/davidbrazda/Library/Application Support/v2realbot/tradecache/BAC-2024-01-16.parquet
Trimming 2024-01-16 00:00:00-05:00 2024-01-16 23:59:00-05:00
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 222754 entries, 2024-01-16 04:00:00.009225-05:00 to 2024-01-16 19:59:48.834830-05:00
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   x       222754 non-null  object 
 1   p       222754 non-null  float64
 2   s       222754 non-null  int64  
 3   i       222754 non-null  int64  
 4   c       222754 non-null  object 
 5   z       222754 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 11.9+ MB

In [22]:
df
Out[22]:
<style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
x p s i c z
t
2024-01-16 04:00:00.009225-05:00 K 32.800 1 52983525027912 [ , T, I] A
2024-01-16 04:00:00.012088-05:00 P 32.580 8 52983525027890 [ , T, I] A
2024-01-16 04:00:02.299262-05:00 P 32.750 1 52983525027916 [ , T, I] A
2024-01-16 04:00:03.895322-05:00 P 32.640 1 52983525027920 [ , T, I] A
2024-01-16 04:00:04.145553-05:00 P 32.740 1 52983525027921 [ , T, I] A
... ... ... ... ... ... ...
2024-01-16 18:58:10.081270-05:00 D 32.104 10 79371957716549 [ , T, I] A
2024-01-16 18:58:11.293971-05:00 T 32.090 3 62883460503386 [ , T, I] A
2024-01-16 18:58:24.511348-05:00 D 32.110 1 79371957716560 [ , T, I] A
2024-01-16 18:58:46.648899-05:00 D 32.110 1 79371957716786 [ , T, I] A
2024-01-16 18:59:54.013894-05:00 D 32.100 1 71710070428229 [ , T, I] A

159301 rows × 6 columns

In [3]:
#comparing dataframes
from ttools.utils import AGG_CACHE, compare_dataframes
import pandas as pd
file1 = AGG_CACHE / "SPY-AggType.OHLCV-1-2024-02-15T09-30-00-2024-10-20T16-00-00-4679BCFMOPUVWZ-100-False.parquet"
file2 = AGG_CACHE / "SPY-AggType.OHLCV-1-2024-02-15T09-30-00-2024-10-20T16-00-00-4679BCFMOPUVWZ-100-False_older2.parquet"
df1 = pd.read_parquet(file1)
df2 = pd.read_parquet(file2)
df1.equals(df2)

#compare_dataframes(df1, df2)
Out[3]:
True
In [5]:
from ttools.config import TRADE_CACHE
import pandas as pd
file1 = TRADE_CACHE / "BAC-2024-01-16.parquet"
df1 = pd.read_parquet(file1)
In [8]:
df1
Out[8]:
<style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
x p s i c z
t
2024-01-16 04:00:00.009225-05:00 K 32.80 1 52983525027912 [ , T, I] A
2024-01-16 04:00:00.012088-05:00 P 32.58 8 52983525027890 [ , T, I] A
2024-01-16 04:00:00.856156-05:00 K 32.61 14 52983525028705 [ , F, T, I] A
2024-01-16 04:00:02.299262-05:00 P 32.75 1 52983525027916 [ , T, I] A
2024-01-16 04:00:03.895322-05:00 P 32.64 1 52983525027920 [ , T, I] A
... ... ... ... ... ... ...
2024-01-16 19:59:24.796862-05:00 P 32.12 500 52983576997941 [ , T] A
2024-01-16 19:59:24.796868-05:00 P 32.12 500 52983576997942 [ , T] A
2024-01-16 19:59:24.796868-05:00 P 32.12 500 52983576997943 [ , T] A
2024-01-16 19:59:24.796871-05:00 P 32.12 500 52983576997944 [ , T] A
2024-01-16 19:59:48.834830-05:00 K 32.10 25 52983526941511 [ , T, I] A

222754 rows × 6 columns