Stock embeddings

https://github.com/cyrus723/stock-embeddings

Stock Embeddings: Learning Distributed Representations for Financial Assets

Identifying meaningful relationships between the price movements of financial assets is a challenging but important problem in a variety of financial applications. However with recent research, particularly those using machine learning and deep learning techniques, focused mostly on price forecasting, the literature investigating the modelling of asset correlations has lagged somewhat. To address this, inspired by recent successes in natural language processing, we propose a neural model for training stock embeddings, which harnesses the dynamics of historical returns data in order to learn the nuanced relationships that exist between financial assets. We describe our approach in detail and discuss a number of ways that it can be used in the financial domain. Furthermore, we present the evaluation results to demonstrate the utility of this approach, compared to several important benchmarks, in two real-world financial analytics tasks.

https://arxiv.org/abs/2207.07183

https://github.com/wuxxx949/stock_embedding

Learning Embedded Representation of the Stock Correlation Matrix using Graph Machine Learning

Bhaskarjit Sarmah, Nayana Nair, Dhagash Mehta, Stefano Pasquali

Understanding non-linear relationships among financial instruments has various applications in investment processes ranging from risk management, portfolio construction and trading strategies. Here, we focus on interconnectedness among stocks based on their correlation matrix which we represent as a network with the nodes representing individual stocks and the weighted links between pairs of nodes representing the corresponding pair-wise correlation coefficients. The traditional network science techniques, which are extensively utilized in financial literature, require handcrafted features such as centrality measures to understand such correlation networks. However, manually enlisting all such handcrafted features may quickly turn out to be a daunting task. Instead, we propose a new approach for studying nuances and relationships within the correlation network in an algorithmic way using a graph machine learning algorithm called Node2Vec. In particular, the algorithm compresses the network into a lower dimensional continuous space, called an embedding, where pairs of nodes that are identified as similar by the algorithm are placed closer to each other. By using log returns of S&P 500 stock data, we show that our proposed algorithm can learn such an embedding from its correlation network. We define various domain specific quantitative (and objective) and qualitative metrics that are inspired by metrics used in the field of Natural Language Processing (NLP) to evaluate the embeddings in order to identify the optimal one. Further, we discuss various applications of the embeddings in investment management.

https://github.com/wuxxx949/News_Based_Stock_Embedding?tab=readme-ov-file

what does it mean by the distributed representation in machine or deep learning?

Representation Learning:
- The goal is to transform raw data into a format that makes it easier for the machine learning model to understand and work with. Distributed representations are a powerful type of representation learning.
Distributed Encoding:
- In distributed representation, a piece of data (e.g., a word, an image) is represented by a pattern of activation across multiple units (neurons or features). Each unit captures some aspect of the data, and the combination of these units represents the entire data point.
- This is in contrast to a local representation (or one-hot encoding) where each data point is represented by a single unit.
Advantages of Distributed Representation:
- Efficiency: By distributing information across multiple units, the representation can capture more complex patterns and nuances in the data.
- Generalization: Distributed representations help models generalize better because they can capture relationships and similarities between data points.
- Scalability: They can represent a vast amount of information compactly, making it feasible to handle large datasets.

Examples in Deep Learning

Word Embeddings:
- In natural language processing (NLP), words are often represented using embeddings like Word2Vec, GloVe, or contextual embeddings from models like BERT. These embeddings distribute the representation of words across many dimensions, capturing semantic similarities and relationships between words.
- For example, the word “king” might be represented by a 300-dimensional vector where each dimension captures some aspect of its meaning, and similar words like “queen” would have similar vectors.
Neural Network Layers:
- Each layer in a neural network learns a distributed representation of the input data. Lower layers might capture basic features (e.g., edges in images), while higher layers capture more abstract concepts (e.g., faces or objects in images).
- For instance, in a convolutional neural network (CNN) for image recognition, the initial layers might detect edges and textures, while later layers might detect shapes and objects.

Mathematical Perspective

Vectors and Matrices:
- In a neural network, the input data is often represented as a vector. During training, the network learns weight matrices that transform these input vectors into new representations at each layer.
- Each neuron’s activation can be seen as a function of a weighted sum of inputs, allowing for complex patterns to be represented across the network.
Dimensionality Reduction:
- Techniques like Principal Component Analysis (PCA) and autoencoders use distributed representations to reduce the dimensionality of data while preserving important information.

Summary

Distributed representation in machine and deep learning is a method of encoding information across multiple dimensions or units. This approach enables models to capture complex patterns, relationships, and abstractions in the data, leading to more efficient, scalable, and generalizable learning. It’s a fundamental concept underlying many advancements in neural networks and representation learning.

https://pubsonline.informs.org/doi/10.1287/mnsc.2023.4695

https://arxiv.org/abs/2306.11644

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3465888

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3630898

https://academic.oup.com/rfs/advance-article-abstract/doi/10.1093/rfs/hhae019/7675482?redirectedFrom=fulltext

https://academic.oup.com/rfs/article-abstract/36/12/4759/7162712?redirectedFrom=fulltext

Basics

import pandas_datareader.data as pdr
import yfinance as yf 
yf.pdr_override() 
df = pdr.get_data_yahoo('TSLA AAPL NVDA', start = '2020-01-01', end = '2023-12-31')['Adj Close']

import yfinance as yf
import matplotlib.pyplot as plt
df = yf.download('TSLA AAPL AMD NVDA', start = '2020-01-01', end = '2023-12-31')['Adj Close']
df.divide(df.iloc[0]).plot()
plt.show()

tickers = ['SBUX', 'WMT', 'AMZN', 'HD']
mydata = pd.DataFrame()
for t in tickers:
    mydata[t] = yf.download(t, start="2000-01-01", end="2022-05-31")['Adj Close']

mydata.head()

import pandas_datareader as pdr

factors_ff3_monthly_raw = pdr.DataReader(
name="F-F_Research_Data_Factors",
data_source="famafrench", 
start=start_date, 
end=end_date)[0]

factors_ff3_monthly = (factors_ff3_monthly_raw
.divide(100)
.reset_index(names="month")
.assign(month=lambda x: pd.to_datetime(x["month"].astype(str)))
.rename(str.lower, axis="columns")
.rename(columns={"mkt-rf": "mkt_excess"})
)

# Initialize cumulative return
cumulative_return = 1

# Iterate through each year
for year in range(1926, 2010):
   
   best_asset = HistRet.loc[year].idxmax()       # Get the best-performing asset for the current year
   
   asset_return = HistRet.loc[year, best_asset]  # Get the return of the best-performing asset for the current year
   
   cumulative_return *= (1 + asset_return)       # Update cumulative return
   
   print("Final cumulative dollar return:", '${:,.2f}'.format(cumulative_return))

Professor Ha-Chin Yi Home Page

“You can always find the sun within yourself if you will only search.” — Maxwell Maltz

Finance

Stock embeddings

Basics

Maxdrawn

Volatility

Options

Interest Rates

Fed Data

ML in Finance

DL in Finance