Contents
Videos
Repository
EDGAR
Reading edgar database
https://github.com/LexPredict/openedgar
https://law.mit.edu/pub/openedgar/release/1
https://www.sec.gov/search-filings/edgar-search-assistance/accessing-edgar-data
https://www.kaggle.com/code/svendaj/extracting-data-from-sec-edgar-restful-apis/notebook
Scraping 10-K statements (annual reports) filed with the U.S. Securities and Exchange Commission (SEC) can be a crucial task for financial analysis, machine learning models, or even investment research. There are a few common approaches and libraries in Python that can assist with extracting 10-K forms from the SEC’s EDGAR database. Below are some of the best sources and methods you can use to scrape these statements:
1. SEC EDGAR Database
The SEC’s EDGAR database contains all publicly filed company documents, including 10-Ks. You can access these using SEC’s APIs or directly via web scraping.
2. Python Libraries for Scraping 10-K Statements
a. sec-edgar-downloader
Library
This is a popular Python library designed specifically for downloading SEC filings like 10-Ks, 10-Qs, etc. It simplifies downloading filings in bulk from the SEC EDGAR database.
pip install sec-edgar-downloader
from sec_edgar_downloader import Downloader dl = Downloader("/path/to/download/folder") dl.get("10-K", "AAPL")
This will download all the 10-K filings for Apple (AAPL) into the specified folder.
Benefits:
Direct integration with EDGAR.
Documentation: sec-edgar-downloader GitHub
Efficient for downloading bulk filings.
Custom date ranges or limits on the number of filings.
b. edgar
Python Package
Another library designed for interacting with SEC’s EDGAR. It provides access to company filings, including 10-Ks.
pip install edgar
from edgar import Company
company = Company("AAPL", "0000320193") # Company ticker and CIK code
tree = company.get_all_filings(filing_type="10-K")
docs = Company.get_documents(tree, no_of_documents=5)
Benefits:
Easy to use and designed for specific EDGAR access.
Allows retrieval of documents in XML/HTML format.
Documentation: edgar PyPI
c. requests
and BeautifulSoup
(Manual Web Scraping)
If you want full control over the scraping process, you can use requests
to download pages and BeautifulSoup
to parse the HTML and extract the data. This is more flexible but also more complex.
import requests from bs4
import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019323000056/aapl-20230930.htm"
response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") # Extract specific parts of the filing, for example, the financial statements tables = soup.find_all('table') for table in tables: print(table.text)
Benefits:
Full flexibility in terms of scraping and parsing data.
Control over the exact sections of the document to be scraped (e.g., Balance Sheets, MD&A, etc.).
Drawbacks:
Requires more effort compared to using dedicated libraries.
Potential for HTML structure changes in EDGAR.
d. sec-api
(API-based Access)
If you prefer using an API instead of scraping directly, sec-api.io provides an API for accessing filings from the SEC EDGAR database.
Example:pythonCopy codeimport sec_api api = sec_api.QueryApi(api_key="YOUR_API_KEY") query = { "query": { "query_string": "ticker:AAPL AND formType:\"10-K\"" }, "from": "2023-01-01", "to": "2023-12-31", "category": "annual" } filings = api.get_filings(query)
Benefits:
Simplified access with ready-made API queries.
Ideal for avoiding legal or rate-limiting issues with scraping.
JSON-based response.
Documentation: sec-api.io Documentation
3. Financial Data Providers
Some commercial APIs and data providers also offer 10-K filings access, such as:
Alpha Vantage
Quandl (now owned by Nasdaq)
Xignite
These might have higher costs but provide reliable and easy access to financial filings.
Summary of Tools and Methods:
Method | Pros | Cons |
---|---|---|
sec-edgar-downloader | Easy-to-use, bulk downloading, Python API | Limited flexibility |
edgar Python package | Simple API access, company filings | Limited documentation |
requests + BeautifulSoup | Full control over scraping and parsing | Requires more effort, potential for structure changes |
sec-api.io | API-based, JSON responses, less rate-limiting | Commercial service, API limits |
Financial Data Providers | Ready-made, commercial services | Often requires subscription/payment |
For most users, sec-edgar-downloader
or edgar
library should be sufficient for scraping 10-K statements with Python. If you require more granular control, requests
and BeautifulSoup
are great alternatives, while API-based solutions like sec-api
provide robust and structured data access.
Literature showing how to extract text from EDGAR
"Lazy Prices" JF paper,
We draw from a variety of data sources to construct the sample used in this
paper. We begin by downloading all complete 10-K, 10-K405, 10-KSB, and 10-Q
filings from the SEC’s EDGAR website14 from 1995 to 2014. All complete 10-K
and 10-Q filings are in HTML text format and contain an aggregation of all
information that is submitted with each firm’s file, such as exhibits, graphics, XBRL files, PDF files, and Excel files. Similar to Loughran and McDonald (2011), we focus our analysis on the textual content of the document. We extract only the main 10-K and 10-Q texts in each document and remove all tables (if their numeric character content is greater than 15%), HTML tags, XBRL tables, exhibits, ASCII-encoded PDFs, graphics, XLS, and other binary files. 15 Bill McDonald provides a detailed description on how to strip 10-K/Qs down to text files:
http://sraf.nd.edu/data/stage-one-10-x-parse-data
We use monthly stock returns from the Center for Research in Security
Prices (CRSP) and firms’ book value of equity and earnings per share from
Compustat. We also obtain analyst data from the Institutional Brokers
Estimate System (I/B/E/S), and sentiment category identifiers from Loughran
and McDonald’s (2011) Master Dictionary.
We capture quarter-on-quarter similarities between 10-Q and 10-K filings
using four similarity measures taken from the literature in linguistics, textual similarity, and NLP: (i) cosine similarity, (ii) Jaccard similarity, (iii) minimum edit distance, and (iv) simple similarity. We describe each measure, and its respective calculation, below
Manager sentiment and stock returns, JFE
We then obtain 264,335 10-Ks and 10-Qs for 10,414 unique firms from the EDGAR website (www.sec.gov). We exclude firms in the financial and utility sectors and firms with missing or negative total assets. We compute the textual tone based on the entire document, since Loughran and McDonald (2011) find that the full document and Management’s Discussion and Analysis (MD&A) section often use similar words, and focusing on the MD&A section would lead to a loss of observations. Because the filed documents are often in HTML format, following
Li (2008, 2010), we remove all encoded images, tables, exhibits, HTML code, special symbols, and other non-text items from the documents.
The Fast and the Circuitous:Semantic Progression as a Type of Disclosure Complexity
The sample consists of all U.S. firms in the merged EDGAR/CRSP/Compustat data set from January 1994 to December 2018. That is, following Cohen et al. (2020), we download annual reports from EDGAR, including all 10-K, 10-K405, and 10-KSB filings. We keep only textual content following Loughran and McDonald (2011). Next, we compute the speed, volume, and circuitousness of each 10-K following Toubia et al. (2021). We collect data for several additional variables from Compustat, CRSP, Audit Analytics, I/B/E/S, Thomson Reuters 13F data, SEC Analytics Suite, and WRDS Beta Suite. We discuss these variables in more detail below and also list definitions in Appendix A and the footnote of Table 1.
The final merged sample consists of 88,272 firm-year observations for 10,956 unique firms. This sample is comparable with related studies, such as the 90,437 firm-years between 2003 and 2016 in Cao, Jiang, Yang, and Zhang (2020) and the 86,965 firm-years between 1995 to 2014 in Cohen et al. (2020).
We follow four steps to measure progression complexity. First, we pre-process all 10-Ks. In this step, we start with the Loughran and McDonald (2011) cleaned 10-X files, which omit all tables, HTML tags, XBRL, exhibits, ASCII-encoded PDFs, and other binary files. We then tokenize the remaining content using the Python natural language toolkit, NLTK. Second, each document is split into non-overlapping information chunks. We set the target chunk size to 250 words, but our empirical results are similar if we instead use a target of 125 or 375 words. To avoid breaking up coherent ideas, words in the same sentence are assigned to the same chunk. Thus, the actual chunks are often slightly larger than 250 words (with the exception of the very last chunk of each document). As a result, the average chunk in our sample contains 255.21 words. Third, following Cong, Liang, and Zhang (2019), each information chunk is represented by the average of the word vectors obtained from the pretrained GloVe model (Pennington et al., 2014). The length of each word vector is 300, which is the highest possible dimensionality in the GloVe model.
Are Financial Constraints Priced? Evidence from Textual Analysis From the EDGAR database, we download all filings of Form 10-K from 1994 to 2010. Following F. Li (2010), Hoberg and Maksimovic (2015), and Bodnaruk, Loughran, and McDonald (2015), from each 10-K filing we extract the Managment’s Discussion and Analysis (MD&A) section, which contains a narrative explanation of the past performance of the firm, its financial condition, and its future prospects. As such, the MD&A material contains the textual information we want. We focus on the MD&A section because SEC Regulation S-K requires firms to discuss their liquidity needs and sources, and this discussion is always contained in the MD&A section. In this regard, we depart from Loughran and McDonald (2011), which examines the whole 10-K. However, their intent is to pick up word tone, which can appear anywhere in a 10-K, and our intent is to pick up specific discussions of financial frictions. 1.2.1 Preprocessing. After extracting the MD&A section from each 10-K filing, we preprocess each MD&A (Feinerer, Hornik, and Meyer, 2008, Li, 2010). The preprocessing steps are all standard, and their goal is to make the textual analysis more precise by reducing unnecessary noise in the text. We remove all characters that are not alphanumeric, we convert all letters to lowercase, we remove all stop words (e.g., “am” or “and”), and we stem each document. Stemming means that we reduce inflected or derived words to their stem, which is a standard procedure from computational linguistics to conflate related words. Consider, for example, the following sentence: Diamond is the latest in a line of U.S. oil companies that have cut its contract prices over the last two days citing weak oil markets. After stemming, this sentence becomes: Diamond is the latest in a line of U.S. oil compani that have cut it contract price over the last two day cit weak oil market. Finally, we remove all words that do not occur in at least 99% of the MD&A statements. The purpose of this step is to remove words that appear so infrequently that their meaning cannot easily be detected by our textual analysis. Because there is a remote possibility these words have a greater impact, we are careful to set the threshold high enough to remove only the very infrequent words, while keeping the rest
Pima Indians Diabetes Database
Pima Indians Diabetes Database https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data https://archive.ics.uci.edu/dataset/34/diabetes https://www.openml.org/search?type=data&id=37&sort=runs&status=active ### Method 1: Using `pandas` to load from a CSV file import pandas as pd # Load the dataset df = pd.read_csv('pima-indians-diabetes.csv') print(df.head()) print(df.shape) ### Method 2: Using `sklearn` datasets from sklearn.datasets import load_diabetes # Load the dataset data = load_diabetes() # Access the data and target X, y = data.data, data.target print('Data shape:', X.shape) print('Target shape:', y.shape) ### Method 3: Using `seaborn` to load from an online source import seaborn as sns # Load the dataset df = sns.load_dataset('pima-indians-diabetes', data_home="http://path/to/dataset/") print(df.head()) print(df.shape) ### Method 4: Using `openml` with `sklearn` from sklearn.datasets import fetch_openml # Load the dataset pima = fetch_openml(name='diabetes', version=1) # Split data into features and target X, y = pima.data, pima.target print('Data shape:', X.shape) print('Target shape:', y.shape) ### Method 5: Using `mlxtend` from mlxtend.data import loadlocal_pima_indians_diabetes # Load the dataset X, y = loadlocal_pima_indians_diabetes(images_path='pima-indians-diabetes.csv') print('Data shape:', X.shape) print('Target shape:', y.shape) ### Method 6: Using UCI repository with `pandas` import pandas as pd # Load the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"] df = pd.read_csv(url, header=None, names=columns) print(df.head()) print(df.shape)
mobile price classification
mobile price classification https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification
MNIST data
to classify grayscale images of handwritten digits (28 × 28 pixels) into their 10 categories (0 through 9). a set of 60,000 training images, plus 10,000 test images, assembled by the National Institute of Standards and Technology (the NIST in MNIST) in the 1980s. ------ from tensorflow.keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() >>> train_images.shape (60000, 28, 28) >>> len(train_labels) 60000 >>> train_labels array([5, 0, 4, ..., 5, 6, 8], dtype=uint8) (Source: Chollet, 2022) Here are several ways to upload the MNIST dataset in Python: ### Method 1: Using Keras from keras.datasets import mnist # Load the dataset (train_images, train_labels), (test_images, test_labels) = mnist.load_data() print('Training data shape:', train_images.shape, train_labels.shape) print('Testing data shape:', test_images.shape, test_labels.shape) ### Method 2: Using TensorFlow import tensorflow as tf # Load the dataset mnist = tf.keras.datasets.mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() print('Training data shape:', train_images.shape, train_labels.shape) print('Testing data shape:', test_images.shape, test_labels.shape) ### Method 3: Using PyTorch import torch from torchvision import datasets, transforms # Define a transform to normalize the data transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) # Download and load the training data trainset = datasets.MNIST('MNIST_data/', download=True, train=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True) # Download and load the test data testset = datasets.MNIST('MNIST_data/', download=True, train=False, transform=transform) testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=True) dataiter = iter(trainloader) images, labels = dataiter.next() print(images.shape) print(labels.shape) ### Method 4: Using sklearn from sklearn.datasets import fetch_openml # Load MNIST data from openml mnist = fetch_openml('mnist_784', version=1) mnist.data.shape, mnist.target.shape # Split data into training and testing X, y = mnist["data"], mnist["target"] X = X.to_numpy().reshape(-1, 28, 28) y = y.to_numpy().astype(int) train_images, test_images = X[:60000], X[60000:] train_labels, test_labels = y[:60000], y[60000:] print('Training data shape:', train_images.shape, train_labels.shape) print('Testing data shape:', test_images.shape, test_labels.shape) ### Method 5: Using `mlxtend` from mlxtend.data import loadlocal_mnist # Load training data X_train, y_train = loadlocal_mnist(images_path='train-images-idx3-ubyte', labels_path='train-labels-idx1-ubyte') # Load testing data X_test, y_test = loadlocal_mnist(images_path='t10k-images-idx3-ubyte', labels_path='t10k-labels-idx1-ubyte') print('Training data shape:', X_train.shape, y_train.shape) print('Testing data shape:', X_test.shape, y_test.shape)
Iris
https://archive.ics.uci.edu/dataset/53/iris
Wine quality
https://archive.ics.uci.edu/dataset/1k86/wine+quality
IMDB dataset
the IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews. https://ai.stanford.edu/%7Eamaas/data/sentiment/ from tensorflow.keras.datasets import imdb (train_data, train_labels), (test_data, test_labels) = imdb.load_data( num_words=10000) "The argument num_words=10000 means you’ll only keep the top 10,000 most frequently occurring words in the training data. Rare words will be discarded. This allows us to work with vector data of manageable size. If we didn’t set this limit, we’d be working with 88,585 unique words in the training data, which is unnecessarily large. Many of these words only occur in a single sample, and thus can’t be meaningfully used for classification." (Source: Chollet, 2022)
Boston Housing dataset
https://keras.io/2.16/api/datasets/boston_housing/ https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset https://github.com/selva86/datasets/blob/master/BostonHousing.csv https://www.kaggle.com/datasets/altavish/boston-housing-dataset https://scikit-learn.org/1.1/modules/generated/sklearn.datasets.load_boston.html Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's
Breast cancer dataset
The Breast Cancer dataset is a well-known dataset used for machine learning and data mining tasks, particularly for classification problems. There are several versions of this dataset, with the most popular being the Breast Cancer Wisconsin dataset. Here are the key details and sources where you can find this dataset:
**Description:**
- **Attributes:** 30 numeric features representing characteristics of cell nuclei present in breast cancer biopsies.
- **Classes:** 2 (malignant or benign).
- **Samples:** 569 instances.
**Features:**
- Radius (mean of distances from center to points on the perimeter)
- Texture (standard deviation of gray-scale values)
- Perimeter
- Area
- Smoothness (local variation in radius lengths)
- Compactness (perimeter² / area - 1.0)
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Symmetry
- Fractal dimension (approximation of the ‘coastline’)
**Source:**
- UCI Machine Learning Repository: [Breast Cancer Wisconsin (Diagnostic) Data Set](
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
### 2. Breast Cancer Wisconsin (Original) Dataset
**Description:**
- **Attributes:** 10 features.
- **Classes:** 2 (malignant or benign).
- **Samples:** 699 instances.
**Features:**
- Clump Thickness
- Uniformity of Cell Size
- Uniformity of Cell Shape
- Marginal Adhesion
- Single Epithelial Cell Size
- Bare Nuclei
- Bland Chromatin
- Normal Nucleoli
- Mitoses
**Source:**
- UCI Machine Learning Repository: [Breast Cancer Wisconsin (Original) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original))
### 3. Breast Cancer Coimbra Dataset
**Description:**
- **Attributes:** 9 numeric features.
- **Classes:** 2 (malignant or benign).
- **Samples:** 116 instances.
**Features:**
- Age
- BMI
- Glucose
- Insulin
- HOMA
- Leptin
- Adiponectin
- Resistin
- MCP-1
**Source:**
- UCI Machine Learning Repository: [Breast Cancer Coimbra Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra)
These datasets are widely used in research and educational purposes to develop, test, and benchmark machine learning algorithms. They provide an excellent foundation for exploring classification techniques and understanding the underlying patterns associated with breast cancer detection and diagnosis.
r.init()
r.url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/")
r.click("breast-cancer.data")
ImageNet dataset
The pretrained network we’ll explore here was trained on a subset of the ImageNet dataset (http://imagenet.stanford.edu). ImageNet is a very large dataset of over 14 million images maintained by Stanford University. All of the images are labeled with a hierarchy of nouns that come from the WordNet dataset (http://wordnet.princeton.edu), which is in turn a large lexical database of the English language.