Contents

1 Videos
2 Repository
- 2.1 EDGAR
- 2.2 Literature showing how to extract text from EDGAR
3 Pima Indians Diabetes Database
4 mobile price classification
5 MNIST data
6 Iris
7 Wine quality
8 IMDB dataset
9 Boston Housing dataset
10 Breast cancer dataset
11 ImageNet dataset

Videos

Repository

https://www.kaggle.com/datasets https://archive.ics.uci.edu/https://www.microsoft.com/en-us/research/tools/?

EDGAR

Reading edgar database

https://github.com/LexPredict/openedgar
https://law.mit.edu/pub/openedgar/release/1
https://www.sec.gov/search-filings/edgar-search-assistance/accessing-edgar-data
https://www.kaggle.com/code/svendaj/extracting-data-from-sec-edgar-restful-apis/notebook

Scraping 10-K statements (annual reports) filed with the U.S. Securities and Exchange Commission (SEC) can be a crucial task for financial analysis, machine learning models, or even investment research. There are a few common approaches and libraries in Python that can assist with extracting 10-K forms from the SEC’s EDGAR database. Below are some of the best sources and methods you can use to scrape these statements:

1. SEC EDGAR Database

The SEC’s EDGAR database contains all publicly filed company documents, including 10-Ks. You can access these using SEC’s APIs or directly via web scraping.

2. Python Libraries for Scraping 10-K Statements

a. sec-edgar-downloader Library

This is a popular Python library designed specifically for downloading SEC filings like 10-Ks, 10-Qs, etc. It simplifies downloading filings in bulk from the SEC EDGAR database.

pip install sec-edgar-downloader

from sec_edgar_downloader import Downloader dl = Downloader("/path/to/download/folder") dl.get("10-K", "AAPL")

This will download all the 10-K filings for Apple (AAPL) into the specified folder.

Benefits:

Direct integration with EDGAR.
Documentation: sec-edgar-downloader GitHub
Efficient for downloading bulk filings.
Custom date ranges or limits on the number of filings.

b. edgar Python Package

Another library designed for interacting with SEC’s EDGAR. It provides access to company filings, including 10-Ks.

pip install edgar
from edgar import Company 
company = Company("AAPL", "0000320193") # Company ticker and CIK code 
tree = company.get_all_filings(filing_type="10-K")
docs = Company.get_documents(tree, no_of_documents=5)

Benefits:

Easy to use and designed for specific EDGAR access.

Allows retrieval of documents in XML/HTML format.

Documentation: edgar PyPI

c. requests and BeautifulSoup (Manual Web Scraping)

If you want full control over the scraping process, you can use requests to download pages and BeautifulSoup to parse the HTML and extract the data. This is more flexible but also more complex.

import requests from bs4
import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019323000056/aapl-20230930.htm"
response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") # Extract specific parts of the filing, for example, the financial statements tables = soup.find_all('table') for table in tables: print(table.text)

Benefits:

Full flexibility in terms of scraping and parsing data.

Control over the exact sections of the document to be scraped (e.g., Balance Sheets, MD&A, etc.).

Drawbacks:

Requires more effort compared to using dedicated libraries.

Potential for HTML structure changes in EDGAR.

d. sec-api (API-based Access)

If you prefer using an API instead of scraping directly, sec-api.io provides an API for accessing filings from the SEC EDGAR database.

Example:pythonCopy codeimport sec_api api = sec_api.QueryApi(api_key="YOUR_API_KEY") query = { "query": { "query_string": "ticker:AAPL AND formType:\"10-K\"" }, "from": "2023-01-01", "to": "2023-12-31", "category": "annual" } filings = api.get_filings(query)

Benefits:

Simplified access with ready-made API queries.

Ideal for avoiding legal or rate-limiting issues with scraping.

JSON-based response.

Documentation: sec-api.io Documentation

3. Financial Data Providers

Some commercial APIs and data providers also offer 10-K filings access, such as:

Alpha Vantage

Quandl (now owned by Nasdaq)

Xignite

These might have higher costs but provide reliable and easy access to financial filings.

Summary of Tools and Methods:

Method	Pros	Cons
`sec-edgar-downloader`	Easy-to-use, bulk downloading, Python API	Limited flexibility
`edgar` Python package	Simple API access, company filings	Limited documentation
`requests + BeautifulSoup`	Full control over scraping and parsing	Requires more effort, potential for structure changes
`sec-api.io`	API-based, JSON responses, less rate-limiting	Commercial service, API limits
Financial Data Providers	Ready-made, commercial services	Often requires subscription/payment

For most users, sec-edgar-downloader or edgar library should be sufficient for scraping 10-K statements with Python. If you require more granular control, requests and BeautifulSoup are great alternatives, while API-based solutions like sec-api provide robust and structured data access.

Literature showing how to extract text from EDGAR

"Lazy Prices" JF paper, 
We draw from a variety of data sources to construct the sample used in this
paper. We begin by downloading all complete 10-K, 10-K405, 10-KSB, and 10-Q
filings from the SEC’s EDGAR website14 from 1995 to 2014. All complete 10-K
and 10-Q filings are in HTML text format and contain an aggregation of all
information that is submitted with each firm’s file, such as exhibits, graphics, XBRL files, PDF files, and Excel files. Similar to Loughran and McDonald (2011), we focus our analysis on the textual content of the document. We extract only the main 10-K and 10-Q texts in each document and remove all tables (if their numeric character content is greater than 15%), HTML tags, XBRL tables, exhibits, ASCII-encoded PDFs, graphics, XLS, and other binary files. 15 Bill McDonald provides a detailed description on how to strip 10-K/Qs down to text files:
http://sraf.nd.edu/data/stage-one-10-x-parse-data
We use monthly stock returns from the Center for Research in Security
Prices (CRSP) and firms’ book value of equity and earnings per share from
Compustat. We also obtain analyst data from the Institutional Brokers
Estimate System (I/B/E/S), and sentiment category identifiers from Loughran
and McDonald’s (2011) Master Dictionary.
We capture quarter-on-quarter similarities between 10-Q and 10-K filings
using four similarity measures taken from the literature in linguistics, textual similarity, and NLP: (i) cosine similarity, (ii) Jaccard similarity, (iii) minimum edit distance, and (iv) simple similarity. We describe each measure, and its respective calculation, below

Manager sentiment and stock returns, JFE
We then obtain 264,335 10-Ks and 10-Qs for 10,414 unique firms from the EDGAR website (www.sec.gov). We exclude firms in the financial and utility sectors and firms with missing or negative total assets. We compute the textual tone based on the entire document, since Loughran and McDonald (2011) find that the full document and Management’s Discussion and Analysis (MD&A) section often use similar words, and focusing on the MD&A section would lead to a loss of observations. Because the filed documents are often in HTML format, following
Li (2008, 2010), we remove all encoded images, tables, exhibits, HTML code, special symbols, and other non-text items from the documents.

The Fast and the Circuitous:Semantic Progression as a Type of Disclosure Complexity
The sample consists of all U.S. firms in the merged EDGAR/CRSP/Compustat data set from January 1994 to December 2018. That is, following Cohen et al. (2020), we download annual reports from EDGAR, including all 10-K, 10-K405, and 10-KSB filings. We keep only textual content following Loughran and McDonald (2011). Next, we compute the speed, volume, and circuitousness of each 10-K following Toubia et al. (2021). We collect data for several additional variables from Compustat, CRSP, Audit Analytics, I/B/E/S, Thomson Reuters 13F data, SEC Analytics Suite, and WRDS Beta Suite. We discuss these variables in more detail below and also list definitions in Appendix A and the footnote of Table 1.
The final merged sample consists of 88,272 firm-year observations for 10,956 unique firms. This sample is comparable with related studies, such as the 90,437 firm-years between 2003 and 2016 in Cao, Jiang, Yang, and Zhang (2020) and the 86,965 firm-years between 1995 to 2014 in Cohen et al. (2020).
We follow four steps to measure progression complexity. First, we pre-process all 10-Ks. In this step, we start with the Loughran and McDonald (2011) cleaned 10-X files, which omit all tables, HTML tags, XBRL, exhibits, ASCII-encoded PDFs, and other binary files. We then tokenize the remaining content using the Python natural language toolkit, NLTK. Second, each document is split into non-overlapping information chunks. We set the target chunk size to 250 words, but our empirical results are similar if we instead use a target of 125 or 375 words. To avoid breaking up coherent ideas, words in the same sentence are assigned to the same chunk. Thus, the actual chunks are often slightly larger than 250 words (with the exception of the very last chunk of each document). As a result, the average chunk in our sample contains 255.21 words. Third, following Cong, Liang, and Zhang (2019), each information chunk is represented by the average of the word vectors obtained from the pretrained GloVe model (Pennington et al., 2014). The length of each word vector is 300, which is the highest possible dimensionality in the GloVe model.

Are Financial Constraints Priced? Evidence from Textual Analysis
From the EDGAR database, we download all filings of Form 10-K from
1994 to 2010. Following F. Li (2010), Hoberg and Maksimovic (2015), and
Bodnaruk, Loughran, and McDonald (2015), from each 10-K filing we extract
the Managment’s Discussion and Analysis (MD&A) section, which contains
a narrative explanation of the past performance of the firm, its financial
condition, and its future prospects. As such, the MD&A material contains the
textual information we want. We focus on the MD&A section because SEC
Regulation S-K requires firms to discuss their liquidity needs and sources, and
this discussion is always contained in the MD&A section. In this regard, we
depart from Loughran and McDonald (2011), which examines the whole 10-K.
However, their intent is to pick up word tone, which can appear anywhere in a
10-K, and our intent is to pick up specific discussions of financial frictions.

1.2.1 Preprocessing. After extracting the MD&A section from each 10-K
filing, we preprocess each MD&A (Feinerer, Hornik, and Meyer, 2008, Li,
2010). The preprocessing steps are all standard, and their goal is to make
the textual analysis more precise by reducing unnecessary noise in the text.
We remove all characters that are not alphanumeric, we convert all letters to
lowercase, we remove all stop words (e.g., “am” or “and”), and we stem each
document. Stemming means that we reduce inflected or derived words to their
stem, which is a standard procedure from computational linguistics to conflate
related words. Consider, for example, the following sentence:
Diamond is the latest in a line of U.S. oil companies that have cut
its contract prices over the last two days citing weak oil markets. After stemming, this sentence becomes:
Diamond is the latest in a line of U.S. oil compani that have cut it
contract price over the last two day cit weak oil market.
Finally, we remove all words that do not occur in at least 99% of the MD&A
statements. The purpose of this step is to remove words that appear so
infrequently that their meaning cannot easily be detected by our textual analysis. Because there is a remote possibility these words have a greater impact, we are careful to set the threshold high enough to remove only the very infrequent words, while keeping the rest

Pima Indians Diabetes Database

Pima Indians Diabetes Database

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data

https://archive.ics.uci.edu/dataset/34/diabetes

https://www.openml.org/search?type=data&id=37&sort=runs&status=active


### Method 1: Using `pandas` to load from a CSV file

import pandas as pd

# Load the dataset
df = pd.read_csv('pima-indians-diabetes.csv')

print(df.head())
print(df.shape)


### Method 2: Using `sklearn` datasets

from sklearn.datasets import load_diabetes

# Load the dataset
data = load_diabetes()

# Access the data and target
X, y = data.data, data.target

print('Data shape:', X.shape)
print('Target shape:', y.shape)


### Method 3: Using `seaborn` to load from an online source

import seaborn as sns

# Load the dataset
df = sns.load_dataset('pima-indians-diabetes', data_home="http://path/to/dataset/")

print(df.head())
print(df.shape)


### Method 4: Using `openml` with `sklearn`

from sklearn.datasets import fetch_openml

# Load the dataset
pima = fetch_openml(name='diabetes', version=1)

# Split data into features and target
X, y = pima.data, pima.target

print('Data shape:', X.shape)
print('Target shape:', y.shape)


### Method 5: Using `mlxtend`

from mlxtend.data import loadlocal_pima_indians_diabetes

# Load the dataset
X, y = loadlocal_pima_indians_diabetes(images_path='pima-indians-diabetes.csv')

print('Data shape:', X.shape)
print('Target shape:', y.shape)


### Method 6: Using UCI repository with `pandas`

import pandas as pd

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
df = pd.read_csv(url, header=None, names=columns)

print(df.head())
print(df.shape)

mobile price classification

mobile price classification

https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification

MNIST data

to classify grayscale images of handwritten digits (28 × 28 pixels) into their 10 
categories (0 through 9). a set of 60,000 training images, plus 10,000 test images, 
assembled by the National Institute of Standards and Technology (the NIST in MNIST) 
in the 1980s. 
------
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

>>> train_images.shape
(60000, 28, 28)
>>> len(train_labels) 
60000 
>>> train_labels
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

(Source: Chollet, 2022)

Here are several ways to upload the MNIST dataset in Python:

### Method 1: Using Keras
from keras.datasets import mnist

# Load the dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

print('Training data shape:', train_images.shape, train_labels.shape)
print('Testing data shape:', test_images.shape, test_labels.shape)

### Method 2: Using TensorFlow

import tensorflow as tf

# Load the dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

print('Training data shape:', train_images.shape, train_labels.shape)
print('Testing data shape:', test_images.shape, test_labels.shape)

### Method 3: Using PyTorch

import torch
from torchvision import datasets, transforms

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Download and load the training data
trainset = datasets.MNIST('MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Download and load the test data
testset = datasets.MNIST('MNIST_data/', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=True)

dataiter = iter(trainloader)
images, labels = dataiter.next()
print(images.shape)
print(labels.shape)

### Method 4: Using sklearn

from sklearn.datasets import fetch_openml

# Load MNIST data from openml
mnist = fetch_openml('mnist_784', version=1)
mnist.data.shape, mnist.target.shape

# Split data into training and testing
X, y = mnist["data"], mnist["target"]
X = X.to_numpy().reshape(-1, 28, 28)
y = y.to_numpy().astype(int)

train_images, test_images = X[:60000], X[60000:]
train_labels, test_labels = y[:60000], y[60000:]

print('Training data shape:', train_images.shape, train_labels.shape)
print('Testing data shape:', test_images.shape, test_labels.shape)

### Method 5: Using `mlxtend`

from mlxtend.data import loadlocal_mnist

# Load training data
X_train, y_train = loadlocal_mnist(images_path='train-images-idx3-ubyte', labels_path='train-labels-idx1-ubyte')

# Load testing data
X_test, y_test = loadlocal_mnist(images_path='t10k-images-idx3-ubyte', labels_path='t10k-labels-idx1-ubyte')

print('Training data shape:', X_train.shape, y_train.shape)
print('Testing data shape:', X_test.shape, y_test.shape)

Iris

https://archive.ics.uci.edu/dataset/53/iris

Wine quality

https://archive.ics.uci.edu/dataset/1k86/wine+quality

IMDB dataset

the IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie 
Database. They’re split into 25,000 reviews for training and 25,000 reviews for 
testing, each set consisting of 50% negative and 50% positive reviews.

https://ai.stanford.edu/%7Eamaas/data/sentiment/

from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
num_words=10000)

"The argument num_words=10000 means you’ll only keep the top 10,000 most frequently 
occurring words in the training data. Rare words will be discarded. This allows us to 
work with vector data of manageable size. If we didn’t set this limit, we’d be 
working with 88,585 unique words in the training data, which is unnecessarily large. 
Many of these words only occur in a single sample, and thus can’t be meaningfully 
used for classification." (Source: Chollet, 2022)

Boston Housing dataset

https://keras.io/2.16/api/datasets/boston_housing/
https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset
https://github.com/selva86/datasets/blob/master/BostonHousing.csv
https://www.kaggle.com/datasets/altavish/boston-housing-dataset
https://scikit-learn.org/1.1/modules/generated/sklearn.datasets.load_boston.html

Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's

Breast cancer dataset

The Breast Cancer dataset is a well-known dataset used for machine learning and data mining tasks, particularly for classification problems. There are several versions of this dataset, with the most popular being the Breast Cancer Wisconsin dataset. Here are the key details and sources where you can find this dataset:

**Description:**
- **Attributes:** 30 numeric features representing characteristics of cell nuclei present in breast cancer biopsies.
- **Classes:** 2 (malignant or benign).
- **Samples:** 569 instances.

**Features:**
- Radius (mean of distances from center to points on the perimeter)
- Texture (standard deviation of gray-scale values)
- Perimeter
- Area
- Smoothness (local variation in radius lengths)
- Compactness (perimeter² / area - 1.0)
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Symmetry
- Fractal dimension (approximation of the ‘coastline’)

**Source:**
- UCI Machine Learning Repository: [Breast Cancer Wisconsin (Diagnostic) Data Set](
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))

### 2. Breast Cancer Wisconsin (Original) Dataset

**Description:**
- **Attributes:** 10 features.
- **Classes:** 2 (malignant or benign).
- **Samples:** 699 instances.

**Features:**
- Clump Thickness
- Uniformity of Cell Size
- Uniformity of Cell Shape
- Marginal Adhesion
- Single Epithelial Cell Size
- Bare Nuclei
- Bland Chromatin
- Normal Nucleoli
- Mitoses

**Source:**
- UCI Machine Learning Repository: [Breast Cancer Wisconsin (Original) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original))

### 3. Breast Cancer Coimbra Dataset

**Description:**
- **Attributes:** 9 numeric features.
- **Classes:** 2 (malignant or benign).
- **Samples:** 116 instances.

**Features:**
- Age
- BMI
- Glucose
- Insulin
- HOMA
- Leptin
- Adiponectin
- Resistin
- MCP-1

**Source:**
- UCI Machine Learning Repository: [Breast Cancer Coimbra Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra)

These datasets are widely used in research and educational purposes to develop, test, and benchmark machine learning algorithms. They provide an excellent foundation for exploring classification techniques and understanding the underlying patterns associated with breast cancer detection and diagnosis.

r.init()
r.url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/")

r.click("breast-cancer.data")

ImageNet dataset

The pretrained network we’ll explore here was trained on a subset of the ImageNet 
dataset (http://imagenet.stanford.edu). ImageNet is a very large dataset of over 
14 million images maintained by Stanford University. All of the images are labeled 
with a hierarchy of nouns that come from the WordNet dataset 
(http://wordnet.princeton.edu), which is in turn a large lexical database of the 
English language.

Professor Ha-Chin Yi Home Page

“You can always find the sun within yourself if you will only search.” — Maxwell Maltz

Data

Videos

Repository

EDGAR

Literature showing how to extract text from EDGAR

Pima Indians Diabetes Database

mobile price classification

MNIST data

Iris

Wine quality

IMDB dataset

Boston Housing dataset

Breast cancer dataset

ImageNet dataset