Snippets

Commands

where python
whereis python
idle 
py -V
py -3 --version

ls -l
ls -lha
ls -lh
clear
ls
dir
cd
pwd

conda
where conda
conda -V 
conda list

conda env list
conda info --envs

conda create -n yourenvname python=x.x anaconda
conda create -p C:\Apps\Anaconda3\envs\yourenvname python==3.10.11  
conda activate C:\Apps\Anaconda3\envs\yourenvname
pip install -r "<Folder>\requirements.txt"
pip3 list 

python -m venv yourenvname 
cd yourenvname 
source bin/activate 
deactivate 

py -m pip install --upgrade pip 

# to update Spyder kernels, open conda shell 
(base) C:\Users\hy11>conda install spyder‑kernels=2.1 
(base) C:\Users\hy11>pip install spyder‑kernels==2.1.* 

#create the venv 
pip install yourenvname
python -m venv yourenvname 
yourenvname\Scripts\activate 
deactivate 

# If your virtual environment is in a directory called 'venv': 
rm -r venv 

pip install jupyter 

pip install nbconvert 

jupyter nbconvert testnotebook*.ipynb --to python 
jupyter nbconvert testnotebook.ipynb --to script 
jupyter nbconvert testnotebook.ipynb testnotebook1.ipynb testnotebook2.ipynb --to python

WRDS

Important: The WRDS Cloud compute nodes (wrds-sas1, wrds-sas36, etc) are not Internet-accessible. So you must use the two WRDS Cloud head nodes (wrds-cloud-login1-h or wrds-cloud-login2-h) to upload packages to your WRDS home directory. Once uploaded however, you can use them on the WRDS Cloud compute nodes as normal.

https://wrds-www.wharton.upenn.edu/pages/support/the-wrds-cloud/using-ssh-connect-wrds-cloud/

WRDS offers Jupyter hub in its cloud with Linux machine and theoretically it will be faster and can lengthen the runtime. This is its main benefit. However, some details are required.

Jupyter terminal does not outbound connectivity to the internet so you cannot create a virtual environment in this way.

Instead, you need to access the log in node using Putty SSH connection.

https://wrds-www.wharton.upenn.edu/pages/support/the-wrds-cloud/using-ssh-connect-wrds-cloud/

WRDS Cloud Putty Configuration

need to open putty app<br>
log in

wrds-cloud.wharton.upenn.edu <br>
use log in id and password to sign on<br>
create a virtual env diredctory<br>
using <br>
python3 -m venv --copies ~/virtualenv <br>

Activate the virtualenv: <br>
source ~/virtualenv/bin/activate <br>
and install the desired packages<br>

you are on a login node and your virtual environment is activated,

you can install your packages. You must also install the ipykernel package.

Once you have installed your packages and Jupyter, you must create a custom Jupyter kernel. Do this by running the following command and giving it a name, remembering this name for later use: <br>

pip install ipykernel

python3 -m ipykernel install --user --name=<kernel_name>\

pip install -r "<Tidy-Finance-with-Python Folder>\requirements.txt"

Log in to JupyterHub and start a JupyterLab instance. Click the “+” in the upper left hand corner to start a new launcher. In the first row of the launcher tab, “Notebooks,” you will see a new kernel with the name you gave your kernel. Click it to start a notebook. This notebook is using the virtual environment you set up, including the packages you installed. Notice under “Notebook” and “Console” there are several kernels, including the one created named my-jupyter-env. Launching this will use the virtual environment you created the kernel from.

https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/installing-your-own-python-packages/#introduction

Google Colab Pro

https://colab.research.google.com/notebooks/pro.ipynb#scrollTo=SKQ4bH7qMGrA

Making the Most of your Colab Subscription

Faster GPUs
Users who have purchased one of Colab’s paid plans have access to premium GPUs. You can upgrade your notebook’s GPU settings in Runtime > Change runtime type in the menu to enable Premium accelerator. Subject to availability, selecting a premium GPU may grant you access to a V100 or A100 Nvidia GPU.

The free of charge version of Colab grants access to Nvidia’s T4 GPUs subject to quota restrictions and availability.

You can see what GPU you’ve been assigned at any time by executing the following cell. If the execution result of running the code cell below is “Not connected to a GPU”, you can change the runtime by going to Runtime > Change runtime type in the menu to enable a GPU accelerator, and then re-execute the code cell.

gpu_info = !nvidia-smigpu_info = ‘\n’.join(gpu_info)if gpu_info.find(‘failed’) >= 0:  print(‘Not connected to a GPU’)else:  print(gpu_info)

In order to use a GPU with your notebook, select the Runtime > Change runtime type menu, and then set the hardware accelerator dropdown to GPU.

More memory
Users who have purchased one of Colab’s paid plans have access to high-memory VMs when they are available.

You can see how much memory you have available at any time by running the following code cell. If the execution result of running the code cell below is “Not using a high-RAM runtime”, then you can enable a high-RAM runtime via Runtime > Change runtime type in the menu. Then select High-RAM in the Runtime shape dropdown. After, re-execute the code cell.

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print(‘Your runtime has {:.1f} gigabytes of available RAM\n’.format(ram_gb))

if ram_gb < 20:
print(‘Not using a high-RAM runtime’)
else:
print(‘You are using a high-RAM runtime!’)
Longer runtimes
All Colab runtimes are reset after some period of time (which is faster if the runtime isn’t executing code). Colab Pro and Pro+ users have access to longer runtimes than those who use Colab free of charge.

Background execution
Colab Pro+ users have access to background execution, where notebooks will continue executing even after you’ve closed a browser tab. This is always enabled in Pro+ runtimes as long as you have compute units available.

Relaxing resource limits in Colab Pro
Your resources are not unlimited in Colab. To make the most of Colab, avoid using resources when you don’t need them. For example, only use a GPU when required and close Colab tabs when finished.

If you encounter limitations, you can relax those limitations by purchasing more compute units via Pay As You Go. Anyone can purchase compute units via Pay As You Go; no subscription is required.

Set-ups

from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = "all" 

from google.colab import drive 
drive.mount('/content/drive') 

%%capture 
You can use %%capture to prevent output from cluttering your notebook, 
especially if a cell produces a lot of text or warnings that you don't 
need to see.
import pkg_resources
import pip
import sys
installedPackages = {pkg.key for pkg in pkg_resources.working_set}
required = {'nltk', 'spacy', 'textblob','gensim'}
missing = required - installedPackages
if missing:
!pip install nltk==3.4
!pip install textblob==0.15.3
!pip install gensim==3.8.2 
!pip install -U SpaCy==2.2.0
!python -m spacy download en_core_web_lg
!pip install pandas==2.2.1  # restart the session
!pip install pyarrow
help(sum) 

import warnings 
warnings.filterwarnings('ignore')  

import pandas as pd 
pd.options.mode.chained_assignment = None # default='warn'
# Jupyter uses forward slashes to access folders in path. For example., 
path = 'C:/Users/hy11/Documents/......../data/facebook.csv' 
# or path = 'C:\\Users\\hy11\\Documents\\........\\data\\facebook.csv'
numpy.inf 
IEEE 754 floating point representation of (positive) infinity.
import numpy as np
np.set_printoptions(precision = 2)

import pandas as pd
pd.set_option('max_rows', 20)
pd.options.display.max_rows = 100
pd.options.display.float_format = '{:,.2f}'.format
# How to see more data 
with pd.option_context('display.min_rows', 30, 'display.max_columns', 82): 
   display( df.query('`ColumnA`.isna()'))
from google.colab import files 
uploaded = files.upload()
import os 
os.chdir('/content/drive/My Drive/Colab Notebooks/data/') 
os.getcwd() 
os.listdir()

list = os.listdir()
number_files = len(list)
print(number_files)
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

# Restart the kernel

import pandas as pd
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title=’Heart Disease’, html={‘style’:{‘full_width’:True}})
profile.to_notebook_iframe()

Magic commands

%time: Time the execution of a single statement

%timeit: Time repeated execution of a single statement for more accuracy

%prun: Run code with the profiler

%lprun: Run code with the line-by-line profiler

%memit: Measure the memory use of a single statement

%mprun: Run code with the line-by-line memory profiler

%cd Change the current working directory.
%conda Run the conda package manager within the current kernel.

%dirs Return the current directory stack.

%load_ext Load an IPython extension by its module name.

%pwd Return the current working directory path.
 

Tuple, List, Dictionary

list = [1, 2, 3, 4]
len(list)
output:
4

number_list = [2,4,6,8,10,12]
print(number_list[::2])
output:
[2, 6, 10]

Datetime

# from string to datetime or date
from datetime import datetime
datetime.strptime('2009-01-01', '%Y-%m-%d')

output:
datetime.datetime(2009, 1, 1)

datetime.strptime('2009-01-01', '%Y-%m-%d').date()

Output:
datetime.date(2009, 1, 1)

Function

def simulate_process(is_stationary: bool) -> np.ndarray:

# -> np.ndarray: This indicates that the function returns a value of type np.ndarray, which is the type used by NumPy to represent arrays.

I/O

The tarfile module makes it possible to read and write tar archives, including those 
using gzip, bz2 and lzma compression. 
Use the zipfile module to read or write .zip  files, or the higher-level functions 
in shutil.

import tarfile
tar = tarfile.open("sample.tar.gz")
tar.extractall(filter='data')
tar.close()
The urllib.request module defines functions and classes which help in opening URLs 
(mostly HTTP) in a complex world — basic and digest authentication, redirections, 
cookies and more.
urllib is part of the Python Standard Library and offers basic functionality for 
working with URLs and HTTP requests. 
urllib3 is a third-party library that provides a more extensive feature set for
making HTTP requests and is suitable for more complex web interaction tasks. 
It must be installed separately.

import urllib.request 
with urllib.request.urlopen('http://www.python.org/') as f:
     print(f.read(300))
import urllib.request
min_date = '1963-07-31'
max_date = '2020-03-01'

ff_url = 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip'
urllib.request.urlretrieve(ff_url, 'factors.zip')
!unzip -a factors.zip
!rm factors.zip
with open('text.txt') as f: 
     text_in = f.read() 

print(text_in)
import pandas as pd 
url = 'https://github.com/mattharrison/datasets/raw/master/data/ames-housing-dataset.zip' 
df = pd.read_csv(url, engine='pyarrow', dtype_backend='pyarrow')
#removing the file. 
os.remove("file_name.txt")

file_name = "python.txt" 
os.rename(file_name,'Python1.txt')
# to tell the path 
import sys 
sys.path.append('/content/drive/My Drive/Colab Notebooks/data/')
import sys
print(sys.version)

#Listing out all the paths
import sys
print(sys.path)

# print out all modules imported
sys.modules"
print ('%s is %d years old' % ('Joe', 42))

Output:
Joe is 42 years old

print('We are the {} who say "{}!"'.format('knights', 'Ni'))

Output:
We are the knights who say "Ni!"
 

Data Wrangling – Numpy / Pandas

np.random.seed(42)

Setting the random seed is important for reproducibility. 
By fixing the seed to a specific value (in this case, 42), 
you ensure that the sequence of random numbers generated 
by NumPy will be the same every time you run your code, 
assuming that other sources of randomness in your code are 
also controlled or fixed.
X_b = np.c_[np.ones((100, 1)), X]    # add x0 = 1 to each instance
>>> np.ones(5)
array([1., 1., 1., 1., 1.])
>>> np.ones((2, 1))
array([[1.],
       [1.]])
>>> np.c_[np.array([1,2,3]), np.array([4,5,6])]
array([[1, 4],
       [2, 5],
       [3, 6]])
 
df.dtypes
df.shape
df.iloc[3:5]
df.loc[(df["Col_A"] == "brown") & (df["Col_B"] == "blue"), ["Col_D", "Col_Z"]]

# data type conversion 
df.['year'].astype(str) 
df.['number'].astype(float)

y = y.astype(np.uint8)
# We can alternatively use glob as this directly allows to include pathname matching. 
# For example if we only want Excel .xlsx files:

data_path = os.path.join(os.getcwd(), 'data_folder')

from glob import glob 
glob(os.path.join(data_path, '*.xlsx'))

output: 

'/content/drive/MyDrive/Colab Notebooks/data/output.xlsx',  
'/content/drive/MyDrive/Colab Notebooks/data/beta_file.xlsx',  

# index
df = df.set_index("Column 1")

df.index.name = "New Index"

# making data frame from csv file 
data = pd.read_csv("nba.csv", index_col ="Name") 

# remove own index with default index 
df.reset_index(inplace = True, drop = True)
# to rename 
data.rename(columns={'OLD': 'NEW'}, inplace=True)  

# reformat column namess 
df.columns = df.columns.str.lower().str.replace(" ", "_")
df["column_name"].str.lower() 

# to print memory usage 
memory_per_column = df.memory_usage(deep=True) / 1024 ** 2 

df_cat = df.copy()
Slices can be passed by name using  
.loc[startrow:stoprow:step, startcolumn:stopcolumn:step] 

or by position using .iloc[start:stop:step, start:stop:step].

df.loc[2::10, "name":"gender"]
.isna()
.notna()

Always use pd.isnull() or pd.notnull() when conditioning on missing values. 
This is the most reliable.

df[pd.isnull(df['Column1'])
df[pd.notnull(df['Column2'])].head()

df.query('ColumnA'.isna())

try:
     df = pd.read_csv(url, skiprows=0, header=0)
     print("Data loaded successfully.")
except Exception:
     print("Error loading sheet")
     # If an error occurs, halt further execution
     raise
if __name__ == "__main__":
     main()
# use the break and continue statements
for x in range(5,10):
    if (x == 7): break
    if (x % 2 == 0): continue
    print (x)
pd.concat([data_FM.head(), data_FM.tail()])
pd.concat([df.head(10), df.tail(10)])
# Categoricals - Pandas 2
df.select_dtypes('string')  # or 'strings[pyarrow]'
# Categoricals
df.select_dtypes('string').describe().T
What is type and dtype in Python?

The type of a NumPy array is numpy.ndarray ; this is just the type of Python object 

it is (similar to how type("hello") is str for example). dtype just defines how bytes 

in memory will be interpreted by a scalar (i.e. a single number) or an array and 

the way in which the bytes will be treated (e.g. int / float ).
# Convert string columns to the `'category'` data type to save memory.
(df
.select_dtypes('string')
.memory_usage(deep=True)
.sum()
)


(df
.select_dtypes('string')
.astype('category')
.memory_usage(deep=True)
.sum()
)
# Missing numeric columns (and strings in Pandas 1)

df.isna().mean().mul(100).pipe(lambda ser: ser[ser > 0])
 
# sample five rows
df.sample(5)
# LAMBDA 
adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 if '>50K' in x else 0) 
df.assign(Total_Salary = lambda x: df['Salary'] + df['Bonus'])

# SORT_VALUES 
oo.sort_values(by = ['Edition', 'Athlete']) 

# Count unique values 
oo['NOC].value_counts(ascending = True) oo['NOC].unique() 
oo[(oo['Medal']=='Gold') & (oo['Gender'] == 'Women')] 
oo[oo['Athlete'] == 'PHELPS, Michael']['Event'] 
oo["Athlete"].value_counts().sort_values(ascending = False).head(10) 
df['Label'].value_counts().plot(kind='bar') # or df.groupby('Label').size().plot(kind='bar') 
oo[oo.Athlete == 'PHELPS, Michael'][['Event', 'City','Edition']]  #or oo[oo['Athlete'] == 'PHELPS, Michael'][['Event', 'City','Edition']] 


# Delete a single column from the DataFrame 
data = data.drop(labels="deathes", axis=1) 

# Delete multiple columns from the DataFrame 
data = data.drop(labels=["deaths", "deaths_per_million"], axis=1) 

# Note that the "labels" parameter is by default the first, so # the above lines can be written slightly more concisely: 
data = data.drop("deaths", axis=1) 
data = data.drop(["deaths", "deaths_per_million"], axis=1) 

# Delete a single named column from the DataFrame 
data = data.drop(columns="cases") 

# Delete multiple named columns from the DataFrame 
data = data.drop(columns=["cases", "cases_per_million"]) 
pd.concat([s1, s2], axis=1) 
pd.concat([s1, s2], axis=1).reset_index() a.to_frame().join(b) 

# Delete column numbers 1, 2 and 5 from the DataFrame # Create a list of all column numbers to keep 
columns_to_keep = [x for x in range(data.shape[1]) if x not in [1,2,5]] 

# Delete columns by column number using iloc selection 
data = data.iloc[:, columns_to_keep] 

# delete a single row by index value 0
 data = data.drop(labels=0, axis=0) 

# delete a few specified rows at index values 0, 15, 20. 
# Note that the index values do not always align to row numbers. 
data = data.drop(labels=[1,15,20], axis=0) 

# delete a range of rows - index values 10-20 
data = data.drop(labels=range(40, 45), axis=0)  

# The labels parameter name can be omitted, and axis is 0 by default 
# Shorter versions of the above: 
data = data.drop(0) 
data = data.drop([0, 15, 20]) 
data = data.drop(range(10, 20)) 
data.shape  
output: (238, 11) 

# Delete everything but the first 99 rows. 
data = data[:100] 
data.shape  
output: (100, 11) 

ata = data[10:20] 
data.shape  
output: (10, 11)
# divide multiple columns by a number efficiently
df_ff[['MKT_RF','SMB','HML','RMW','CMA','RF']] /= 100.0
# handling missing variables
df = df.append(another_row)

.loc .iloc

data_ml.loc[data_ml["R12M_Usd"] > 50, ["stock_id", "date", "R12M_Usd"]]

Lamda

def add(x, y):
    return x + y

This is the same as 

add = lambda x, y: x + y
print(add(3, 5))

Output: 
8
# LAMBDA 
adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 if '>50K' in x else 0) 
df.assign(Total_Salary = lambda x: df['Salary'] + df['Bonus'])
(lambda x, y: x * y)(3,4)

Output: 
12
str1 = 'Texas'

reverse_upper = lambda string: string.upper()[::-1]

print(reverse_upper(str1))

Output:
SAXET
is_even_list = [lambda arg=x: arg * 10 for x in range(1, 5)]

for item in is_even_list:
    print(item())

Output:
10
20
30
40
>>> (lambda x: x + 1)(2)
3

list1 = ["1", "2", "9", "0", "-1", "-2"]
# sort list[str] numerically using sorted()
# and custom sorting key using lambda

print("Sorted numerically:", sorted(list1, key=lambda x: int(x)))

Output:
Sorted numerically: ['-2', '-1', '0', '1', '2', '9']
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [0, 0, 0, 0, 0]})
print(df)
df['col3'] = df['col1'].apply(lambda x: x * 10)
df

output:
col1 col2
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0

col1 col2 col3
0 1 0 10
1 2 0 20
2 3 0 30
3 4 0 40
4 5 0 50
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'IT'],
'Salary': [70000, 80000, 90000, 60000, 75000]
}

# 1. Increase salary by 10% using lambda
df['New Salary'] = df['Salary'].apply(lambda x: x * 1.10)

# 2. Check if the new salary is greater than 80,000 using lambda
df['Above 80K'] = df['New Salary'].apply(lambda x: 'Yes' if x > 80000 else 'No')


Plot

df.plot.scatter("X", "Y", alpha=0.5)
# In Matplotlib it is possible to change styling settings globally with runtime configuration (rc) parameters. 
# The default Matplotlib styling configuration is set with matplotlib.rcParams. 
# This is a dictionary containing formatting settings and their values. 
import matplotlib as mpl 
mpl.rcParams['figure.figsize'] = (15, 10) 
mpl.rcParams["font.family"] = "monospace"
mpl.rcParams["font.family"] = "sans serif"

# Matplotlib comes with a selection of available style sheets. 
# These define a range of plotting parameters and can be used to apply those parameters to your plots. 
import matplotlib.pyplot as plt 
plt.style.available 
plt.style.use("dark_background") 
plt.style.use("ggplot")
plt.style.use('fivethirtyeight')

plt.tight_layout()


# check types
df.dtypes

# string to int 
df['string_col'] = df['string_col'].astype('int')

# If you want to convert a column to numeric, I recommend to use df.to_numeric(): 
df['length'] = pd.to_numeric(df['length']) 
df['length'].dtypes


import multiprocessing
multiprocessing.cpu_count()


import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set_style('whitegrid')
plt.style.use('seaborn')
import plotly.express as px
fig = px.line(df, x="lifeExp", y="gdpPercap")
fig.show()

import plotly.express as px
fig = px.line(x1, x2, width=1000, height=480, title = 'Returns')
fig.show()
# to show up within this notebook so we need to direct bokeh output to the notebook.
import bokeh.io
bokeh.io.output_notebook()

from bokeh.plotting import figure, show
p = figure(title='Returns', x_axis_label='Date', y_axis_label='GOOG', height=400, width=800)
p.line(x, y1, color = 'firebrick', line_width=2)
p.line(x, y2, color = 'navy', line_width=2)
show(p)
fig, ax = plt.subplots()

ax.plot(train['date'], train['data'], 'g-.', label='Train')
ax.plot(test['date'], test['data'], 'b-', label='Test')

ax.set_xlabel('Date')
ax.set_ylabel('Earnings per share (USD)')

ax.axvspan(80, 83, color='#808080', alpha=0.2)
# ax.axvspan() highlights a vertical span from x=80 to x=83.
# color='#808080' sets the color to a shade of gray.
# alpha=0.2 sets the transparency of the highlighted area.

ax.legend(loc=2) 
#loc=2 places the legend in the upper left corner of the plot.

plt.xticks(np.arange(0, 85, 8), [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980]);
# semicolon supress the output
# plt.xticks() sets custom ticks on the x-axis.
# np.arange(0, 85, 8) generates values from 0 to 84 in steps of 8.
# The second argument provides custom labels for these ticks.

fig.autofmt_xdate() 
# automatically formats the x-axis labels for better readability, often rotating them 
# to prevent overlap.

plt.tight_layout();
import altair as alt
base1= alt.Chart(m, width=800, height=400).encode(x='Date', y="GOOG_Returns")
base2= alt.Chart(m, width=800, height=400).encode(x='Date', y="Strategy_Returns")
base1.mark_line(color='gray') + base2.mark_line(color='navy')
https://www.tomasbeuzen.com/python-programming-for-data-science/chapters/chapter9-wrangling-advanced.html

RPA

import rpa

rpa.init(visual_automation=True, chrome_browser=True)

rpa.url("https://ourworldindata.org/coronavirus-source-data")

rpa.click("here")

rpa.click("CSV")
rpa.close()


!pip install rpa

import rpa as r

r.init()

r.url("https://www.urbizedge.com/custom-shape-map-in-power-bi/")
#This is aimed at navigating to a website by using the rpa bot

r.close()


r.init()
r.url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/")

r.click("breast-cancer.data")


r.init()r.url('https://www.google.com')
r.type('//*[@name="q"]', 'USA[enter]')
print(r.read('result-stats'))
r.snap('page', 'USA.png')

r.close()
#This is meant to close the RPA system running of Chrome.

Text

type(clean_words)

list
clean_words

['thank',
 'professor',
 'wieland',
" ".join(clean_words)

'thank professor wieland introduction thank institute monetary financial stability 

Print Friendly, PDF & Email