Contents
Commands
where python whereis python idle py -V py -3 --version ls -l ls -lha ls -lh clear ls dir cd pwd conda where conda conda -V conda list conda env list conda info --envs conda create -n yourenvname python=x.x anaconda conda create -p C:\Apps\Anaconda3\envs\yourenvname python==3.10.11 conda activate C:\Apps\Anaconda3\envs\yourenvname pip install -r "<Folder>\requirements.txt"
pip3 list python -m venv yourenvname cd yourenvname source bin/activate deactivate py -m pip install --upgrade pip # to update Spyder kernels, open conda shell (base) C:\Users\hy11>conda install spyder‑kernels=2.1 (base) C:\Users\hy11>pip install spyder‑kernels==2.1.* #create the venv pip install yourenvname python -m venv yourenvname yourenvname\Scripts\activate deactivate # If your virtual environment is in a directory called 'venv': rm -r venv pip install jupyter pip install nbconvert jupyter nbconvert testnotebook*.ipynb --to python jupyter nbconvert testnotebook.ipynb --to script jupyter nbconvert testnotebook.ipynb testnotebook1.ipynb testnotebook2.ipynb --to python
WRDS
Important: The WRDS Cloud compute nodes (wrds-sas1, wrds-sas36, etc) are not Internet-accessible. So you must use the two WRDS Cloud head nodes (wrds-cloud-login1-h or wrds-cloud-login2-h) to upload packages to your WRDS home directory. Once uploaded however, you can use them on the WRDS Cloud compute nodes as normal.
https://wrds-www.wharton.upenn.edu/pages/support/the-wrds-cloud/using-ssh-connect-wrds-cloud/
WRDS offers Jupyter hub in its cloud with Linux machine and theoretically it will be faster and can lengthen the runtime. This is its main benefit. However, some details are required.
Jupyter terminal does not outbound connectivity to the internet so you cannot create a virtual environment in this way.
Instead, you need to access the log in node using Putty SSH connection.
https://wrds-www.wharton.upenn.edu/pages/support/the-wrds-cloud/using-ssh-connect-wrds-cloud/
need to open putty app<br>
log in
wrds-cloud.wharton.upenn.edu
<br>
use log in id and password to sign on<br>
create a virtual env diredctory<br>
using <br>
python3 -m venv --copies ~/virtualenv
<br>
Activate the virtualenv: <br>
source ~/virtualenv/bin/activate
<br>
and install the desired packages<br>
you are on a login node and your virtual environment is activated,
you can install your packages. You must also install the ipykernel
package.
Once you have installed your packages and Jupyter, you must create a custom Jupyter kernel. Do this by running the following command and giving it a name, remembering this name for later use: <br>
pip install ipykernel
python3 -m ipykernel install --user --name=<kernel_name>\
pip install -r "<Tidy-Finance-with-Python Folder>\requirements.txt"
Log in to JupyterHub and start a JupyterLab instance. Click the “+” in the upper left hand corner to start a new launcher. In the first row of the launcher tab, “Notebooks,” you will see a new kernel with the name you gave your kernel. Click it to start a notebook. This notebook is using the virtual environment you set up, including the packages you installed. Notice under “Notebook” and “Console” there are several kernels, including the one created named my-jupyter-env. Launching this will use the virtual environment you created the kernel from.
Google Colab Pro
https://colab.research.google.com/notebooks/pro.ipynb#scrollTo=SKQ4bH7qMGrA
Making the Most of your Colab Subscription
Faster GPUs
Users who have purchased one of Colab’s paid plans have access to premium GPUs. You can upgrade your notebook’s GPU settings in Runtime > Change runtime type in the menu to enable Premium accelerator. Subject to availability, selecting a premium GPU may grant you access to a V100 or A100 Nvidia GPU.
The free of charge version of Colab grants access to Nvidia’s T4 GPUs subject to quota restrictions and availability.
You can see what GPU you’ve been assigned at any time by executing the following cell. If the execution result of running the code cell below is “Not connected to a GPU”, you can change the runtime by going to Runtime > Change runtime type in the menu to enable a GPU accelerator, and then re-execute the code cell.
gpu_info = !nvidia-smigpu_info = ‘\n’.join(gpu_info)if gpu_info.find(‘failed’) >= 0: print(‘Not connected to a GPU’)else: print(gpu_info)
In order to use a GPU with your notebook, select the Runtime > Change runtime type menu, and then set the hardware accelerator dropdown to GPU.
More memory
Users who have purchased one of Colab’s paid plans have access to high-memory VMs when they are available.
You can see how much memory you have available at any time by running the following code cell. If the execution result of running the code cell below is “Not using a high-RAM runtime”, then you can enable a high-RAM runtime via Runtime > Change runtime type in the menu. Then select High-RAM in the Runtime shape dropdown. After, re-execute the code cell.
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print(‘Your runtime has {:.1f} gigabytes of available RAM\n’.format(ram_gb))
if ram_gb < 20:
print(‘Not using a high-RAM runtime’)
else:
print(‘You are using a high-RAM runtime!’)
Longer runtimes
All Colab runtimes are reset after some period of time (which is faster if the runtime isn’t executing code). Colab Pro and Pro+ users have access to longer runtimes than those who use Colab free of charge.
Background execution
Colab Pro+ users have access to background execution, where notebooks will continue executing even after you’ve closed a browser tab. This is always enabled in Pro+ runtimes as long as you have compute units available.
Relaxing resource limits in Colab Pro
Your resources are not unlimited in Colab. To make the most of Colab, avoid using resources when you don’t need them. For example, only use a GPU when required and close Colab tabs when finished.
If you encounter limitations, you can relax those limitations by purchasing more compute units via Pay As You Go. Anyone can purchase compute units via Pay As You Go; no subscription is required.
Set-ups
from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" from google.colab import drive drive.mount('/content/drive')%%capture
You can use%%capture
to prevent output from cluttering your notebook, especially if a cell produces a lot of text or warnings that you don't need to see.
import pkg_resources import pip import sys installedPackages = {pkg.key for pkg in pkg_resources.working_set} required = {'nltk', 'spacy', 'textblob','gensim'} missing = required - installedPackages if missing: !pip install nltk==3.4 !pip install textblob==0.15.3 !pip install gensim==3.8.2 !pip install -U SpaCy==2.2.0 !python -m spacy download en_core_web_lg
!pip install pandas==2.2.1 # restart the session
!pip install pyarrow
help(sum) import warnings warnings.filterwarnings('ignore') import pandas as pd pd.options.mode.chained_assignment = None # default='warn'
# Jupyter uses forward slashes to access folders in path. For example.,
path = 'C:/Users/hy11/Documents/......../data/facebook.csv'
# or path = 'C:\\Users\\hy11\\Documents\\........\\data\\facebook.csv'
numpy.inf IEEE 754 floating point representation of (positive) infinity.
import numpy as np np.set_printoptions(precision = 2) import pandas as pd pd.set_option('max_rows', 20) pd.options.display.max_rows = 100 pd.options.display.float_format = '{:,.2f}'.format
# How to see more data with pd.option_context('display.min_rows', 30, 'display.max_columns', 82): display( df.query('`ColumnA`.isna()'))
from google.colab import files
uploaded = files.upload()
import os os.chdir('/content/drive/My Drive/Colab Notebooks/data/') os.getcwd() os.listdir() list = os.listdir() number_files = len(list) print(number_files)
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip # Restart the kernel import pandas as pd from pandas_profiling import ProfileReport profile = ProfileReport(df, title=’Heart Disease’, html={‘style’:{‘full_width’:True}}) profile.to_notebook_iframe()
Magic commands
%time: Time the execution of a single statement %timeit: Time repeated execution of a single statement for more accuracy %prun: Run code with the profiler %lprun: Run code with the line-by-line profiler %memit: Measure the memory use of a single statement %mprun: Run code with the line-by-line memory profiler %cd Change the current working directory.
%conda Run the conda package manager within the current kernel. %dirs Return the current directory stack. %load_ext Load an IPython extension by its module name. %pwd Return the current working directory path.
Tuple, List, Dictionary
list = [1, 2, 3, 4]
len(list)
output:
4
number_list = [2,4,6,8,10,12]
print(number_list[::2])
output:
[2, 6, 10]
Datetime
# from string to datetime or date from datetime import datetime datetime.strptime('2009-01-01', '%Y-%m-%d') output: datetime.datetime(2009, 1, 1) datetime.strptime('2009-01-01', '%Y-%m-%d').date() Output: datetime.date(2009, 1, 1)
Function
def simulate_process(is_stationary: bool) -> np.ndarray:
# -> np.ndarray
: This indicates that the function returns a value of type np.ndarray
, which is the type used by NumPy to represent arrays.
I/O
The tarfile module makes it possible to read and write tar archives, including those using gzip, bz2 and lzma compression. Use the zipfile module to read or write .zip files, or the higher-level functions in shutil. import tarfile tar = tarfile.open("sample.tar.gz") tar.extractall(filter='data') tar.close()
The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more. urllib is part of the Python Standard Library and offers basic functionality for working with URLs and HTTP requests. urllib3 is a third-party library that provides a more extensive feature set for making HTTP requests and is suitable for more complex web interaction tasks. It must be installed separately. import urllib.request with urllib.request.urlopen('http://www.python.org/') as f: print(f.read(300))
import urllib.request min_date = '1963-07-31' max_date = '2020-03-01' ff_url = 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip' urllib.request.urlretrieve(ff_url, 'factors.zip') !unzip -a factors.zip !rm factors.zip
with open('text.txt') as f: text_in = f.read() print(text_in)
import pandas as pd url = 'https://github.com/mattharrison/datasets/raw/master/data/ames-housing-dataset.zip' df = pd.read_csv(url, engine='pyarrow', dtype_backend='pyarrow')
#removing the file. os.remove("file_name.txt") file_name = "python.txt" os.rename(file_name,'Python1.txt')
# to tell the path
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks/data/')
import sys print(sys.version) #Listing out all the paths import sys print(sys.path) # print out all modules imported sys.modules"
print ('%s is %d years old' % ('Joe', 42)) Output: Joe is 42 years old
print('We are the {} who say "{}!"'.format('knights', 'Ni')) Output: We are the knights who say "Ni!"
Data Wrangling – Numpy / Pandas
np.random.seed(42) Setting the random seed is important for reproducibility. By fixing the seed to a specific value (in this case, 42), you ensure that the sequence of random numbers generated by NumPy will be the same every time you run your code, assuming that other sources of randomness in your code are also controlled or fixed.
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
>>> np.ones(5) array([1., 1., 1., 1., 1.])
>>> np.ones((2, 1)) array([[1.], [1.]])
>>> np.c_[np.array([1,2,3]), np.array([4,5,6])] array([[1, 4], [2, 5], [3, 6]])
df.dtypes
df.shape
df.iloc[3:5]
df.loc[(df["Col_A"] == "brown") & (df["Col_B"] == "blue"), ["Col_D", "Col_Z"]]
# data type conversion df.['year'].astype(str) df.['number'].astype(float) y = y.astype(np.uint8)
# We can alternatively use glob as this directly allows to include pathname matching. # For example if we only want Excel .xlsx files: data_path = os.path.join(os.getcwd(), 'data_folder') from glob import glob glob(os.path.join(data_path, '*.xlsx')) output: '/content/drive/MyDrive/Colab Notebooks/data/output.xlsx', '/content/drive/MyDrive/Colab Notebooks/data/beta_file.xlsx',
# index df = df.set_index("Column 1") df.index.name = "New Index" # making data frame from csv file data = pd.read_csv("nba.csv", index_col ="Name") # remove own index with default index df.reset_index(inplace = True, drop = True)
# to rename data.rename(columns={'OLD': 'NEW'}, inplace=True) # reformat column namess df.columns = df.columns.str.lower().str.replace(" ", "_") df["column_name"].str.lower() # to print memory usage memory_per_column = df.memory_usage(deep=True) / 1024 ** 2 df_cat = df.copy()
Slices can be passed by name using .loc[startrow:stoprow:step, startcolumn:stopcolumn:step] or by position using .iloc[start:stop:step, start:stop:step]. df.loc[2::10, "name":"gender"]
.isna() .notna() Always use pd.isnull() or pd.notnull() when conditioning on missing values. This is the most reliable. df[pd.isnull(df['Column1']) df[pd.notnull(df['Column2'])].head() df.query('ColumnA'.isna())
try: df = pd.read_csv(url, skiprows=0, header=0) print("Data loaded successfully.") except Exception: print("Error loading sheet") # If an error occurs, halt further execution raise
if __name__ == "__main__": main()
# use the break and continue statements for x in range(5,10): if (x == 7): break if (x % 2 == 0): continue print (x)
pd.concat([data_FM.head(), data_FM.tail()]) pd.concat([df.head(10), df.tail(10)])
# Categoricals - Pandas 2 df.select_dtypes('string') # or 'strings[pyarrow]'
# Categoricals df.select_dtypes('string').describe().T
What is type and dtype in Python?
The type of a NumPy array is numpy.ndarray ; this is just the type of Python object it is (similar to how type("hello") is str for example). dtype just defines how bytes in memory will be interpreted by a scalar (i.e. a single number) or an array and the way in which the bytes will be treated (e.g. int / float ).
# Convert string columns to the `'category'` data type to save memory. (df .select_dtypes('string') .memory_usage(deep=True) .sum() ) (df .select_dtypes('string') .astype('category') .memory_usage(deep=True) .sum() )
# Missing numeric columns (and strings in Pandas 1) df.isna().mean().mul(100).pipe(lambda ser: ser[ser > 0])
# sample five rows df.sample(5)
# LAMBDA adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 if '>50K' in x else 0) df.assign(Total_Salary = lambda x: df['Salary'] + df['Bonus'])
# SORT_VALUES oo.sort_values(by = ['Edition', 'Athlete']) # Count unique values oo['NOC].value_counts(ascending = True) oo['NOC].unique() oo[(oo['Medal']=='Gold') & (oo['Gender'] == 'Women')] oo[oo['Athlete'] == 'PHELPS, Michael']['Event'] oo["Athlete"].value_counts().sort_values(ascending = False).head(10) df['Label'].value_counts().plot(kind='bar') # or df.groupby('Label').size().plot(kind='bar') oo[oo.Athlete == 'PHELPS, Michael'][['Event', 'City','Edition']] #or oo[oo['Athlete'] == 'PHELPS, Michael'][['Event', 'City','Edition']]
# Delete a single column from the DataFrame data = data.drop(labels="deathes", axis=1) # Delete multiple columns from the DataFrame data = data.drop(labels=["deaths", "deaths_per_million"], axis=1) # Note that the "labels" parameter is by default the first, so # the above lines can be written slightly more concisely: data = data.drop("deaths", axis=1) data = data.drop(["deaths", "deaths_per_million"], axis=1) # Delete a single named column from the DataFrame data = data.drop(columns="cases") # Delete multiple named columns from the DataFrame data = data.drop(columns=["cases", "cases_per_million"]) pd.concat([s1, s2], axis=1) pd.concat([s1, s2], axis=1).reset_index() a.to_frame().join(b) # Delete column numbers 1, 2 and 5 from the DataFrame # Create a list of all column numbers to keep columns_to_keep = [x for x in range(data.shape[1]) if x not in [1,2,5]] # Delete columns by column number using iloc selection data = data.iloc[:, columns_to_keep] # delete a single row by index value 0 data = data.drop(labels=0, axis=0) # delete a few specified rows at index values 0, 15, 20. # Note that the index values do not always align to row numbers. data = data.drop(labels=[1,15,20], axis=0) # delete a range of rows - index values 10-20 data = data.drop(labels=range(40, 45), axis=0) # The labels parameter name can be omitted, and axis is 0 by default # Shorter versions of the above: data = data.drop(0) data = data.drop([0, 15, 20]) data = data.drop(range(10, 20)) data.shape output: (238, 11) # Delete everything but the first 99 rows. data = data[:100] data.shape output: (100, 11) ata = data[10:20] data.shape output: (10, 11)
# divide multiple columns by a number efficiently df_ff[['MKT_RF','SMB','HML','RMW','CMA','RF']] /= 100.0
# handling missing variables
df = df.append(another_row)
.loc .iloc
data_ml.loc[data_ml["R12M_Usd"] > 50, ["stock_id", "date", "R12M_Usd"]]
Lamda
def add(x, y): return x + y This is the same as add = lambda x, y: x + y print(add(3, 5)) Output: 8
# LAMBDA adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 if '>50K' in x else 0) df.assign(Total_Salary = lambda x: df['Salary'] + df['Bonus'])
(lambda x, y: x * y)(3,4) Output: 12
str1 = 'Texas' reverse_upper = lambda string: string.upper()[::-1] print(reverse_upper(str1)) Output: SAXET
is_even_list = [lambda arg=x: arg * 10 for x in range(1, 5)] for item in is_even_list: print(item()) Output: 10 20 30 40
>>> (lambda x: x + 1)(2) 3
list1 = ["1", "2", "9", "0", "-1", "-2"]
# sort list[str] numerically using sorted()
# and custom sorting key using lambda
print("Sorted numerically:", sorted(list1, key=lambda x: int(x)))
Output:
Sorted numerically: ['-2', '-1', '0', '1', '2', '9']
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [0, 0, 0, 0, 0]})
print(df)
df['col3'] = df['col1'].apply(lambda x: x * 10)
df
output:
col1 col2
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
col1 col2 col3
0 1 0 10
1 2 0 20
2 3 0 30
3 4 0 40
4 5 0 50
# Creating a DataFrame data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Department': ['HR', 'IT', 'Finance', 'Marketing', 'IT'], 'Salary': [70000, 80000, 90000, 60000, 75000] } # 1. Increase salary by 10% using lambda df['New Salary'] = df['Salary'].apply(lambda x: x * 1.10) # 2. Check if the new salary is greater than 80,000 using lambda df['Above 80K'] = df['New Salary'].apply(lambda x: 'Yes' if x > 80000 else 'No')
Plot
df.plot.scatter("X", "Y", alpha=0.5)
# In Matplotlib it is possible to change styling settings globally with runtime configuration (rc) parameters. # The default Matplotlib styling configuration is set with matplotlib.rcParams. # This is a dictionary containing formatting settings and their values. import matplotlib as mpl mpl.rcParams['figure.figsize'] = (15, 10) mpl.rcParams["font.family"] = "monospace" mpl.rcParams["font.family"] = "sans serif" # Matplotlib comes with a selection of available style sheets. # These define a range of plotting parameters and can be used to apply those parameters to your plots. import matplotlib.pyplot as plt plt.style.available plt.style.use("dark_background") plt.style.use("ggplot") plt.style.use('fivethirtyeight') plt.tight_layout()
# check types df.dtypes # string to int df['string_col'] = df['string_col'].astype('int') # If you want to convert a column to numeric, I recommend to use df.to_numeric(): df['length'] = pd.to_numeric(df['length']) df['length'].dtypes
import multiprocessing multiprocessing.cpu_count()
import matplotlib.pyplot as plt import seaborn as sns sns.set_style('whitegrid') plt.style.use('seaborn')
import plotly.express as px
fig = px.line(df, x="lifeExp", y="gdpPercap")
fig.show()
import plotly.express as px
fig = px.line(x1, x2, width=1000, height=480, title = 'Returns')
fig.show()
# to show up within this notebook so we need to direct bokeh output to the notebook. import bokeh.io bokeh.io.output_notebook() from bokeh.plotting import figure, show p = figure(title='Returns', x_axis_label='Date', y_axis_label='GOOG', height=400, width=800) p.line(x, y1, color = 'firebrick', line_width=2) p.line(x, y2, color = 'navy', line_width=2) show(p)
fig, ax = plt.subplots() ax.plot(train['date'], train['data'], 'g-.', label='Train') ax.plot(test['date'], test['data'], 'b-', label='Test') ax.set_xlabel('Date') ax.set_ylabel('Earnings per share (USD)') ax.axvspan(80, 83, color='#808080', alpha=0.2) # ax.axvspan() highlights a vertical span from x=80 to x=83. # color='#808080' sets the color to a shade of gray. # alpha=0.2 sets the transparency of the highlighted area. ax.legend(loc=2) #loc=2 places the legend in the upper left corner of the plot. plt.xticks(np.arange(0, 85, 8), [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980]); # semicolon supress the output # plt.xticks() sets custom ticks on the x-axis. # np.arange(0, 85, 8) generates values from 0 to 84 in steps of 8. # The second argument provides custom labels for these ticks. fig.autofmt_xdate() # automatically formats the x-axis labels for better readability, often rotating them # to prevent overlap. plt.tight_layout();
import altair as alt base1= alt.Chart(m, width=800, height=400).encode(x='Date', y="GOOG_Returns") base2= alt.Chart(m, width=800, height=400).encode(x='Date', y="Strategy_Returns") base1.mark_line(color='gray') + base2.mark_line(color='navy')
https://www.tomasbeuzen.com/python-programming-for-data-science/chapters/chapter9-wrangling-advanced.html
RPA
import rpa rpa.init(visual_automation=True, chrome_browser=True) rpa.url("https://ourworldindata.org/coronavirus-source-data") rpa.click("here") rpa.click("CSV") rpa.close()
!pip install rpa import rpa as r r.init() r.url("https://www.urbizedge.com/custom-shape-map-in-power-bi/") #This is aimed at navigating to a website by using the rpa bot r.close()
r.init()
r.url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/")
r.click("breast-cancer.data")
r.init()r.url('https://www.google.com') r.type('//*[@name="q"]', 'USA[enter]') print(r.read('result-stats')) r.snap('page', 'USA.png') r.close() #This is meant to close the RPA system running of Chrome.
Text
type(clean_words) list
clean_words ['thank', 'professor', 'wieland',
" ".join(clean_words) 'thank professor wieland introduction thank institute monetary financial stability