Web Scraping SET50 Constituents with Python

4 min readJan 25, 2021

Incepted on August 16, 1995, SET50 has been the bedrock on which Thai capital market is built. Financial products, including indices, funds, derivatives, and many capital market innovations yet to come, have stemmed from the index.

Therefore, having a script that runs through SET50 constituents from official SET website is fundamental to any success model or analysis built on the index itself.

In this article, I will guide you how to:

Automatically download constituents from official website
Turn the PDF file into text
Convert the text into ready-to-use table and excel file

All examples and codes are done in Python 3.0 in Google Colaboratory

1. Automatically download constituents from official website

First, get to the link below and you will see the links of constituents for the first and second half of each year. (Sometimes, there are more than two files per year if there is(are) an interim revision(s))

SET - Market Data - Constituents

Terms & Conditions of Use |Privacy Center |Cookies Policy |Third Party Terms

www.set.or.th

Second, get the URL for interested period.

1H21 URL Example

Then, paste the URL to Python code like this.

from pathlib import Pathimport requestsfilename = Path('SET50_100_H2_2020.pdf') #set up file nameurl = 'http://www.set.or.th/th/market/files/constituents/SET50_100_H2_2020.pdf' #paste URL hereresponse = requests.get(url)filename.write_bytes(response.content)

The saved file will appear like this Google Colaboratory.

2. Turn the PDF file into text

For this section, I rely on ‘PyPDF2’ library based on Farooq’s code from the link below.

Python for Pdf

Table of content

medium.com

Our PDF file will look like this.

Now, let’s turn the PDF file into texts through which we can iterate.

Install PyPDF2 before importing its library.

pip install PyPDF2

From the code in above blocks, the result will look like this.

3. Convert the text into ready-to-use table and excel file

As you can see the result, our interested information starts from the #12 of our text list. Further, each securities have #4 important features that we want to store in our table, namely, No., Ticker, Company Name, and Sector.

Now that we understand our data structure, we can start coding to store that data.

After running the code, the table will look a bit mess as we have not cleaned the data yet. Many items should not be included in our text list; otherwise, it will break our loops. The texts inside the red boxes should be excluded from our data before running the code.

All unwanted texts that should be excluded from data

Remove unwanted texts with the code below

#remove unnecessary text that will interrupt our looplines = [e for e in lines if e not in('Reserve list of SET50 ','SET50 Inclusion: BAM, COM7 & DELTA','SET50 Exclusion: BPP, IRPC & WHA','PCL')]

Then, run the loop again, and run the code below to remove SET50 reserve list (by removing duplicates in ‘No.’ column).

set50_constituents.drop_duplicates(subset=['No.'],inplace=True) #drop duplicates | to clear out reserve list

The resulting DataFrame is shown in the picture below.

Now, row #55 to #59 is the remaining unwanted data. I will delete those rows, parse ‘float’ variable type to the ‘No.’ column, set index as ‘No.’ column. and sort by ‘No.’ column.

set50_constituents = set50_constituents[:-4] #drop bottom 4set50_constituents = set50_constituents.astype({"No.": float})set50_constituents = set50_constituents.set_index('No.')

After that, we can export it to excel format (.xlsx) to be used in our analysis or model construction.

set50_constituents.to_excel('SET50_2H20_constituents.xlsx')

Thanks for reading :) I hope this article will help you with such tedious work as collecting SET50 components.

Here is the full code