Python Tales - The Python Tutor Blog

A Python Programming Blog, from a Pythoneer to Pythoneers, created by The Python Tutor.

Monday, September 11, 2017

Handling Relational Database with Python (Part I - Setting Database Up)

Hello my fellows Pythoneers in this new entry I’m going to talk about a cool topic, and that is handling SQL database systems with Python, but first let’s talks as usual about some background theory so we can put you in context.

Relational Database

A really quick definition about this concept is that al digital data is organized and stored as separated groups of data that a have relations on somehow between these groups, in the same storage, normally called database, we could have several data groups, these data groups in terms of SQL are called tables, and each table have unique records of information, these records are called rows.
So structurally speaking it would be something like this.
Database -> Table -> Row
Tables are organized as concerns in your applications, each table have a fixed  number of columns in which each column represent a feature of attribute of those elements that we want to save in each table. And finally every row is a unique entry. We can understand these concepts better by approaching with an example.


SQL stands for Structured Query Language, and this is not a programming language like Python or Javascript, but instead is a language for querying a database system, most of the relational databases use this query language as the only and main language.
This language gives you syntax and constructs so you can cleverly consult, insert, update and delete information in a logical and meaningful way, you can use a large set of statements to build a smart query for achieving a desired result, and we can see some examples later in this entry.

Database systems and Database management systems

Even that this topics are not so much of programming concerns as a programmer you need to be in somehow proficient in every aspect of computer science or informatics, and have some knowledge and practice with these concepts.
A database system is a software that you could install in your local computer, this software will automatically handle the way in which data is stored in the hard drive, this is not like saving information in a CSV file or in a Excel Workbook, one of the advantages of using a database system is that a large amount of data is to be stored and we want to retrieve information out of it very quickly, so the database system not only handles this optimized storage but also gives you a drive, engine or service so you can query the data to obtain valuable information, pretty cool right? There several other advantages by using a database system like users and permissions but here we will focus only in the basics.
Also we have database management systems; these are software that usually comes with the database system installation that you could use for managing the data base with the use of a graphical user interface.
In this entry we are going to build a simple database populating it with three main CSV files.
·         albums.csv
·         genres.csv
·         tracks.csv
You can click on any of these links to download the respective file.
One thing to point out, the information in all of these CSV files are music tracks, I took them from a college assignment of programming, from a friend that I helped with.
These tracks they could be gathered into a single CSV file, but they are separated to construct a schema that could be implemented in a database, so what is a schema? Is the way we can split data based on relations so we can store in separated tables. Why? Well for instance several tracks could have the same album name, a very long string, some bytes that you will store in each record or row in the tracks table right? for small databases this could be good enough but when our database tends to be really large, millions of records those bytes could be a size problem in terms of storage, so what can we do about it to optimize this, well can replace the album name by a unique identifier, that could take less bytes that the entire album name, and then hold in a separated table the full album name with  the same identifier, this identifier also called ID, takes one column for each table, en these columns are the relations that bound one table to another, that is why it is called relational database, in the albums table you only take one record for the album name instead of several times, as much tracks as the album perhaps has.
The same approach could be applied for genres, so if you open these CSV files you can see that this schema is already in place, but if you take a closer look you can see that you could even better optimize this data by replacing the names of artists by unique ID and build a new table with artists names only.
But let's keep it simple for now, we want just to build these three tables and make some queries on it.
First let's talk about the database system that we are going to use, for simplicity we are going to use sqlite, this instead of a database system is just a way of saving data into a single file, but query it with programming languages, for each programming language there must be a library that supports sqlite, in Python this library can be imported with sqlite3 module.

Python Functions

First we are going to build 3 functions for populating the database with each CSV file.
·         setup_tracks
·         setup_genres
·         setup_albums
Each one of these functions takes two arguments, the first one db, which is a database filename, and the second one data_file, which is the file handler for the respective CSV file, remember that a file handler is an object returned by the open function. Let's take a look at the setup_tracks function.
import sqlite3
import csv

def setup_tracks(dbdata_file):
    '''(str, file) -> NoneType
    Create and populate the Tracks table in the given database db with
    the data from the open file data_file.

    def get_sec(time_str):

            h, m, s = time_str.split(':')
            return int(h) * 3600 + int(m) * 60 + int(s)


            m, s = time_str.split(':')
            return int(m) * 60 + int(s)
    create connection
    con = sqlite3.connect(db)
    create cursor
    cur = con.cursor()

        cur.execute('DROP TABLE Tracks')


    cur.execute('CREATE TABLE IF NOT EXISTS Tracks(Title TEXT, ID INTEGER, Time INTEGER)')

    reader = csv.DictReader(data_file)

    for row in reader:

        track_title = row["Track Title"]
        track_id = row["Track ID"]
        time = get_sec(row["Time"])

        cur.execute("INSERT INTO Tracks VALUES(?,?,?)", (track_titletrack_id, time,))

    close everything

As you can see first we import our two main modules, csv for CSV file handling and sqlite3. Then in our setup_tracks function, we start by creating a connection object, this bound a sqlite database file to a Python local variable, and with the cursor method we create a cursor instance that can be used for sql query executions.
The first try and except statement is used for deleting the table Tracks in case it exists, so we can have a clean and fresh database setup each time we execute this function,  as you can see we use the execute method from the cursor created, this method takes a string which is a SQL query syntax, they are pretty much readable for this example. For the second execution we create a new table with the next attributes
·         Title (Text)
·         ID (Integer)
·         Time (Integer)
Each one with its respective data type, this is very important in Python variables type are defined at runtime dynamically, in SQL each attribute type requires to be specified at the moment the table is created.
The commit method applies the executions to the database, every time we use the execute method this queries are saved to a queue, but they are not entirely applied to the database we need to commit them.
The using the DictReader function from csv module we can create a list of dictionaries based on the CSV file header, so we use a for loop and dictionary keys access to take each value of each attribute for every track in the CSV file.
It could be kind of strange the third execution statement, as you can see we want to insert a new record to the Tracks table but what about those "?" questions marks, well the query must be a string, so you can convert and concatenate each attribute value and build a query, but this is not safe, an injection attack could be made with a dangerous attribute value, let me explain this with a cool and funny comic.

Taken from xkcd
As you can see there must be a safer way to execute sql statements from a programming language, well for Python this is one way, we use "?" in the respective positions, and in the second argument we provide a tuple with the values ordered in their respective value, in this way the execute method will take a closer look if the query is safe or not.
After we commit and we are done with the updates we close both the cursor and the connection objects.
In this function we used a local function for converting time formats, from h:mm:ss to seconds, or mm:ss to seconds.
For the others two functions we take the same approach let's take a look at them.
def setup_genres(db, data_file):
    '''(str, file) -> NoneType
    Create and populate the Genres table in the given database db with
    the data from the open file data_file.
    # create connection
    con = sqlite3.connect(db)
    # create cursor
    cur = con.cursor()
        cur.execute('DROP TABLE Genres')
    cur.execute('CREATE TABLE IF NOT EXISTS Genres(Artist TEXT, Genre TEXT)')
    reader = csv.DictReader(data_file)
    for row in reader:
        artist = row["Artist"]
        genre = row["Genres"]
        cur.execute("INSERT INTO Genres VALUES(?,?)", (artist, genre,))
    # close everything
def setup_albums(db, data_file):
    '''(str, file) -> NoneType
    Create and populate the Albums table in the given database db with
    the data from the open file data_file.
    # create connection
    con = sqlite3.connect(db)
    # create cursor
    cur = con.cursor()
        cur.execute('DROP TABLE Albums')
    cur.execute('CREATE TABLE IF NOT EXISTS Albums(ID INTEGER, Artist TEXT, Album TEXT)')
    reader = csv.DictReader(data_file)
    for row in reader:
        track_id = row["Track ID"]
        artist = row["Artist"]
        album_title = row["Album Title"]
        cur.execute("INSERT INTO Albums VALUES(?,?,?)", (track_id, artist, album_title,))
    # close everything

With these three functions we can populate our database file from the CSV files, we can take a main script for this
if __name__ == '__main__':
    # PART I: Set up the music database
    music_db = 'music.db'
    setup = False
    # Once the database has been set up, set setup to False to skip this step
    # while testing your other functions
    if setup:
        tracks_file = open("tracks.csv")
        albums_file = open("albums.csv")
        genres_file = open("genres.csv")
        setup_tracks(music_db, tracks_file)
        setup_albums(music_db, albums_file)
        setup_genres(music_db, genres_file)
        #pass # call your setup functions here

After executing this string you will have a music.db file in the same directory as the Python script file, also you must save all three CSV files in this same directory. Alright so for now we have our database filled with our tracks, what can we do about? Anything, remember that databases are the core store for almost every application out there in the world and in the Internet, but for now we will make some basic and simple queries, in the second part of this entry we will build a web application so we can interact with this database, so keep in touch if you want to learn more.
For a basic query we are going to create a simple function that will return the results from our database SQL execution.
def run_query(db, query, args=None):
    '''(str, str [, tuple]) -> list of tuple
    Return the results of running the given SQL query on database db.
    The args parameter is optional; if given, args contains the query arguments.
    # create connection
    con = sqlite3.connect(db)
    # create cursor
    cur = con.cursor()
    # run the query
    if args == None:
        # case when no args are passed
        # args has a value which should be a tuple
        cur.execute(query, args)
    # fetch result
    result = cur.fetchall()
    # close everything
    return result

With this function we can play a little with our database, the main difference with the other functions is that after we make our execution we can fetch data from the cursor with the fetchall method, this will return a list of tuples, and each tuple is a record with the selected attributes from the table that we used in our query, cool right? Let's do some queries with Python
>>> music_db = "music.db"
>>> output = run_query(music_db, 'SELECT Album FROM Albums')
>>> output[0][0]
'All This Bad Blood'
>>> len(output)
>>> output = run_query(music_db, 'SELECT Title FROM Tracks WHERE ID = ?', (85,))
>>> output
[('Next Lover',)]
>>> output[0]
('Next Lover',)
>>> output[0][0]
'Next Lover'

As you can see in each run_query  call a list of tuples is returned, so accessing a string value we might need to use double square brackets.
So this is it my friends I hope you have found this entry useful, so keep in touch if you want to see the second part of this entry, we will cover web application with this database, so comment, subscribe, follow my facebook page and share with your friends, best regards.