Skip to main content

Data Processing using Python Generators

Hello my friends in this Entry we are going to talk about data processing with Python, but we are going to process data using a pipeline approach, to a better understanding on this you can go back an entry earlier were I explained Iterators and Generators. In this entry we are going to move forward in applying generators to process data with pipelines.
But first let's talk about Comprehension.

Python Comprehension

Comprehension is a way of creating sequences from iterators in a one line simple statement; we have several kinds of comprehensions

 Python List Comprehension

For example if we want to build a list with the first 10 squares values of the positive integer numbers we can do it like this
>>> squares = [x ** 2 for x in range(111)]
>>> squares
[149162536496481100]

We can also create compounds lists with if statements, let's now build the same squares but only taking the even ones.
>>> squares = [x ** 2 for x in range(1, 11) if x % 2 == 0]
>>> squares
[4, 16, 36, 64, 100]

Pretty cool right? Keep in mind that the use of the if statement can be implemented in data processing to filter based on a condition, another thing to point out is that you can construct list and also apply so kind of element wise operation before outputting this to a variable.

Python Dictionary Comprehension

Also, you can create Dictionaries using the same approach; let's look a quick example in which we construct a not much meaningful dictionary holding names
>>> data = {k:v for k, v in enumerate(names)}
>>> data
{0: 'John', 1: 'Caleb', 2: 'Matthew', 3: 'Johan', 4: 'David'}


Not that cool, but you can find out how this could be useful to you.

Python Generators Comprehension

Finally you can create generators using this same approach, in the last entry I used functions or methods to yield elements in a sequence on the fly, if your elements in the sequence can be one line processed as we did here with the list example, can build a generators easily. Let's turn the list example into a generator, this can be done by just replace the square brackets [] in the statement by round brackets (), just that simple.
>>> squares = (x ** 2 for x in range(1, 11))
>>> squares
<generator object <genexpr> at 0x02B74828>
>>> next(squares)
1
>>> next(squares)
4
>>> for s in squares:
         s
 
         
9
16
25
36
49
64
81
100

Cool right? So now that you have took a look at this let's move on with Pipelines.

Generators as Pipelines

Remember that generators are better if you want to process a lot of data and also want to improve performance with a very small and semantic code, sometimes you need to process some source of data in sequential elements, and apply several processing algorithms, creating a for loop for each one of this processes is not so much Pythonic (remember the Zen of Python), you can create a pipeline practice in which each pipe creates a new generator based on the generator fed by the previous pipe, let's explain this with a small graphic.

As you can see, you want to process your data from a raw source that could become from the Internet, a Database or a simple CSV file to a final formatted output like another database, JSON or XML. But several conversions or algorithms need to be applied to each elements of this data, you split this as pipes in which each pipe is a different process operation, and each one returns a generator from another generator, you will see that by using this approach a more modular and maintainable will be obtained.
Another thing to point out is that if your data operation takes more than one line of code don't feel down for this, you can for this pipe create a function based generator, and have a mixed code between functions and comprehensions.
And to explain this better let's code an example, we are going to process a file that contains information about Pokemon, you can find the file in this link.
The file contains 721 entries of Pokemon, and in each line we have comma separated values for each of these properties in order.
·         Number
·         Name
·         Type 1
·         Type 2
·         Total
·         HP
·         Attack
·         Defense
·         Sp. Atk
·         Sp. Def
·         Speed
·         Generation
·         Legendary
Now that we have discuss the source format let's talk about the output format, how about dictionaries with the properties as key with corresponding values? Yeap that could work, but what about if we convert each of this dictionaries into json?... better.
Another thing that we could do is to convert numbers in string representation to actual python integers.
So let's enumerate the pipes
1.       Data reading: In this pipe we are going to generate each line of the csv file.
2.       Data packing: In this pipe we are going to convert from a single comma separated line values into a dictionary
3.       Data conversion: In this pipe we are going to convert each element inside the dictionaries that could represent integers.
4.       Data formatting: In this pipe we are going to convert from dictionary to json.
And there have it, 4 generators in a pipe that as a whole generates each output element on the fly, mind blowing.
Let's code the first generator from the pipeline.
# data_process1.py
 
pokefile = open("pokemon.csv")
pokefile.readline()
 
pokelines = (line.strip() for line in pokefile)

In this code we are reading the first line so we can get rid of the headers from the csv file, and apply the strip method for string just remove trailing new lines "\n", now let's code the next pipe.
def process_lines(pokelines):
 
    for line in pokelines:
        
        values = line.split(",")
 
        keys = ["Number", "Name", "Type 1",  "Type 2",
                "Total", "HP", "Attack", "Defense",
                "Sp. Atk", "Sp. Def", "Speed", "Generation",
                "Legendary"]
 
        yield dict(zip(keys, values))
 
pokedicts = process_lines(pokelines) 

This one takes more than one lines to perform the process, so for this we build a function that yields each dictionary, we used a pretty cool trick in which we use the zip function to build a dictionary using keys and values in the same order.
Now let's code the next pipe.
def process_dicts(pokedicts):
 
    number_keys = ["Number", "Total", "HP", "Attack",
                   "Defense", "Sp. Atk", "Sp. Def", "Speed",
                   "Generation"]
    
    for pokemon in pokedicts:
 
        for key in pokemon.keys():
 
            if key in number_keys:
 
                pokemon[key] = int(pokemon[key])
 
        pokemon["Legendary"] = bool(pokemon["Legendary"])
 
        yield pokemon
 
pokeconverts = process_dicts(pokedicts)

Pretty much the same thing but we have now spare our program into several functions so we can maintain them in a better way. Finally let's convert each element into json.
pokejson = (json.dumps(pokemon, indent=4) for pokemon in pokeconverts)

Let's print the first 5 elements
>>> for i in range(5):
         print next(pokejson)
 
         
{
    "Name": "Bulbasaur", 
    "Generation": 1, 
    "Sp. Atk": 65, 
    "HP": 45, 
    "Sp. Def": 65, 
    "Type 2": "Poison", 
    "Number": 1, 
    "Type 1": "Grass", 
    "Attack": 49, 
    "Defense": 49, 
    "Legendary": true, 
    "Total": 318, 
    "Speed": 45
}
{
    "Name": "Ivysaur", 
    "Generation": 1, 
    "Sp. Atk": 80, 
    "HP": 60, 
    "Sp. Def": 80, 
    "Type 2": "Poison", 
    "Number": 2, 
    "Type 1": "Grass", 
    "Attack": 62, 
    "Defense": 63, 
    "Legendary": true, 
    "Total": 405, 
    "Speed": 60
}
{
    "Name": "Venusaur", 
    "Generation": 1, 
    "Sp. Atk": 100, 
    "HP": 80, 
    "Sp. Def": 100, 
    "Type 2": "Poison", 
    "Number": 3, 
    "Type 1": "Grass", 
    "Attack": 82, 
    "Defense": 83, 
    "Legendary": true, 
    "Total": 525, 
    "Speed": 80
}
{
    "Name": "VenusaurMega Venusaur", 
    "Generation": 1, 
    "Sp. Atk": 122, 
    "HP": 80, 
    "Sp. Def": 120, 
    "Type 2": "Poison", 
    "Number": 3, 
    "Type 1": "Grass", 
    "Attack": 100, 
    "Defense": 123, 
    "Legendary": true, 
    "Total": 625, 
    "Speed": 80
}
{
    "Name": "Charmander", 
    "Generation": 1, 
    "Sp. Atk": 60, 
    "HP": 39, 
    "Sp. Def": 50, 
    "Type 2": "", 
    "Number": 4, 
    "Type 1": "Fire", 
    "Attack": 52, 
    "Defense": 43, 
    "Legendary": true, 
    "Total": 309, 
    "Speed": 65
}

As you can see each one of these elements is generated on the fly from the first pipe to the last, with this memory use is better handled and execution is improved.
Well my friend we have come to the end of this entry, I really enjoyed writing this one, hope you can use some of this code in your programing and you can turn your old processing code into pipelines.
So thanks for coming by my new entry and I hope you have enjoyed this as much as I enjoyed writing it, stop by the comments if you want to discuss about this, share in your social media and subscribe. Cheers my friends.

Popular posts from this blog

Multithreading web scraping with Python

Hello my friends today in this entry we are going to talk about a very trendy topic, web scraping and how to do it with our beautiful Python programming language, so open your Python Idle and get set because in this Tutorial Entry we are going to code once again.

Python Free Books

Hello my friends in an earlier post I talked about some dive deep Python Books that you could purchase to start learning Python, you can check that entry in this link, and now I have decided to write this entry to give you a list of online and free Python books.


These books are supposed to be hosted documentation; they are written in restructured text (reStructuredText) and translated into beautiful HTML or PDF with a tool called Sphinx. This documentation format is supposed to be used for writing your own packages and modules documentation, but experts also use them to write practical books and tutorials of different languages and they can be uploaded and hosted for free in different online platforms, one of them is read the docs website.

Singleton - Design Patterns in Python

Hello my friends, here in this quickly entry we are going to talk about the most basic but very useful Design Pattern and that is the Singleton, but first let's discuss as always a little about theory.
Design PatternsThis is the next step in the programing learning curve, after Object Oriented Programming there is a list of topics that you could learn next, I strongly recommend Python Design Patterns. Accordingly to the Wikipedia, “a software design pattern is a general reusable solution to a commonly occurring problem within a given context in software design”, in other words is the same code to the same kind of problems; in general we tend to find the same kind of problems when we are designing our software, and we tend to solve this problems with the same solution, in time and each time this solution is improved and finally is considered a standard or a pattern in software design, so it becomes a design pattern.