Python Tales - The Python Tutor Blog

A Python Programming Blog, from a Pythoneer to Pythoneers, created by The Python Tutor.

Monday, August 14, 2017

Data Processing using Python Generators

Hello my friends in this Entry we are going to talk about data processing with Python, but we are going to process data using a pipeline approach, to a better understanding on this you can go back an entry earlier were I explained Iterators and Generators. In this entry we are going to move forward in applying generators to process data with pipelines.
But first let's talk about Comprehension.

Python Comprehension

Comprehension is a way of creating sequences from iterators in a one line simple statement; we have several kinds of comprehensions

 Python List Comprehension

For example if we want to build a list with the first 10 squares values of the positive integer numbers we can do it like this
>>> squares = [x ** 2 for x in range(111)]
>>> squares

We can also create compounds lists with if statements, let's now build the same squares but only taking the even ones.
>>> squares = [x ** 2 for x in range(1, 11) if x % 2 == 0]
>>> squares
[4, 16, 36, 64, 100]

Pretty cool right? Keep in mind that the use of the if statement can be implemented in data processing to filter based on a condition, another thing to point out is that you can construct list and also apply so kind of element wise operation before outputting this to a variable.

Python Dictionary Comprehension

Also, you can create Dictionaries using the same approach; let's look a quick example in which we construct a not much meaningful dictionary holding names
>>> data = {k:v for k, v in enumerate(names)}
>>> data
{0: 'John', 1: 'Caleb', 2: 'Matthew', 3: 'Johan', 4: 'David'}

Not that cool, but you can find out how this could be useful to you.

Python Generators Comprehension

Finally you can create generators using this same approach, in the last entry I used functions or methods to yield elements in a sequence on the fly, if your elements in the sequence can be one line processed as we did here with the list example, can build a generators easily. Let's turn the list example into a generator, this can be done by just replace the square brackets [] in the statement by round brackets (), just that simple.
>>> squares = (x ** 2 for x in range(1, 11))
>>> squares
<generator object <genexpr> at 0x02B74828>
>>> next(squares)
>>> next(squares)
>>> for s in squares:

Cool right? So now that you have took a look at this let's move on with Pipelines.

Generators as Pipelines

Remember that generators are better if you want to process a lot of data and also want to improve performance with a very small and semantic code, sometimes you need to process some source of data in sequential elements, and apply several processing algorithms, creating a for loop for each one of this processes is not so much Pythonic (remember the Zen of Python), you can create a pipeline practice in which each pipe creates a new generator based on the generator fed by the previous pipe, let's explain this with a small graphic.

As you can see, you want to process your data from a raw source that could become from the Internet, a Database or a simple CSV file to a final formatted output like another database, JSON or XML. But several conversions or algorithms need to be applied to each elements of this data, you split this as pipes in which each pipe is a different process operation, and each one returns a generator from another generator, you will see that by using this approach a more modular and maintainable will be obtained.
Another thing to point out is that if your data operation takes more than one line of code don't feel down for this, you can for this pipe create a function based generator, and have a mixed code between functions and comprehensions.
And to explain this better let's code an example, we are going to process a file that contains information about Pokemon, you can find the file in this link.
The file contains 721 entries of Pokemon, and in each line we have comma separated values for each of these properties in order.
·         Number
·         Name
·         Type 1
·         Type 2
·         Total
·         HP
·         Attack
·         Defense
·         Sp. Atk
·         Sp. Def
·         Speed
·         Generation
·         Legendary
Now that we have discuss the source format let's talk about the output format, how about dictionaries with the properties as key with corresponding values? Yeap that could work, but what about if we convert each of this dictionaries into json?... better.
Another thing that we could do is to convert numbers in string representation to actual python integers.
So let's enumerate the pipes
1.       Data reading: In this pipe we are going to generate each line of the csv file.
2.       Data packing: In this pipe we are going to convert from a single comma separated line values into a dictionary
3.       Data conversion: In this pipe we are going to convert each element inside the dictionaries that could represent integers.
4.       Data formatting: In this pipe we are going to convert from dictionary to json.
And there have it, 4 generators in a pipe that as a whole generates each output element on the fly, mind blowing.
Let's code the first generator from the pipeline.
pokefile = open("pokemon.csv")
pokelines = (line.strip() for line in pokefile)

In this code we are reading the first line so we can get rid of the headers from the csv file, and apply the strip method for string just remove trailing new lines "\n", now let's code the next pipe.
def process_lines(pokelines):
    for line in pokelines:
        values = line.split(",")
        keys = ["Number", "Name", "Type 1",  "Type 2",
                "Total", "HP", "Attack", "Defense",
                "Sp. Atk", "Sp. Def", "Speed", "Generation",
        yield dict(zip(keys, values))
pokedicts = process_lines(pokelines) 

This one takes more than one lines to perform the process, so for this we build a function that yields each dictionary, we used a pretty cool trick in which we use the zip function to build a dictionary using keys and values in the same order.
Now let's code the next pipe.
def process_dicts(pokedicts):
    number_keys = ["Number", "Total", "HP", "Attack",
                   "Defense", "Sp. Atk", "Sp. Def", "Speed",
    for pokemon in pokedicts:
        for key in pokemon.keys():
            if key in number_keys:
                pokemon[key] = int(pokemon[key])
        pokemon["Legendary"] = bool(pokemon["Legendary"])
        yield pokemon
pokeconverts = process_dicts(pokedicts)

Pretty much the same thing but we have now spare our program into several functions so we can maintain them in a better way. Finally let's convert each element into json.
pokejson = (json.dumps(pokemon, indent=4) for pokemon in pokeconverts)

Let's print the first 5 elements
>>> for i in range(5):
         print next(pokejson)
    "Name": "Bulbasaur", 
    "Generation": 1, 
    "Sp. Atk": 65, 
    "HP": 45, 
    "Sp. Def": 65, 
    "Type 2": "Poison", 
    "Number": 1, 
    "Type 1": "Grass", 
    "Attack": 49, 
    "Defense": 49, 
    "Legendary": true, 
    "Total": 318, 
    "Speed": 45
    "Name": "Ivysaur", 
    "Generation": 1, 
    "Sp. Atk": 80, 
    "HP": 60, 
    "Sp. Def": 80, 
    "Type 2": "Poison", 
    "Number": 2, 
    "Type 1": "Grass", 
    "Attack": 62, 
    "Defense": 63, 
    "Legendary": true, 
    "Total": 405, 
    "Speed": 60
    "Name": "Venusaur", 
    "Generation": 1, 
    "Sp. Atk": 100, 
    "HP": 80, 
    "Sp. Def": 100, 
    "Type 2": "Poison", 
    "Number": 3, 
    "Type 1": "Grass", 
    "Attack": 82, 
    "Defense": 83, 
    "Legendary": true, 
    "Total": 525, 
    "Speed": 80
    "Name": "VenusaurMega Venusaur", 
    "Generation": 1, 
    "Sp. Atk": 122, 
    "HP": 80, 
    "Sp. Def": 120, 
    "Type 2": "Poison", 
    "Number": 3, 
    "Type 1": "Grass", 
    "Attack": 100, 
    "Defense": 123, 
    "Legendary": true, 
    "Total": 625, 
    "Speed": 80
    "Name": "Charmander", 
    "Generation": 1, 
    "Sp. Atk": 60, 
    "HP": 39, 
    "Sp. Def": 50, 
    "Type 2": "", 
    "Number": 4, 
    "Type 1": "Fire", 
    "Attack": 52, 
    "Defense": 43, 
    "Legendary": true, 
    "Total": 309, 
    "Speed": 65

As you can see each one of these elements is generated on the fly from the first pipe to the last, with this memory use is better handled and execution is improved.
Well my friend we have come to the end of this entry, I really enjoyed writing this one, hope you can use some of this code in your programing and you can turn your old processing code into pipelines.
So thanks for coming by my new entry and I hope you have enjoyed this as much as I enjoyed writing it, stop by the comments if you want to discuss about this, share in your social media and subscribe. Cheers my friends.