2014-02-10

Python, matplotlib: plot the partially colored histogram

Last week was really useful from perspective of new experience. I've faced a problem of building the colored histograms on some data. For this I used an awesome library matplotlib. The plotting of histogram itself was quite easy, but the coloring of the separated bars on the histogram was really-really challenging. I've spent about 3 hours browsing an internet before figured out a solution. And now I'm about to share this knowledge.

Before start:
  • Download and install matplotlib following the instructions on the official web site
  • You might be asked to install some other libraries, like numpy or dateutil and so on. Don't hesitate to install them.

import matplotlib.pyplot as plt
import random
def buildHist(data):
    # define window size, output and axes
    fig, ax = plt.subplots(figsize=[8,6])
    # set plot title
    ax.set_title("Colored Histogram")
    # set x-axis name
    ax.set_xlabel("Random Number")
    # set y-axis name
    ax.set_ylabel("Number of Records")
    # create histogram within output
    N, bins, patches = ax.hist(data, bins=50, color="#777777")

    # Iterate through all histogram elements
    # each element in this interation is one patch on the histogram, where:
    # - bin_size - number of records in current bin
    # - bin - value of current bin (x-axis)
    # - patch - a rectangle, object of class matplotlib.patches.Patch
    # more details on patch properties: http://matplotlib.org/api/artist_api.html#matplotlib.patches.Patch
    for bin_size, bin, patch in zip(N, bins, patches):
        if bin_size == max(N):
            patch.set_facecolor("#FF0000")
            patch.set_label("max")
        elif bin_size == min(N):
            patch.set_facecolor("#00FF00")
            patch.set_label("min")
    # add legend to a plot     

    plt.legend()
    # save plot as an image     

   plt.savefig("hist.png")
    # show plot     

    plt.show()
if __name__ == "__main__":
    data = [random.randint(0,1000) for i in xrange(0, 1000)]
    buildHist(data)



As a result you will see nice histogram, like this one:

2013-07-12

Faster filesearch with Python using glob

In case you have a big-sized deep folder structure with lots of different files the glob is much more faster then os.walk.

import os, sys, glob
def getFilelist(root):
    def listIter(subroot):
    '''Local recursive function.'''
        for name in glob.glob(os.path.join(subroot, '*')):
            print name
            listIter(name)
    listIter(root) # Call recursion.

if __name__ == '__main__':
    sys.exit(getFilelist(r'd:\example'))


In case of searching the group of certain files or folders among the huge amount of files - this search has better performance.

2013-03-22

Multiprocessing with Python

Let's consider the list containing huge amount of filenames. There is a necessity to perform certain action with each file in this list, for example to read first line and write all results into the file. Below is my working code with comments.

# First of all let's get the filelist:

import os
filelist = []
def getFileList(root_folder):
    '''
    Returns the list of files in specified folder.
    '''

    for root,dirs,files in os.walk(root_folder):
        for filename in files:
            filepath = os.path.join(root,filename)
            if os.path.isfile(filepath):
                filelist.append(filepath)
    return filelist

# And function to return 1st line from the file:

def readFirstLine(filename):
    '''
    Returns as text the first line from file.
    '''

    f = open(filename, "r")
    firstline = f.readlines()[0]
    f.close()
    return firstline

# The following function works with the list of files and throws the result into Queue.

def fileListProcessing(files, q):
    '''
    Puts first lines from all listed files into a Queue. Provides a safe way of getting the result from several processes.
    '''

    try:
        result = []
        for filename in files:
            result.append(readFirstLine(filename))
    except:
        q.put([])
        raise
    q.put(result)

# And here is an actual multiprocessing:

from multiprocessing import Queue, Process, cpu_count

def myMultiprocessing(folder):
    '''
    Splits the source filelist into sublists according to the number of CPU cores and provides multiprocessing of them.
    '''

    files = getFileList(folder)
    q = Queue()
    procs = []
    for i in xrange(0,cpu_count()):
        # Split the source filelist into several sublists.
        lst = [files[j] for j in xrange(0, len(files)) if j % cpu_count() == i]
        if len(lst)>0:
            p = Process(target=fileListProcessing, args=([lst, q]))
            p.start()
            procs += [p]
    # Collect the results:
    all_results = []
    for i in xrange(0, len(procs)):
        # Save all results from the queue.
        all_results += q.get()

    # Output results into the file.
    log = open("logfile.log", "w")
    print >>log, all_results
    log.close()

if __name__ == "__main__":
    myMultiprocessing("d:\\someFolder")

This seems to be an example of the multiprocessing with Python.

2012-11-10

Python: ElementTree to String

Виявив проблему при роботі із ElementTree.
I've met the problem with ElementTree usage.

import import xml.etree.cElementTree as etree
tree = etree.parse('example.xml')
spam = etree.tostring(tree)

Фрагмент коду, наведений вище, завжди буде давати помилку:
The code above will always give an error
"AttributeError: 'ElementTree' object has no attribute 'tag'"
Щоб уникнути цього, потрібно спочатку знайти кореневий тег і саме його конвертувати в string, а не цілий ElementTree
To fix this you should convert to string root, not ElementTree instance:

import import xml.etree.cElementTree as etree
tree = etree.parse('example.xml')
root = tree.getroot()
spam = etree.tostring(root)