Showing posts with label matplotlib. Show all posts
Showing posts with label matplotlib. Show all posts

2016-12-08

Outliers detection - Part 1

Sometimes, we need to work with data which is seems to be correct but in fact is not.
It often happens especially with telemetry data, collected by different sensors, like humidity or temperature. Something might go wrong. 

Lets consider such situation. For example we have some temperature observations in the room during a winter.
import numpy as np
data = [24, 24, 24, 22, 16, 26, 15, 24, 16, 26, 18, 
        20, 16, 50, 23, 22, 25, 42, 18, 19, 17, 21, 
        26, 19, 16, 20, 17, 23, 22]
print("Array len; %d, mean: %0.3f, median: %0.3f, std: %0.3f" \
       % (len(data), np.mean(data), np.median(data), np.std(data)))

# this will be an output:
# >>> Array len; 29, mean: 22.448, median: 22.000, std: 7.323
Well, nothing strange so far, ain't it?. Sure, we might spot abnormal values just looking on the list. but what if we have a list with thousands of elements?
There is an easy way to spot it on the graph:
import seaborn as sns
import matplotlib.pyplot as plt
sns.distplot(data)
plt.show()
This code with plot you this graph, where you can easily find an abnormal observation.

If you dont have a seaborn or don't want to deal with it, you can just try to use standard hist method from matplotlib:
import matplotlib.pyplot as plt
plt.hist(data)
plt.show()
It will plot you this (yes, it's less fancy, but still a quick and essential python tool):

Te be certainly sure we could try to look on measures of central tendency. Just sort your data and try to look at mean, median and standatd deviation throwing out some max elements:
data = sorted(data)
print("Whole data - mean: %0.3f, median: %0.3f, std: %0.3f" \
       % (np.mean(data), np.median(data), np.std(data)))
for i in range(1, 3):
    data_slice = data[:-i]
    print("Without %d max elem - mean: %0.3f, median: %0.3f, std: %0.3f" \
           % (i, np.mean(data_slice), np.median(data_slice), np.std(data_slice)))
And here you can spot a significant change in standard deviation without a single element:
Whole data - mean: 21.750, median: 21.500, std: 6.434
Without 1 elem - mean: 20.704, median: 21.000, std: 3.505
Without 2 elem - mean: 20.500, median: 20.500, std: 3.411
Literally it means that your data might  contain some outliers, because it is much more accurate and less spread even without a single element. But to be 100% sure we need to do more math. Which I'm going to reveal in the next part very soon.

Stay tuned and love your data.

2014-02-10

Python, matplotlib: plot the partially colored histogram

Last week was really useful from perspective of new experience. I've faced a problem of building the colored histograms on some data. For this I used an awesome library matplotlib. The plotting of histogram itself was quite easy, but the coloring of the separated bars on the histogram was really-really challenging. I've spent about 3 hours browsing an internet before figured out a solution. And now I'm about to share this knowledge.

Before start:
  • Download and install matplotlib following the instructions on the official web site
  • You might be asked to install some other libraries, like numpy or dateutil and so on. Don't hesitate to install them.

import matplotlib.pyplot as plt
import random
def buildHist(data):
    # define window size, output and axes
    fig, ax = plt.subplots(figsize=[8,6])
    # set plot title
    ax.set_title("Colored Histogram")
    # set x-axis name
    ax.set_xlabel("Random Number")
    # set y-axis name
    ax.set_ylabel("Number of Records")
    # create histogram within output
    N, bins, patches = ax.hist(data, bins=50, color="#777777")

    # Iterate through all histogram elements
    # each element in this interation is one patch on the histogram, where:
    # - bin_size - number of records in current bin
    # - bin - value of current bin (x-axis)
    # - patch - a rectangle, object of class matplotlib.patches.Patch
    # more details on patch properties: http://matplotlib.org/api/artist_api.html#matplotlib.patches.Patch
    for bin_size, bin, patch in zip(N, bins, patches):
        if bin_size == max(N):
            patch.set_facecolor("#FF0000")
            patch.set_label("max")
        elif bin_size == min(N):
            patch.set_facecolor("#00FF00")
            patch.set_label("min")
    # add legend to a plot     

    plt.legend()
    # save plot as an image     

   plt.savefig("hist.png")
    # show plot     

    plt.show()
if __name__ == "__main__":
    data = [random.randint(0,1000) for i in xrange(0, 1000)]
    buildHist(data)



As a result you will see nice histogram, like this one: