Data Analysis Snippets-1

23 Aug 2016

alt text

This is ambitious!

Well … as the world knows, Python is one of the coolest languages out there.

And it is one of the most recommended languages particularly for data analysis. Mostly because of the kind of massive support given by python’s existing libraries for this purpose. The power of python standard libraries is what we will see in this snippet.

Some Background Noise ;-)

Life is all about data right now. So I chose to give myself a tough project. Every now and then, I want to record anything I find interesting in doing data analysis with python.

Interestingly, I find Python as a friend in need, while I recommend Java as the Guru in all situations.

But as a matter of fact, given the kind of skills I have in either of these languages or in data analysis, this project is very ambitious. I still would like to give myself this challenge.

Here is the first interesting thing I want to talk about. There is nothing that I own here. I am just copying things from the book Python for Data Analysis. But the point is, I am taking the most interesting thing in the initial 25 pages.

The first worked example in the text introduces about 1.USA.gov data from bit.ly. This is what the introduction says:

In 2011, URL shortening service bit.ly partnered with the United States government website usa.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil . As of this writing, in addition to providing a live feed, hourly snapshots are available as downloadable text files.

As said, this is true as of this book’s writing. But I guess the data is not available now anymore where the author said it was. Instead, it is available in the author’s github repo. The author, Wes McKinney shared all the worked out examples in the book in Jupyter Notebooks and shared them in this repo, which makes it really interesting to learn from.

Anyway, getting into the data now …

The data specified here is in JSON format, but saved as a .txt file. There are a couple of ways I can load the data for my work in python. But the simplest way shown in the book is …

import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]

Now if you just run records[0], it will output the first line in the data in dict format.

{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 u'al': u'en-US,en;q=0.8',
 u'c': u'US',
 u'cy': u'Danvers',
 u'g': u'A6qOVH',
 u'gr': u'MA',
 u'h': u'wfLQtf',
 u'hc': 1331822918,
 u'hh': u'1.usa.gov',
 u'l': u'orofrog',
 u'll': [42.576698, -70.954903],
 u'nk': 1,
 u'r':
 u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 u't': 1331923247,
 u'tz': u'America/New_York',
 u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

And the total number of records (lines in the data file) are 3560.

If you observe, one of the keys in this dict is 'tz' which stands for timezone. You can extract all the values of this particular key in all the lines of the data and save it separately, like this.

time_zones = [rec['tz'] for rec in records if 'tz' in rec]

And try time_zones[:10], you will see the output as,

[u'America/New_York',
u'America/Denver',
u'America/New_York',
u'America/Sao_Paulo',
u'America/New_York',
u'America/New_York',
u'Europe/Warsaw',
u'',
u'',
u'']

Now, the author shows how to produce counts by time zones in different approaches.

First Approach

First step:

We are defining a new function get_counts that takes a sequence as an argument and counts what we want it to count.

def get_counts(sequence):
  counts = {}
  for x in sequence:
  if x in counts:
    counts[x] += 1
  else:
    counts[x] = 1
  return counts

And the sequence in our example is time_zones, so … running this line below gives us a dictionary with all the timezones as keys and their counts as values.

counts = get_counts(time_zones)

Try print counts['America/New_York'] and the output is 1251.

Second step:

Now you need the top 10 timezones and their counts. We will define another function for this.

def top_counts(count_dict, n=10):
  value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
  value_key_pairs.sort()
  return value_key_pairs[-n:]

Our count_dict in this example is the dictionary of counts. Now, try print top_counts(counts) and the output looks like this.

[(33, u'America/Sao_Paulo'),
 (35, u'Europe/Madrid'),
 (36, u'Pacific/Honolulu'),
 (37, u'Asia/Tokyo'),
 (74, u'Europe/London'),
 (191, u'America/Denver'),
 (382, u'America/Los_Angeles'),
 (400, u'America/Chicago'),
 (521, u''),
 (1251, u'America/New_York')]

Second Approach

All that we have done to achieve the last output can be done in 3 lines of code by using Counter method in collections library, which is one of the standard libraries of python.

from collections import Counter
counts2 = Counter(time_zones)
counts2.most_common(10)

And that’s it! It gives you the same output.

[(33, u'America/Sao_Paulo'),
 (35, u'Europe/Madrid'),
 (36, u'Pacific/Honolulu'),
 (37, u'Asia/Tokyo'),
 (74, u'Europe/London'),
 (191, u'America/Denver'),
 (382, u'America/Los_Angeles'),
 (400, u'America/Chicago'),
 (521, u''),
 (1251, u'America/New_York')]

I loved the way this difference made by using python standard libraries was taught. It really gives an insight into how I can make my job much faster and more effcient in doing data analysis with python.

Thanks for reading up to this line. Hope it made sense. I will come back with another snippet if I find anything interesting.

Enjoy Python!

The Loop Game

14 Jul 2016

alt text

Breaking things is easier than making!

I guess this is what Jonas Lekevicius wants to make us remain constantly aware of.

The Loop Game. There is nothing extraordinary about it. No novelty at all. Lot of such games came and went. Just a subtle combination of minimalism and intense music at the most immersing extent.

It takes you into a world of possibilities, imagination. It takes you into completely different mental state.

A state we call Trance State of Mind.

Not just by the music it plays in your ears, gently slithering deep into your nerves, making you feel like LOST. But it does so by immersing you into the essence. Perhaps, while playing this, that’s one of the rarest moments in life where you are fully mindfull, focused.

No more thoughts, period!

I am not sure what creativity is. To me, the word is as complicated as how mathematics sounds. I am not sure how to describe such abstrations. Nature is full of these. That’s why spiritual gurus refrain you from the temptation, a tendency to describe everything you see, feel, touch, know, etc.

I guess I seem to find it sensible. Rather than spending your energy and time on finding ways to describe any beauty, just take it all into your eyes!

Well, all that I have done so far in these lines is … describing !!!

Anyway, back to the LoopGame, I find it really relaxing. Imagination gets flamed up by the music it plays. I wish I could get it as an audio file.

Sometimes, you feel like you are in an alien ship. Sometimes you just want to close your eyes and focus on nothing specific. But it feels you are being judged on your emotional intelligence. I don’t know it gave a lot of other thoughts!

The Loop Game.

Go for it. As in the name,

The game is endless …..

Lessons while learning The Shell

19 May 2016

While learning to write a shell script that does what `which` command does in Linux

 ______________
< I love Linux >
 --------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

I have been struggling to find a learning resource, both interesting as well as useful, for making my baby steps in shell scripting. Mostly, I ended finding either somebody who made a tutorial too serious to follow on, or someone who made it sound too informal.

Today, I came across this tutorial by Dr. Peter Chubb, who begins by saying very affirmatively,

This is not a talk! This is a tutorial!!

That sounded like a promise to me and I went on.

There were many lessons I learned in this tutorial, called as Beginning with the Shell. But one thing simply caught my total attention. During this particular exercise, Dr. Chubb shows you not only how to write a shell script, run it with proper arguments and evaluate the exit codes. In addition, he shows how to ask your shell to show you line by line when it reads and executes each line in your script. That … is really helpful!

I am yet not sure if I can do this with any and every shell script, but I decided to put this on my blog instantly. In fact, when I wrote this post, I just finished only the abovementioned part in the video. I haven not watched it completely yet. I am sure there is more learning coming up.

**For now, here is what I learned so far. **

Make your own `which` command line utility

And call your script file wh!

Just One Simple Error - GitHub Page Build Fails

19 May 2016

alt text

This is all about a tiny tiny error I made in organizing my github pages repo locally and everything ended up going wrong. Finally, my GitHub pages site, this site, would not get any of my new posts.

This killed almost over 3 hours of my time. I still wanted to keep an update here, so if ever I make similar mistakes, I could check this post, like making my head-banging process manageable.

Here is the story!

I have been maintaining this site, although not as seriously. At least, whenever I feel there is something worth noting down, I would immediately start scribbling my story in markdown syntax using vi and then use pandoc and lynx for previwing my .md file within terminal. This markdown-preview within terminal is one of my recent discoveries. I am thankful to the makers for helping me stick on to terminal for all things once considered only browser-based.

Here is how I do `markdown-preview` from within my Terminal

As per the makers’ suggestion, I just needed to add these lines to my .profile file.

### Preview Markdown files from within Terminal. 
### Source of this info: http://tosbourn.com/view-markdown-files-terminal/
### This makes use of Pandoc & Lynx
### Note: This LYNX thing shocked me like crazy. Linux is an art, really!!!
rmd() {
        pandoc $1 | lynx -stdin
}

This is it! The colors and format appears in lynx style and I find it really very cool.

Back to the Story

I got an idea. I wanted to keep a directory within my GitHub pages repo, called drafts which should contain my writing-in-progress files.

And to make things easier (and to prove that I am lazier than ever), I added the usual Poole-Jekyll specific config lines at the beginning of the post. Something like this:

---
layout: post
title: Just One Simple Error - GitHub Page Build Fails
comments: true
---

And then pushed all my new stuff into the remote.

Bang!

I hit the wall. The GitHub won’t build my pages. It sent me a very generic email that said,

Page build failed. For more information, see https://help.github.com/articles/troubleshooting-github-pages-build-failures.

And I went all over the cosmos of google trying to find the god who can help me with this.

First I thought, it was because of my email on GitHub being verified, following the suggestions on GitHub help pages. That didn’t help. Then, I thought it was due to my global config settings of my git on my machine and tried to reconfigure all of it. Nope, that wasn’t the issue too. And then I remembered I formatted my OS about a month ago which might have resulted in Jekyll-build related issues. So I started working around installing Jekyll which required gem, which again required ruby, which further needs rvm to be installed on my machine.

Honestly, anything that has to do with either of Java, JavaScript and Ruby, I just cannot find myself having sufficient patience. It always turned into a nightmare everytime I tried to get these things to work on my machine. I wish somebody like Digital Ocean writes a simple and easy-to-follow instructions on how to work on this stuff.

Anyway, I somehow managed to get all this worked out and made my system ready with all these pre-requisites. I still cannot see my GitHub page updated with my new post. I could not undestand what else to do now. And then I tried jekyll build and jekyll serve inside my site directory. Both of them gave me a common error.

Deprecation: You appear to have pagination turned on, but you haven't included the `jekyll-paginate` gem. Ensure you have `gems: [jekyll-paginate]` in your configuration file.

I started searching for this stuff now. Somebody asked me to make use of gem:jekyll-paginate plugin and include this in the _config.yml of my jekyll site. I did that after installing jekyll-paginate gem. I could get rid of the error above. But I still continued getting one more error.

Invalid Date: '' is not a valid datetime.
  Liquid Exception: exit in _layouts/post.html

This became totally impossible to find where it all went wrong. Purely by chance, I came across a forum discussion where one of the responders asked a question.

Are you trying to use layout: post for something other than a post?

That triggered my brain!

Remember, I mentioned about my directory creation inside my site repo, drafts and I added the layout: post entry in my post? I went back to that file, deleted it and pushed into the remote.

That’s it! This is all that went wrong. Even if I minus all my efforts in installing jekyll & ruby, finding this simple mistake became an impossible task. It finally worked! And here I am, back to my scirbbling!

Hope it helps others, especially, someone like me!

Add SSL Certificates to your website

11 May 2016

;-) Sort of a Disclaimer: This post does not (none of my posts, for that matter) mean that I am an authority in this domain of technology. I am learning and just sharing here so somebody might find time to point out the errors. But all the steps mentioned here worked just right for me to get an A on SSL-Test of Qualsys SSL Labs. Getting an A+ requires thorough knowledge in ssl_ciphers, which I do not have. But there are more knowledgeable people who wrote several posts on getting A+.

I guess you are using Apache as your webserver and all the instructions are given based on that assumption.

NOTE: Ensure to take backup of all necessary config files before trying this!!!

Step-1 | Collect SSL Certificates into the relevant directory

The vendor of SSL Certificates might have sent you a few certificate files usually with file extensions .crt & .key, perhaps all of them zipped. In this example, the certificates are coming from Network Solutions.

Once you unzip this folder, you may see 5 things.

A .crt file that looks similar to AddTrustExternalCARoot.crt
A .crt file that looks similar to USERTrust... or UTNRSACertificationAuthority.crt
A .crt file that looks similar to NetworkSolutionsOVServerCA2.crt
A .crt file whose file name contains your target domain name such as www.example.com. In this example, the file name would be WWW.EXAMPLE.COM.crt
A .key file which is your Certification Key File

Bring all these files into this path: /etc/apache2/ssl/ (Create ssl directory if you don’t already have it: sudo mkdir /etc/apache2/ssl.)

Step-2 | Make ChainTxt File

Let us make a chainfile.txt in the same directory.

So let’s go to that directory:

cd /etc/apache2/ssl

Create a file called chainfile.txt(use any editor of your choice):

sudo vi chainfile.txt`

Into this file, copy the contents of the following files in the SAME EXACT ORDER, and save and close it.

AddTrustExternalCARoot.crt, UTNRSACertificationAuthority.crt & NetworkSolutionsOVServerCA2.crt

Step-3 | Configure the `default-ssl.conf`

Open this file:

sudo vi /etc/apache2/sites-enabled/default-ssl.conf

Before our intervention, the file might look something like this, excluding all the comments (lines beginning with #):

 <VirtualHost _default_:80>
                ServerAdmin webmaster@localhost
                DocumentRoot /var/www/html/
                ErrorLog ${APACHE_LOG_DIR}/error.log
                CustomLog ${APACHE_LOG_DIR}/access.log combined
   </VirtualHost>

After creating a backup of this file, change the content of this file that looks like below:

<IfModule mod_ssl.c>
    <VirtualHost _default_:443>
                ServerAdmin webmaster@localhost
                DocumentRoot /var/www/html/
                ErrorLog ${APACHE_LOG_DIR}/error.log
                CustomLog ${APACHE_LOG_DIR}/access.log combined

                SSLEngine on
                SSLCertificateFile   /etc/apache2/ssl/WWW.EXAMPLE.COM.crt
                SSLCertificateKeyFile /etc/apache2/ssl/<your.key.file>
                SSLCertificateChainFile /etc/apache2/ssl/chainfile.txt
                SSLCipherSuite ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-SHA384:DHE-RSA-AES256-SHA256:ECDHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES128-SHA:AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:AES128-SHA
                SSLHonorCipherOrder on
                SSLProtocol all -SSLv2 -SSLv3
       </VirtualHost>
</IfModule>

Step-4 | Configure the `example.com.conf`

This step is exactly what you did in the previous step except that you do the configuration entries in your website’s conf file. For example, if your website is www.example.com, then you will find a conf file in your sites-enabled folder /etc/apache2/sites-enabled/, with a filename that looks like www.example.com.conf. Make the same changes in this file as you did in the previous step.

Step-5 | Enable Apache SSL Module & Restart Apache

Run the following commands:

sudo a2enmod ssl
sudo service apache2 restart

This finishes the process of installing SSL certificates in your site. However, if everything went fine, you need to get at least an A on the SSLLabs website. I hope you get it. If you don’t, please leave a comment here to help me learn my mistakes.

Step-6 | Testing your SSL Certification Installation

Go to https://www.ssllabs.com/

References:

Older Newer

Anand's Space :: My Studio of Thoughts!

Data Analysis Snippets-1

This is ambitious!

Some Background Noise ;-)

Anyway, getting into the data now …

First Approach

First step:

Second step:

Second Approach

The Loop Game

Breaking things is easier than making!

Lessons while learning The Shell

While learning to write a shell script that does what which command does in Linux

Make your own which command line utility

Just One Simple Error - GitHub Page Build Fails

This is all about a tiny tiny error I made in organizing my github pages repo locally and everything ended up going wrong. Finally, my GitHub pages site, this site, would not get any of my new posts.

Here is the story!

Here is how I do markdown-preview from within my Terminal

Back to the Story

Add SSL Certificates to your website

Step-1 | Collect SSL Certificates into the relevant directory

Step-2 | Make ChainTxt File

Step-3 | Configure the default-ssl.conf

Step-4 | Configure the example.com.conf

Step-5 | Enable Apache SSL Module & Restart Apache

Step-6 | Testing your SSL Certification Installation

References:

While learning to write a shell script that does what `which` command does in Linux

Make your own `which` command line utility

Here is how I do `markdown-preview` from within my Terminal

Step-3 | Configure the `default-ssl.conf`

Step-4 | Configure the `example.com.conf`