Pragati magazine on Kindle

I am a fan of Acorn and one of the things he’s been involved with for a few years now is the Pragati magazine. It is a monthly publication with articles & opinion pieces that revolve around themes that include - in their words - “economic freedom, realism in international relations, an open society, a culture of tolerance and an emphasis on good governance.” The magazine is available as a freely downloadable PDF file every month.

A couple of weeks ago, I bought a new Kindle electronic reader from Amazon and since then, I’ve been getting into the habit of reading long-form content on it during the daily commute rather than browsing Reddit on the phone.

Now although the Kindle supports PDF files - and Pragati’s very nicely done page layout and design is rendered perfectly on it - the magazine’s two-column format is a pain to read on the Kindle’s 6” screen.

Pragati on Kindle

So I set about the task of converting the magazine into the Kindle’s native e-book file format.

I ruled out using the PDF file as a source for three reasons: (a) PDF is a pain to work with; (b) all of Pragati’s articles are also available on their website as regular web pages and (c) PDF is a real pain to work with! It is infinitely easier to slice, dice & transform HTML than it is to work with PDF.

If you start dealing with e-books, sooner or later you’ll encounter an application called Calibre, an e-book library management app. Among its many features is the ability to download content from online newspapers & magazines and convert them into any of the several supported e-book formats such as .epub, .lit, .chm, .fb2 and the Kindle’s native format .mobi.

Calibre uses what it calls recipes, essentially Python scripts, to do the website to e-book conversion. It comes pre-loaded with a few hundred recipes supporting sites such as the WSJ, The Economist, The Hindu, etc., and users can add their own recipes too.

TL;DR

I wrote a Calibre recipe that creates a Kindle friendly version of the current month’s Pragati magazine issue. Download^ and try out the converted March 2011 issue on your Kindle. If you like it, install Calibre and then follow this guide to load up the following recipe in Calibre and then schedule Calibre to run it monthly. The same recipe will also work if you wish to create e-books in any of the other formats that Calibre supports.

UPDATE: I’ve imported the recipe into my public Mercurial repository. You can now track changes or fork it from here. There’s already one new change compared to the below code!

# vim: filetype=python autoindent tabstop=4 expandtab shiftwidth=4 softtabstop=4

"""
pragati.nationalinterest.in
"""
import string, re
import datetime

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class PragatiRecipe(BasicNewsRecipe):

    __author__ = 'Deepak Sarda'

    today = datetime.datetime.now()

    title      = 'Pragati' + today.strftime(' %B %Y')
    description = 'The Indian National Interest Review'
    publisher = 'The Takshashila Institution'
    category = 'news, opinion, politics'
    isbn = '0973-8460'
    publication_type = 'magazine'
    language = 'en'
    timefmt = ' %B %Y'

    INDEX = 'http://pragati.nationalinterest.in/'

    keep_only_tags = [dict(name='div', attrs={'id':'single'})]
    remove_tags_after = [dict(name='p', attrs={'class':'tags tabs'})] 
    no_stylesheets = True

    # TODO: Get better image!
    #def get_masthead_url(self):
    #    return self.INDEX + 'wp-content/uploads/2010/09/pragati-logo-cl.png'

    def parse_index(self):
        articles = []
        slug = self.INDEX + self.today.strftime("%Y/%m/")
        keep_going = True

        is_pager = lambda(val): 'navigation' in val

        while keep_going:

            self.log('will fetch ', slug)
            soup = self.index_to_soup(slug)

            posts = soup.find('div', attrs={'id':'posts'})
            for post in posts.findAll('div', attrs={'class':'content'}):
                #self.log('the post: ', post)
                h = post.find('h2')
                title = self.tag_to_string(h)
                a = h.find('a', href=True)
                url = a['href']
                if url.startswith('/'):
                    url = self.INDEX.rstrip('/') + url
                self.log('Found article:', title, 'at', url)

                paras = post.findAll('p')

                desc = None
                timestamp = None
                for p in paras:
                    if not p.has_key('class'): # paragraphs without class attribute
                        desc = self.tag_to_string(p)
                        # self.log('\t\t', desc)
                    elif p['class'] == 'postmetadata':
                        tstamp = p.find('span', attrs={'class':'timestamp'})
                        timestamp = self.tag_to_string(tstamp)

                articles.append({'title':title, 'url':url, 'description':desc, 'date': timestamp})

            keep_going = False

            pager = posts.find('div', attrs={'class': is_pager})
            if not pager:
                break

            pager_links = pager.findAll('a', href=True)
            if not pager_links:
                break

            for link in pager_links:
                if 'OLDER' in self.tag_to_string(link).upper():
                    slug = link['href']
                    self.log('found another older link to follow', slug)
                    keep_going = True
                    break


        feeds = [("Articles" , articles)]

        return feeds

    def populate_article_metadata(self, article, soup, first):

        is_metadata_para = lambda(val): 'postmetadata' in val

        def recurse_delete(tag):
            if tag.nextSibling:
                recurse_delete(tag.nextSibling)
            tag.extract()

        metadata = soup.find('p', attrs={'class': is_metadata_para})

        if not metadata: return

        author = metadata.find('strong')

        if not author: return

        author_text = self.tag_to_string(author)
        article.author = author_text
        self.log('found author', article.author, 'for article', article.title)

        # Rip out everything after author including
        # 'publish time' and 'comments link' from metadata
        recurse_delete(author.nextSibling)

I’ll try and keep this post updated as and when I tweak the recipe. If the Pragati web site design changes in the future, then the recipe will certainly stop working. Come back here and bug me to fix it :)

^ Pragati is made available under a permissive license which I think allows me to convert & redistribute it.