Pragati magazine on Kindle
I am a fan of Acorn and one of the things he’s been involved with for a few years now is the Pragati magazine. It is a monthly publication with articles & opinion pieces that revolve around themes that include - in their words - “economic freedom, realism in international relations, an open society, a culture of tolerance and an emphasis on good governance.” The magazine is available as a freely downloadable PDF file every month.
A couple of weeks ago, I bought a new Kindle electronic reader from Amazon and since then, I’ve been getting into the habit of reading long-form content on it during the daily commute rather than browsing Reddit on the phone.
Now although the Kindle supports PDF files - and Pragati’s very nicely done page layout and design is rendered perfectly on it - the magazine’s two-column format is a pain to read on the Kindle’s 6” screen.
So I set about the task of converting the magazine into the Kindle’s native e-book file format.
I ruled out using the PDF file as a source for three reasons: (a) PDF is a pain to work with; (b) all of Pragati’s articles are also available on their website as regular web pages and (c) PDF is a real pain to work with! It is infinitely easier to slice, dice & transform HTML than it is to work with PDF.
If you start dealing with e-books, sooner or later you’ll encounter an application called Calibre, an e-book library management app. Among its many features is the ability to download content from online newspapers & magazines and convert them into any of the several supported e-book formats such as .epub
, .lit
, .chm
, .fb2
and the Kindle’s native format .mobi
.
Calibre uses what it calls recipes, essentially Python scripts, to do the website to e-book conversion. It comes pre-loaded with a few hundred recipes supporting sites such as the WSJ, The Economist, The Hindu, etc., and users can add their own recipes too.
TL;DR
I wrote a Calibre recipe that creates a Kindle friendly version of the current month’s Pragati magazine issue. Download^ and try out the converted March 2011 issue on your Kindle. If you like it, install Calibre and then follow this guide to load up the following recipe in Calibre and then schedule Calibre to run it monthly. The same recipe will also work if you wish to create e-books in any of the other formats that Calibre supports.
UPDATE: I’ve imported the recipe into my public Mercurial repository. You can now track changes or fork it from here. There’s already one new change compared to the below code!
# vim: filetype=python autoindent tabstop=4 expandtab shiftwidth=4 softtabstop=4
"""
pragati.nationalinterest.in
"""
import string, re
import datetime
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString
class PragatiRecipe(BasicNewsRecipe):
__author__ = 'Deepak Sarda'
today = datetime.datetime.now()
title = 'Pragati' + today.strftime(' %B %Y')
description = 'The Indian National Interest Review'
publisher = 'The Takshashila Institution'
category = 'news, opinion, politics'
isbn = '0973-8460'
publication_type = 'magazine'
language = 'en'
timefmt = ' %B %Y'
INDEX = 'http://pragati.nationalinterest.in/'
keep_only_tags = [dict(name='div', attrs={'id':'single'})]
remove_tags_after = [dict(name='p', attrs={'class':'tags tabs'})]
no_stylesheets = True
# TODO: Get better image!
#def get_masthead_url(self):
# return self.INDEX + 'wp-content/uploads/2010/09/pragati-logo-cl.png'
def parse_index(self):
articles = []
slug = self.INDEX + self.today.strftime("%Y/%m/")
keep_going = True
is_pager = lambda(val): 'navigation' in val
while keep_going:
self.log('will fetch ', slug)
soup = self.index_to_soup(slug)
posts = soup.find('div', attrs={'id':'posts'})
for post in posts.findAll('div', attrs={'class':'content'}):
#self.log('the post: ', post)
h = post.find('h2')
title = self.tag_to_string(h)
a = h.find('a', href=True)
url = a['href']
if url.startswith('/'):
url = self.INDEX.rstrip('/') + url
self.log('Found article:', title, 'at', url)
paras = post.findAll('p')
desc = None
timestamp = None
for p in paras:
if not p.has_key('class'): # paragraphs without class attribute
desc = self.tag_to_string(p)
# self.log('\t\t', desc)
elif p['class'] == 'postmetadata':
tstamp = p.find('span', attrs={'class':'timestamp'})
timestamp = self.tag_to_string(tstamp)
articles.append({'title':title, 'url':url, 'description':desc, 'date': timestamp})
keep_going = False
pager = posts.find('div', attrs={'class': is_pager})
if not pager:
break
pager_links = pager.findAll('a', href=True)
if not pager_links:
break
for link in pager_links:
if 'OLDER' in self.tag_to_string(link).upper():
slug = link['href']
self.log('found another older link to follow', slug)
keep_going = True
break
feeds = [("Articles" , articles)]
return feeds
def populate_article_metadata(self, article, soup, first):
is_metadata_para = lambda(val): 'postmetadata' in val
def recurse_delete(tag):
if tag.nextSibling:
recurse_delete(tag.nextSibling)
tag.extract()
metadata = soup.find('p', attrs={'class': is_metadata_para})
if not metadata: return
author = metadata.find('strong')
if not author: return
author_text = self.tag_to_string(author)
article.author = author_text
self.log('found author', article.author, 'for article', article.title)
# Rip out everything after author including
# 'publish time' and 'comments link' from metadata
recurse_delete(author.nextSibling)
I’ll try and keep this post updated as and when I tweak the recipe. If the Pragati web site design changes in the future, then the recipe will certainly stop working. Come back here and bug me to fix it :)
^ Pragati is made available under a permissive license which I think allows me to convert & redistribute it.