Github Repo: https://github.com/anuvrat/scrape-google-play
Unlike Apple, Google does not provide a list of all the apps in the Google Play store. There’s no index which has links to all active apps in their marketplace (Apple has a nice alphabetically index list per category of all apps in the iTunes AppStore). The only way to discover apps in the Google Play store is by crawling the entire marketplace and discovering new apps by looking at the Similar Apps or More from Developer sections. But one needs an initial set of apps to start the crawl first.
Google Play lists out various categories, each having a url of type – https://play.google.com/store/apps/category/<category>. On each page there are two feeds – Top Paid in <category> and Top Free in Category. I scanned these 2 feeds for all the category to create the initial set of apps to be crawled.
In addition to the category pages, Google Play also has collections – https://play.google.com/store/apps/collection/topselling_<suffix> – where suffixes are like paid_game, free, paid, new_paid_game, new_free, new_paid. I included these in the initial set.
Now that I had my initial set of apps, all I had to do was to fetch each of them, look at Similar and More from Developer apps and add those I hadn’t already discovered to the list of pending apps.
Wikipedia says that Google Play has around a million apps (as of July 2013). Unfortunately, I have only been able to discover ~300000 apps. Here’s the script that crawls Google Play. The crawl state is saved to a file after every 100 apps crawled to ensure that the crawl can be restarted in case of errors without any loss of data.