Web Content & Reading it Later

I kind of want to make a site/app that aggregates literature. Kind of like an Amazon or Google Books, except that the content is decentralised – possibly under a restrictive copyright – and the only thing I'm providing is a set of programmatic instructions for the app to download the content from a third party.

It sort of exists for Arch Linux, for downloading third party software. You can download a package from the "Arch User Repository", which is a kind of makefile that installs the relevant dependencies, downloads the desired software, applies any patches if necessary, then builds it into a nice standard package ready to be installed, all the while respecting the copyright owner's original license.

I'm not sure if it's been done already, but the same sort of thing could be brought to literature (albeit with a nice market style interface). The benefit would be that the publisher gets promoted via whatever distribution method they choose, the end user gets (mostly) free content in a standardised layout, without having to deal with the lock-in and DRM that is rife the traditional publishing industry.

I realise the web is already sort of this platform, but there's several things I think can be made better:

  • Discoverability: most people I know don't want to go searching for content. They listen to the radio, watch the television, maybe buy a bestseller or two after the movie comes out. With a "frontend" to all the disconnected content on the web, one could easily cut out the cruft that accumulates in search results and discover awesome, polished prose that's worthy of reading in bed. Not only that, but this could serve as a kind of ubercatalog of things from other locations such as Project Gutenberg, Google Books, and really anything with a standard HTTP interface.
  • Format Shifting: People still publish things in all different formats. There's few things worse than having to deal with different formats. Kindle, Nook, PDF, ePUB, HTML, DocBook, ugh. While the first two are crippled with DRM, the latter three (among many others) are open and could be integrated into this service with relative ease.
  • Bookmarking: Stuff like Instapaper has laid the groundwork for this particular concept, allowing a person to save a page to read later and both this and Safari have a reading mode which strips out the cruft, leaving a consistent and distraction free reading environment.
  • Content Sanitation: The AUR model would allow the app to load a set of instructions for downloading and sanitising content. This might include fetching multiple pages in a web article, pulling in comments or related notes, converting PDF files to plain text (including programmatic instructions for fixing dodgy line breaks or PDF artefacts,) and probably lots more things I haven't thought of yet. There are some major benefits to this, namely the reading experience is made more consistent in the one interface.
  • Consistency: Presently you can get ebook readers that do any number of formats, but all these formats are markedly different ways of displaying content. This is really visible to the end user. This model could potentially unify these formats by ripping out the guts of them and rearranging the content in a consistent and readable manner.
  • Ownership: There's actually a market for a $2.99 downloadable copy of Ars Technica's three part feature on a 48 hour "game jam", which indicates to me that people like owning stuff. This system would allow a person to download their own digital copy of a publically available work which would effectively last forever, even if the original copy was moved or lost. While it's someone else's copyright, obviously, I suspect fair dealing could apply in this circumstance as all the actual copying is being done on the user's personal device as a sort of format shifting.
This model presents a whole number of niggles; copyright, accountability, who the hell is going to curate this library of things? An open project would be preferable, but would present its own set of challenges such as how to safely allow anyone in the world to have this kind of control over a client device.

I've been thinking about this for a while, and while my thinking thinking overly technical my underlying sentiment is that I just want something lovely on the web to make it easier for me to find great content to stimulate my brain that isn't encumbered by shit. I want to force rich semantic goodness on the rest of the web rather than wait for it to happen of its own accord. It feels a bit like my idea betrays the spirit of the world wide web, but I suppose it's in a way that could possibly inspire a better one.