29. AI x piracy x data extraction

2023-05-16

It was interesting to see Anthropic launch a new 100K token context window for Claude last week. I wasn’t, however, super clear on how I might use this new large context window. Earlier today I stumbled on a great use case, though, that seems to always drive innovation: piracy.

Really simply, I wanted to read an article that was paywalled. I could see in the page’s source that the article content was loaded, but I couldn’t find where the paywall element was to remove it. The article wasn’t that important, so I didn’t want to waste that many brain cycles on it, but I did wonder: Could Claude just… extract it for me?

So here’s what I did:

Opened a terminal and typed curl [url]

Copied the resulting output and pasted it into a new Claude chat window

Asked Claude to Please strip this html down to the main article text

After about a minute, the model printed the article contents

Now, it’s not clear this is “worth” the cost of the tokens or the parsing time, and it definitely wouldn’t work on all paywalls out there. But, it was a very easy solution to a problem I had.

If you squint, there’s a broader use case for parsing scraped outputs and fundamentally rethinking data extraction pipelines. More broadly, I'm sure there are many, many more piracy use cases that this and other AI features would greatly facilitate and accelerate.