My Old LibTech Blog (2013-2016)

Content worth stealing

Author: John Durno
Date: 2013-09-10

bookshelf

The library spends a lot of money on content. Every year we license tens of thousands of electronic journals and purchase heaps of books and other stuff. Whether we should have to pay for access is of course hotly debated in open access circles and other places. But for the moment, anyhow, we do pay, and how. The latest CARL statistics peg UVic's electronic serials expenditures at slightly less than $4M per year.

Is it worth it? I don't have the definitive answer to that question, but by at least one metric the answer is a resounding yes. And the metric I have in mind is how hard people are willing to work to steal it.

One thing we do here in Library Systems is manage technology that, when all goes well, enables access to our electronic content for the people who should have access (as determined by our content licensing agreements), and keeps out everyone else. Typically the select group of the fortunate includes current staff, students, faculty, faculty emeritus, and anyone who is physically on campus.

The principle is fairly simple and I don't think I'm giving away any secrets when I tell you what it is. It's based on the internet address of your computer, also known as an IP address. Every computer has one, and they look something like 142.104.35.79 (which happens to be the IP of one of the workstations in one of our labs). As long as you connect to our research databases from a computer with a UVic IP address, the database hosts will let you in.

But what if you're off-campus and your computer has an IP address assigned by your service provider? If you belong to one of the authorized groups, you can log into the Library's proxy server, aka EZproxy. The proxy will then relay all of your traffic to the database hosts, making it appear to be coming from a UVic IP address. One of the nicer aspects to this approach, apart from its simplicity, is that you retain a degree of anonymity. All our content providers ever see is the network address of our proxy server; they have no obvious way to link search activity to individuals, so they can't augment their revenue stream by selling your personal data to advertisers. The Library could, but you trust us not to, right?

Of course, the content providers have to trust us too. They have no really good way of verifying that we're really holding up our end of the user authentication agreement, although they do have some legal recourse if we don't. But in fact we do try really hard to make sure we're keeping our end of the deal.

The weakest link in the chain is not the technology. It's that, when you have thousands of users, invariably some of their credentials (netlink IDs and passwords) will get out in the wild. There are various ways this might happen. I suspect one of the biggest drivers are those phishing emails that tell you your webmail account will be deleted unless you send your username, password, social insurance number, date of birth and credit card number to 'UVic Computing Systems' at their new email address in Nigeria. People do fall for those, and of course some of them are constructed a little more cleverly than that.

Then those phished IDs and passwords get shared around all sorts of places, and a surprising number of them wind up being used to authenticate spiders, mini-programs whose sole purpose is to log into our proxy and suck up licensed library content, presumably so it can be packaged and resold to researchers and students whose libraries aren't resourced well enough to afford subscriptions.

Publishers and other content vendors tend not to appreciate that, so the Library is contractually obliged to monitor our proxy server traffic to see if we can detect usage patterns that might indicate our licensed content is being spidered or harvested in bulk. We derive no particular joy from this task, but as we're legally bound to try to prevent this kind of activity, we do give it our best shot. So without going into too much detail, we have a couple of things we look for. One of which of course is sheer volume: anyone downloading substantially more content than they could possibly read in a given timeframe (or lifetime) is likely to have their access cut off. Another tip-off are users who connect from different countries on the same night. We've had instances of the same user coming in from China, Russia and Canada at the same time; this is a pretty good indication that the account has been compromised.

If we determine your account has likely been compromised, we block you from logging into our proxy server until we can confirm you've changed your password. It's a regrettable inconvenience, particularly if it happens on the night before your paper is due, and that also happens to be the night you started working on it. However in a sense we're doing you a favour ... given all the other things your Netlink ID has access to, you really don't want random strangers sharing it around.

Lately we've become aware of a new tactic on the part of our adversaries, what I call slow-crawling. Apparently enough libraries now are monitoring for excessive downloading that the bad folks have had to come up with a new approach. Slow-crawlers are programs that typically don't download very much over the course of an hour, but they plug away, hour after hour, day after day, week after week. They never log out, and interestingly, never go away long enough for their proxy session to expire. So even if we block the account in the normal way, they don't lose access, because they never actually have to log in after the first time. Deviously clever, and my hat's off to them. Of course, now that we're wise to their new trick we can monitor for that too, so they're going to have to work even harder in the future. I'm sure they will ...