Opening Up the Law: Pacer, CITP, and the RECAP the Law Project

recap-diagAs some of you know I am a Visiting Fellow this year at Princeton’s Center for Information Technology Policy. When I arrived a couple weeks ago, I heard about a project in the works and have been dying to tell people about it. It is now live and looks great. It is called RECAP and just may change the way people access a major part of the law. We’re talking about the law that lurks outside cases; the actual guts of litigation.

Attorneys live and die by documents. As I tell my students, you must write well, because lawyers are paid in large part to write. With around 1.1 million attorneys practicing in the U.S., a large amount of paper, a.k.a., courts documents, is generated each and every day. Court documents are essentially public documents (there are times when papers are sealed etc., but that is a separate matter). The government runs a system called PACER that allows one to search for and access U.S. Appellate, District, and Bankruptcy court records and documents. But as the Washington Post explains, “The fee to access PACER is $0.08 per page: ‘The per page charge applies to the number of pages that results from any search, including a search that yields no matches (one page for no matches.) The charge applies whether or not pages are printed, viewed, or downloaded.’ For people who do a lot of legal research, those fees add up quickly.”

In an era of transparent government, open source, and access-to-knowledge movements, it was only a matter of time before someone decided to find a way to make court documents available on a broader basis. The folks at Stanford have the IP Litigation Clearing House. That project aims to fill the “critical need for a comprehensive, online resource for scholars, policy makers, industry, lawyers, and litigation support firms in the field of intellectual property litigation.” That project has 23,000 documents and is growing. Pretty darn good, if you ask me. But wait; don’t order yet! Now comes RECAP from the folks at Princeton’s Center for Information Technology Policy. (Specifically, Harlan Yu, Steve Schultze, and Timothy B. Lee developed the project which is led by Prof. Ed Felten). Here is the link to the About Page, but let me tell you a little more.

CITP’s Harlan Yu explains:

RECAP is a plug-in for the Firefox web browser that makes it easier for users to share documents they have purchased from PACER, the court’s pay-to-play access system. With the plug-in installed, users still have to pay each time they use PACER, but whenever they do retrieve a PACER document, RECAP automatically and effortlessly donates a copy of that document to a public repository hosted at the Internet Archive.

In addition, if one is using PACER and RECAP “The documents in this repository are, in turn, shared with other RECAP users, who will be notified whenever documents they are looking for can be downloaded from the free public repository.” So when one searches for a document, one is notified about the availability of a free copy of the document.

There is probably much more to say here, but for now I want to congratulate the folks here at CITP on a great idea that uses information, technology, law, and policy to craft an elegant solution to increasing government transparency. This resource should feed almost anyone interested in practicing or studying the law. Empirical researchers alone should be drooling at this new wealth of information.

You may also like...

7 Responses

  1. Sarah L. says:

    This is fantastic. Is there a way to browse the publicly available documents at the Internet Archive without logging onto Pacer, or a search engine for the documents that gets around having to log into Pacer?

  2. Harlan Yu says:

    Hi Sarah, the public repository can’t be searched or indexed by crawlers just yet because of privacy reasons. In the past, the Courts haven’t been very good at enforcing their own rule that requires attorneys to submit redacted versions of their filings. They’re starting to be more strict about this, but there is still plenty of private data that can be mined out of the existing documents.

    That said, if you know the court name and PACER case number, you can manually look at cases at the repository:

    [court] is the short abbreviation of the court name from the ECF domain name (e.g. ‘cand’ for Northern District of California) and [casenum] is the case number from PACER.

  3. Sarah L. says:

    Thank you–that’s very helpful. And let me just reiterate how incredibly great this project is.

  4. Harlan

    You state: “That said, if you know the court name and PACER case number”

    That is really not correct – you have not correctly described what RECAP does.

    The Pacer (actually CM/ECF) case number you RECAP elected to use is not the Docket Number but rather a hidden unique number used by the CM/ECF database. One can find this only if one inspects the source code of the docket html page. This unique case number is obscure and of no meaning to most people. It is like identifying you by your SS number, rather than your name.

    In addition, one would need to know the exact Docket Entry number for the document. So, your specification is not complete.

    People also need to know that judicial opinions marked by a judge as such are already completely free on PACER – CM/ECF as long as one registered for a user name. FREE.

    Also, has many district court cases obtained this way from the free opinions on CM/ECF. But, they do something better than RECAP. They have a meaningful file name AND they stuff all of the metadata into the properties of the pdf file and the opinions are searchable on Google.

    RECAP would be much better if it used an understandable file name with the docket number and included the metadata on the docket sheet in the pdf file.

    That being said, the RECAP concept is brilliant and the programming is expert – but more needs to make this effective and to persuade attorneys to sign on and offer up free documents.

    Alan Sugarman

  5. Harlan

    It occurred to me that you meant you would see a list of documents so I checked again to see if the directory was exposed, and it was.

    I now see that if one does happen to know the ECF case number, then one can see a list of all documents uploaded as to the case number – since the directory is exposed.

    Now having seen that, I checked out the files in this directory for a file I uploaded yesterday:

    I see now the metadata and the docket sheet that I looked at yesterday.

    First, the docket sheet html files leaves out all of the very valuable information at the start of the docket including type of case, and names of attorney, and the name of the judge. Most important, you leave out the coverage period for the docket report.

    I wonder why all of this was not captured.

    Second, it appears you collected the docket sheet report – in part – that I elected to ask for. The problem there is that one can ask for a docket sheet only for certain dates – this is done is big cases. So, I assume that when this is done, there will be an overwrite I assume.

    Third, I looked at the \”metadata\” for the actual document uploaded. This is very limited and should be compared to the breadth of data on the written opinions report that CM/ECF provides (am I the only person who logs into ecf.nysd and see CMECF at the top of the page – why do people refer to this as PACER???) See how does this. Much better. What you need to do is to fully populate the xml file and attach it to the pdf file. The matedata ought to go with the file. Sorry if this spoils your hash. Also, I see that you may want to attach the case meta xml file as well to the pdf file.

    This is what you have for the metadata for this document:
    ETag: \”6cd55dac216aa2d147c30312185db880\”
    accept-encoding: identity
    authorization: LOW MtXL0tEgFmJcLXjr:REDACTED_BY_IA_S3
    connection: close
    content-length: 38327
    content-type: application/x-www-form-urlencoded
    user-agent: Python-urllib/2.5
    x-archive-meta-attachment-num: 0
    x-archive-meta-available: 1
    x-archive-meta-collection: usfederalcourts
    x-archive-meta-court: vid
    x-archive-meta-doc-num: 370
    x-archive-meta-language: eng
    x-archive-meta-mediatype: texts
    x-archive-meta-neverindex: true
    x-archive-meta-noindex: true
    x-archive-meta-pacer-case-num: 10330
    x-archive-meta-pacer-doc-id: 1930210992
    x-archive-meta-sha1: 95b4d596913e6f653f8df2134f989b7a34f54fa7
    x-archive-meta-upload-date: 2009-08-15 11:43:45
    x-archive-queue-derive: 0
    x-upload-date: 2009-08-15T16:34:58.000Z

    Incredibly, for the meatadata for this document, you omit the docket number of the case – one of the most important pieces of information. So, one could not search the xml file and find the case by docket number!!!!! But, you did include this in the docket xml file. It should be in both. I am assuming of course that the opinion files are ones that people will want to search on the internet at some point in the future. Oops. Now I see you left out the case name in the document xml, but it is in the docket html.

    For example, you do not even have the judges name. Your also should drop in the name of the court in the metadata although also in the file data.

    Basically, you need to parse out each field you can identify in the CMECF database – and have a separate entry in the metadata. And, the metadata of a separate doc file should include the metadata for the case.

    You also need more comprehensive descriptors – for example, the metadata should incude a line stating \”United States District Court\” – although one could claim this implied in the cryptic file name \”vid.\”

    You could also help out by counting the number of pages and characters in the document using standard PDF SDK.

    Anyway, whatever you do, please do not toss out the information in the docket sheet header.


  6. Harlan Yu says:

    Hi Alan, thanks for your comments- I agree that these are all important issues. This release is just the first iteration of the project and we look forward to working with you and others to make these documents as useful as possible. A few responses:

    – We’ve open-sourced the client and would love to see outside developers add features and submit patches. Implementing better filenames is definitely one of these client-side features that somebody could run with (though, it might not as easy as it sounds… will probably need a client-side cache of case names, etc.)

    – We’re slowly improving our scraping to gather more metadata from the docket sheet header. As you may have noticed, each instance of CM/ECF can choose to style their HTML pages differently, so a bit of logic and lots of testing is needed to make sure we’re scraping correctly for each court.

    – There’s now a centralized feedback forum for RECAP: It would be great if you can enter all of your specific suggestions there!

    If you have more technical questions, I’m happy to continue the discussion off-forum.