Structuring US Law

In 2013, the U.S. House Law Revision Counsel released the Titles of the U.S. Code as “structured data” in xml.  Previously the law had been available only as ordinary text.  This structuring of the law as data allows for interesting visualizations and interactions with the law that were not previously feasible, such as the following:

circleExplode2h

Click on image to launch Force Directed Explorer App

 

US Code Explorer Screen shot

Click on image to launch Code Explorer App

 

This post will discuss what it means for US law to be structured as data and why this has enabled increased analysis and visualization of the law. (You can read more about the visualizations above here and here)

Structuring U.S. Law

The U.S. Code – (the primary codification of Federal Statutory Law) – has always had an implicit structure. However, it now has had an explicit, machine-readable structure.

Structured Law and Computer Analysis

The 2013 release by the government of the United States Code – in xml (extensible markup language) format was the first time the law was officially released as data, rather than as ordinary text.  XML is a general language for expressing data according to well defined formatting rules, whether that data consists of business transactions, course enrollment, or the Titles of the U.S. Code. Releasing the laws in “.xml” means that the federal laws have now been given a structure that can be read by computers.

Law in XML

Law in XML

To see why explicitly structuring the law in “machine-readable” form allows for more advanced computer analysis, let’s first contrast the concepts of implicit and explicit structure in the law.

The Structure of the United States Code

The US Code has a structure. At the highest structural level the Code is divided into over 50 “Titles”.

Title 15 - Commerce and Trade
.. 
Title 25 - Internal Revenue Code
..
Title 35 - Patent Law

Loosely speaking, a “Title” corresponds to a different topical area for lawmaking. For instance, Title 35 contains most of the the Patent Laws, Title 17 contains most of the Copyright Laws.  (However, note this is an approximation as many Titles contain a hodgepodge of unrelated topics housed under one document – e.g. Title 15 – Commerce and Trade; conversely, the laws regulating some topics are spread across multiple titles.) The fact that laws are loosely placed by topic within a particular Title is one form of overall structure.

Title Hierarchy: Parts -> Chapters -> Sections

Each Title, in turn, has its own structure in terms of a hierarchy. Every title is divided into smaller parts and sections in different hierarchical levels. A typical structure of a title of the US code will have it divided into something like.

  Chapters -> Sub-Chapters ->
                   Sections -> Sub-Sections -> Paragraphs

and so on.

For instance, Title 35 – the Patent Code – has different rules concerning different patent related topics that are located located in separate parts of the overall hierarchy of the title.

Title 35 - Patents
  Part 1 -United States Patent And Trademark Office
     CHAPTER 1— Establishment, Officers And Employees
     CHAPTER 2— Proceedings In The Patent And Trademark Office
        § 21. Filing Date And Day For Taking Action
  ....
  Part 2 -Patentability Of Inventions And Grant Of Patents
  ... 
     CHAPTER 10— Patentability Of Inventions
        § 100. Definitions
        § 101. Inventions Patentable

Section 101, for instance, contains the rules that tell us what types of inventions can be patented, and these rules are located in the overall hierarchy in:

Title 35 - Part 2 - Chapter 10 - 
Section 101 - "Inventions Patentable". 

Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Plain Text Law: Unstructured Text

The section just presented is an example of what might be called an “unstructured”  or a “plain text” version of the law. A “plain text” version of the law simply refers to the law as lawyers are used to seeing it written – in ordinary sentences designed for people to read (as opposed to computers).

I used the phrase “designed for people to read” to emphasize a point: such a plain text sentence may not be easy for computers to read. Computers are likely to find laws written in plain-text – like the one above – difficult to read. “Plain text” can be contrasted against highly structured machine-readable” text, like the example below.

Section101XML

Computer Readable XML version of Law

Computers prefer text to be rigidly organized and precisely labeled (e.g. <section>) like this. Such text is considered “structured” (and machine-readable) because a computer can, following rigid rules, methodically go through and unambiguously identify each part. In the example above, there is legal language within <sectionText>, and the computer knows exactly where the <sectionText> language begins and where it ends.

Plain Text Law: Implicit Structure

By contrast, attorneys know that a typical Title of the U.S. Code does have an internal structure, but we can think of that structure is being implicit, rather than explicit.

Patent table of Contents

The meaning of an “implicit structure” might might not be obvious. If you’re an attorney looking at the above, you might be thinking, “There is an obvious explicit structure in the law above – it is divided up into chapters, sections, etc., all lined up neatly, I can see that plainly.”

True, if an attorney were to look at a printout of Title 35, she could see that it is divided into 5 “Parts”, each “Part” contains multiple “Chapters”, each “Chapter” in turn contains “Sections”, etc.)

However, there is nothing in the official Title 35 document that explicitly tells us information such as “a ‘Part‘ comes above a ‘Chapter‘ in the hierarchy, and a ‘Chapter‘ is above a ‘Section‘” in the hierarchy, or that the names of the sections can be found by looking at the text on a line of its own in bold.

Plain text image of section101

Rather, such an organization is implicit in the way the text is displayed and labeled, guided by long-standing legal conventions. Attorneys learn to parse this hierarchy by relying upon these familiar standards about how law is labeled and structured (e.g. “Sections tend to come below Chapters in Titles”), and by drawing upon their general legal knowledge and experience.

Visual Cues and Implicit Structure of Law

In looking at the law, visual cues are important to understand the internal structure and hierarchy of a Title. For instance, when the law is displayed, it is often indented by several spaces each new level in order to make the hierarchy apparent.

Similarly,  emphasis like bold is often used to demarcate section or chapter names as in the example below.

Screen Shot 2015-04-09 at 11.46.42 AM

Additionally, we rely on implicit visual cues such as spacing and blank lines to understand where the different elements begins and others end, such as where the section headings end, and where the section text begins.

For instance, in looking at the above plain text printing of Title 101, we understand through visual cues that the heading of the section is “Inventions Patentable”, and that the heading ends with the word “Patentable”, the last bolded word.  There is a blank line in between.

We understand from these cues that the text of the section begins after, with “Whoever invents…” and ends with the word “title.” The change in formatting and spacing conveys to readers visually where the heading begins, and the content ends.

Unstructured Law: Difficult For Computers

The structure in “plain text” sentences – like the law above – is obvious for attorneys to see, but for a computer, such implicit structure is typically difficult to unambiguously process. A computer might not be able to understand (without accuracy issues) the same cues (spacing, headings) that humans easily rely upon to separate out the law into its components and subcomponents.

A computer might, for instance, accidentally read the section heading and text as one continuous entity, “Inventions Patentable Whoever invents..”, not understanding that the change in bolding and blank line are meaningful indicators.

In general, computers are not as good as people at understanding arbitrary visual cues – like bolding and spacing – that are often used to indicate the implicit structure of printed documents.

While in principle you can program a computer to make educated guesses about the structure based upon the formatting and spacing, the computer is liable to make errors in “parsing” or reading the law and if there are even minor changes variations (e.g. one section uses two lines, instead of one line, between heading and text).

By contrast, people are good at adapting to arbitrary formatting changes, and are great at picking up patterns that indicate implicit structure.

Such plain text law is difficult for computers to read with accuracy.

Such plain text law is difficult for computers to read with accuracy.

In sum, when the law is printed as plain text – as it has traditionally been printed for hundreds of years – very basic tasks – such as separating out a Title into its different parts and sub-parts,  (e.g. Headings, content, chapters,etc),or differentiating section headings from the text of the legal rule, are be comparatively difficult for computers to do with a high level of accuracy.

A simple task that merely involved counting the number of Sections in Title 35 – an easy task for a person — would be error prone for computer that is examining plain text.

Structuring Law: – US Code Released as XML

In 2013, the U.S. House of Representatives released the titles of the U.S. Code as structured data in xml format. Previously the excellent Cornell Legal Information Institute had released an unofficial xml version of the federal law as well, but this was the first time a government source had done so.

The fact that the law is now marked-up in .xml means that the Section 101 of the Patent Code now looks like something this:

Law in XML

Law in XML

Computer Friendly Law

This version of the law is much less human-friendly to read than the typical plain text form, but much friendlier for computers to read. Computers excel when there are precise, unambiguous rules to follow. The .xml version of the U.S. Code makes the structure and hierarchy of the law explicit in a way that a computer can be told to read.

For instance, rather than guessing about where the text of section 101 begins ands ends based upon bolding and spacing, we have been told explicitly thanks to the <section> tags.

The text of Section 101 is everything between the labels:
<section num=”101″>

and

</section>

The US Government took the time to label the exact start and end of every single section, part, etc of every law in the U.S. Code. This means that a computer no longer has to approximate based upon visual cues or spacing to determine the start or end of the section. The end result is that a computer can unambiguously and accurately extract the text of any section, subsection, chapter, etc in any US Title.

Moreover, this structured format allows a computer to identify the headings and the numbers of each section, chapter, etc., without a problem. In the above example they have been explicitly marked as

number=”101″ and description = “Inventions Patentable”

in the xml code.

Extracting the Hierarchy

Additionally, the hierarchy of parts within each US Title has been made explicit. For instance, Title 35 in .xml looks something like this:
Title 35 Hierarchy

This structure means that the computer does not have to guess about the hierarchy (e.g. what Part contains what Chapter) in the law based upon visual clues and indenting or “know” common legal conventions. Rather, “Title 35” explicitly contains Part I within its tags:

Screen Shot 2015-05-07 at 2.24.28 PM

Including Part I inside the Title tags <title></title> indicates that Part II is below “Title” in the law hierarchy. By explicitly placing one portion within the tags of the other portion, you are explicitly defining the hierarchy in a way that the computer can read.

Conclusion: Structured Law = New Analysis and Visualization

The upshot is that computers can now precisely read or “parse” the structure of the U.S. Code.  To be clear, this does not enable computers to the meaning of the law, only it’s structure.

Because of such structuring, we can begin to create new visualizations like the U.S. Code tree explorer or the Force Directed Graph that were not feasible in the era of “plain-text” law.

Visual Combined

Similarly, structured law is more conducive to certain types of analysis than plain text, such as this analysis of the Complexity of the Law by Prof. Dan Katz and Michael Bommarito.

Computers are good at analyzing data when it has a consistent format.  Now that the law is essentially structured data, many types of sophisticated analysis of the law that were previously infeasible, are now possible.

You may also like...