Data Mining Lessons Applied to Analyzing Patent Documents

As the economic marketplace associated with patent assets continues to grow, involve new players, and become more transparent, methods for analyzing patents are continuing to be developed and used to gain insight on portfolios. Recently, we have seen two examples where the use of patent analytics have had a significant impact on the economic valuation of a collection of patents. The first involved a doubling of the value of RIM’s patent portfolio by a major Canadian bank after it was mentioned as having a stand-out portfolio in a patent study. The second involved the analysis of AOL patent assets where two different sets of analytics provided very different results. In the AOL case, when it came to the eventual purchase by Microsoft, one of the valuations matched almost exactly the price that was paid. Both of these cases demonstrate how important well thought out analytics are to providing signals of value when working with patents.

There are a number of lessons that can be gleaned from the field of data mining and applied to patent analysis but for the purposes of this article let’s focus on three of them to help us understand how to conduct patent analytics that provide real value.

One of the first things data analysts learn is that context matters. This is especially true with patent data. While it is romantic to think that the raw data and the numbers themselves will tell the story, in reality it is the meaning and understanding of the context behind that data that provides the real insight. Subject matter experts can be used to help make sense of the data and apply it properly but there is really no substitute for working with analysts who have worked in the patent field and who understand the nuances and peculiarities of patent information.

Data scientists also tend to think about analysis projects with regard to the scope of analysis pursued. They refer to the analysis of small data collections as micro scale, encompassing tens of documents or looking at a single entity of some type. The next scale level is called meso and looks at hundreds of documents or a collection of closely related organizations or individuals. Finally, when the data sets get large, encompassing tens of thousands of documents or entire fields or countries, the analysis is referred to as being on a macro level. Different tools and techniques work best on the various levels and it is important to understand these distinctions before the analyst starts working on a particular problem.

It is also important to keep in mind that, as a data scientist, you should never rely on a single source or piece of evidence for drawing your conclusions. It is critically important that, when it comes to data, when possible, to use more than one source, potentially collected in a different way or under different rules to avoid generating false artifacts. If more than one independent data source analyzed using different methods come to the same conclusions than there is a higher likelihood that what is being observed is a real event as opposed to a situation which may include an unforeseen bias based on the data collected or the method in which it was analyzed.

Keeping these three tenets in mind let’s have a look at a popular patent analysis technique that is often used to determine the potential value of patent documents. Patent citations have been used for many years as an indicator of patent value. The point of this article is not to argue whether the practice is valuable or not, per se, but to look more closely at how the method should be practiced according to the data mining rules mentioned earlier.

I recently asked a question about how many forward citations US8341981 has? This patent was granted on January 1st of this year so the answer would obviously appear to be zero since it is only recently published. Personally, I think the answer should be seven and an argument can be made for the number 22. The first data mining lesson talked about the use of context and understanding how the patenting process works. In this case, it is critical to realize that since 2001 US granted patents may also have a corresponding pre-grant application associated with them, and this equivalent document could have citations associated with it as well as the patent itself. A lot of patent information systems and methods of analysis look at citations to discrete documents and don’t take patent equivalents into consideration. This can dramatically under represent the true number of citations associated with a patent document. In the case of the ‘981 patent there is a corresponding pre-grant application associated with it that has seven citations to it while the granted patent has none. Citations can also be collected for patent families, which in this case, is where the number 22 answer comes from.

Looking at citations associated with patent equivalents or families is an example of conducting a micro level analysis. When working with patent data it is also possible to look at trends on the meso and macro levels. For instance, on the macro level we can look at the different citation patterns associated with patent documents coming from the US versus those coming from Europe. In these studies, it was found that a little over 75% of US granted patents have at least one forward citation associated with them. Interestingly, about the same percentage of pre-grant applications from the US also have at least one forward citation associated with them. Compared to Europe though the statistics change dramatically. With European granted patents the chances of finding at least one forward citation associated with them is less than 10% but when the corresponding pre-grant applications are looked at it is found that between 35-45% of them have at least one forward citation. This is also an example of context but it demonstrates vividly the differences in behavior between the two systems at a very high, or macro level. This is critical to keep in mind when conducting an analysis using the different countries.

Finally, it is always best, when possible, to consider multiple sources when analyzing data, including patent data. In the case of the question I asked about the citations associated with US8341981, it was surprising to find that not all of the various systems agree with one another on how many citations are associated with the discrete documents. One of the systems suggested that the pre-grant application had seven citations while another said six and a third said four. For this reason, it is important to consider different sources or at the very least to document when the analysis took place and with what data source.

Patent analytics are playing a significant role in helping establish the economic value associated with patent documents and portfolios. Like almost any data source it is important that the analysts conducting the studies understand the information they are working with and follow some general rules to provide real insight. In fact, it can be argued that it is even more important that patent savvy individuals be tasked with working with patent data since it is far more nuanced and complicated than other sources and potentially ripe for misinterpretation.


Warning & Disclaimer: The pages, articles and comments on do not constitute legal advice, nor do they create any attorney-client relationship. The articles published express the personal opinion and views of the author as of the time of publication and should not be attributed to the author’s employer, clients or the sponsors of Read more.

Join the Discussion

One comment so far.