미국 특허 서지정보 추출 방법에 대한 연구: HTML 파싱 기법의 활용을 중심으로

Yoo-Jin Han,Seung-Woo Oh

doi:10.3743/kosim.2010.27.2.007

Abstract

ABSTRACT This study aims to provide a method of extracting the most recent information on US patent documents. An HTML paring technique that can directly connect t o the US Patent and Trademark Office (USPTO) Web page is adopted. After obtaining a list of 50 documents through a keyword searching method, this study suggested an algorithm, using HTML parsing techniques, which can extract a patent number, an applicant, and the US patent class information. The study also revealed an algorithm by which we can extract both patents and subsequen t patents using their closely connected relationship, that is a very distinctive characteristic of US p atent documents. Although the proposed method has several limitations, it can supplement existing data bases effectively in terms of timeliness and comprehensiveness. 초 록 본 연구는 미국 특허 문서에서 가장 최신의 정보를 추출할 수 있는 방법을 제시하였다 . 이를 위해 미국특허청 웹페이지에 직접 접속하여 , HTML 문서를 파싱하는 방법을 제시하였다 . 먼저 관심 있는 키워드로 검색을 한 후 50개로 이루어진 리스트가 출력되면 , HTML 파싱 기법을 이용하여 여기서 직접 특허번호 , 출원인, 미국 특허 클래스와 같은 주요 서지정보를 추출할 수 있는 알고리즘을 제안하였다 . 또한 미국 특허문서에서 특수하게 제공되는 선․후행 특허간의 관계를 활용해 본 특허와 후행 특허의 미 국 특허 클래스를 동시에 추출 할 수 있는 알고리즘도 보여주었다. 본 연구에서 제시한 방법은 몇 가지 한계를 가지지만, 적시성․포괄성 측면에서 이미 존재하는 데이터베이스를 보완할 수 있을 것이다 .Keywords: US patents, bibliographic information, extraction, HT ML parsing미국 특허, 서지정보, 추출, HTML 파싱

Full Text