parsing data from tables (XML format?)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

parsing data from tables (XML format?)

Andrea Rossi

I would like to be able to extract the information from certain tables in a patent using a script. I was able to download the sequences as an XML file, but I could not find the other tables as XML format. Is there any reason why only sequences are stored as a non image format? 

I could use some OCR (Optical Character Recognition) technique to extract the data, but that would make my life much harder. 

Thanks in advance for any valuable suggestion you may provide!

Reply | Threaded
Open this post in threaded view

Re: parsing data from tables (XML format?)

Dear Andrea,

Concerning published PCT applications submitted in paper or image form by the applicant, we use automated OCR procedures to obtain the associated text. Unfortunately, due to the diverse layouts of tables within patents, the OCR procedures are not reliable enough to output relatively accurately the contents of tables. Tables are identified by our OCR package by recognizing the table lines for rows and columns. When contained in descriptions, sequences are usually without table lines and are then treated as text.

If the sequences are in XML format, it means that they have been supplied by the applicant to the International Bureau as such and we are therefore able to publish them in text form.

Unfortunately, the only solution for you is indeed to use an OCR package and correct the deficiencies for the patent you consider.

Best regards,
