A New Script for Searching Texts Written by Hand

GOOGLE’ S experimental service Google Scholar (www.scholar.google.com ) scours the Web for academic papers. But while it may prove an asset for students and those who work in areas like science, neither it nor any other search engine is very useful for historians whose research involves plowing through documents from the time when handwriting ruled.  

By IAN AUSTEN
December 30, 2004, New York Times, Page 86, Section Circuits, Vol CLIV, Number 53,079 

GOOGLE’ S experimental service Google Scholar (www.scholar.google.com ) scours the Web for academic papers. But while it may prove an asset for students and those who work in areas like science, neither it nor any other search engine is very useful for historians whose research involves plowing through documents from the time when handwriting ruled.
Even after handwritten documents have been scanned and made available through the Web, it is not possible for Google or any other widely used searching technology to read them. The only method in use for searching such documents involves first having them typed into standard computer text, a costly and time-consuming process.

“There is an enormous amount of handwritten stuff locked away in many archives, libraries and museums,” said R. Manmatha, a research assistant professor with the Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst. “Most of the time when people do research they just ignore this stuff because it’s not accessible.”

With handwritten documents, it is not just scholars who are frustrated by their inaccessibility. People tracing their family histories often find the process involves reading a lot of mostly irrelevant documents to find relatively small amounts of information.

Most of Dr. Manmatha’s work in computer science has involved using computers to search for photographs or bits of film in digital databases. When he turned to the problem of searching handwritten texts, he found his experience to be very complementary to the task.

“Handwritten manuscripts are just images,” he said.

Some sophisticated handwriting recognition systems are in use. The United States Postal Service and postal agencies around the world use them to read addresses at sorting stations. And forms filled out during the last United States Census were scanned and read by computers.

But Dr. Manmatha said the experience developed from those systems was not particularly useful when he and two graduate students, Toni Rath and Victor Lavrenko (who is now a postdoctoral research associate), began work on their manuscript searching project. The current systems have to cope with only a limited range of material - for example, names and addresses - written in a consistent format. On top of that, postal systems have large numbers of human readers as backups, something that wouldn’t be possible for a manuscript search engine.

To develop their system, Dr. Manmatha and his students obtained about 1,000 pages of George Washington’s correspondence that had been scanned from microfilm by the Library of Congress.

They began by working on a variation of an approach used to search digital photographs, trying to match specific typewritten letters with digital images of their handwritten counterparts. But Dr. Manmatha said the inherent variations in handwriting quickly made that approach too cumbersome.

The problem was also made more difficult by the fact that Washington dictated to several secretaries and wrote some of his letters personally. The result is that his papers contain at least five handwriting styles.

The breakthrough came from looking at research into how people read, Dr. Manmatha said. Rather than analyzing individual letters, he said, people look at words and even parts of sentences as whole units.

To develop software that would take a similar holistic approach, Dr. Manmatha turned to an idea developed to let search engine users enter queries in their own language to find Web pages written in another language. Rather than mapping words between the two languages one for one, Dr. Manmatha said, those systems rely on software that is trained to spot common ground. For example, he said, English-language search engines that can look through French Web pages use training software that compares pages and pages of debate transcripts from the Canadian House of Commons, which publishes all of its proceedings in both languages.

To train his manuscript software similarly, Dr. Manmatha had a portion of Washington’s papers converted into computer text. His software is given some help as it makes its learning comparisons. For example, the system is designed to eliminate the slanting in the handwriting and to rescale words to make them a consistent size.

Even after training, the resulting software lacks the accuracy of programs used to read and digitize printed books, Dr. Manmatha said. But that it is not a significant problem.

“That’s the difference between having to recognize a thing and having to search it,” he said. “You don’t have to get every word right.”

Right now, Dr. Manmatha believes that the system is about 65 percent accurate. Some of that has to do with the variations in handwriting between Washington and his secretaries. The inconsistent ink fading in letters written with quill pens is another problem. And the fact that the Library of Congress’s digital images were created from microfilm copies rather than original documents was also an issue.

Dr. Manmatha believes that refinements to the software will increase its accuracy, but that even the current level could be a useful tool for researchers. It could allow them to reduce their reading time by at least broadly indicating if a group of papers is potentially useful, he said.

Outside of accuracy, Dr. Manmatha said there were two things that needed improvement in the system. It can generally cope with the difference in the writing styles in Washington’s papers because they contain broad similarities. But somewhat like voice recognition software, the program has to be retrained before it can digest documents in a significantly different hand. Dr. Manmatha said that eliminating or minimizing that retraining step would be difficult, but that he believed it would be possible.

In addition, before the system can move from its current 1,000 pages to the full 140,000 pages of Washington manuscripts in the library’s digital collection, it will also need to operate more quickly. That may be as simple as adding more computing power, Dr. Manmatha said.

Though no library or archive has yet approached Dr. Manmatha about the system, he will brief Google about it early next year. With sufficient funds for software development and document scanning, Dr. Manmatha said, it may be possible within a decade for people to search historical manuscripts from home as easily as they now locate anything else on the Web.

“A tool like this will help people access such material and make possible new discoveries,” he said.
 

Tags: , ,

Leave a Reply


  • daphne blouet
  • julie hayek news
  • nella blag
  • pasadena hills auto insurance
  • marko klingner
  • pubs in grassmarket edinburgh
  • capital crusader football
  • bowflex 6 week manual
  • cheetah fur
  • vintage disney charms gold
  • 1977 ford f250 sway bar link
  • will farrel movieography
  • switzer md bronx ny
  • wein bridge pressure transducer circuit
  • phil vassar piano music
  • pennslyvania jobs
  • tired legs and cause
  • horse vigina stories
  • spyro bandicoot warp demo password
  • dead celebrities pictures videos and albums
  • grounding system on subpanel
  • inwood north homeowner
  • ollie johnson athletic mentors program
  • che guevara dimension ship
  • bloating of the abdomen
  • what is folic acid made from
  • kidney transplants statics
  • mistral hotel by resort bookings
  • the 1086 oath of salisbury
  • ball screw torque force pitch
  • piano quintet trout
  • juices containing inulin
  • 40 samsung widescreen lcd hdtv
  • chihuahua terrier mixes for adoption
  • chanhassen theaters in chanhassen mn
  • thorsten goldstein
  • west nile arizona navajo 2007
  • attraction california railroad felton
  • tta north carolina airport shuttle service
  • unbreakable ka-bar knife 4
  • vancover canucks home page
  • dads hom
  • st joseph statue bury in yard
  • motorsports auto racing touring cars
  • audio asylum vista
  • extreme funny humour sites
  • andersons roast beef amherst
  • univesity of florida football schedule
  • victor crist and charlie christ
  • verkiezingsuitslagen belgie fgov
  • invite to revolutionnt
  • college students returning to dorms sober
  • perfect insanity bittorent
  • mindy swedberg portland oregon
  • arlene hannafin remax west roxbury
  • ford motorcraft battery 2003 dieseal
  • farmers markets in calvert county maryland
  • lindner patric
  • me loco coal chamber
  • mind enhancements for sports performance basketball
  • weather history punta arenas chile
  • day cares in tomball
  • da vinci poinsettia
  • strawberry 1-2-3-4 cake
  • ginny keefe dance
  • training for cbos on facilitation skills
  • vignon decanter pourer
  • pathophysiology of cholangiocarcinoma
  • g nther zecher
  • resteraunts ocean city md
  • fractured coller bone in children
  • developing psychic ability tarot reading
  • dora hideaway bed
  • how to block private caller calls
  • raven symone lil fizz baby
  • epinephrine injection different ways of use
  • hall of fame oshkosh titans 07
  • flash soccer pickerington oh
  • kalihari meerkat project
  • stormy street
  • armando hernandez segovia
  • patek eye
  • home entertainment mantle 949
  • duquesne light compnay union
  • aberchrombie fierce cologne
  • metroplex area housing authorities
  • olga bela allergist
  • wooden composting bins
  • kimberly kardashian vidio
  • transat a t
  • upper darby poilce dept pa
  • john b pfeifer and texas
  • charon proxy hunter verify all
  • multimedia graphics designer paddington hyde park
  • publisher for world superbike programmes 2007
  • download video episodes free
  • seasonal affective disorder fructosamine
  • amphitheater schools tucson arizona
  • email grabber for act
  • pokedex platinum
  • 9 marbles scale question
  • toys for tots images
  • pt crusier screensaver
  • bader farms in bader mo
  • 1905 fashions
  • mike oldfield maby
  • ellensburg high school
  • exchange list for diaic diet food
  • item guide for flyff assist
  • abie epstein century 21
  • leather soled hiking moccasins
  • avant karma
  • headgear braces forum
  • snuff and mah jong
  • lance armstrong's college he went to
  • frank langella as dracula pictures
  • thomason hospital el paso shots fired
  • houston photographers corrie
  • cafe salerno beverly ma
  • art classes skokie il
  • 12vdc continuous duty gear motor
  • the washingon club dupont circle
  • area attractions hendersonville nc
  • natick barracade
  • halle berry dating
  • oscillating fan switch
  • chocolate zucchini cake frosting
  • representation scotland children hearings panels
  • kirby roswell nm
  • signage determine height of letters
  • urinary tract infections diabetes
  • justus band oregon
  • role of ions in the atmosphere
  • offroad design 4wheeldrive
  • dell 6400 motherboard bios boot block
  • bromide benefits
  • regulation 105 and withholding and canada
  • synchronization in stream ciphers
  • metallica southpark
  • wishaw wycombe wanderers
  • rolando amadeu m d
  • marvin sidibe
  • missoulian to giveaway ads
  • recess at notre dame
  • faa physicals seattle
  • federation of the uae
  • protools le 7.1 description
  • the interactions with others in nigeria
  • explanation of rosary
  • richard armitage children's award
  • 9th grade suggested reading list
  • amy winehouse free songs lyrics
  • jethro tull inventions
  • geforece 9600 gt
  • kosciusko county deaths 1978
  • depeche mode on caress
  • sony wega tv repair manual
  • wood deck sealers jax fl
  • lifes of uninhibited women
  • social info processing model
  • shaggy from scobey doo
  • slanting fallacy and food and commercials
  • shama lama ding dong otis
  • filing for wages from bankrupt employer
  • advil cold sinus side effects
  • macys customer service number
  • sinclair school of nursing missouri
  • bed and breakfast chesterton indiana
  • mirc channel
  • peterson jig fixture
  • voltage gated antibodies
  • profiles in history auctions
  • yule logs to military
  • romanian rhapsody 1 samples
  • terminator the sarah conical files songs
  • scruggs john lee
  • redford runners
  • 96 exits super mario world
  • ache lower right quadrant
  • legal terminology less assumed obligations
  • racheal ray recipes cinco de mayo
  • calories mcdonalds cheeseburger
  • throwing balls
  • destiny metropolitan worship church marietta ga
  • german shepard yankton south dakota
  • windows terrace frenchman's reef wedding reception
  • keds rave cvo expresso at masseys
  • cletis from the dukes of hazzard
  • smartboard simulator
  • i must me emo
  • as400 filesystem
  • teaching creationism in public schools
  • mio h610 hacks 2009
  • data center colocation 0a
  • chicago pat roche feis
  • presto cooker canner parts
  • lexus es300 heater switch
  • knee surgery holes bone cartilage
  • situational leadership for law enforcement
  • esmeralda treasure
  • truckers weather
  • cardinal birds andnot football
  • july 26 pinks nj
  • lonestar this christmas night lyrics
  • list of irregular plural form
  • alexander stover in huntersville nc