Friday, July 3, 2020

The Pitfalls of Using Google Ngram to Study Language

Google's ngram function reveals word prevalence in millions of books over hundreds of years, which is an amazing tool for studying language. 
"Google populated the database from over 5 million books published up to 2008."

But there are some pitfalls...

"OCR, or optical character recognition, is how computers take the pixels of a scanned book and convert it into text. It's never a perfect process...lowercase long s in old books looks a lot like a f: ...case versus cafe, funk versus sunk, fame versus same."
"Google Book's English language corpus is a mishmash of fiction, nonfiction, reports, proceedings, and, as Dodds' paper seems to show, a whole lot of scientific literature. "It's just too globbed together," "
"scientific publications are taking up more and more of the the corpus, certain non-scientific terms may appear to fall in relative popularity..."
"He notes that a search for Barack Obama restricted to years before his birth turns up 29 results..."
[If] "a book only appears once—whether it's been read once or millions of times...[vs] some random paper on mechanics. The two texts are weighted equally. It doesn't reflect what people are talking about so much as what people are publishing about..."


You can even search for wildcards and multiple forms of a word (book_INF for booked, booking, books) and other advanced searches - see https://books.google.com/ngrams/info
<iframe width="560" height="315" src="https://www.youtube.com/embed/DHacBPWrB8g" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


No comments:

Post a Comment

Search This Blog

Followers