python library 

trafilatura ( is a nice python library that we use to extract article full text from HTML documents for indexing in scholar. It has good accuracy and recall, works with "old" HTML (eg from web archives), and pulls out metadata like title, author, and date. There are lots of similar tools, mostly focused on news articles, and trafilatura is an improvement.

Thanks to Adrien Barbaresi for maintaining it!

python library 

@scholar I'm glad to see that your scholar project is multilingual as well! Testing a few key phrases in Japanese returned some results.

Does the IA have subscriptions to online journals or is this just indexing based on abstracts, etc? How are you handling journals that are still primarily distributed physically and have little/no online presence (in Japan this is very much the case for a huge number of important journals)? Are you pulling info from cinii or webcat or anything else?

python library 

Scholar contains content that we have crawled from open sources, such as OA publishers, repositories, and national libraries. Most of our Japanese metadata and content comes via the JALC DOI registrar and JSTAGE hosting site.

There is also digitized print content in, and some of that ends up in scholar. I don't know of any specific Japanese research collections there.

There are more details in "the guide":

re: python library 

@scholar Oh I see... Interesting. If it is at all possible to integrate cinii into that (I believe it has an API), I think that would be valuable. It has the most complete info for Japanese scholarly articles and books, even for journals that are only distributed via a physical medium.

I'm unsure how much access the API gets you or anything like that, so I'm unsure how easy/possible it would be to integrate, but I think it'd be a really good resource to look at to expand your Japanese source data!
Sign in to participate in the conversation
Internet Archive

A Mastodon Server for Internet Archive employees and Role Accounts (Announcements)