trafilatura (https://github.com/adbar/trafilatura) is a nice python library that we use to extract article full text from HTML documents for indexing in scholar. It has good accuracy and recall, works with "old" HTML (eg from web archives), and pulls out metadata like title, author, and date. There are lots of similar tools, mostly focused on news articles, and trafilatura is an improvement.
Thanks to Adrien Barbaresi for maintaining it!
Scholar contains content that we have crawled from open sources, such as OA publishers, repositories, and national libraries. Most of our Japanese metadata and content comes via the JALC DOI registrar and JSTAGE hosting site.
There is also digitized print content in archive.org, and some of that ends up in scholar. I don't know of any specific Japanese research collections there.
There are more details in "the guide": https://guide.fatcat.wiki/
re: python library
A Mastodon Server for Internet Archive employees and Role Accounts (Announcements)