python library
trafilatura (https://github.com/adbar/trafilatura) is a nice python library that we use to extract article full text from HTML documents for indexing in scholar. It has good accuracy and recall, works with "old" HTML (eg from web archives), and pulls out metadata like title, author, and date. There are lots of similar tools, mostly focused on news articles, and trafilatura is an improvement.
Thanks to Adrien Barbaresi for maintaining it!
python library
@TiljaCallahan
Scholar contains content that we have crawled from open sources, such as OA publishers, repositories, and national libraries. Most of our Japanese metadata and content comes via the JALC DOI registrar and JSTAGE hosting site.
There is also digitized print content in archive.org, and some of that ends up in scholar. I don't know of any specific Japanese research collections there.
There are more details in "the guide": https://guide.fatcat.wiki/
re: python library
python library
Does the IA have subscriptions to online journals or is this just indexing based on abstracts, etc? How are you handling journals that are still primarily distributed physically and have little/no online presence (in Japan this is very much the case for a huge number of important journals)? Are you pulling info from cinii or webcat or anything else?