trafilatura
all
A Python tool for web scraping and crawling that extracts main text, metadata, and comments from web pages. Designed for creating text corpora and extracting structured content.
More info →Options (4)
-u, --URLbooleanExtract text from a URL
Example:
trafilatura {{[-u|--URL]}} {{url}}-o, --output-dirbooleanExtract text and save to a file
Example:
trafilatura {{[-u|--URL]}} {{url}} {{[-o|--output-dir]}} {{path/to/output.txt}}-i, --input-filebooleanExtract text from multiple URLs listed in a file
Example:
trafilatura {{[-i|--input-file]}} {{path/to/url_list.txt}}-h, --helpbooleanDisplay help
Example:
trafilatura {{[-h|--help]}}Examples (8)
Extract text from a URL
trafilatura [-u|--URL] urlExtract text and save to a file
trafilatura [-u|--URL] url [-o|--output-dir] path/to/output.txtExtract text in JSON format
trafilatura [-u|--URL] url --jsonExtract text from multiple URLs listed in a file
trafilatura [-i|--input-file] path/to/url_list.txtCrawl a website using its sitemap
trafilatura --sitemap url_to_sitemap.xmlExtract text while preserving HTML formatting
trafilatura [-u|--URL] url --formattingExtract text including comments
trafilatura [-u|--URL] url --with-commentsDisplay help
trafilatura [-h|--help]made by @shridhargupta | data from tldr-pages