adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Language: Python
Total stars: 1562
Stars trend:
14 Aug 2023
15 Aug 2023
#python
#articleextractor, #corpus, #corpusbuilder, #corpustools, #crawler, #htmltomarkdown, #html2text, #news, #newsaggregator, #newscrawler, #nlp, #readability, #rssfeed, #scraping, #tei, #textcleaning, #textextraction, #textmining, #textpreprocessing, #webscraping
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Language: Python
Total stars: 1562
Stars trend:
14 Aug 2023
7pm ██▎ +18
8pm ███▎ +26
9pm ███▊ +30
10pm ██▍ +19
11pm ██ +16
15 Aug 2023
12am ██ +16
1am █▊ +14
2am ███ +24
3am █▋ +13
4am ██▌ +20
5am ██▌ +20
#python
#articleextractor, #corpus, #corpusbuilder, #corpustools, #crawler, #htmltomarkdown, #html2text, #news, #newsaggregator, #newscrawler, #nlp, #readability, #rssfeed, #scraping, #tei, #textcleaning, #textextraction, #textmining, #textpreprocessing, #webscraping
wisupai/e2m
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.
Language:Jupyter Notebook
Total stars: 766
Stars trend:
#jupyternotebook
#doc2x, #e2m, #llm, #markdown, #pdftomarkdown, #textcleaning
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.
Language:Jupyter Notebook
Total stars: 766
Stars trend:
14 Jan 2025
7pm ▉ +7
8pm ▊ +6
9pm ▊ +6
10pm ▌ +4
11pm ▍ +3
15 Jan 2025
12am █ +8
1am ▌ +4
2am ▉ +7
3am ▏ +1
4am █▌ +12
5am █▏ +9
6am █▎ +10
#jupyternotebook
#doc2x, #e2m, #llm, #markdown, #pdftomarkdown, #textcleaning