Coeus
Table of Contents
🧾 Description
Coeus is a web crawler that is designed to extract information from website with high efficiency.
Coeus respects robots.txt, which is a file that website owners use to tell search engines which pages they want to be crawled and which ones they don’t. This ensures that the program only crawls pages that are allowed, avoiding any legal or ethical issues.
Coeus also includes a filter for seen URLs and duplicated content, which helps to ensure that the program does not waste time crawling the same pages multiple times. Additionally, the program compresses HTML pages to conserve memory and avoid performance issues.
To guard against node crash, Coeus save its states and data automatically. Allow it to be restartable after failure.
📷 Demo
✨ Features
- ✅ Avoid spider traps by setting maximal url length
- ✅ Built a distributed infrastructure system consisting of web server, key-value store and processing engine for Coeus
- ✅ Compressed html page for optimization
- ✅ Cralwer could be restarted to guard against node crash
- ✅ Filter seen URLs & duplicated content
- ✅ Respect robots.txt for crawler politeness