A web crawler is an internet bot that systematically browses the internet for indexing or searching for specific content. In my case, we (Institute for Information Systems) needed text to train the natural language processing (NLP) models. We have used the NLP package from Stanford University. Each unit is programmed in Java.
1. The Entry – Twitter
As entry to the internet, the first bot downloaded tweets from Twitter that contain internet links. We used multiple accounts, one for each thread. Each thread filtered different search entries provided from the “hot topics” of the day. The extracted links are then forwarded to the Index.
2. The Index
The Index is sorted after top-level domain (TLD) and stored all subdomains and links to its own webpage. Both TLD and link stored the time of the last visit to prevent overloading of target webpages or freezing through honeypots. The Index also received all existing links on the visited web page to continue indexing the complete website and related web pages. Through this method, a webpage is quickly indexed, especially when the webpage provided an archive. In addition, the robot.txt, RSS and other metadata are indexed too.
3. The Text Extraction
Each node of the downloader requested a domain of the Index to visit. The text of a webpage is extracted once as the original source and once as plain text and is stored in an archive, with the date of visit. This is important to detect changes and edits, for example, in the news.