GHCrawler by Microsoft

ghcrawler A robust GitHub API crawler that walks a queue of GitHub entities retrieving and storing their contents. * Retreiving all GitHub entities related to an org, repositories, or user * Efficiently storing and the retrieved entities * Keeping the stored data up to date when used in conjunction with a GitHub event tracker. GHCrawler focuses on successively retrieving and walking GitHub resources supplied on a (set of) queues. Usage : The crawler itself is not particularly runnable. It needs to be configured with: - Queuing infrastructure that can take and supply requests to process the response from an API URL. - A fetcher that queries APIs with the URL in a given request. - One or more processors that handle requests and the fetched API document. - A store used to store the processed documents. For more information contact - opencode@microsoft.com with any additional questions or comments.