Logo

Transistor

Transistor, a Python web scraping framework for intelligent use cases.

The web is full of data. Transistor is a web scraping framework for collecting, storing, and using targeted data from structured web pages. Transistor's current strengths are in being able to: - provide an interface to use `Splash <https://github.com/scrapinghub/splash>`_ headless browser / javascript rendering service. - includes *optional* support for using the scrapinghub.com `Crawlera <https://scrapinghub.com/crawlera>`_ 'smart' proxy service. - ingest keyword search terms from a spreadsheet or use RabbitMQ or Redis as a message broker, transforming keywords into task queues. - scale one ``Spider`` into an arbitrary number of workers combined into a ``WorkGroup``. - coordinate an arbitrary number of ``WorkGroups`` searching an arbitrary number of websites, into one scrape job. - send out all the ``WorkGroups`` concurrently, using gevent based asynchronous I/O. - return data from each website for each search term 'task' in our list, for easy website-to-website comparison. - export data to CSV, XML, JSON, pickle, file object, and/or your own custom exporter. - save targeted scrape data to the database of your choice. Suitable use cases include: - comparing attributes like stock status and price, for a list of ``book titles`` or ``part numbers``, across multiple websites. - concurrently process a large list of search terms on a search engine and then scrape results, or follow links first and then scrape results. Development of Transistor is sponsored by `BOM Quote Manufacturing <https://www.bomquote.com>`_. **Primary goals**: 1. Enable scraping targeted data from a wide range of websites including sites rendered with Javascript. 2. Navigate websites which present logins, custom forms, and other blockers to data collection, like captchas. 3. Provide asynchronous I/O for task execution, using `gevent <https://github.com/gevent/gevent>`_. 4. Easily integrate within a web app like `Flask <https://github.com/pallets/flask>`_, `Django <https://github.com/django/django>`_ , or other python based `web frameworks <https://github.com/vinta/awesome-python#web-frameworks>`_. 5. Provide spreadsheet based data ingest and export options, like import a list of search terms from excel, ods, csv, and export data to each as well. 6. Utilize quick and easy integrated task work queues which can be automatically filled with data search terms by a simple spreadsheet import. 7. Able to integrate with more robust task queues like `Celery <https://github.com/celery/celery>`_ while using `rabbitmq <https://www.rabbitmq.com/>`_ or `redis <https://redis.io/>`_ as a message broker as desired. 8. Provide hooks for users to persist data via any method they choose, while also supporting our own opinionated choice which is a `PostgreSQL <https://www.postgresql.org/>`_ database along with `newt.db <https://github.com/newtdb/db>`_. 9. Contain useful abstractions, classes, and interfaces for scraping and crawling with machine learning assistance (wip, timeline tbd). 10. Further support data science use cases of the persisted data, where convenient and useful for us to provide in this library (wip, timeline tbd).