Difference between revisions of "News+C"

From Archiveteam
Jump to navigation Jump to search
m
 
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
Wait, we already have [[NewsGrabber]]! How is this one different?
Wait, we already have [[NewsGrabber]]! How is this one different?


# News+C focuses on websites that have user comments, especially those that use third party comment plugins (Facebook, Disqus etc.)
* News+C focuses on websites that have user comments, especially those that use third party comment plugins (Facebook, [[Disqus]] etc.)
# As third party comment plugins are usually full of Javascript and difficult to archive in an automated way, News+C is more like a manual project.
* As third party comment plugins are usually full of Javascript and difficult to archive in an automated way, News+C is more like a manual project.
# While NewsGrabber archives all articles of thousands of websites, News+C focuses on only some (popular) websites and archives only those, but more thoroughly.
* While NewsGrabber archives all articles of thousands of websites, News+C focuses on only some (popular) websites and archives only those, but more thoroughly.
# While anyone can join to the Newsgrabber project, as it needs only a script to be started, a News+C project needs more knowledge, time and attention.
* While anyone can join to the Newsgrabber project, as it needs only a script to be started, a News+C project needs more knowledge, time and attention.


News+C is in no way a replacement or a competitor of NewsGrabber. It is a small-scale project with a different approach, that, in fact, focuses on comments rather than on the news, and to some extent, prefers quality over quantity.
News+C is in no way a replacement or a competitor of NewsGrabber. It is a small-scale project with a different approach, that, in fact, focuses on comments rather than on the news, and to some extent, prefers quality over quantity.
Line 14: Line 14:
= Tools =
= Tools =


Archiving websites with a ton of Javascript is always a pain in the ass of an archivist. And third party comment plugins do use a lot of Javascript. The problem is that you need to list and save all URLs that these script request during user actions, but only browsers are able to interpret these scripts correctly.
Archiving websites with a ton of Javascript is always a pain in the ass of an archivist. And third party comment plugins do use a lot of Javascript. The problem is that you need to list and save all URLs that these scripts request during user actions, but only browsers are able to interpret these scripts correctly.


So there are two approaches:
So there are two approaches:


* If you are a Javascript mage, you find a way to automate all those Ajax and whatever requests, so that you are able to fetch comments with wget or wpull.  
* If you are a Javascript mage, you find a way to automate all those Ajax and whatever requests, so that you are able to fetch comments with wget or wpull.  
* If you are not that expert/intelligent/patient/whatever, and don't want to deal with all that, there is a slower, but simpler, more universal and cosier approach: automating a web browser using computer vision.
* If you are not that expert/intelligent/patient/whatever, and don't want to deal with all that, there is a slower, but simpler, more universal and cosier approach: automating a web browser, using computer vision.


== Solving the script puzzle ==
== Solving the script puzzle ==
Line 27: Line 27:
== Using computer vision ==
== Using computer vision ==


We need a list of URLs. Web browsers interpret Javascript well. There are tools that archive websites as you are browsing ([https://webrecorder.io], [https://github.com/internetarchive/warcprox] etc.). So you can save a website pretty much perfectly if you yourself browse it in a web browser with these archiving tools on. (We are talking about [[WARC]] archives, of course. Ctrl+S-ing the website is not the optimal way for us.) (Alternatively, if you don't trust or otherwise can't use such a tool, you can export the list of URLs with some browser plugin, and then save those with wget, wpull.)
Web browsers interpret Javascript well. There are tools that archive websites as you are browsing ([https://webrecorder.io], [https://github.com/internetarchive/warcprox] etc.). So you can save a website pretty much perfectly if you yourself browse it in a web browser with these archiving tools on. (Alternatively, if you don't trust or otherwise can't use such a tool, you can export the list of URLs with some browser plugin, and then save those with wget, wpull.) (We are talking about [[WARC]] archives, of course. Ctrl+S-ing the website is not the optimal way for us.)


But the question is, as always: how do you automate this process?
But the question is, as always: how do you automate this process?
Line 39: Line 39:
with a little programming.
with a little programming.


This – according to [[user:bzc6p]]'s knowledge – needs a graphical interface and can't be put in the background, but at least you can save a few hundred/thousand articles overnight while you sleep.
This – according to [[user:bzc6p]]'s knowledge – needs a graphical interface and can't be put in the background, but at least you can save a few hundred/thousand articles overnight, while you sleep.


Different scripts are necessary for different websites, but the approach is the same, and the scripts are also similar. The modifiable python2 script [[user:bzc6p]] uses has been named by its creator the Archiving SharpShooter (ASS).
Different scripts are necessary for different websites, but the approach is the same, and the scripts are also similar. The modifiable python2 script [[user:bzc6p]] uses has been named by its creator the Archiving SharpShooter (ASS).
Line 49: Line 49:
* Input is a list of news URLs.
* Input is a list of news URLs.
* Key python2 libraries used are [https://pypi.org/project/PyAutoGUI/ pyAutoGUI] and [https://opencv.org/ openCV]. The former is our hands, the latter our eyes.
* Key python2 libraries used are [https://pypi.org/project/PyAutoGUI/ pyAutoGUI] and [https://opencv.org/ openCV]. The former is our hands, the latter our eyes.
* pyautogui.press(), pyautogui.click() types, scrolls and clicks. cv2.matchTemplate() finds the location of the "Read comments", "More comments" etc. buttons or links, and we click them. matchTemplate needs a template to search for (we cut them out from screenshots) and an up-to-date screenshot (we invoke [https://en.wikipedia.org/wiki/Scrot scrot] from python, and load that image). With matchTemplate we can also check if the page has been loaded or if we have reached the bottom of the page.
* pyautogui.press(), pyautogui.click() types, scrolls and clicks. cv2.matchTemplate() finds the location of the "Read comments", "More comments" etc. buttons or links, and we click them. matchTemplate needs a template to search for (we cut them out from screenshots) and an up-to-date screenshot (we invoke [https://en.wikipedia.org/wiki/Scrot scrot] from python, and load that image). With matchTemplate we can also check if the page has been loaded or if we have reached the bottom of the page. (The threshold for matchTemplate must be carefully chosen for each template, so that it doesn't miss a template, nor finds a false positive.)


What the program basically does:
What the program basically does:
Line 61: Line 61:
# repeats this until bottom of page is reached (no more comments)
# repeats this until bottom of page is reached (no more comments)


[[user:bzc6p]] can't or doesn't want to use a built-in or proxied archiving tool, instead he exports the list of visited URLs (with e.g. firefox extension [https://addons.mozilla.org/en/firefox/addon/http-request-logger/ HTTP Request Logger]), and later downloads them with wpull. (Not too much should be waited, as the structure of the comment plugin or the website may change.)
During this, [https://github.com/internetarchive/warcprox warcprox] runs in the background, and every request is immediately saved to a WARC file. (Warcprox provides a proxy, which is set in the browser.)
 
The threshold for matchTemplate must be carefully chosen for each template, so that it doesn't miss a template, or find a false positive.


==== Disclaimer ====
==== Disclaimer ====


The Archiving SharpShooter, or anything with the same concept may be slow, but it does the job. We don't have anything better until someone comes up with one. Also, ASS is universal in a way that, for each website, you need just a few templates (excerpt images), set (and test) thresholds and set command order, and you're all set, not needing to carefully reverse-engineer megabytes of Javascript code. This may also help archive e.g. Facebook threads or other stuff other than news.
The Archiving SharpShooter, or anything with the same concept may be slow, but it does the job. We don't have anything better until someone comes up with one. Also, ASS is universal in a way that, for each website, you need just a few templates (excerpt images), set (and test) thresholds, and set command order, and you're all set, not needing to carefully reverse-engineer tons of Javascript code. This may also help archiving e.g. Facebook threads or other stuff other than news.


= Websites being archived =
= Websites being archived =

Latest revision as of 18:55, 2 November 2018

News+C is a project brought to life by user:bzc6p, and is concerned with archiving news websites.

NewsGrabber vs. News+C

Wait, we already have NewsGrabber! How is this one different?

  • News+C focuses on websites that have user comments, especially those that use third party comment plugins (Facebook, Disqus etc.)
  • As third party comment plugins are usually full of Javascript and difficult to archive in an automated way, News+C is more like a manual project.
  • While NewsGrabber archives all articles of thousands of websites, News+C focuses on only some (popular) websites and archives only those, but more thoroughly.
  • While anyone can join to the Newsgrabber project, as it needs only a script to be started, a News+C project needs more knowledge, time and attention.

News+C is in no way a replacement or a competitor of NewsGrabber. It is a small-scale project with a different approach, that, in fact, focuses on comments rather than on the news, and to some extent, prefers quality over quantity.

Tools

Archiving websites with a ton of Javascript is always a pain in the ass of an archivist. And third party comment plugins do use a lot of Javascript. The problem is that you need to list and save all URLs that these scripts request during user actions, but only browsers are able to interpret these scripts correctly.

So there are two approaches:

  • If you are a Javascript mage, you find a way to automate all those Ajax and whatever requests, so that you are able to fetch comments with wget or wpull.
  • If you are not that expert/intelligent/patient/whatever, and don't want to deal with all that, there is a slower, but simpler, more universal and cosier approach: automating a web browser, using computer vision.

Solving the script puzzle

If you know how to efficiently archive Facebook or Disqus comment threads with a script, do not hesitate to share. The founder of this project, however, doesn't, so he is developing the other method.

Using computer vision

Web browsers interpret Javascript well. There are tools that archive websites as you are browsing ([1], [2] etc.). So you can save a website pretty much perfectly if you yourself browse it in a web browser with these archiving tools on. (Alternatively, if you don't trust or otherwise can't use such a tool, you can export the list of URLs with some browser plugin, and then save those with wget, wpull.) (We are talking about WARC archives, of course. Ctrl+S-ing the website is not the optimal way for us.)

But the question is, as always: how do you automate this process?

Here comes the computer vision to the stage. You can – surprisingly easily –

  • simulate keypresses
  • simulate mouse movement and clicks
  • find the location of an excerpt image on the screen

with a little programming.

This – according to user:bzc6p's knowledge – needs a graphical interface and can't be put in the background, but at least you can save a few hundred/thousand articles overnight, while you sleep.

Different scripts are necessary for different websites, but the approach is the same, and the scripts are also similar. The modifiable python2 script user:bzc6p uses has been named by its creator the Archiving SharpShooter (ASS).

Archiving SharpShooter

Particular code may be published later, or if you are interested, you can ask user:bzc6p, but the project is still quite beta, so only the algorithm is explained here.

  • Input is a list of news URLs.
  • Key python2 libraries used are pyAutoGUI and openCV. The former is our hands, the latter our eyes.
  • pyautogui.press(), pyautogui.click() types, scrolls and clicks. cv2.matchTemplate() finds the location of the "Read comments", "More comments" etc. buttons or links, and we click them. matchTemplate needs a template to search for (we cut them out from screenshots) and an up-to-date screenshot (we invoke scrot from python, and load that image). With matchTemplate we can also check if the page has been loaded or if we have reached the bottom of the page. (The threshold for matchTemplate must be carefully chosen for each template, so that it doesn't miss a template, nor finds a false positive.)

What the program basically does:

  1. types URL in the address bar
  2. waits till page is loaded
  3. scrolls till it finds "Read comments" or equivalent sign
  4. clicks on that
  5. waits for comments to be loaded
  6. scrolls till "More comments" or equivalent is reached
  7. waits for more comments to be loaded
  8. repeats this until bottom of page is reached (no more comments)

During this, warcprox runs in the background, and every request is immediately saved to a WARC file. (Warcprox provides a proxy, which is set in the browser.)

Disclaimer

The Archiving SharpShooter, or anything with the same concept may be slow, but it does the job. We don't have anything better until someone comes up with one. Also, ASS is universal in a way that, for each website, you need just a few templates (excerpt images), set (and test) thresholds, and set command order, and you're all set, not needing to carefully reverse-engineer tons of Javascript code. This may also help archiving e.g. Facebook threads or other stuff other than news.

Websites being archived

For an easier overview, let this page have subpages for countries/languages.