Difference between revisions of "YouTube"

From Archiveteam
Jump to navigation Jump to search
(my method for downloading)
(2 intermediate revisions by the same user not shown)
Line 40: Line 40:


<tt>youtube-dl --title --continue --retries 100 --write-info-json --write-description --write-thumbnail --proxy="localhost:8000" --write-annotations --all-subs --no-check-certificate --ignore-errors -k -f bestvideo+bestaudio/best </tt>(stick the video/channel/playlist/whatever URL in here)
<tt>youtube-dl --title --continue --retries 100 --write-info-json --write-description --write-thumbnail --proxy="localhost:8000" --write-annotations --all-subs --no-check-certificate --ignore-errors -k -f bestvideo+bestaudio/best </tt>(stick the video/channel/playlist/whatever URL in here)
====vxbinaca likes this method for entire channels:====
Put this in ~/.config/youtube-dl.conf
<tt> -q --download-archive ~/.ytdlarchive --retries 100 --no-overwrites --call-home --continue --write-info-json --write-description --write-thumbnail --write-annotations --all-subs --sub-format srt --convert-subs srt --write-sub --add-metadata -f bestvideo+bestaudio/best --merge-output-format 'mkv'</tt>
* Let's severely cut down on all that output and see only things that matter like errors.
* Use archiving file so we don't download the same video over and over on subsequent channel rips
* No over-writes so video/metadata aren't touched often
* Reduces traffic and lookups
* Standard-based file formats, for both subs and video (as well as embedding the subs in the video file, instead of some files being webm, some mp4, some mkv. I pick one free format that's more expansive than webm and go with that.
Optional flags, can be safely turned off:
* Some of the subtitle stuff
* Call home to aid in development of the script


== Site reconnaissance  ==
== Site reconnaissance  ==

Revision as of 01:55, 31 January 2016

YouTube
YouTube logo
YouTube - Broadcast Yourself. 1303512848647.png
URL http://youtube.com[IAWcite.todayMemWeb]
Status Online!
Archiving status Not saved yet
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

YouTube is a video sharing website currently owned by Google. YouTube is currently the most popular video hosting website on the planet.

Archiving tools

Several free FLV downloaders and video-to-URL converters exist on the web. AT rescue projects usually use youtube-dl.
YouTube annotations (speech bubbles and notes) are available as XML

http://www.youtube.com/api/reviews/y/read2?feat=TCS&video_id=

To transform this XML to SRT, use ann2srt

Recommended way to archive Youtube videos

First, download the video/playlist/channel/user using youtube-dl:

youtube-dl --title --continue --retries 4 --write-info-json --write-description --write-thumbnail --write-annotations --all-subs --ignore-errors -f bestvideo+bestaudio URL

This can be simplified by running the script by emijrp and others, which also handles upload.

You need a recent (2014) ffmpeg or avconv for the bestvideo+bestaudio muxing to work. On Windows, you also need to run youtube-dl with Python 3.3/3.4 instead of Python 2.7, otherwise non-ASCII filenames will fail to mux.

Also, make sure you're using the most recent version of youtube-dl. Previous versions didn't work if the highest quality video+audio was webm+m4a. New versions should automagically merge incompatible formats into a .mkv file.[1]

Then, upload it to https://archive.org/upload/ Make sure to upload not only the video itself (.mp4 and/or .mkv files), but also the metadata files created along with it (.info.json, .jpg, .annotations.xml and .description).

kyan likes this method:

Youtube sucker (look out it leaves some incompletes in the directory afterward. Can clean up w/ rm -v ./*.mp4 ./*.webm then ls | grep \.part$ and get the video IDs out of that and redownload them and repeat etc etc). Can upload the WARCs only e.g. using ia (Python Internet Archive client) or warcdealer (automated uploader I hacked together) — or if you want, can upload the other stuff too, but that's kind of wasteful of storage space. In my opinion, getting stuff without a WARC is a great crime, given the ready availability of tools to create WARCs. Note that this method also works for other Web sites supported by youtube-dl too, although it maybe would need different cleanup commands afterward. Depends on youtube-dl and warcprox running on localhost:8000.

youtube-dl --title --continue --retries 100 --write-info-json --write-description --write-thumbnail --proxy="localhost:8000" --write-annotations --all-subs --no-check-certificate --ignore-errors -k -f bestvideo+bestaudio/best (stick the video/channel/playlist/whatever URL in here)

vxbinaca likes this method for entire channels:

Put this in ~/.config/youtube-dl.conf

-q --download-archive ~/.ytdlarchive --retries 100 --no-overwrites --call-home --continue --write-info-json --write-description --write-thumbnail --write-annotations --all-subs --sub-format srt --convert-subs srt --write-sub --add-metadata -f bestvideo+bestaudio/best --merge-output-format 'mkv'

  • Let's severely cut down on all that output and see only things that matter like errors.
  • Use archiving file so we don't download the same video over and over on subsequent channel rips
  • No over-writes so video/metadata aren't touched often
  • Reduces traffic and lookups
  • Standard-based file formats, for both subs and video (as well as embedding the subs in the video file, instead of some files being webm, some mp4, some mkv. I pick one free format that's more expansive than webm and go with that.

Optional flags, can be safely turned off:

  • Some of the subtitle stuff
  • Call home to aid in development of the script

Site reconnaissance

Little is known about its database, but according to data from 2006, it was 45TB and doubling every 4 months. At this rate it would be 660 Petabytes (Oct 2014) by now.

According to Leo Leung's calculations based on available information, an often updated Google spreadsheet estimates that in early 2015 YouTube's content reached 500 petabytes in size.

FYI, all of Google Video was about 45TB, and the Archive Team's biggest project, MobileMe was 200TB. The Internet Archive's total capacity is 50PB as of August 2014. So let's hope YouTube stays healthy, because the Archive Team may have finally met its match.

Vital signs

Will be living off Google for a long time if nothing changes.

References

See also

External links