Difference between revisions of "User talk:Archive Maniac"

From Archiveteam
Jump to navigation Jump to search
Line 152: Line 152:


::[[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 15:04, 22 November 2014 (EST)
::[[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 15:04, 22 November 2014 (EST)
== Re: Wikiadownloader.py problem  ==
[[user:bzc6p|bzc6p]] here, let me answer your question until chfoo gives a better one (if necessary).
At the beginning of the wikiadownloader.py you can read the following:
# using a list of wikia subdomains, it downloads all dumps available in Special:Statistics pages
# you can use the list available at the "listofwikis" directory, the file is called wikia.com and it contains +200k wikis
So, <code>wikia.com</code> ''is'' actually a file, so the script isn't wrong, at least at this point. However, I couldn't find the file where it is said to be. But indeed, there are files in that directory (in fact, in its subdirectories) that have lists of wikis. After studying the code, I think you need to download a list, rename it to <code>wikia.com</code> and start the script (the listfile must be in the same directory as the script). See also the instructions in the script file. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 13:04, 27 November 2014 (EST)

Revision as of 18:04, 27 November 2014

Hi Archive Maniac, if you're having trouble, it's best to chat on IRC on the #archiveteam channel on EFnet where more people can help. I don't know how to upload wikis so you will need to join the #wikiteam channel for help. Please be patient and leave your chat client connected to give someone time to answer. Thanks. Chfoo 01:37, 17 February 2014 (EST)

Hi, sorry for not responding to your earlier messages. I don't check the wiki for messages that often because Archive Team does all its discussion on IRC. There's no forums unfortunately. If you have trouble with IRC, you can email me and I can get back to you sooner.

Regarding the best way to store your backups is to keep copies on multiple hard drives. Like VHS tapes and audio cassettes, CDs and DVDs wear out after a while. It's called disk rot. Although hard drives don't last long either, they hold much more data and are cheaper in the long run.

People who run the Warrior scripts manually usually have experience and money to spend on cloud computing for virtual hosts so they can run dozens of the scripts at once. This is why the people at the top of the Warrior leaderboards have gigabytes and gigabytes downloaded.

Archive Team already has a way for people to submit websites to be archived. It's called ArchiveBot and anyone can use it. All Archive Team files are placed into the archiveteam collection. Adding files to collection is restricted since files under this collection show up in the Wayback Machine.

Regarding uploading things to Internet Archive, uploading archives with good conventions is excellent and I wish more people would take initiative and be proactive.

However when uploading websites, you need to upload WARC files instead of a 7z file of the website. With wget, you'll need to use the --warc-file option. For example, --warc-file example will produce a WARC file called example.warc.gz. You want to use WARC files so The Wayback Machine can load them and show the archives properly.

I hope I answered your questions and sorry for missing your earlier messages. --- Chfoo 16:18, 12 April 2014 (EDT)

Some friendly words

The text with small letters is obsolete, see update under that.

I don't like starting private conversations except about technical things. However, I've seen your strange activities on the ArchiveTeam IRC channels recently, and I can't help saying some words.

First, I won't ever be sarcastic or cinycistic with you. Some of AT members may have been, but it's understandable. We have different amounts of patience. They have made assumptions about your age as well, however, we have no information about that.

Seeing your reactions and activities, I think I can understand your behaviour. I used to make similar actions and reactions myself, so we have common traits in some way, if you don't mind me saying this.

I was lucky to be present when the initial affair happened. I read through the lines several times, but the only thing I could conclude is that you accidentally wrote those lines to that window (they were totally out of context and you said this yourself too), but you were immediately banned. I don't remember you asking too much as you state on your user page. It is possible I didn't get something, and SketchCow and the others had the reasons to qualify you as "persona non grata", but I don't see.

Either way, you shouldn't feel offended. If you had logged in with another nickname, no one would have ever remembered your earlier activity. Even if you had logged in as... you know how, even then, I'm sure, no one would have said a word against or about you, provided you acted normally.

Even now, you could see people tried to be friendly towards you. However, what you feel is that they have some kind of hatred against you, and you must take revenge. No. It's not true. People don't hate you, even now, and you don't need to take revenge. I hate to say this but if you go on acting like this, then they may become actually fed up. But it's not too late now to turn back on this crazy way.

What you have been doing is called "demonstrating" on Wikipedia. Sad to see if someone, otherwise valueable member, does that. You seem to be a valuable member, doing useful things for/with/like ArchiveTeam. Please be collaborative and not disruptive. You don't have to say much, or even do much. I myself don't say or do too much (however, more and more as I've been an AT member for more and more time). I'm sure all (or at least 98% of) ArchiveTeam members counts on your work and welcomes you if you don't act in a kind of crazy way, if you don't mind me saying this.

And, one thing about SketchCow: he is not a de jura nor a de facto leader of AT. He writes about himself: "While I am a (generally) beloved figure who is appreciated for his public speaking skills and snappy dressing, Archive Team has collectively disagreed with me and some projects have been approached completely different ways than I would have approached them." What's more, you don't have to talk to him. I myself haven't talked to him yet too, just listen to him and agree or disagree with him in myself.

You write on your user page that there are friendly people here. Definitely, more than you think. As I see, almost every one of them. There are ones who don't seem to be so good mannered – but where aren't people like them? They are good too, just not that patient or have their own problems or such. (SketchCow has really unique manners, some adorable, some maybe not, but the same could be said about any one of us.)

I want to ensure you that you can ask me if you have questions, want to discuss something, and I won't try to get rid of you, and try not to hurt you with my words. And I want to encourage you to take part in ArchiveTeam's nice work. I've been in the group only for some months so far, but every day I know more and more about archiving, web, programming – and archiving is kind of fun, isn't it? Be sure your work is appreciated by everyone, just avoid demonstrating like today. Except your today's demonstrative activities, your work (making website crawls, informing AT about closures, running warriors) is appreciated. I think everyone is ready to forget everything about you immediately, if you return to that kind of work, with a calm tone. The one on your user page is a good starting point.

I know what it is like to be touchy. I am (or used to be) touchy myself. People forget and forgive, and we outgrow our traits like that. So cheer up and ArchiveTeam awaits you in its journey and mission!

Yours truly, bzc6p (talk), 17 October 2014, 14:55 (UTC)

I studied your "history", the events preceding your ban. So basically the only problem was that you talked much offtopic on ArchiveTeam channels and asked many, not-that-much important questions.

About the first thing. No problem that you are chatty. You could think that these IRC channels are also meant for talk you initiated. You didn't mistake too much about it, just a bit. You just need to accept that these channels are not completely like you imagined. It's not a problem with you, nor with the channel. But the two together. You can't do much about it, but don't be angry with channel members. Nor are they angry with you, they just find what you were doing inappropriate.

About the second thing. For some of your questions, the previous paragraph applies. For the others: some answers you may find out yourself, some of them you don't necessarily need to know. No problem with curiosity, but members may find too many questions exhausting. I hope you understand this. (I say this while I myself tend to ask too many questions sometimes, to make sure, but I am also patient answering questions. Not all of us must be like me regarding this thing, it's understandable that some people don't like tons of questions.)

And about both of the two things in general: too much text in IRC channels and logs makes the essence get lost. At least I think this. This is another thing why we should talk only about archiving-related stuff on AT IRC channels.

Still I uphold much of what I wrote earlier. You shouldn't be in cross with AT members, especially not swearing at them. If you consider what I wrote in the preceding paragraphs, you will be welcome on IRC even after these things. (Or, if you want to make sure, you can choose another nickname. That doesn't matter too much, I think.) Don't let revenge lead your actions. That's disruptive and contraproductive. None of us can do quality work if we don't listen to each other, study, sometimes ask. We know more and more every day, and after a point we answer more than we ask. But only if we are collaborative. That's the way it goes.

I'm ready to answer your questions if I can, I think I won't run out of patience too early. (No problem if someone does, but then that person shouldn't be bothered too much.) You can use my talk page if you have questions you think I can answer.

I gladly see you didn't give up archiving, even if you communicated this on IRC in a quite provocative way. I want to repeat that you won't possibly do quality work if you ignore other, more experienced members. Don't get hurt if they say your product is not okay. What to do with incompatible or corrupted or incomplete files? You should accept the pieces of advice. All of us does so. If something, then archiving is a thing which you can't do with completely closed eyes and ears.

And please don't curse SketchCow or anyone else... We must conform to others' manners when we talk to them. They also do so when they talk to us. This is the way it goes, again. I'm sure you know how it feels to be hurt. Why would you hurt others then?

I myself feel that I must be careful when talking to some people, especially if he is much older than me or has strange manners. So do others when talking to us (e.g. not to hurt, being patient etc.) And, about mistakes, we all forget and forgive – and learn.

I know the things I just wrote may be seen as spam, or at least needless and offtopic and too personal for this wiki. However, I just wanted to tell you that your archiving efforts are appreciated, and with some experience you may soon become a valued member of ArchiveTeam, doing lot of good stuff. You only need to be patient yourself, listen to others, read instructions and IRC, try things you are unsure of, and if important or you can't find out, ask. More or less this is what I've been doing, and I haven't had quarrels with others in AT so far, but I'm already on the level of being able to answer some questions and do good work (I think so).

I think I can tell you on behalf of ArchiveTeam that if you consider what I've written above, you'll be fine and your work will be welcome.

I hope we can count on you in the future. That's why I wrote this 10kb-ish post. (Sorry everyone for writing so much, this is one of my weaknesses.)

Yours truly, bzc6p (talk), 18 October 2014, 20:23 (UTC)

You are welcome. However, I think it would be too early and strange if I entered the channel that "Hey guys, Dec-31-99 is sorry and wants you to forgive him"... It will resolve itself, if you wait a couple of days. Then, if you want to tell them something important (in short, to make sure), they won't kick you out, I'm sure – provided you follow the guidelines others and I told you.
I'm sure that not I'm the only one who "understood your situation". Rather, I may be the only time-millionaire who can type 10kBs to "explain ArchiveTeam".
Well, the message "if you know any other Hungarian sites..." is addressed to Hungarian people in the first place, they can find sunsetting sites easier, you guess why... but of course no one is excluded. I myself regularly check Google with keywords like "web site closes" (in Hungarian). (In fact, this way did I find Panoramio and alarmed ArchiveTeam!) As for GPortál, it's a very big WYSIWYG website hosting and has other services as well, I don't expect it to close without any notification, and if it is ever going to shut down, that will be a big thing and will make noise.
For the specific website you mentioned: if you want to archive that site (I don't have the time now, I'm concerned with Demotiváló right now – and you could learn with grabbing this donkeykong), you can do two things. One is that you pass it to ArchiveBot. I haven't used that so you need to check out how it works. (My projects so far needed special care, I think ArchiveBot couldn't have done them itself. But if it's a simple website with not too much awful Javascript, hidden comments etc, it may be able to handle.) The other thing is that you grab the website yourself. For that I recommend wpull, which is a wget-like software designed with creating WARC files in mind. I didn't check the website too deeply, but if I see well, website components reside under "donkeykong.gportal.hu" and "gportal.hu/portal/donkeykong". The wpull command I would try first:
wpull --accept-regex "donkeykong.gportal.hu|gportal.hu/portal/donkeykong" -o log.txt --no-warc-keep-log --recursive --level inf -p -H -Dgportal.hu --tries inf --no-robots --retry-connrefused --retry-dns-error --delete-after --warc-cdx --database DATABASEFILENAME --warc-file WARCFILENAME
where you choose DATABASEFILENAME and WARCFILENAME as you wish. The database file lets you continue the download, only problem is that then wpull ignores the already existing warcfile (and overwrites it). If I archive a larger site, I prepare, and for the warcfilename I give the _01 postfix first, and if wpull gets stopped for some reason, I change the postfix to _02 etc, leaving the other options intact. This is not too elegant, to have several files, but later they may be merged together with some megawarc tool. But if you have a good internet connection (here the problem is that for some reason wpull pretends there is no connection when there is, may be a bug) and the site is not that big, it may come down in one run – in that case you can omit the database file and the postfixes. This latter is the desirable way.
Wpull documentation, including a manpage-style option overview: http://wpull.readthedocs.org
See The WARC Ecosystem for warc-tools.
If you want to test your WARC, try warc-proxy. Even ArchiveTeam uses that sometimes. I've read somewhere that one of your (?) WARCs couldn't be injected into Wayback Machine for some reason. Well, if warc-proxy can read your WARC, that doesn't necessarily imply that Wayback also will, but we can hope.
These are all Linux tools. I don't know any tools for Windows. Software like HTTrack may be good in mirroring, but they don't speak WARC, and WARC is essential for Wayback Machine.
bzc6p (talk) 19 October 2014, 22:25 (UTC+2)
wpull has just dropped Python2 support.
You can run Python programs on Windows if you have Python and the other dependencies installed, don't you? (I haven't tried.)
bzc6p (talk) 20 October 2014, 17:52 (UTC+2)
A possible and handy solution is to create a virtual machine with a minimalist Linux installation (e.g. Debian testing, and when installing, choose Expert install and don't go further than installing the base (or core) system if you don't want a GUI). I do the same myself, as Debian stable (what I use) seems to be too obsolete for wpull. I don't remember errors when installing wpull on Debian testing.
On the other hand, I could install the ArchiveTeam scripts easily on Debian stable and had problems on Debian testing, so I run the scripts on the real stable system, and also a virtual machine with testing to run wpull. bzc6p (talk) 21 October 2014, 07:31 (UTC+2)

Re: Any Help on Chat?

I don't think I have that much a way with words. I rarely speak on AT IRC channels, and have never done on #archiveteam-bs and on #archivebot.

Regarding #archiveteam-bs, the best way to find out the appropriate behaviour is to read through some of the chatlogs. On http://badcheese.com/~steve/atlogs/ you can read the logs of some channels (including #archiveteam, #archiveteam-bs, but unfortunately not #archivebot) for the last 10 days directly, but by changing the parameter in the URL, you can even go back several months.

Regarding #archivebot, I've never been there and have no chatlogs, so I can rely only on what is written on the wiki: "Channel for controlling ArchiveBot. Discussions about ArchiveBot development also take place here."; and yipdw wrote on #archiveteam-bs on 2014-09-09: "in that channel the expectation is that you're there to issue commands, check up on a job, or talk about something to work on; talking about how archivebot works is fine but there's a point where it just gets annoying to deal with". There is a wiki page with basic information about ArchiveBot.

Regarding my IRC presence, I'm usually logged in to channels of featured – and currently active – projects, mainly to follow the news. Right now I'm available on #quitpic. Sometimes I also log in to #archiveteam, but usually for a short time, when announcing something important and waiting for the reactions. I said I rarely speak on IRC: I only answer questions not answered in some minutes, or announce important news or problems yet not noticed by the people "in charge".

My IRC username is the same as here: bzc6p. However, I'm much of the time away from keyboard, but I usually check the log when coming back, and reply to private messages if any.

bzc6p (talk) 22 October 2014, 12:59 (UTC+2)

Invitation for private chat

Let's talk in private my friend. Please come to #pmchannel on EFnet. (There you can recommend a better "place" if you have any.) I'll be by the computer or check often from today to Sunday from ~7:00 until 22:00 UTC. I count on your attendance. bzc6p (talk) 23 October 2014, 10:44 (UTC+2)

Damn timezones. Thank you for being there – I missed your arrival and leaving just by some tens of minutes... Well, the weekend may be better for us in terms of free time and sleeping patterns, but I don't wait until that. Next time I'll get up during the night (that's your afternoon and evening), and we can talk. We may give each other our email addresses (I don't want to disclose it publicly) to overcome this timezone issue. I want to end this private communication on this wiki.
See you there tomorrow, and sorry for this situation.
bzc6p (talk) 23 October 2014 21:03 PDT / 24 October 2014 06:03 CEST

ArchiveBot Requests

To tell the truth, I don't really care ArchiveBot, at least for now, for three reasons. One, I don't consider most websites simple enough that a wpull run can get everything without human intelligence. Two, I don't want to use others' bandwidth while I manage with mine. Three, I can learn a lot about archiving websites if I do that myself.

So I think I can't take ArchiveBot requests. (I don't even know its commands, etc.) Moreover, I have no more right in ArchiveBot or ArchiveTeam channels than you. And, if you have only some sites you want to be grabbed by archivebot, people in #archiveteam usually initiate the task of archiving a page if you ask them.

If you've been banned in such a way that you can't enter those channels at all (even with an other nickname), that's another case, then tell me and I'll transfer your request.

Sorry if I sounded rough or something, don't take it on yourself. I've been busy these days and I'm quite tired at the moment.

Regards, bzc6p (talk) 15:03, 19 November 2014 (EST)

1. My favourite web archiving tool is wpull. I had a problem with wget parsing certain HTML files. And a great thing in wpull (what wget lacks, as I know) is that it can store its database in a separate file. So, when continuing a mirroring, it doesn't need the files it earlier downloaded (they don't even need to be stored, --delete-after), but uses the database file instead. (You can even manipulate it with e.g. sqlitestudio, for example, for preventing failing URLs to be retried forever, adding new URLs, etc. – however, normal usage may not require this, and it may be inappropriate.)
I don't know about any other sophisticated WARC supporting mirroring tool.
2. Yes, of course, I do that myself too. As I know, ArchiveBot runs wpull, and I believe that an ArchiveBot command just initiates a recursive download of the site with page requisites – I don't know it at all, but surely there is no (easy) way to apply human intelligence in a way like I do in my mirrors, in several steps, taking scripts and other things into consideration.
So I think there's nothing you couldn't do and ArchiveBot could. (Except maybe uploading directly to the ArchiveTeam collection, but that's not so important.) However, AB may have more space, better performance and a stable internet connection. But the latter can also be worked around: if you need to stop and continue the grab, you can do that with the help of the database file, but you must give a different WARC file name (you may append a postfix), and finally you can concatenate them using megawarc.
3. Possibly because Python is – I guess – much more portable and platform independent, and the source code doesn't need to be compiled every single time (it's an interpreted language).
4. I don't, but I bet you'll find information about it with a Google search.
No, I'm not annyoyed at all. However, it may happen that I can't/don't answer very soon. bzc6p (talk) 14:43, 21 November 2014 (EST)
Well, as I remember, on a Debian Jessie (currently testing branch) I could install wpull smoothly. I think the python3 and possibly the python3-pip packages are necessary to issue pip3 install wpull, and that pulls the dependencies automatically.
On older Debian (Wheezy, it's the stable) I couldn't install it, because Wheezy seems to insist on Python2 as default. I run a virtual machine with Jessie (without GUI) to run wpull. You said you had similar problems on Windows. Well, I haven't used much Windows in a while and not at all its new versions, so I think I can't help with that. It should work on Debian Jessie. Or, if you prefer Ubuntu or something else, if it's a recent version and prefers (or at least supports well) Python3, that should suffice too. (There must be a Windows workaround too, I believe, documented somewhere on the internet, in general about Python3.)
bzc6p (talk) 15:04, 22 November 2014 (EST)

Re: Wikiadownloader.py problem

bzc6p here, let me answer your question until chfoo gives a better one (if necessary).

At the beginning of the wikiadownloader.py you can read the following:

# using a list of wikia subdomains, it downloads all dumps available in Special:Statistics pages
# you can use the list available at the "listofwikis" directory, the file is called wikia.com and it contains +200k wikis

So, wikia.com is actually a file, so the script isn't wrong, at least at this point. However, I couldn't find the file where it is said to be. But indeed, there are files in that directory (in fact, in its subdirectories) that have lists of wikis. After studying the code, I think you need to download a list, rename it to wikia.com and start the script (the listfile must be in the same directory as the script). See also the instructions in the script file. bzc6p (talk) 13:04, 27 November 2014 (EST)