Talk:Ex.ua/Scratchpad

From Archiveteam
Jump to navigation Jump to search

Last-page detection - old version

There are two general ways I've found to do this.

The first (and mildly obvious) method would be to regex match portions of the HTML. This wouldn't be fooled by the (properly escaped) text on the site, but it's still not a sure thing at scale.

The second method I thought of relies on the fact that the site lets you seek past the end of conversations, and sends a page with a consistent, predictable HTML pattern when you do so.

The second approach seems the most interesting to me so I'll list that one first.


Method A

Object pages

Empty pages seem to look like

<p>
<table width=100% border=0 cellpadding=0 cellspacing=8 class=include_1>

</table>
<p>

The empty line is replaced with content on non-empty pages.

I think matching <p>\n<table width=100% border=0 cellpadding=0 cellspacing=8 class=include_1>\n\n</table>\n<p> could help here.

I've also noticed that the content on these pages seems to be emitted on a single line (ie the HTML has no line breaks).


view_comments pages

Empty pages seem to look like

<table width=100% border=0 cellpadding=0 cellspacing=0 class=comment>
<tr><td></td></tr>

</table>

On non-empty pages, there's no blank line directly under the <tr><td></td></tr> - there's another <tr> and further content follows that.

I think matching <table width=100% border=0 cellpadding=0 cellspacing=0 class=comment>\n<tr><td></td></tr>\n\n</table> could reliably tell when we're on an empty page.


Method B

Object pages

On page 31 (http://rover.info/view_comments/93576596?p=30) you see this:

<td><font color=#808080><b>6201..6400</b></font></td>
<td><a href='/205?r=1&p=32'><img src='/t3/arr_r.gif' border=0 width=20 height=20 alt='перейти на следующую страницу, Ctrl →'></a></td>
<td class='small'>Ctrl →</td>
<td><a href='/205?r=1&p=32'>6401..6473</a></td>
<td><a href='/205?r=1&p=32'><img src='/t3/arr_e.gif' border=0 width=20 height=20 alt='перейти на последнюю страницу, всего позиций - 6473'></a></td>

On page 32 (http://rover.info/view_comments/93576596?p=31) the same section of HTML turns into this:

<td><font color=#808080><b>6401..6473</b></font></td>
<td><img src='/t3/arr_rg.gif' width=20 height=20 alt='вы находитесь на последней странице'></td>
<td><img src='/t3/arr_eg.gif' width=20 height=20 alt='вы находитесь на последней странице'></td>


view_comments pages

On page 11, you'll see this:

...<a href='/view_comments/93576596?p=11'>12</a>

On page 12, you'll see this:

...12

Detection ideas:

  1. Substring searches (inserting (curpage-1) and curpage into the test string)
  2. If a regex match against --I336 (talk) 23:52, 12 December 2016 (EST)<span class=r_browse_selected>[0-9]+</span></td></tr></table> fails, you aren't on the last page yet