1. Datasets of Normal Crawl
We consider all the YouTube videos to form a directed graph, where each video is a node in the graph. If a video b is in the related video list (first 20 only) of a video a, then there is a directed edge from a to b. Our crawler uses a breadth-first search to find videos in the graph. We define the initial set of 0-depth video IDs, which the crawler reads in to a queue at the beginning of the crawl. When processing each video, it checks the list of related videos and adds any new ones to the queue.
Given a video ID, the crawler first extracts information from the YouTube API, which contains all the meta-data except age, category and related videos. The crawler then scrapes the video's webpage to obtain the remaining information.We record the following information of a YouTube video in order; they are divided by '\t' in the data file.
video ID |
an 11-digit string, which is unique |
uploader |
a string of the video uploader's username |
age |
an integer number of days between the date when the video was uploaded and Feb.15, 2007 (YouTube's establishment) |
category |
a string of the video category chosen by the uploader |
length |
an integer number of the video length |
views |
an integer number of the views |
rate |
a float number of the video rate |
ratings |
an integer number of the ratings |
comments |
an integer number of the comments |
related IDs |
up to 20 strings of the related video IDs |
Our first crawl was on February 22nd, 2007, and started with the initial set of videos from the list of "Recently Featured", "Most Viewed", "Top Rated" and "Most Discussed", for "Today", "This Week", "This Month" and "All Time", which totalled 189 unique videos on that day. The crawl went to more than four depths, finding approximately 750 thousand videos in about five days. In the following weeks we ran the the crawler every two to three days, each time defining the initial set of videos from the list of "Most Viewed", "Top Rated", and "Most Discussed", for "Today" and "This Week", which is about 200 to 300 videos. On average, the crawl finds 73 thousand distinct videos each time in less than 9 hours.
All the 35 datasets can be downloaded from here. In each package, there are: (1) "0.txt", "1.txt", "2.txt" and "3.txt" (and "4.txt" in "0222.zip"), storing the video data of depth 0, 1, 2 and 3 respectively; (2) "log.txt", a log file indicating the start and finsh time, number of videos crawled and the duration of each depth.
No. |
File (click to download) |
Date |
Total Crawled Videos |
Duration (second) |
Note |
1 |
0222.zip |
Feb. 22nd, 2007 |
749361 |
432589 |
crawled to fifth depth, but did not finish |
2 |
0301.zip |
Mar. 1st, 2007 |
155513 |
93352 |
|
3 |
0302.zip |
Mar. 2nd, 2007 |
10324 |
4692 |
only crawled three depths |
4 |
0303.zip |
Mar. 3rd, 2007 |
68343 |
30898 |
|
5 |
0305.zip |
Mar. 5th, 2007 |
65754 |
21578 |
|
6 |
0306.zip |
Mar. 6th, 2007 |
42775 |
14203 |
|
7 |
0309.zip |
Mar. 9th, 2007 |
64239 |
29681 |
|
8 |
0313.zip |
Mar. 13th, 2007 |
100382 |
42383 |
|
9 |
0314.zip |
Mar. 14th, 2007 |
71347 |
31362 |
|
10 |
0315.zip |
Mar. 15th, 2007 |
45338 |
19068 |
|
11 |
0316.zip |
Mar. 16th, 2007 |
42628 |
17925 |
|
12 |
0318.zip |
Mar. 18th, 2007 |
63725 |
32160 |
|
13 |
0320.zip |
Mar. 20th, 2007 |
45239 |
18128 |
|
14 |
0322.zip |
Mar. 22nd, 2007 |
53156 |
21199 |
|
15 |
0325.zip |
Mar. 25th, 2007 |
62327 |
24960 |
|
16 |
0327.zip |
Mar. 27th, 2007 |
50940 |
19849 |
|
17 |
0329.zip |
Mar. 29th, 2007 |
103476 |
47433 |
|
18 |
0403.zip |
Apr. 3rd, 2007 |
101693 |
39617 |
|
19 |
0410.zip |
Apr. 10th, 2007 |
206590 |
120455 |
fourth depth was not completed |
20 |
0413.zip |
Apr. 13th, 2007 |
83712 |
36535 |
|
21 |
0418.zip |
Apr. 18th, 2007 |
91822 |
35555 |
|
22 |
0420.zip |
Apr. 20th, 2007 |
112259 |
90939 |
|
23 |
0422.zip |
Apr. 22nd, 2007 |
89017 |
29036 |
|
24 |
0424.zip |
Apr. 24th, 2007 |
76936 |
24571 |
|
25 |
0426.zip |
Apr. 26th, 2007 |
62897 |
20433 |
|
26 |
0428.zip |
Apr. 28th, 2007 |
71794 |
21455 |
|
27 |
0430.zip |
Apr. 30th, 2007 |
84675 |
26874 |
|
28 |
0502.zip |
May 2nd, 2007 |
39238 |
13639 |
fourth depth was not completed due to networking problem |
29 |
0505.zip |
May 5th, 2007 |
62203 |
26814 |
|
30 |
0507.zip |
May 7th, 2007 |
62440 |
19911 |
|
31 |
0509.zip |
May 9th, 2007 |
67939 |
21681 |
|
32 |
0511.zip |
May 11th, 2007 |
66595 |
20531 |
|
33 |
0513.zip |
May 13th, 2007 |
67370 |
21312 |
|
34 |
0515.zip |
May 15th, 2007 |
45071 |
21312 |
|
35 |
0518.zip |
May 18th, 2007 |
51873 |
15880 |
|
The new data crawled in 2008 are as followings:
No. |
File (click to download) |
Date |
Total Crawled Videos |
Duration (second) |
Note |
1 |
080327.zip |
Mar. 27th, 2008 |
310005 |
72984 |
|
2 |
080329.zip |
Mar. 29th, 2008 |
186671 |
44014 |
|
3 |
080331.zip |
Mar. 31st, 2008 |
150382 |
38292 |
|
4 |
080402.zip |
Apr. 2nd, 2008 |
154372 |
31394 |
|
5 |
080404.zip |
Apr. 4th, 2008 |
155213 |
31903 |
|
6 |
080406.zip |
Apr. 6th, 2008 |
128499 |
30830 |
|
7 |
080408.zip |
Apr. 8th, 2008 |
133279 |
32044 |
|
8 |
080412.zip |
Apr. 12th, 2008 |
123843 |
32691 |
|
9 |
080414.zip |
Apr. 14th, 2008 |
86737 |
17580 |
|
10 |
080416.zip |
Apr. 16th, 2008 |
101249 |
20779 |
|
11 |
080418.zip |
Apr. 18th, 2008 |
114269 |
27237 |
|
12 |
080422.zip |
Apr. 22nd, 2008 |
114269 |
28355 |
|
13 |
080424.zip |
Apr. 24th, 2008 |
110232 |
30862 |
|
14 |
080426.zip |
Apr. 26th, 2008 |
68205 |
16975 |
|
15 |
080428.zip |
Apr. 28th, 2008 |
74816 |
17760 |
|
16 |
080430.zip |
Apr. 30th, 2008 |
102969 |
21505 |
|
17 |
080502.zip |
May 2nd, 2008 |
96684 |
25334 |
|
18 |
080504.zip |
May 4th, 2008 |
69169 |
20761 |
|
19 |
080506.zip |
May 6th, 2008 |
98766 |
19694 |
|
20 |
080508.zip |
May 8th, 2008 |
84197 |
17148 |
|
21 |
080510.zip |
May 10th, 2008 |
85094 |
23972 |
|
22 |
080512.zip |
May 12th, 2008 |
91306 |
25485 |
|
23 |
080514.zip |
May 14th, 2008 |
89824 |
39771 |
|
24 |
080516.zip |
May 16th, 2008 |
88333 |
21908 |
|
25 |
080518.zip |
May 18th, 2008 |
90320 |
25645 |
|
26 |
080520.zip |
May 20th, 2008 |
78123 |
18765 |
|
27 |
080522.zip |
May 22nd, 2008 |
73417 |
15274 |
|
28 |
080524.zip |
May 24th, 2008 |
81537 |
18678 |
|
29 |
080526.zip |
May 26th, 2008 |
73165 |
19818 |
|
30 |
080528.zip |
May 28th, 2008 |
69112 |
23274 |
|
31 |
080530.zip |
May 30th, 2008 |
72138 |
17795 |
|
32 |
080601.zip |
Jun. 1st, 2008 |
69987 |
17197 |
|
33 |
080603.zip |
Jun. 3rd, 2008 |
84000 |
18900 |
|
34 |
080605.zip |
Jun. 5th, 2008 |
75995 |
- |
The crawl paused due to the networking problem. |
35 |
080609.zip |
Jun. 9th, 2008 |
77603 |
15333 |
|
36 |
080611.zip |
Jun. 11th, 2008 |
72138 |
14528 |
|
37 |
080613.zip |
Jun. 13th, 2008 |
72042 |
21381 |
|
38 |
080615.zip |
Jun. 15th, 2008 |
68736 |
20136 |
|
39 |
080617.zip |
Jun. 17th, 2008 |
74689 |
21144 |
|
40 |
080619.zip |
Jun. 19th, 2008 |
72542 |
21808 |
|
41 |
080621.zip |
Jun. 21st, 2008 |
55535 |
17371 |
|
42 |
080623.zip |
Jun. 23rd, 2008 |
59165 |
15304 |
|
43 |
080625.zip |
Jun. 25th, 2008 |
53232 |
18606 |
|
44 |
080627.zip |
Jun. 27th, 2008 |
65906 |
17314 |
|
45 |
080629.zip |
Jun. 29th, 2008 |
52219 |
13496 |
|
46 |
080701.zip |
Jul. 1st, 2008 |
62065 |
15565 |
|
47 |
080703.zip |
Jul. 3rd, 2008 |
62065 |
14607 |
|
48 |
080705.zip |
Jul. 5th, 2008 |
51555 |
15052 |
|
49 |
080707.zip |
Jul. 7th, 2008 |
63053 |
18723 |
|
50 |
08079.zip |
Jul. 9th, 2008 |
54981 |
16367 |
|
51 |
080711.zip |
Jul. 11th, 2008 |
65412 |
19833 |
|
52 |
080713.zip |
Jul. 13th, 2008 |
48748 |
18690 |
|
53 |
080715.zip |
Jul. 15th, 2008 |
54075 |
20403 |
|
54 |
080717.zip |
Jul. 17th, 2008 |
48884 |
18205 |
|
55 |
080719.zip |
Jul. 19th, 2008 |
72773 |
27226 |
|
56 |
080721.zip |
Jul. 21st, 2008 |
46868 |
19656 |
|
57 |
080723.zip |
Jul. 23rd, 2008 |
50290 |
22334 |
|
58 |
080725.zip |
Jul. 25th, 2008 |
78680 |
31925 |
|
59 |
080727.zip |
Jul. 27th, 2008 |
60205 |
32802 |
|
|
2. Datasets of Updating Crawl
To study the growth trend of the video popularity, we also updated the statistics of some previously found videos. For this crawl we only retrieve the number of views for relatively new videos (uploaded after February 15th, 2007). This crawl is performed once a week from March 5th to April 16th 2007, which results in seven datasets.
The 7 datasets can be downloaded from here. In each package, there are: (1) "update.txt", storing the updated video data; it contains video ID, views, rate, ratings and comments; (2) "id.txt", containing all the video IDs to be updated; (3) "log.txt", a log file indicating the start and finish time, number of videos updated and the duration.
No. |
File (click to download) |
Date |
Total Updated Videos |
Duration (second) |
1 |
0305u.zip |
Mar. 5th, 2007 |
53818 |
4858 |
2 |
0312u.zip |
Mar. 12th, 2007 |
79091 |
12931 |
3 |
0319u.zip |
Mar. 19th, 2007 |
140702 |
12489 |
4 |
0326u.zip |
Mar. 26th, 2007 |
185784 |
16473 |
5 |
0402u.zip |
Apr. 2nd, 2007 |
217170 |
17541 |
6 |
0409u.zip |
Apr. 9th, 2007 |
247526 |
20656 |
7 |
0416u.zip |
Apr. 16th, 2007 |
316313 |
27145 |
In 2008, we updated the statistics of 161085 videos once a week for 21 weeks. The new data are also presented here:
No. |
File (click to download) |
Date |
Duration (second) |
1 |
080416u.zip |
Apr. 16th, 2008 |
- |
2 |
080423u.zip |
Apr. 23rd, 2008 |
9694 |
3 |
080430u.zip |
Apr. 30th, 2008 |
9895 |
4 |
080508u.zip |
May 8th, 2008 |
6227 |
5 |
080514u.zip |
May 14th, 2008 |
10474 |
6 |
080521u.zip |
May 21st, 2008 |
9968 |
7 |
080528u.zip |
May 28th, 2008 |
10846 |
8 |
080604u.zip |
Jun. 4th, 2008 |
12064 |
9 |
080611u.zip |
Jun. 11th, 2008 |
12476 |
10 |
080618u.zip |
Jun. 18th, 2008 |
10244 |
11 |
080625u.zip |
Jun. 25th, 2008 |
8651 |
12 |
080702u.zip |
Jul. 2nd, 2008 |
8659 |
13 |
080709u.zip |
Jul. 9th, 2008 |
8816 |
14 |
080716u.zip |
Jul. 16th, 2008 |
13399 |
15 |
080723u.zip |
Jul. 23rd, 2008 |
14716 |
16 |
080730u.zip |
Jul. 30th, 2008 |
13362 |
17 |
080806u.zip |
Aug. 6th, 2008 |
19496 |
18 |
080813u.zip |
Aug. 13th, 2008 |
14492 |
19 |
080820u.zip |
Aug. 20th, 2008 |
14784 |
20 |
080827u.zip |
Aug. 27th, 2008 |
14476 |
21 |
080903u.zip |
Sep. 3rd, 2008 |
14683 |
|
3. Datasets of Video File Size and Bitrate
We have also separately crawled the video file size and video bitrate information. To get the file size, the crawler retrieves the response information from the server when requesting to download the video file and extracts the information on the size of the download. Some videos have the bitrate embedded in the FLV video meta-data, which the crawler extracts after downloading the meta-data of the video file.
In "0523.zip", there are: "idlength.txt", containing the video IDs and video lengths; "size.txt", adding the video file size (integer number of Byte) into the previous information; "log.txt", a log file.
In "0628.zip", there are: "idlength.txt", containing the video IDs and video lengths; "rate.txt", adding the video bitrate (double number of kbps) into the previous information; the VBR videos result in invalid bitrate value, and we use video lengths and sizes to calculate the average bitrate; "log.txt", a log file.
In "080908sizerate.zip", there are: "idlength.txt", containing the video IDs and video lengths; "sizerate.txt", adding the video file size (integer number) and bitrate (double number of kbps) into the previous information; the VBR videos result in invalid bitrate value, and we record a "VBR"; "log.txt", a log file.
No. |
File (click to download) |
Date |
Total Crawled |
Duration (second) |
Note |
1 |
0523.zip |
May 23rd, 2007 |
199691 |
88470 |
video file size |
2 |
0628.zip |
Jun. 28th, 2007 |
199691 |
81082 |
video bitrate |
3 |
080908sizerate.zip |
Sep. 8th, 2008 |
153710 |
104804 |
video file size and bitrate |
|