Difference between revisions of "ImportTextFiles.php"
Line 5: | Line 5: | ||
## In the case of Wikipedia, every page can be considered as a single text file, and each page has a unique ID provided in either XML or MediaWiki data model. | ## In the case of Wikipedia, every page can be considered as a single text file, and each page has a unique ID provided in either XML or MediaWiki data model. | ||
## Dump all the textual data content into a directory. Each page should be stored in a unique file, whose page name is stored in a comma delimited file (CSV), that each line associates a unique ID with a the page name. | ## Dump all the textual data content into a directory. Each page should be stored in a unique file, whose page name is stored in a comma delimited file (CSV), that each line associates a unique ID with a the page name. | ||
## For importing Modules, where all textual content should be treated by the Content Model named:<code>Scribunto</code>, it might be useful to add the prefix:<code>Module:</code>, so that while adding the content, the system knows to mark the page information to use <code>Scribunto</code> as it Content Handling model. (This is to be tried). | |||
# Move the file into a place where MediaWiki's maintenance script can easily have access, such as <code>/var/www/html/images</code>. For Docker deployments, this directory is likely to be mapped onto the host machine's hard drive, therefore, easy to put the text files under this directory. For example:<code>INPUTDATA_DIR</code> under <code>/var/www/html/images</code>. | # Move the file into a place where MediaWiki's maintenance script can easily have access, such as <code>/var/www/html/images</code>. For Docker deployments, this directory is likely to be mapped onto the host machine's hard drive, therefore, easy to put the text files under this directory. For example:<code>INPUTDATA_DIR</code> under <code>/var/www/html/images</code>. | ||
# Go to the terminal of the host machine that serves MediaWiki, then, run the <code>importTextFiles.php</code> script like follows: | # Go to the terminal of the host machine that serves MediaWiki, then, run the <code>importTextFiles.php</code> script like follows: |
Revision as of 16:37, 27 January 2022
Please refer to the original MediaWiki document[1]
A Typical Importation Process
- Extract text content from some sources, ideally Wikipedia or some reputable data source with existing data model.
- In the case of Wikipedia, every page can be considered as a single text file, and each page has a unique ID provided in either XML or MediaWiki data model.
- Dump all the textual data content into a directory. Each page should be stored in a unique file, whose page name is stored in a comma delimited file (CSV), that each line associates a unique ID with a the page name.
- For importing Modules, where all textual content should be treated by the Content Model named:
Scribunto
, it might be useful to add the prefix:Module:
, so that while adding the content, the system knows to mark the page information to useScribunto
as it Content Handling model. (This is to be tried).
- Move the file into a place where MediaWiki's maintenance script can easily have access, such as
/var/www/html/images
. For Docker deployments, this directory is likely to be mapped onto the host machine's hard drive, therefore, easy to put the text files under this directory. For example:INPUTDATA_DIR
under/var/www/html/images
. - Go to the terminal of the host machine that serves MediaWiki, then, run the
importTextFiles.php
script like follows:
root@hostname:/var/www/html/#php ./maintenance/importTextFiles.php -s "Loading Textual Content from external sources" --overwrite --use-timestamp ./images/INPUTDATA_DIR/*
Preprocess file names
Give a list of input data files, for example, a list of files from Wikipedia's Module collection, which contains a list of unique IDs mapping to the page name. The following command will produce a smaller file that contains the replacement.
ls ./images/TEMPLATE/ | xargs -i grep {} wp_articles_template.csv > template_replacement.txt
When performing data extraction, one usually uses ,
(comma) as a way to delimit words. To be compatible with the moveBatch.php syntax, we need to perform a substitution task:
sed -i "s/,/|/g" template_replacement.txt
Move pages to its proper names
After the replacement
file is ready, one can simply run the moveBatch.php script to change the page names.
php ../maintenance/moveBatch.php --r="load template data" --noredirects template_replacement.txt
Note: The original instruction on the MediaWiki manual shows an --u=user
option. The string user
should be a user name in the database that you are trying to load this data. For simplicity, --u
option can be ignored most of the time. If you insist using this option, make sure that you know a user name that already exists on the database, otherwise, it will throw exceptions.
Remove Temporal Page Names
Since we initially created the pages with temporary page names, we need to remove these page names. Use mw:Manual:deleteBatch.php to perform this job. Create a list of page names by using the following instructions:
ls ./TEMPLATE > tmpPageNameList.txt php ../maintenance/deleteBatch.php tmpPageNameList.txt