Difference between revisions of "ImportTextFiles.php"

From PKC
Jump to navigation Jump to search
 
(40 intermediate revisions by the same user not shown)
Line 1: Line 1:
Please refer to the original MediaWiki document<ref>[[mw:Manual:importTextFiles.php]]</ref>
Please refer to the original MediaWiki document<ref name="manualForImportTextFiles">[[mw:Manual:importTextFiles.php]]</ref>


=A Typical Importation Process=
# Extract text content from some sources, ideally Wikipedia or some reputable data source with existing data model.
## In the case of Wikipedia, every page can be considered as a single text file, and each page has a unique ID provided in either XML or MediaWiki data model.
## Dump all the textual data content into a directory. Each page should be stored in a unique file, whose page name is stored in a comma delimited file (CSV), that each line associates a unique ID with a the page name.
## For importing Modules, where all textual content should be treated by the Content Model named:<code>Scribunto</code>, it might be useful to add the prefix option<ref name="manualForImportTextFiles"/>:<code>--prefix "‎Module:"</code>, so that while adding the content, the system knows to mark the page information to use <code>Scribunto</code> as it Content Handling model. (This is to be tried).
# Move the file into a place where MediaWiki's maintenance script can easily have access, such as <code>/var/www/html/images</code>. For Docker deployments, this directory is likely to be mapped onto the host machine's hard drive, therefore, easy to put the text files under this directory. For example:<code>MODULE</code> under <code>/var/www/html/images</code>.
# Go to the terminal of the host machine that serves MediaWiki, then, run the <code>importTextFiles.php</code> script like follows:
'''root@hostname:/var/www/html/#'''php ./maintenance/importTextFiles.php -s "Loading Textual Content from external sources"  --overwrite --use-timestamp <code>--prefix "‎Module:"</code> ./images/MODULE/*
==Preprocess file names==
Give a list of input data files, for example, a list of files from Wikipedia's Module text file collection, which are given unique IDs as file names. Each of these unique IDs can be mapped to a unique Module name through a text file. The following command will produce a smaller file that contains the replacement.
ls ./MODULE/ | xargs -i grep {} wp_articles_module.csv > module_replacement.txt
When performing data extraction, one usually uses <code>,</code>(comma) as a way to delimit words. To be compatible with the [[moveBatch.php]] syntax, we need to perform a substitution task. In case you want to add a string to the beginning of each line<ref>[https://linuxconfig.org/add-character-to-the-beginning-of-each-line-using-sed Add character to the beginning of each line using sed]</ref>, you may use the following command in place of the earlier statement:
ls ./MODULE/ | sed "s/^/Module:/" | xargs -i grep {} wp_articles_module.csv > module_replacement.txt
In case you want to fix the comma-separated delimiter, use the following command.
sed -i "s/,/|/" module_replacement.txt
==Move pages to its proper names==
After the <code>replacement</code> file is ready, one can simply run the [[moveBatch.php]] script to change the page names.
php ../maintenance/moveBatch.php --r="change page names to proper title" --noredirects module_replacement.txt
Note: The original instruction on the MediaWiki manual shows an <code>--u=user</code> option. The string <code>user</code> should be a user name in the database that you are trying to load this data. For simplicity, <code>--u</code> option can be ignored most of the time. If you insist to use this option, make sure that you know a user name that already exists on the database, otherwise, it will throw exceptions.
==Remove temporary page names==
Since we initially created the pages with temporary page names, we need to remove these page names. Use [[mw:Manual:deleteBatch.php]] to perform this job. Create a list of page names by using the following instructions:
ls ./TEMPLATE > tmpPageNameList.txt
php ../maintenance/deleteBatch.php tmpPageNameList.txt
To create tmpPageNameList.txt, one can construct it from scratch, or create a new file from the text file :<code>module_replacement.txt</code><ref>[https://stackoverflow.com/questions/9191030/removing-pattern-at-the-end-of-a-string-using-sed-or-other-bash-tools/12635437 Removing pattern at the end of a string using sed]</ref>. Note that <code>module_replacement.txt</code> has the format:
Module:575426565|Module:certainModuleName1
Module:657477565|Module:certainModuleName2
Module:734556565|Module:certainModuleName3
We can perform the following <code>sed</code> instruction<ref>[https://www.grymoire.com/Unix/Sed.html#uh-0 Sed Tutorial]</ref> at command line to process this:
ls ./MODULE/ | sed "s/^/Module:/" > pageNamesToBeRemoved.txt
For more [[sed]] related instructions, see [[sed]].
<noinclude>
<noinclude>
=References=
=References=
<references/>
<references/>

Latest revision as of 04:44, 28 January 2022

Please refer to the original MediaWiki document[1]

A Typical Importation Process

  1. Extract text content from some sources, ideally Wikipedia or some reputable data source with existing data model.
    1. In the case of Wikipedia, every page can be considered as a single text file, and each page has a unique ID provided in either XML or MediaWiki data model.
    2. Dump all the textual data content into a directory. Each page should be stored in a unique file, whose page name is stored in a comma delimited file (CSV), that each line associates a unique ID with a the page name.
    3. For importing Modules, where all textual content should be treated by the Content Model named:Scribunto, it might be useful to add the prefix option[1]:--prefix "‎Module:", so that while adding the content, the system knows to mark the page information to use Scribunto as it Content Handling model. (This is to be tried).
  2. Move the file into a place where MediaWiki's maintenance script can easily have access, such as /var/www/html/images. For Docker deployments, this directory is likely to be mapped onto the host machine's hard drive, therefore, easy to put the text files under this directory. For example:MODULE under /var/www/html/images.
  3. Go to the terminal of the host machine that serves MediaWiki, then, run the importTextFiles.php script like follows:
root@hostname:/var/www/html/#php ./maintenance/importTextFiles.php -s "Loading Textual Content from external sources"  --overwrite --use-timestamp --prefix "‎Module:" ./images/MODULE/*

Preprocess file names

Give a list of input data files, for example, a list of files from Wikipedia's Module text file collection, which are given unique IDs as file names. Each of these unique IDs can be mapped to a unique Module name through a text file. The following command will produce a smaller file that contains the replacement.

ls ./MODULE/ | xargs -i grep {} wp_articles_module.csv > module_replacement.txt

When performing data extraction, one usually uses ,(comma) as a way to delimit words. To be compatible with the moveBatch.php syntax, we need to perform a substitution task. In case you want to add a string to the beginning of each line[2], you may use the following command in place of the earlier statement:

ls ./MODULE/ | sed "s/^/Module:/" | xargs -i grep {} wp_articles_module.csv > module_replacement.txt

In case you want to fix the comma-separated delimiter, use the following command.

sed -i "s/,/|/" module_replacement.txt

Move pages to its proper names

After the replacement file is ready, one can simply run the moveBatch.php script to change the page names.

php ../maintenance/moveBatch.php --r="change page names to proper title" --noredirects module_replacement.txt

Note: The original instruction on the MediaWiki manual shows an --u=user option. The string user should be a user name in the database that you are trying to load this data. For simplicity, --u option can be ignored most of the time. If you insist to use this option, make sure that you know a user name that already exists on the database, otherwise, it will throw exceptions.

Remove temporary page names

Since we initially created the pages with temporary page names, we need to remove these page names. Use mw:Manual:deleteBatch.php to perform this job. Create a list of page names by using the following instructions:

ls ./TEMPLATE > tmpPageNameList.txt
php ../maintenance/deleteBatch.php tmpPageNameList.txt

To create tmpPageNameList.txt, one can construct it from scratch, or create a new file from the text file :module_replacement.txt[3]. Note that module_replacement.txt has the format:

Module:575426565|Module:certainModuleName1
Module:657477565|Module:certainModuleName2
Module:734556565|Module:certainModuleName3

We can perform the following sed instruction[4] at command line to process this:

ls ./MODULE/ | sed "s/^/Module:/" > pageNamesToBeRemoved.txt

For more sed related instructions, see sed.


References

Related Pages